Scaricare pagina html (utf-8)

**dionisoft** · 09-04-2010, 13:51

Ciao, ho scoperto che il metodo che ho sempre usato per scaricare pagine web non funziona come dovrebbe con i caratteri cirillici, polacchi, ecc...
Il mio metodo è questo:

codice:

public String readHTML(String path) {
        String result = null;
        try {
                URL url = new URL(path);
                StringBuffer sbuf = new StringBuffer();
                HttpURLConnection httpURLConnection = (HttpURLConnection)url.openConnection();
                httpURLConnection.setDoInput(true); 			             
                httpURLConnection.setDoOutput(true); 			
                httpURLConnection.setUseCaches(false); 			
                httpURLConnection.setRequestProperty("Referer", "http://www.google.com");
 		httpURLConnection.setRequestProperty("User-Agent", "Internet Explorer"); 		
	 	// The CharSet has to be UTF-8!! 	
		BufferedReader br = new BufferedReader(new InputStreamReader(httpURLConnection.getInputStream(), Charset.forName("UTF-8"))); 			
                String line = ""; 		
		while((line = br.readLine()) != null) { 		
		 	sbuf.append(line);
 		} 			
	        // Parse the file manually 		
        	result = sbuf.toString(); 			 			
 	} 
        catch (Exception e) { 		
		e.printStackTrace();
 	} 		 		
        return result;
}

Gli output che ottengo sono del tipo:

OK : Communauté de communes du Bassin de Pompey -> Communauté_de_communes_du_Bassin_de_Pompey (la 'é' è correttamente sostituita dall'entità '&eacute ;'
ERR: Столична община -> ????????_??????
ERR: OŠ Vojke Šmuc Izola - SE Vojka Šmuc Isola -> O?_Vojke_?muc_Izola_-_SE_Vojka_?muc_Isola

Qualcuno sa aiutarmi?? Grazie!!

PS: Gli underscore li metto io

Discussione: Scaricare pagina html (utf-8)

Strumenti discussione

Ricerca discussione

Visualizza

Visualizzazione discussione

Scaricare pagina html (utf-8)

Permessi di invio