[Java] Web Crawler: estrarre i link da una pagina html e navigare attraverso essi

**AnnaVerde1984** · 20-06-2007, 21:39

Si infatti il codice funziona, solo ke se provo ad iterare, cioè a passargli cm nuova source il link ke ha appena scoperto mi dà errore....ora t scrivo il codice cn le mie correzioni (le classe Graph e Node fanno parte di una libreria unina2 che dv utilizzare e il main è rikiamato a parte da un'altra classe):

codice:

package unina2.parsers.html;


import unina2.graph.gui.*;

import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.*;


public class translateHTML {
  
public translateHTML(){}

Graph webGraph ;

public void crawlweb(String source)
{

webGraph = new Graph();

    EditorKit kit = new HTMLEditorKit();
    Document doc = kit.createDefaultDocument();

    // The Document class does not yet 
    // handle charset's properly.
    doc.putProperty("IgnoreCharsetDirective", 
      Boolean.TRUE);
    try {

Reader rd = getReader(source);

int startPos = source.indexOf(".") + 1 ;
int endPos = source.length() ;

String webName = source.substring(startPos,endPos) ;

Node root=webGraph.addNode(webName,150,100,10);

      // Parse the HTML.
      kit.read(rd, doc, 0);

      // Iterate through the elements 
      // of the HTML document.
      ElementIterator it = new ElementIterator(doc);
      javax.swing.text.Element elem;
      while ((elem = it.next()) != null) {
        SimpleAttributeSet s = (SimpleAttributeSet)elem.getAttributes().getAttribute(HTML.Tag.A);
        if (s != null) {
          System.out.println(s.getAttribute(HTML.Attribute.HREF));

String r=(String)s.getAttribute(HTML.Attribute.HREF);

Node nodofiglio=webGraph.addNode(r,100,50,15);
webGraph.addEdge(root, nodofiglio,90) ;

crawlweb(r);


        }
      }
    } catch (Exception e) {
      e.printStackTrace();
    }
    
  }



// Returns a reader on the HTML data. If 'uri' begins
// with "http:", it's treated as a URL; otherwise,
// it's assumed to be a local filename.
  public static Reader getReader(String uri) 
    throws IOException {
    if (uri.startsWith("http:")) {

// Retrieve from Internet.
      URLConnection conn = 
        new URL(uri).openConnection();
      return new 
        InputStreamReader(conn.getInputStream());
    } else {

// Retrieve from file.

int startPos = uri.indexOf("/") + 1 ;
        int endPos = uri.length() ;

String fileName = uri.substring(startPos,endPos) ;

      return new FileReader(fileName);
    }
}



public Graph getSchemaGraph()
    {
        return webGraph ;
    }


}

Discussione: [Java] Web Crawler: estrarre i link da una pagina html e navigare attraverso essi

Strumenti discussione

Ricerca discussione

Visualizza

Visualizzazione discussione

Permessi di invio