Scraping dei link in php, voglio solo il nome dominio, come eseguire lo strip?

**Sireal** · 21-01-2008, 17:25

Allora ho trovato questo script in php che fa lo scraping dei link in una determinata pagina che decido io e li inserisce in un database. Io voglio però che il risultato dello scraping non siano link come:

http://nomdedominio.it/members.php

io voglio che sia solo

http://nomedominio.it

ho trovato diversi modi di fare stripping ma non ci sono riuscito, e non ho una minima idea di come si possa fare, vi pasto il codice:

Codice PHP:


<?php



$db_host = "localhost";

$db_user = "ODBC";

$db_password = "";

$db_name = "ODBC";

$db = mysql_connect($db_host, $db_user, $db_password);



mysql_select_db($db_name, $db)

or die ("Errore nella selezione del database.");



function storeLink($url,$gathered_from) {

    $query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";

    mysql_query($query) or die('Error, insert query failed');

}



$target_url = "http://notsecurity.com/";

$userAgent = 'Googlebot/2.1 ([url]http://www.googlebot.com/bot.html[/url])';



// make the cURL request to $target_url

$ch = curl_init();

curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);

curl_setopt($ch, CURLOPT_URL,$target_url);

curl_setopt($ch, CURLOPT_FAILONERROR, true);

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

curl_setopt($ch, CURLOPT_AUTOREFERER, true);

curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);

curl_setopt($ch, CURLOPT_TIMEOUT, 10);

$html= curl_exec($ch);

if (!$html) {

    echo "

cURL error number:" .curl_errno($ch);

    echo "

cURL error:" . curl_error($ch);

    exit;

}



// parse the html into a DOMDocument

$dom = new DOMDocument();

@$dom->loadHTML($html);



// grab all the on the page

$xpath = new DOMXPath($dom);

$hrefs = $xpath->evaluate("/html/body//a");



for ($i = 0; $i < $hrefs->length; $i++) {

    $href = $hrefs->item($i);

    $url = $href->getAttribute('href');

    storeLink($url,$target_url);

    echo "

Link stored: $url";

}

mysql_close($db);

?>

come strippare?

**gianiaz** · 21-01-2008, 17:40

non ho capito il termine "scraping", ma credo che quello che cerchi è questa funzione:

http://fr.php.net/manual/en/function.parse-url.php

**Sireal** · 21-01-2008, 17:44

per scraping intendo catturare dal testo tutti i link in questo caso

Discussione: Scraping dei link in php, voglio solo il nome dominio, come eseguire lo strip?

Strumenti discussione

Ricerca discussione

Visualizza

Scraping dei link in php, voglio solo il nome dominio, come eseguire lo strip?

Permessi di invio