QUICK TIP - Crawling Web Pages with PHPCrawl So Easy

- 1. QUICK TIP - Most Useful .htaccess Tricks for WordPress
- 2. QUICK TIP - How to Join and Earn From Brave Ads
Uwe Hunfeld provides an object oriented library called PHPCrawl available at http://phpcrawl.cuab.de. This class can be used to crawl web pages with many different parameters. It also allows you to process each page and do what manipulation or scraping you need to do.
What is PHPCrawl?
PHPCrawl is a framework for crawling/spidering websites written in the programming language PHP, so just call it a webcrawler-library or crawler-engine for PHP
PHPCrawl “spiders” websites and passes information about all found documents (pages, links, files ans so on) for futher processing to users of the library.
It is high configurable and provides several options to specify the behaviour of the crawler like URL- and Content-Type-filters, cookie-handling, robots.txt-handling, limiting options, multiprocessing and much more.
PHPCrawl is completly free opensource software and is licensed under the GNU GENERAL PUBLIC LICENSE v2.
To get a first impression on how to use the crawler you may want to take a look at the quickstart guide or an example inside the manual section.
A complete reference and documentation of all available options and methods of the framework can be found in the classreferences-section
The current version of the phpcrawl-package and older releases can be downloaded from a sourceforge-mirror.
Requirements
At least the following requirements are necessary to run phpcrawl (v 0.8) in basic single-process-mode:
- PHP 5.2.1 or later version
- PHP with OpenSSL-support for SSL-connections (https).
Not necessary for http-connects.
In order to run phpcrawl in multi-process-mode, some additional requirements are needed:
- The multi-process mode only works on unix-based systems (linux)
- Scripts using the crawler in multi-process-mode have to be run from the commandline (PHP cli)
- The PCNTL-extension for php (process control) has to be installed and activated.
- The SEMAPHORE-extension for php has to be installed and activated.
- The POSIX-extension for php has to be installed and activated.
- The PDO-extension together with the SQLite-driver (PDO_SQLITE) has to be installed and activated.
Installation & Quickstart
The following steps show how to use phpcrawl:
- Unpack the phpcrawl-package somewhere. That’s all you have to do for installation.
- Include the phpcrawl-mainclass to your script or project. Its located in the “libs”-path of the package.
1 |
include("libs/PHPCrawler.class.php"); |
There are no other includes needed.
- Extend the phpcrawler-class and override the handleDocumentInfo-method with your own code to process the information of every document the crawler finds on its way.
12345678910111213class MyCrawler extends PHPCrawler{function handleDocumentInfo(PHPCrawlerDocumentInfo $PageInfo){// Your code comes here!// Do something with the $PageInfo-object that// contains all information about the currently// received document.// As example we just print out the URL of the documentecho $PageInfo->url."\n";}}
For a list of all available information about a page or file within the handleDocumentInfo-method see the PHPCrawlerDocumentInfo-reference.
Note to users of phpcrawl 0.7x or before: The old, overridable method “handlePageData()“, that receives the document-information as an array, still is present and gets called. PHPcrawl 0.8 is fully compatible with scripts written for earlier versions.
- Create an instance of that class in your script or project, define the behaviour of the crawler and start the crawling-process.
123456$crawler = new MyCrawler();$crawler->setURL("www.foo.com");$crawler->addContentTypeReceiveRule("#text/html#");// ...$crawler->go();
For a list of all available setup-options/methods of the crawler take a look at the PHPCrawler-classreference.
How to implement an Example
You accomplish this by overriding the base class and implementing your own functionality in the handleDocumentInfo() and handleHeaderInfo() functions. Use the code below as an example of how to create your own web crawler.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
<span class="pun"><?</span><span class="pln">php</span> <span class="pln">set_time_limit</span><span class="pun">(</span><span class="lit">1000</span><span class="pun">);</span> <span class="com">// Set execution length in seconds</span> <span class="pln">include</span><span class="pun">(</span><span class="str">"libs/PHPCrawler.class.php"</span><span class="pun">);</span> <span class="com">// Extend the provided base class and override the handler functions</span> <span class="kwd">class</span> <span class="typ">NanoCrawler</span> <span class="kwd">extends</span> <span class="typ">PHPCrawler</span> <span class="pun">{</span> <span class="pln"> </span><span class="com">// Process the document contents in $DocInfo->source here</span> <span class="pln"> </span><span class="kwd">function</span><span class="pln"> handleDocumentInfo</span><span class="pun">(</span><span class="pln">$DocInfo</span><span class="pun">)</span> <span class="pln"> </span><span class="pun">{</span> <span class="pln"> echo </span><span class="str">"Page requested: "</span><span class="pun">.</span><span class="pln">$DocInfo</span><span class="pun">-></span><span class="pln">url</span><span class="pun">.</span><span class="str">" ("</span><span class="pun">.</span><span class="pln">$DocInfo</span><span class="pun">-></span><span class="pln">http_status_code</span><span class="pun">.</span><span class="str">")\n"</span><span class="pun">;</span> <span class="pln"> echo </span><span class="str">"Referer-page: "</span><span class="pun">.</span><span class="pln">$DocInfo</span><span class="pun">-></span><span class="pln">referer_url</span><span class="pun">.</span><span class="str">"\n"</span><span class="pun">;</span> <span class="pln"> </span><span class="kwd">if</span> <span class="pun">(</span><span class="pln">$DocInfo</span><span class="pun">-></span><span class="pln">received </span><span class="pun">==</span> <span class="kwd">true</span><span class="pun">)</span> <span class="pun">{</span> <span class="pln"> echo </span><span class="str">"Content received: "</span><span class="pun">.</span><span class="pln">$DocInfo</span><span class="pun">-></span><span class="pln">bytes_received</span><span class="pun">.</span><span class="str">" bytes\n"</span><span class="pun">;</span> <span class="pln"> </span><span class="pun">}</span> <span class="kwd">else</span> <span class="pun">{</span> <span class="pln"> echo </span><span class="str">"Content not received\n"</span><span class="pun">;</span> <span class="pln"> </span><span class="pun">}</span> <span class="pln"> echo </span><span class="str">"Links found: "</span><span class="pun">.</span><span class="pln">count</span><span class="pun">(</span><span class="pln">$DocInfo</span><span class="pun">-></span><span class="pln">links_found_url_descriptors</span><span class="pun">).</span><span class="str">"\n\n"</span><span class="pun">;</span> <span class="pln"> flush</span><span class="pun">();</span> <span class="pln"> </span><span class="pun">}</span> <span class="pln"> </span><span class="com">// Process the headers like http_status_code, content_type, content_length,</span> <span class="pln"> </span><span class="com">// content_encoding, transfer_encoding, cookies, source_url</span> <span class="pln"> </span><span class="kwd">function</span><span class="pln"> handleHeaderInfo</span><span class="pun">(</span><span class="pln">$header</span><span class="pun">)</span> <span class="pun">{</span> <span class="pln"> print_r</span><span class="pun">(</span><span class="pln">$header</span><span class="pun">);</span> <span class="pln"> </span><span class="pun">}</span> <span class="pun">}</span> <span class="com">// Instantiate the new custom crawler class we defined</span> <span class="pln">$crawler </span><span class="pun">=</span> <span class="kwd">new</span> <span class="typ">NanoCrawler</span><span class="pun">();</span> <span class="com">// Set rules and params</span> <span class="pln">$crawler</span><span class="pun">-></span><span class="pln">setURL</span><span class="pun">(</span><span class="str">"example.com"</span><span class="pun">);</span> <span class="pln">$crawler</span><span class="pun">-></span><span class="pln">addContentTypeReceiveRule</span><span class="pun">(</span><span class="str">"#text/html#"</span><span class="pun">);</span> <span class="com">// Ignore links to pictures, dont even request pictures</span> <span class="pln">$crawler</span><span class="pun">-></span><span class="pln">addURLFilterRule</span><span class="pun">(</span><span class="str">"#\.(jpg|jpeg|gif|png)$# i"</span><span class="pun">);</span> <span class="com">// Store and send cookie-data like a browser does</span> <span class="pln">$crawler</span><span class="pun">-></span><span class="pln">enableCookieHandling</span><span class="pun">(</span><span class="kwd">true</span><span class="pun">);</span> <span class="com">// Set the traffic-limit to 10 MB (in bytes)</span> <span class="pln">$crawler</span><span class="pun">-></span><span class="pln">setTrafficLimit</span><span class="pun">(</span><span class="lit">1000</span> <span class="pun">*</span> <span class="lit">1024</span> <span class="pun">*</span> <span class="lit">10</span><span class="pun">);</span> <span class="com">// Set user agent. It's not polite to lie about your user agent.</span> <span class="com">// If you are creating a bot or crawler it is good to set the user agent as</span> <span class="com">// something unique that includes a way to contact like website.</span> <span class="com">// This allows people to report any problems or block your user agent</span> <span class="com">// if it causing problems. There are other situations where a website</span> <span class="com">// will reject anything but a familiar user agent. Sometimes just putting</span> <span class="com">// "firefox" as a user agent will bypass filters.</span> <span class="pln">$crawler</span><span class="pun">-></span><span class="pln">setUserAgentString</span><span class="pun">(</span><span class="str">"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/537.36"</span><span class="pun">);</span> <span class="com">// Delay between requests in seconds</span> <span class="pln">$crawler</span><span class="pun">-></span><span class="pln">setRequestDelay</span><span class="pun">(</span><span class="lit">0</span><span class="pun">);</span> <span class="com">// Max number of pages to crawl</span> <span class="pln">$crawler</span><span class="pun">-></span><span class="pln">setPageLimit</span><span class="pun">(</span><span class="lit">5</span><span class="pun">,</span> <span class="kwd">true</span><span class="pun">);</span> <span class="com">// Skip items listed in robots.txt (or not)</span> <span class="pln">$crawler</span><span class="pun">-></span><span class="pln">obeyRobotsTxt</span><span class="pun">(</span><span class="kwd">false</span><span class="pun">);</span> <span class="com">// Follow redirects</span> <span class="pln">$crawler</span><span class="pun">-></span><span class="pln">setFollowMode</span><span class="pun">(</span><span class="lit">1</span><span class="pun">);</span> <span class="com">// Run it. May take a while.</span> <span class="pln">$crawler</span><span class="pun">-></span><span class="pln">go</span><span class="pun">();</span> <span class="com">// Output crawl report</span> <span class="pln">$report </span><span class="pun">=</span><span class="pln"> $crawler</span><span class="pun">-></span><span class="pln">getProcessReport</span><span class="pun">();</span> <span class="pln">echo </span><span class="str">"Summary:\n"</span><span class="pun">;</span> <span class="pln">echo </span><span class="str">"Links followed: "</span> <span class="pun">.</span><span class="pln"> $report</span><span class="pun">-></span><span class="pln">links_followed </span><span class="pun">.</span> <span class="str">"\n"</span><span class="pun">;</span> <span class="pln">echo </span><span class="str">"Documents received: "</span> <span class="pun">.</span><span class="pln"> $report</span><span class="pun">-></span><span class="pln">files_received </span><span class="pun">.</span> <span class="str">"\n"</span><span class="pun">;</span> <span class="pln">echo </span><span class="str">"Bytes received: "</span> <span class="pun">.</span><span class="pln"> $report</span><span class="pun">-></span><span class="pln">bytes_received </span><span class="pun">.</span> <span class="str">" bytes\n"</span><span class="pun">;</span> <span class="pln">echo </span><span class="str">"Process runtime: "</span> <span class="pun">.</span><span class="pln"> $report</span><span class="pun">-></span><span class="pln">process_runtime </span><span class="pun">.</span> <span class="str">" sec\n"</span><span class="pun">;</span> <span class="pln">echo </span><span class="str">"Memory peak usage: "</span> <span class="pun">.</span> <span class="pun">((</span><span class="pln">$report</span><span class="pun">-></span><span class="pln">memory_peak_usage </span><span class="pun">/</span> <span class="lit">1024</span><span class="pun">)</span> <span class="pun">/</span> <span class="lit">1024</span><span class="pun">)</span> <span class="pun">.</span> <span class="str">" MB\n"</span><span class="pun">;</span> |