[webkit-help] Using webkit for headless browsing

Wed Jul 11 13:08:33 PDT 2012

Hi,

I am using webkit based tools to built a headless browser for crawling
webpages (I need this over curl because I would like to evaluate the
javascript found on pages and fetch the final rendered page). But, the two
different systems I have implemented so far exhibit very poor performance.
I have implemented two different systems, both of which use webkit as the
backend:

   1. Using Google Chrome: I would start Google Chrome and communicate with
   each tab using webSockets exposed by Chrome for remote debugging (debugging
   over wire<https://developers.google.com/chrome-developer-tools/docs/remote-debugging>).
   This way I can control each tab, load a new page and once the page is
   loaded I fetch the DOM of the loaded webpage.
   2. Using phantomjs:
phantomjs<http://code.google.com/p/phantomjs/wiki/QuickStart>uses
webkit to load pages and provides a headless browsing option. As
   explained in the examples of phantomjs, I use page.open to open a new URL
   and then fetch the dom once the page is loaded by evaluating javascript on
   the page.

My goal is to crawl pages as fast as I can and if the page does not load in
the first 10 seconds, declare it failed and move on. I understand that each
page takes a while to load, so to increase the number of pages I load per
second, I open many tabs in Chrome or start multiple parallel processes
using phantomjs. The following is the performance that I observe:

   1. If I open more than 20 tabs in Chrome / 20 phantomjs instances, the
   CPU usage rockets up.
   2. Due to the high CPU usage, a lot of pages take more than 10seconds to
   load and hence I have a higher failure rate (~80% of page load requests
   failing)
   3. If I intend to keep the fails to less than 5% of the total requests,
   I cannot load more than 1 URL per second.

After trying out both the webkit based systems, it feels like the
performance bottleneck is the webkit rendering engine and hence would like
to understand from other users here, the performance experience that I can
expect. My hardware configuration is:

   1. Processor: Intel® Core™ i7-2635QM (1 processor, 4 cores)
   2. Graphics card: AMD Radeon HD 6490M (256MB)
   3. Memory: 4GB
   4. Network bandwidth is good enough to be able to load pages more than
   the performance that I am observing

The question I am trying to ask this mailing list is, does any one have
experience using webkit for crawling web pages for a random set of URLs
(say picking 10k URLs from twitter stream), what is the performance that I
can expect?
Thanks,
Bhanu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-help/attachments/20120711/8a448484/attachment.html>