[webkit-dev] Writing a new XML parser with no external libraries

Wed Jun 29 06:55:57 PDT 2011

On Tue, Jun 28, 2011 at 6:50 PM, Eric Seidel <eric at webkit.org> wrote:
>
> I'm in general in favor of this effort (having worked extensively on
> the existing XML parsers).
>
> But I would caution you that xml is a ridiculously tiny fraction of
> the web.  And it may not be worth the engineering effort to make a
> better parser.
>
> http://www.google.com/search?q=filetype:html = 25,270,000,000
> http://www.google.com/search?q=filetype:xml = 71,000,000
>

I can't let this one just pass by! ;)

First, filetype is by extension and not media type [1].  As such, that
is an incorrect accounting of the amount of XML on the web.  Secondly,
just using file extensions, you'd have to enumerate and sum all the
extensions used by all XML media types (e.g. .xhtml, .svg, etc.).
Third, there is plenty of content on the web that Google does not
crawl (the "dark web") where there are petabytes of XML waiting for
browsers to do something with it (e.g. astronomical data cone search
services).

I know the parser's speed is terrible as I've measured it recently.
This is partially due to some of the things we are doing to deal with
Unicode decoding to work around libxml2 issues.  I think moving to
native strings and decoding would improve the speed by a huge amount.
It would be well work it to some to fix this.

[1] http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=35287

-- 
--Alex Milowski
"The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered."

Bertrand Russell in a footnote of Principles of Mathematics