[webkit-dev] Writing a new XML parser with no external libraries

Tue Jun 28 20:07:57 PDT 2011

Consolidating replies to avoid spamming the thread:

On Jun 28, 2011, at 6:26 PM, Adam Barth wrote:

> A question and a comment:
> 
> 1) Will this let us to remove the code for both the libxml2 and the
> QtXml parsers?  I'd certainly much rather have one XML parser than
> three.

If the new parser pans out, I think it would be up to individual ports to decide when/whether to migrate to it. I hope it can obsolete at least the libxml2 parser, though it won't by itself eliminate the need to link libxml2 due to XSLT. If it doesn't pan out, then presumably we won't leave dead code in the tree.

On Jun 28, 2011, at 6:44 PM, Wyatt Carss wrote:

> If that were all, would it be possible to patch libxml2 to use UTF-16? That might be less of an undertaking than writing a new xml library, but that could just be my youthful naivety..

Maintaining such a patch set in the face of upstream libxml2 security fixes would probably be challenging. Enough so that investing some up-front effort to make our own parser may be a better solution.

On Jun 28, 2011, at 6:30 PM, Dirk Pranke wrote:

> Can you expand a bit more on "using libxml2 exposes its own share of problems"?

Besides the charset conversion performance issue mentioned by Jeffrey, and the need for hacks mentioned by Eric, here are some others:

- Our code to glue libxml2 to WebCore is a surprisingly frequent source of crashers and security bugs.
- libxml2 has security bugs reasonably often, and creates the need for an extra upstream update to pull those fixes.
- Improving performance or security of libxml2 in a systematic way, or adding features, has relatively high barriers to entry for WebKit developers.
- libxml2 is yet another dependency and it would be nice to have fewer - while this project won't eliminate the need entirely, it is one required step
- libxml2 contains a bunch of stuff we don't need, like its own HTML4 parser, XPointer, XInclude, XML Schemas, RelaxNG, XML Catalogs…

On Jun 28, 2011, at 6:50 PM, Eric Seidel wrote:

> I'm in general in favor of this effort (having worked extensively on
> the existing XML parsers).
> 
> But I would caution you that xml is a ridiculously tiny fraction of
> the web.  And it may not be worth the engineering effort to make a
> better parser.

The XML parser has uses besides what is obvious from files named .xml on the Web. It is used on some notable Web expert websites, for non-inline SVG images (which can be found on Wikipedia, a rather popular site), for processing the results of XML returned to XMLHttpRequest, and for processing non-Web content using the Web engine, such as ePub books. So it punches above its weight in a number of ways. Also, security bugs don't care how popular or unpopular a format is.

Also, based on your methodology, one would conclude we should spend 5 times as much effort on XML as PNG, which while less than effort spent on HTML is a good deal more than 0:

http://www.google.com/search?q=filetype:png

So I think it is worth some investment. Much like XBL2, this is one of those longstanding items that needs to rise to the top of someone's list.

Additional note: a native XML parser was a fairly heavily discussed potential project at the WebKit contributors' meeting.

Regards,
Maciej