[webkit-dev] Writing a new XML parser with no external libraries

Eric Seidel eric at webkit.org
Tue Jun 28 18:50:51 PDT 2011


Correct.  We convert from UTF16 to UTF8 (for libxml2) and then back to UTF16.

There has been at least one libxml-related security fix to WebCore in
recent memory.

We have various hacks in the libxml2 parser due to libxml2 being
designed to be a library used by applications, and not used by a
library like WebKit:
http://trac.webkit.org/browser/trunk/Source/WebCore/dom/XMLDocumentParserLibxml2.cpp#L373
http://trac.webkit.org/browser/trunk/Source/WebCore/dom/XMLDocumentParserLibxml2.cpp#L488
http://trac.webkit.org/browser/trunk/Source/WebCore/dom/XMLDocumentParserLibxml2.cpp#L1093
http://trac.webkit.org/browser/trunk/Source/WebCore/dom/XMLDocumentParserLibxml2.cpp#L1182
http://trac.webkit.org/browser/trunk/Source/WebCore/dom/XMLDocumentParserLibxml2.cpp#L1273


I'm in general in favor of this effort (having worked extensively on
the existing XML parsers).

But I would caution you that xml is a ridiculously tiny fraction of
the web.  And it may not be worth the engineering effort to make a
better parser.

http://www.google.com/search?q=filetype:html = 25,270,000,000
http://www.google.com/search?q=filetype:xml = 71,000,000

(Naively) judging by those numbers we should be spending 356 times as
much effort on our HTML support than our XML support. :)

-eric

-eric

On Tue, Jun 28, 2011 at 6:36 PM, Jeffrey Pfau <jpfau at apple.com> wrote:
> I don't know all of the problems libxml2 has, but one of the ones I've heard is that WebCore uses UTF-16 internally, and libxml2 uses UTF-8, so the data is perpetually converted between the two formats--and this is slow. If there are any other big ones, I haven't been told them, only that it would be good to have a replacement.
>
> Jeffrey Pfau
>
> On Jun 28, 2011, at 6:30 PM, Dirk Pranke wrote:
>
>> Can you expand a bit more on "using libxml2 exposes its own share of problems"?
>>
>> -- Dirk
>>
>> On Tue, Jun 28, 2011 at 6:12 PM, Jeffrey Pfau <jpfau at apple.com> wrote:
>>> Currently, WebCore uses libxml2, or, if available, QtXml to parse incoming XML. However, QtXml isn't always available, and using libxml2 exposes its own share of problems. As such, I'm undertaking writing an XML parser that uses no external libraries.
>>>
>>> The first step to doing this is to add a new flag that switches off the other two parsers. As the parsers are already independent and can be switched between by checking USE(QXMLSTREAM), I am adding USE(LIBXML2) checks, replacing the #else conditionals, and also a new ENABLE check, tentatively called NEW_XML (although names such as NATIVE_XML or XML_NATIVE, etc, may be more appropriate).
>>>
>>> As there will probably be a new slew of files pertaining to XML parsing, I will put these files in WebCore/xml/parser, and move the existing XMLDocumentParser* file into this new directory. As far as I know, the placement of these files in WebCore/dom/ is legacy, and, assuming the build on each platform is changed, it makes sense to move them.
>>>
>>> Once all the files are in a logical place, I plan to make a new file for a skeleton of the new XMLDocumentParser, at least to get it to link until a real one is in place, even if the XML parser at that point is just a data sink.
>>>
>>> From there, I plan to copy and modify a good chunk of the lower level HTML tokenization and parsing code, and make changes as necessary to make it work on generalized XML, at least until I can generalize the common code in such a way that the HTML and XML tokenizers can be subclasses and use common code. I'd probably do the refactoring at the end.
>>>
>>> I'm still exploring the existing parsing code, but I'd probably work my way up from there. I've read a lot of the XML 1.0 spec in preparation, as well, but it doesn't have much on implementation itself. If QtWebKit or parsing people have any comments, concerns, or help, I'd be more than willing to listen--I'm just starting here, and I'm not completely familiar with the codebase.
>>>
>>> Although no code is checked in so far, I've started on this list already and have gotten as far as the new flags, a skeleton XMLDocumentParserNew.cpp, and making a tokenizer that compiles and links, but is completely untested.
>>>
>>> Jeffrey Pfau
>>> _______________________________________________
>>> webkit-dev mailing list
>>> webkit-dev at lists.webkit.org
>>> http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
>>>
>
> _______________________________________________
> webkit-dev mailing list
> webkit-dev at lists.webkit.org
> http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
>


More information about the webkit-dev mailing list