[webkit-dev] libxml2 "override encoding" support

Patrick Gansterer paroga at paroga.com
Wed Jan 5 05:07:57 PST 2011

Alex Milowski:

> On Tue, Jan 4, 2011 at 7:05 PM, Alexey Proskuryakov <ap at webkit.org> wrote:
>> 04.01.2011, в 18:40, Alex Milowski написал(а):
>>> Looking at the libxml2 API, I've been baffled myself about how to
>>> control the character encoding from the outside.  This looks like a
>>> serious lack of an essential feature.  Anyone know about this above
>>> "hack" and can provide more detail?
>> Here is some history: <http://mail.gnome.org/archives/xml/2006-February/msg00052.html>, <https://bugzilla.gnome.org/show_bug.cgi?id=614333>.
> Well, that is some interesting history.  *sigh*
> I take it the "work around" is that data is read and decoded into an
> internal string which is represented by a sequence of UChar.  As such,
> we treat it as UTF16 character encoded data and feed that to the
> parser, forcing it to use UTF16 every time.
> Too bad we can't just tell it the proper encoding--possibly the one
> from the transport--and let it do the decoding on the raw data.  Of
> course, that doesn't guarantee a better result.

Is there a reason why we can't pass the "raw" data to libxml2?
E.g. when the input file is UTF-8 we convert it into UTF-16 and then libxml2 converts it back into UTF-8 (its internal format). This is a real performance problem when parsing XML [1].
Is there some (required) magic involved when detecting the encoding in WebKit? AFAIK XML always defaults to UTF-8 if there's no encoding declared.
Can we make libxml2 do the encoding detection and provide all of our decoders so it can use it?

[1] https://bugs.webkit.org/show_bug.cgi?id=43085

- Patrick 

More information about the webkit-dev mailing list