[webkit-dev] UTF-16 default mappings in WebKit

Mon Mar 31 04:20:11 PDT 2008

Hi,

I'm working on the Unicode support for the Java port.  When I was
writing support for the character set encoding/decoding, I catched the
following message:

ERROR: alias ISO-10646-UCS-2 maps to UTF-16LE already, but someone is
trying to make it map to UTF-16BE
(../../../WebCore/platform/TextEncodingRegistry.cpp:137 void
WebCore::checkExistingName(const char*, const char*))

That is, WebKit maps ISO-10646-UCS-2 to UTF-16LE internally, and Java
has another idea of how this alias should be mapped.

Looking futher through the code, I found that WebKit assumes internally
little-endian encoding even for UTF-16 itself (WebCore / platform /
TextEncodingUTF18.cpp defines mapping from UTF-16 to UTF-16LE.)

This means that if WebKit receives BOM-less data marked as UTF-16, it
would be treated as being little-endian, which is somewhat contrary to
the standard.  RFC 2781 Section 4.3 says,

    If the first two octets of the text is not 0xFE followed by 0xFF,
    and is not 0xFF followed by 0xFE, then the text SHOULD be
    interpreted as being big-endian.

So my question is, what is the reason for WebKit to treat UTF-16 data as
being little-endian by default?  Since the standard doesn't say MUST,
there may be some valid reasons to ignore this requirement, does anybody
have an idea of how these reasons might look like?

Thanks,
Sergey