[webkit-gtk] Status of the glib unicode backend in webkit

Fri Dec 10 01:29:51 PST 2010

For the last two weeks I've been working on glib unicode backend. There
were mainly two problems for switching to glib unicode: glib backend was
incomplete/broken and performance issues (due to glib uses utf8 and
webkit utf16 so we have to convert to utf8 to use glib and then convert
the result back to utf16). 

Note: Long mail, go to "Summarizing" if you are not interested in the
details. 

 - Tests still failing with glib unicode:

 fast/encoding/frame-default-enc.html

I'm not sure why this one fails

 fast/encoding/yentest2.html
 fast/encoding/yentest.html

This is a mess, it seems that character 0x5c, which is a back slash in
ascii it's an ambiguous character in some Japanese encodings like
Shift-JIS and can be a yen sign (U+00A5) or a back slash (U+005C).
There's a workaround for this in webkit already but it doesn't seem to
work for us, because ICU decodes 0x5c as U+00A5 and iconv as U+005C. We
could just add a workaround to always encode 0x5c as U+005C when
encoding is Shift-JS but I'm not sure it's correct because I don't know
whether ICU does it always or if it depends on current locale or
whatever. More info:

https://bugs.webkit.org/show_bug.cgi?id=24906
http://blogs.msdn.com/b/michkap/archive/2005/09/17/469941.aspx

 fast/encoding/GBK/EUC-CN.html
 fast/encoding/GBK/chinese.html
 fast/encoding/GBK/cn-gb.html
 fast/encoding/GBK/csgb2312.html
 fast/encoding/GBK/csgb231280.html
 fast/encoding/GBK/gb2312.html
 fast/encoding/GBK/gb_2312-80.html
 fast/encoding/GBK/gbk.html
 fast/encoding/GBK/iso-ir-58.html
 fast/encoding/GBK/x-euc-cn.html
 fast/encoding/GBK/x-gbk.html
 fast/encoding/hebrew/8859-8-e.html
 fast/encoding/hebrew/8859-8-i.html
 fast/encoding/hebrew/csISO88598I.html
 fast/encoding/hebrew/logical.html

These ones are either not supported by iconv or contain an invalid
character that ICU substitutes by another special one. These are
skipped in qt.

 fast/encoding/char-encoding-mac.html

Most of the encodings used in this test are not supported by iconv.
Skipped in qt too.

 fast/encoding/hebrew/8859-8-e.html	expected	actual
 fast/encoding/hebrew/8859-8-i.html	expected	actual
 fast/encoding/hebrew/csISO88598I.html	expected
 fast/encoding/hebrew/logical.html

Not supported by iconv either, skipped in qt too.

 fast/js/sputnik/Unicode/Unicode_320/S15.5.4.16_A1.html
 fast/js/sputnik/Unicode/Unicode_500/S15.5.4.16_A1.html
 fast/js/sputnik/Unicode/Unicode_500/S15.5.4.18_A1.html
 fast/js/sputnik/Unicode/Unicode_510/S15.5.4.16_A1.html
 fast/js/sputnik/Unicode/Unicode_510/S15.5.4.18_A1.html

The problem is that g_unichar_tolower() only works for characters that
are G_UNICODE_UPPERCASE_LETTER or G_UNICODE_TITLECASE_LETTER. These
tests are using 0x2160..0x216F (G_UNICODE_LETTER_NUMBER) and
0x24B6..0x24CF (G_UNICODE_OTHER_SYMBOL). I filed a bug in glib: 
https://bugzilla.gnome.org/show_bug.cgi?id=633436

 fast/text/find-kana.html
 fast/text/find-russian.html
 fast/text/find-soft-hyphen.html
 fast/xsl/sort-unicode.xml

The problem here is the algorithm used when searching text in non-case
sensitive mode. We just use casefold() to convert the string into a
form that is independent of case, that's seems to be what firefox does
too. But ICU implements the search algorithm of strength 3, which means
that, for example, accented characters match to its non accented
version. More info:
http://www.unicode.org/reports/tr10/#Searching
http://userguide.icu-project.org/collation/icu-string-search-service
https://bugs.webkit.org/show_bug.cgi?id=48056
This is probably the most difficult bug to fix.

 fast/dom/Range/range-expand.html

This one fails only for the Chinese words due to this pango bug:
https://bugzilla.gnome.org/show_bug.cgi?id=97545
Bug is open since 2002 so . . . 

 fast/url/host.html

This is a bug in glib:
https://bugzilla.gnome.org/show_bug.cgi?id=633350

 - Performance improvements

 JavaScriptCore/wtf/unicode/glib/UnicodeGLib.cpp

Problematic functions are foldCase(), toLower() and toUpper() the
version that convert an string. Functions that convert a single
character are not a problem because we have a g_unichar_ function in
glib, except for foldcase. When converting a string we need to convert
between utf8 and utf16. I haven't done any benchmark so I don't know
the real impact of these conversions in performance. Already proposed
an improvement here: https://bugs.webkit.org/show_bug.cgi?id=48625

 UChar32 foldCase(UChar32 ch)

The problem here is that we don't have an equivalent version in glib,
because the foldcase of some characters is represented by more than one
character. ICU and qt have a foldCase() method for a single character
that only work for characters that have 1 to 1 mapping. I talked to
behdad to see whether we could do the same in glib:

<KaL> behdad: I'm wondering why there isn't g_unichar_casefold,
wouldn't it make sense even though it wonly works for single-character
mapping?
<behdad> KaL: you know the answer already :)
<KaL> behdad: no, I don't
<KaL> :-P
<behdad> KaL: you said it.  it's hack...
<behdad> slightly better than one that only works for ascii...
<KaL> yes well, what it's a hack is what we have to do in webkit to
emulate it
<behdad> yes, unfortunately our unicode support is far from complete :(
<KaL> behdad: it's actually tolower + a few special cases
<behdad> oh, I see what you mean...
<behdad> KaL: well, that webkit has wrong design for unicode is not
glib's problem

A workaround might be to copy (or generate our own) the table of
special case folding. 

 - Summarizing:

Most of the test cases that are failing are corner cases or bugs in
pango/glib. We would need to measure times to know whether performance
is actually an important issue or not. So, maybe it's not ready to
switch to glib unicode backend by default, but we can probably remove
the message that says it's slow and incomplete. Or we could try to make
it default and see what happens. 

Sorry for the long mail.
-- 
Carlos Garcia Campos
http://pgp.rediris.es:11371/pks/lookup?op=get&search=0xF3D322D0EC4582C3
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.webkit.org/pipermail/webkit-gtk/attachments/20101210/c8f9f2b6/attachment.bin>