[webkit-gtk] Status of the glib unicode backend in webkit
Carlos Garcia Campos
cgarcia at igalia.com
Fri Dec 10 01:29:51 PST 2010
For the last two weeks I've been working on glib unicode backend. There
were mainly two problems for switching to glib unicode: glib backend was
incomplete/broken and performance issues (due to glib uses utf8 and
webkit utf16 so we have to convert to utf8 to use glib and then convert
the result back to utf16).
Note: Long mail, go to "Summarizing" if you are not interested in the
- Tests still failing with glib unicode:
I'm not sure why this one fails
This is a mess, it seems that character 0x5c, which is a back slash in
ascii it's an ambiguous character in some Japanese encodings like
Shift-JIS and can be a yen sign (U+00A5) or a back slash (U+005C).
There's a workaround for this in webkit already but it doesn't seem to
work for us, because ICU decodes 0x5c as U+00A5 and iconv as U+005C. We
could just add a workaround to always encode 0x5c as U+005C when
encoding is Shift-JS but I'm not sure it's correct because I don't know
whether ICU does it always or if it depends on current locale or
whatever. More info:
These ones are either not supported by iconv or contain an invalid
character that ICU substitutes by another special one. These are
skipped in qt.
Most of the encodings used in this test are not supported by iconv.
Skipped in qt too.
fast/encoding/hebrew/8859-8-e.html expected actual
fast/encoding/hebrew/8859-8-i.html expected actual
Not supported by iconv either, skipped in qt too.
The problem is that g_unichar_tolower() only works for characters that
are G_UNICODE_UPPERCASE_LETTER or G_UNICODE_TITLECASE_LETTER. These
tests are using 0x2160..0x216F (G_UNICODE_LETTER_NUMBER) and
0x24B6..0x24CF (G_UNICODE_OTHER_SYMBOL). I filed a bug in glib:
The problem here is the algorithm used when searching text in non-case
sensitive mode. We just use casefold() to convert the string into a
form that is independent of case, that's seems to be what firefox does
too. But ICU implements the search algorithm of strength 3, which means
that, for example, accented characters match to its non accented
version. More info:
This is probably the most difficult bug to fix.
This one fails only for the Chinese words due to this pango bug:
Bug is open since 2002 so . . .
This is a bug in glib:
- Performance improvements
Problematic functions are foldCase(), toLower() and toUpper() the
version that convert an string. Functions that convert a single
character are not a problem because we have a g_unichar_ function in
glib, except for foldcase. When converting a string we need to convert
between utf8 and utf16. I haven't done any benchmark so I don't know
the real impact of these conversions in performance. Already proposed
an improvement here: https://bugs.webkit.org/show_bug.cgi?id=48625
UChar32 foldCase(UChar32 ch)
The problem here is that we don't have an equivalent version in glib,
because the foldcase of some characters is represented by more than one
character. ICU and qt have a foldCase() method for a single character
that only work for characters that have 1 to 1 mapping. I talked to
behdad to see whether we could do the same in glib:
<KaL> behdad: I'm wondering why there isn't g_unichar_casefold,
wouldn't it make sense even though it wonly works for single-character
<behdad> KaL: you know the answer already :)
<KaL> behdad: no, I don't
<behdad> KaL: you said it. it's hack...
<behdad> slightly better than one that only works for ascii...
<KaL> yes well, what it's a hack is what we have to do in webkit to
<behdad> yes, unfortunately our unicode support is far from complete :(
<KaL> behdad: it's actually tolower + a few special cases
<behdad> oh, I see what you mean...
<behdad> KaL: well, that webkit has wrong design for unicode is not
A workaround might be to copy (or generate our own) the table of
special case folding.
Most of the test cases that are failing are corner cases or bugs in
pango/glib. We would need to measure times to know whether performance
is actually an important issue or not. So, maybe it's not ready to
switch to glib unicode backend by default, but we can probably remove
the message that says it's slow and incomplete. Or we could try to make
it default and see what happens.
Sorry for the long mail.
Carlos Garcia Campos
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 198 bytes
Desc: not available
More information about the webkit-gtk