[webkit-dev] Webkit compatibility in India - Transcoding Indic fonts

Thu Nov 20 15:30:55 PST 2008

2008년 11월 20일 (목) 오전 3:00, Prunthaban Kanthakumar <prunthaban at google.com>님의
말:

>
>
> On Thu, Nov 20, 2008 at 12:01 AM, Jungshik Shin (신정식, 申政湜) <
> jungshik at google.com> wrote:
>
>>
>>
>> 2008/11/6 Prunthaban Kanthakumar <prunthaban at google.com>
>>
>> Hi All,
>>>
>>> This is a continuation of the mail thread
>>> https://lists.webkit.org/pipermail/webkit-dev/2008-October/005495.html
>>>
>>> I am interested in discussing about some of the ways to implement mjs'
>>> ideas.
>>>
>>> As mjs says in the above mail,
>>>
>>> *In case you look into implementing this, what I'd suggest is an extra
>>> CSS property that can be set based the font property at style resolution
>>> time. (since I think the computed font list will strip EOT fonts, so it
>>> might be too late to look at it once you are on the rendering side).
>>> Something like -webkit-indic-text-decode. *
>>>
>>> When the code reaches RenderText::styleDidChange method, the font
>>> information will still remain in the RenderStyle object associated with the
>>> RenderText (because this happens at the time of parsing the html file, well
>>> before font resolution happens).  Now in this method, there is check to see
>>> if there are text-transformations as part of the style and if there is one,
>>> then the method setText is called, forcing it to modify the 'internal text'
>>> if needed.
>>>
>>> Now we can do the following,
>>> 1. Add an additional condition in styleDidChange method to check if the
>>> font-family is supported by our transcoder (At present a fast look-up table
>>> should do because we plan to support only limited set of fonts)  - This
>>> condition will be #ifdefed on ENABLE(TRANSCODER_SUPPORT).
>>> 2. Now in the setTextInternal method, based on the font-family, we get
>>> the corresponding transcoder (probably from a map) and perform the
>>> transcoding.
>>>
>>> Later when font-resolution happens, since the particular font is eot, it
>>> will be ignored and based on the code point of glyphs a default font will be
>>> choosen by Webkit and hence the correct characters will appear on the
>>> screen.
>>> Also after setTextInternal method there is a layout & width recalculation
>>> done which is important for us because we modify the characters. So
>>> RenderText::setTextInternal method seems to be the ideal place to plug-in
>>> the transcoder.
>>>
>>> On a related note, I would like to mention here that, we cannot go with
>>> the approach of 'one look-up table' per font-face and a single transcoder to
>>> do the look-up for all fonts. The problem is that many indic languages use
>>> multiple code-points to represent one character and different fonts use
>>> different standards! For example there are situations where one glyph in EOT
>>> needs to be transcoded to 5+ Unicode code points. A reverse situation is
>>> also possible. Due to these issues, we cannot go with a simple look-up table
>>> for all fonts. This forces us to write some specialized code to handle each
>>> font (there might also be some fonts where a one-to-one look-up table will
>>> be enough).
>>
>>
>>
>> In October, I listed two alternatives for this transformation. One is
>> adding ICU converters for Indic font encodings (it can deal with m-to-n
>> mappings) and the other is implementing your own. The first was ruled out
>> because it's not easy to add new converters on Mac OS X where ICU is a part
>> of the OS.   There's another approach you can take. You can build ICU
>> transliterator rules and it seems to be the cleanest way to do this. You
>> don't need to port/implement conversion code (from another project : e.g.
>> Padma) but just need to 'port' the conversion tables to ICU transliterator
>> rules.
>>
>> This transcoding will be invoked on the content of a text node already in
>> Unicode just like 'text-transform: capitalize' or 'text-transform:
>> lowercase' is.  ICU transformer is for transforming a chunk of text in
>> Unicode to another chunk of text in Unicode.
>> ( http://www.icu-project.org/userguide/Transform.html ) So, it appears to
>> be almost a perfect fit.
>>
>
> I do not have much knowledge about ICU Transformers. But from the link
> above what I understand is, transformers are to perform 'transliteration'
> like converting from English to Hindi. I am not sure how this can be used to
> transcode indic fonts. (ICU Converters are the ones which do transcoding
> from one script to another. But from what you have said, it looks like ICU
> converters are not the way to go).
>

'Transliteration' is just one of applications of ICU transformer. (My use of
two terms almost interchangeably must have made you confused) Perhaps, I
should have given you these links, instead of the above:

http://www.icu-project.org/userguide/Transformations.html

>
>
> Also what we are trying to do is to transcode characters which are actually
> in the ASCII range (whose glyphs are "hacked" by font designers to render
> indic characters) to unicode characters of the corresponding language. So to
> what extent a transfomer is going to be helpful to us? In our case each font
> (or in some cases a set of fonts due to some standardization efforts in the
> past) will have its own mapping of ASCII-Unicode (which are m-to-n) and the
> purpose of ICU transformers seem to be different from this.
>

I'm very well aware of how Indic-font-encoding works. :-)  ICU transformer
 can do many kinds of magic (it's rule-based )  for transforming a sequence
of characters to another sequence of characters. The input sequence can be
made of any Unicode characters (ASCII or not) and so do the output sequence.
This is exactly what 'text-transform' is about. 'text-transform:
{capitalize, lowercase, uppercase}' is a very limited form of such a
transform while 'text-fransform: devanagari-font-foo'  would be  more
complex.  Note also that when it's time to apply this transcoding, what you
get from Webkit is a Unicode string (not a byte sequence) that can include
characters like U+201C and U+00AD because font-encoded Indian pages are
regarded as in windows-1252.

You can write rules like:

a > \u0904;
b > \u0915\u0940;    if the glyph for 'b' in a font is actually a glyph for
Devanagari   'KII'

Another example for Tamil (with two-part vowel signs U+0BCA and U+0BCB whose
right parts are identical and represented in 'h' in 'tamil-font-bar')

f g h > \u0b95\u0bca;   Tamil 'KO'.  if the glyph for 'f' is the left part
of U+0BCA, g is for U+0B95 (Tamil 'KA') and 'h' is for the right part of
U+0BCA
i g h > \u0b95\u0bcb;    Tamil 'KOO' :
f g k > \u0b95\u0bcc;   Tamil 'KAU' where the glyph for 'k' is the right
part of U+0BCC.

I guess you now have a better grasp of the capability of ICU transformer.

Having said that, webkit ports that do not use ICU is kinda blocker to using
ICU transformer. It may be possible to write a transcoder for Indic fonts in
such a way that the gut of the transcoder is replaceable (i.e. for ports
using ICU, use it. for other ports, make it possible and easy to use
something else).  Well, then, one of my arguments for ICU (less code to add
to webkit) gets a little  bit weaker and I don't have as strong an opinion
as before. Anyway,  google-url library (used in Chrome) uses such an
approach for IDN handling and encoding conversion. It uses ICU for them, but
that part can be replaced by other implementations. (
http://code.google.com/p/google-url/ )

Jungshik

>
>
>
>>
>>
>> Jungshik
>>
>> P.S. BTW, I filed https://bugs.webkit.org/show_bug.cgi?id=22339 for this
>> task.
>> If you haven't filed one, why don't you use 22339 for uploading a
>> prototype patch for one (site, font) pair as Brett suggested?
>>
>
> Thanks. I will use that. Once we decide upon the approach, I will go ahead
> with implementing it and submit a patch in the bug id you created.
>
>
>>
>>
>>
>>
>>
>>>
>>>
>>> I would like to hear from you about this. Is this approach fine or do you
>>> have any issues or suggestions?
>>>
>>> Regards,
>>> Prunthaban
>>>
>>>
>>> _______________________________________________
>>> webkit-dev mailing list
>>> webkit-dev at lists.webkit.org
>>> http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-dev/attachments/20081120/24a26c31/attachment.html>