[Webkit-unassigned] [Bug 24342] Cannot insert a Thai character after a Thai prepend character when using ICU 4.0

bugzilla-daemon at webkit.org bugzilla-daemon at webkit.org
Thu Mar 5 04:13:32 PST 2009


------- Comment #6 from hbono at chromium.org  2009-03-05 04:13 PDT -------
Thank you for your comments.
Sorry for the lack of writing backgrounds. Yes, it is very confusing to figure
out this problem without examples.
Even though I may not be able to understand this issue totally because I'm not
a native Thai speaker or a WebKit guru who have good knowledge about WebKit, I
would like
to describe the backgrounds of this issue as much as possible.

(In reply to comment #5)
> +    // For example, we should not insert a character break after a Thai preposed
> +    // character (a Thai vowel character) because it breaks a syllable.
> +    // On the other hand, we should be able to move an input cursor even after
> +    // a Thai preposed character.
> Where should the former character break iterator come into play? Should we
> allow selections that only cover a prepend character, for example?

To write from a conclusion, yes, Thai people would like to select only a
character, delete it or replace it with another prepend character.

In brief, prepend characters (U+0E40,...,U+0E44 and U+0EC0,..,U+0EC4) are vowel
characters placed but they are NOT COMBINING CHARACTERS. It is a Thai alphabet
placed BEFORE a consonant to denote the first vowel of a Thai syllable.
So, a Thai syllable may look like two characters from our eyes.

To write the original problem, it is that:
  they cannot insert a Thai syllable (which consists of a prepend character 'A'
  and a consonant 'B' and looks like two characters 'AB' from our eyes) before
  another Thai syllable (which consists of a consonant and combining characters
  and looks like a character 'C' from our eyes).
  When they type a prepend character 'A', the input cursor moves after the end
  of the syllable 'C' and the consonant 'B' is inserted after the 'C'. So, the
  actual result looks like 'ACB' from our eyes. (Its expected result looks like
  'ABC' from our eyes.)
  Also, they cannot delete only a prepend character 'A' from 'ACB' because they
  cannot move the input cursor after the prepend character 'A'.

To investigate this issue deeply, we noticed this issue was caused by a design
change of the character-break iterator in ICU 4.0.

Before ICU 4.0, its character-break iterator splits a string into grapheme
clusters. a grapheme cluster is a set of characters that look like only one
character from our eyes. So, it splits a Thai syllable 'AB' which consists of
a prepend character and a consonant into two pieces, 'A' and 'B'.
The only exception is Japanese half-width katakana voiced marks. It does not
split a Japanese syllable which consists of a half-width katanaka and 
 half-width katakana voiced mark into pieces.

On the other hand, ICU 4.0 changed its character-break iterator to split a
string into "extended" grapheme clusters. For most languages, an "extended"
grapheme cluster is same as a grapheme cluster. But, unfortunately for Thai and
Lao, an "extended" grapheme cluster becomes a syllable. So, even if a Thai
syllable 'AB' looks like two characters from our eyes, it does not split into
On the other hand, somehow, the character-break iterator of ICU 4.0 deletes the
exception for Japanese half-width katakana voiced mark, i.e. it does split a
Japanese syllable which consists of a half-width katanaka and a half-width
katakana voiced mark into pieces as it did in ICU 3.2. So, we have to re-enable
a workaround for ICU 3.2 in RenderText::previousOffset() and
for ICU 4.0.

Even though future ICU may change the behaviors of its character-break
we thought we should use a custom iterator for cursor iteration to avoid
related to ICU versions.

> I'd like to have as many examples of use cases for either iterator as possible,
> to avoid future confusion.

Sorry for your confusion. My description always lacks important background
information and it is confusing.

> What about my AtomicString comment? I still don't think that you need atomic
> strings here. The reason for them to exist is quick comparison - two
> AtomicStrings are equal if and only if their impl pointers are equal.

Sorry again for this problem. I thought I changed this AtomicString to String,
but, actually, I forgot changing it.

> +    // The only difference from the original rules is:
> +    //   added "!!chain" to change this rule set to chained.
> Please expand this comment to explain why we need to make the rule set chained.

This line is for preventing two or more rules from matching to an input string.
Nevertheless, to read the latest rule set, there are not any rules matching at
once. Even though this line is harmless, I'm going to remove this line to avoid

Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

More information about the webkit-unassigned mailing list