[Webkit-unassigned] [Bug 8738] Text should be always normalized to NFC

bugzilla-daemon at webkit.org bugzilla-daemon at webkit.org
Fri Mar 23 21:37:19 PDT 2007


http://bugs.webkit.org/show_bug.cgi?id=8738





------- Comment #7 from robburns1 at mac.com  2007-03-23 21:37 PDT -------
(In reply to my own comment #6 to clarify some tings)
fix typo:
> Relevant to this discussion there has been some discussion on bug 13150. Keep

fixed some typos (in all caps)
> (and perhaps much of this relies
> on the text system that WEBKIT is BUILT on and maybe WebKit is NOT accessing
> the text system just right):

Some example for my last two bullet points:
> • When input is not explicitly a compatibility character, the core
> (non-compatibility) unicode character should be used

This realates more to the input manage, but again, it's worth mentioning here
for clarfication purposes. For example, U+03BC should be used as the canonical
character in preference to U+00B5 the non-canonical compatibility character. 
The compatibility characters are there for legacy reasons and are 'discouraged'
(Unicode Standard's word) for newly entered text. Neither the canonical
equivalent characters nor the compatibility characters are deprecated (meaning
"strongly discouraged") by the Unicod Standard. Rather:

• compatibility characters are discouraged for newly created text
• should be preserved if lossless round-trip translations are expected to
occur
• canonical-equivalent characters are not deprecated (there was some
confusion on bug 13150). Canoncical-equivalent characters are "canon" because
they are NOT deprecated: they are canonical-equivalence. The algorithm for
normalization in 

<http://www.unicode.org/unicode/reports/tr15/#Canonical_Equivalence>

is non-normative, The conformance chapter is normative. That means for
singletons (like the U+2329 /U+3008 example below) one could translate the
string from one canonical equivalent tot he other or vice versa as long as the
system is internally consistent. Neither of those canonical-equivalent
characters is deprecated or discouraged. The Unicode requirements for the
algorithm require that any newly added canonical-equivalent character is the
one used in the algorithm's description.

> • When glyphs exist for canonical-equivalent characters (and don't exist for
> the stored or input character), the view should render the canonical-equivalent
> characters"s glyph

For example if someone inputs or the stored deserialized string contains U+2329
(left-angle pointing bracket) which is canonical-equivalent to U+3008 (left
angle bracket) and no system font has a glyph for character U+2329 then turn to
U+3008 as a fallback for glyphs. This might even be handled font by font as
WebKit moves through each font in the CSS declaration. There are no semantic
differences between these two characters, however, the font glyph differences
should be respected whenever possible.

I'd be happy to clarify further if anyone has any questions on this research.


-- 
Configure bugmail: http://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the webkit-unassigned mailing list