[Webkit-unassigned] [Bug 13150] Character entity references should produce same result as numeric

Fri Mar 23 14:53:44 PDT 2007

http://bugs.webkit.org/show_bug.cgi?id=13150

------- Comment #13 from robburns1 at mac.com  2007-03-23 14:53 PDT -------
(In reply to comment #2)
> This change was intentional, and was a result of a conflict between different
> specs. In Unicode, code points U+2329 and U+232A are deprecated in favor of
> U+3008 and U+3009, respectively. 

I've been researching this further and I think it’s incorrect to treat the
canonical equivalent decomposable singletons as deprecated. The compatibility
characters are discouraged (those that have a decomposition with the keyword
<compat>) but not the canonical equivalent decomposable characters. As far as
the relevance to this bug, it means that there is no conflict between the W3C
and the Unicode spec. However, thre's also probably nothing wrong with
translating the ⟨ and ⟩ character entitities into U+3008 and U+3009
respectively.

If these had the keyword <compat>, then Unicode prohibits changing the meaning
of the text, which NFKC normalization does (does change the meaning of the text
that is). NFC normalization does not change the meaning of the text.

>From Unicode 3 3.6 (D21) <http://www.unicode.org/book/ch03.pdf>, the following
definition:
" Compatibility characters are included in the Unicode Standard to represent
distinctions in other base standards. They support transmission and processing
of legacy data. Their use is discouraged other than for legacy data."

There is nothing about canonical equivalent characters being deprecated or
discouraged. In fact,  it looks to me that the distinction between the compat
and canonical distinction is that the canonical characters are not part of the
discouraged legacy compatibility characters.

I also read this to imply that the normalization should only happen in memory.
That is to say, the NFC normalization could be serialized after completeion,
but it need not. In many ways I think it would be better not to serialize the
normalization.

>From the same URL (C9):
"A process shall not assume that the interpretations of two
canonical-equivalent character sequences are distinct." 

and

"Ideally, an implementation would always interpret two canonical-equivalent
charactr sequences identically. There are practical circumstances under which
implementations may reasonably distinguish them."

I read this (along with the W3C document on normalization) to say that WebKit
should normalize strings NFC for in-memory processing of strings. There are
other issues, that are not implied by this.

• The issue over whether an editable WebView should serialize substituted
canonical-equiavlent characters is not clear from my reading of the Unicode
Standard. 
• Also, whether WebKit should render glyphs for a canonical-equivalent or the
original is not clear either. I would say from a reading of this conformance
chapter that it might be best for WebKit to render the glyphs from a font from
either the stored string code point or any cnonical-equivalent code point for
which there was a glyph (as a fallback mechanism).

Anyway, the compatibility characters are discouraged (though I'm not sure if
the term deprecated applies). The canonical-equivalent characters are not
deprecated or discouraged (as far as I can tell).

Only ten code points have been deprecated (from
<http://www.unicode.org/Public/UNIDATA/PropList.txt>):
0340..0341    ; Deprecated # Mn   [2] COMBINING GRAVE TONE MARK..COMBINING
ACUTE TONE MARK
17A3          ; Deprecated # Lo       KHMER INDEPENDENT VOWEL QAQ
17D3          ; Deprecated # Mn       KHMER SIGN BATHAMASAT
206A..206F    ; Deprecated # Cf   [6] INHIBIT SYMMETRIC SWAPPING..NOMINAL DIGIT
SHAPES

-- 
Configure bugmail: http://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.