[Webkit-unassigned] [Bug 13150] Character entity references should produce same result as numeric

Sat Mar 24 10:42:30 PDT 2007

http://bugs.webkit.org/show_bug.cgi?id=13150

------- Comment #15 from robburns1 at mac.com  2007-03-24 10:42 PDT -------
(In reply to comment #14)
> Talking about characters with singleton decompositions being deprecated, I
> agree that there is no formal official provision saying this. However, that
> fact that such characters can not appear in any normalized text speaks for
> itself. See, for example, the definition of "singleton decomposition" at
> <http://safari.oreilly.com/0201700522/idd1e35240>.

I'm saying that I think that's a misconception about canonical equivalent
characters. It more applies to compatibility characters (where the Unicode
Standard says this explicitly).

> I am not completely sure what you are suggesting for this bug. Should we return
> to the HTML mapping for ⟩ and ⟨? Then they won't render with default
> OS X fonts, and any text using them would become denormalized. Neither is a
> positive outcome.

I'm not sure I think anything needs to be done about this bug (I did leave it
closed). But I think that eventually (when the late normalization and perhaps
glyph issues are dealt with), that WebKit should not change characters input by
a user or read from storage in a lossy way and this bug should then be reopened
()or marked as dependent on bug 8738).

> Your comment has some interesting ideas about in-memory processing. It would be
> really great if you could verify those against
> <http://www.w3.org/TR/charmod-norm/>, and file separate bugs (preferably with
> tests) for cases where we do not conform.

Well, I think bug 8738 covers things for now (or needs to be fixed first). Keep
in mind that there are two types of normalization from the W3C: late and early.
Given the nature of WebKit (as often a read-only user agent), it would be best
if WebKit processed strings using late normalizaation and using full NFKC late
normalization.

Early normalization only arises in a role as a content creation user agent. As
a content creation user agent (through HTML editing), I'm not sure WebKit
should adopt the early normalization, by default. As a developer using WebKit
in this way, I can use HTMLTidy to do an early normaliziation of my text if the
user, or I as the developer, want that. Without normalization flags set, I want
WebKit to load the text as is and accept input (say from the Mac OS X character
palette) without arbitrarily changing the characters to their canonical
equivalents. Since WebKit needs to do late normaization anyway (in its
read-only mode), there's no reason to force early normalization on users.

As I've suggested on bug 8738 I think WebKit should use canonical equivalent
charactrs for fallback glyphs. So with proper late normalization implemented,
the lack of a glyph for ⟩ and ⟨ would not even be an issue. Earlier
you suggested not going against Unicode on these canonical mappings of lang and
rang. However, Unicode has a problem in going aginst the vast majority of
authors and content creators who,  when looking for a left angle bracket fence
or right angle bracket fence, will look in the mathemtical symbols category
(and not the CJK punctuation cateogry) and find the character and the glyph
appearance they're looking for. Changing this on the author is not appropriate
except for fallback. I think the Unicode Standard got these backwards when they
added these characters in 1.1. However, that only becomes an issue if we
interpret (like O"Reilly) that canonical-equivalent singletons with
decompsoition mappings. Obviously if they're canonical equivalents then U+2329
being equivallent to U+3008 means that U+3008 is equivallent to U+2329 too.
They can't both be deprecated singletons.

The compatibility characters are actually "disouraged" by the Unicode Standard
though not "strongly discouraged" (i.e., "deprecated"). The canonical
equivalents are not discouraged at all by the Unicode Standard. However,
despite compaitility characters being discouraged, early NFKC normalization is
not promoted by the W3C because NFKC normalization is semantically lossy.
However, both NFKC and NFC normalizations are presentation (glyph) lossy.

This is a lot of writing on this esoteric issue. For the most part, I don't
think early normalization is a problem except for the glyph presentation issue.
Font makers are often trying to make more glyph's available to content creators
than the Unicode Standard has provided character mappings for (user selected
and somewhat semantic glyph variants). The attempt to deprecate canonical
equivalents only contributes to that problem, since font makers could render
the glyph difference through author selection of canonical equivalent
characters.

-- 
Configure bugmail: http://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.