[webkit-dev] Fwd: Fwd: Fwd: HTML5 & MathML3 entities

Alexey Proskuryakov ap at webkit.org
Fri Sep 17 15:35:49 PDT 2010



Начало переадресованного сообщения:

> От: David Carlisle <davidc at nag.co.uk>
> Дата: 17 сентября 2010 г. 15:32:45 Тихоокеанское летнее время
> Кому: Alexey Proskuryakov <ap at webkit.org>
> Тема: Ответ: [webkit-dev] Fwd: Fwd: HTML5 & MathML3 entities
> 
> On 17/09/2010 22:57, Alexey Proskuryakov wrote:
>> 
>> 17.09.2010, в 14:28, David Carlisle написал(а):
>> 
>>> No the code point is in the math symbols block and was always
>>> intended for math usage. Some time after the code point was added
>>> (I think, I don't have the data to hand) it got added a canonical
>>> mapping to to 3xxx block, that was an error that the unicode
>>> consortium is now trying to correct (or at least back when unicode
>>> 3.x added this new character)
>> 
>> I cannot follow this argument. My understanding is that adding a
>> single character canonical decomposition implies deprecation in
>> Unicode, so describing this as a two-step process confuses me.
> 
> adding a canonical decomposition doesn't imply deprecation.
> Depending on which canonical form is chosen, the canonicalisation mapping can go either way, loosely speaking some forms prefer composite characters, some use combining characters in preference (not that combining characters are involved here)
> 
> 2329  was deprecated some years after the canonical mapping was added
> because it was realised that that mapping was wrong, but mappings are never changed once added. It became deprecated not when the mapping to 3008 was added; it became deprecated when it was replaced by 27E8
> I described it as a two step process because it happened in two stages.
> 
> 
> 
>> At the time I looked at this (and also currently) the deprecated
>> character had a canonical decomposition that made it equivalent to a
>> CJK character. Any software that treats this character as a math one
>> clearly violates many versions of the Unicode specs, including the
>> current one. It might have been conformant to Unicode 2.0 or some
>> earlier version though.
> 
> It was conformant to unicode 2 yes, the fact that unicode then added a canonical form to 3xxx doesn't make them non conformant, systems don't have to use NFC form and they don't have to use any particular glyph, so for either reason it's perfectly conformant to use a math character for 2329.
> 
>> 
>>> the lang and rang entity names come from the ISO math entity to
>>> denote math angle brackets. These sets and these names predate
>>> Unicode and predate HTML, it's unfortunate that after the names
>>> were mapped to unicode a canonical mapping to a different character
>>> was added, but
>> 
>> I don't see how the origins of the debate change the fact that these
>> Unicode fonts you mentioned violated the Unicode spec. They may have
>> been doing "the right thing" or not, but arguing that they didn't
>> violate the letter of the spec seems strange.
> 
> I don't think they violate the spec at all. Except as far as the spec was internally inconsistent once it had added a canonical mapping between two separate characters.
>> 
>> Clearly, I have a different perspective, since I don't think that
>> things that pre-date HTML and Unicode should have much weight in
>> today's decisions.
> 
> The point is that there have been documents using those entities as math character names in continuous use since the '80s why should they all be broken? Not to mention the fact that the vast majority of use of those entities in html will also be expecting a mathematical bracket (even if on some systems, with some fonts the character glyph used was actually designed for CJK punctuation).
> 
> In fact where classical ISO usage and HTML usage differed I followed HTML usage in all cases (for all the obvious reasons) even when the HTML definitions make no sense at all (eg asymp) but in this case
> external factors (ie Unicode moving the goalposts) meant that the "new" Uniocde 3.2 character should be used here.
> 
>> 
>>> the only fix the UTC suggest for that is just not using 2329 at all
>>> and use 27E8 instead. Which is what the entity spec recommends.
>> 
>> 
>> Did they actually suggest to use it for the lang entity in HTML, or
>> did they suggest to use it when a math character is desired?
> 
> xhtml entities have document scope it is not possible for an xhtml+mathml document to have different definitions for html and mathml use, but even for pure html use it is fairly clear that 27e8 is the correct choice.
> 
> rang was never defined to be 3009, it was defined to be 232A  and documented as being a math angle bracket. Unicode have deprecated 232A and suggest that any uses of that be replaced by 27E9 because 232A is effectively unusable as it is subject to an essentially accidental and incorrect normalisation to 3009.
> 
> It would be bizarre in the extreme to redefine rang to be 3009 (is there any evidence of anyone ever having used that entity name and wanting a CJK character?) the choices are doing what Unicode has suggested (since Unicode 3.2) and using 27E9 instead, or the alternative would be to declare that changing the html entities is just too scary and to leave it as 232A  and live with the fact that this will be inconsistently rendered, and violates the w3c/unicode charmod normal form rules, and is directly against the deprecation of this character in the Unicode specification. Of those two choices, defining it to be 27E9 seems to be the lesser of two evils.
> 
> David
> 
>> 
>> - WBR, Alexey Proskuryakov
>> 
>> 

- WBR, Alexey Proskuryakov




More information about the webkit-dev mailing list