[webkit-dev] Fwd: Fwd: Fwd: HTML5 & MathML3 entities

Fri Sep 17 18:08:10 PDT 2010

Начало переадресованного сообщения:

> От: David Carlisle <davidc at nag.co.uk>
> Дата: 17 сентября 2010 г. 18:02:39 Тихоокеанское летнее время
> Кому: Alexey Proskuryakov <ap at webkit.org>
> Тема: Ответ: [webkit-dev] Fwd: Fwd: HTML5 & MathML3 entities
> 
> On 18/09/2010 00:05, Alexey Proskuryakov wrote:
>> 
>> 17.09.2010, в 15:32, David Carlisle написал(а):
>> 
>>> adding a canonical decomposition doesn't imply deprecation.
>>> Depending on which canonical form is chosen, the canonicalisation
>>> mapping can go either way, loosely speaking some forms prefer
>>> composite characters, some use combining characters in preference
>>> (not that combining characters are involved here)
>> 
>> This is not accurate. For singleton decomposition, both NFC and NFD
>> contain the decomposed form. See Unicode 5.2.0 section D113 (full
>> composition exclusion) for details.
> 
> yes NFC and NFD are the same in these cases, but that doesn[t really change the main point that deprecation here is nothing to do with the character having a different normal form. Compare
> ANGSTROM SIGN (212B) and
> LATIN CAPITAL LETTER A WITH RING ABOVE (00C5)
> these are similarly related by canonical form and so clearly C5 is preferred but 212B is not deprecated in the same way as 2329  is.
> see the entry for 212B in
> 
> http://www.unicode.org/charts/PDF/U2100.pdf
> 
> 2329 is deprecated because it is replaced by 27E8  not because it
> maps to something else in NFC.
> 
>> 
>>> 2329  was deprecated some years after the canonical mapping was
>>> added because it was realised that that mapping was wrong, but
>>> mappings are never changed once added. It became deprecated not
>>> when the mapping to 3008 was added; it became deprecated when it
>>> was replaced by 27E8 I described it as a two step process because
>>> it happened in two stages.
>> 
>> Because of the above, I don't see how it could happen in two stages.
>> Adding a singleton decomposition logically implies deprecation. And
>> it wasn't until Unicode 5.2 that "deprecated" had a clearly defined
>> meaning anyway.
> If that were the meaning of deprecated in this case, the deprecated character would be deprecated in favour of its canonically equivalent character but that isn't the case. It is deprecated _because_ that incorrect decomposition exists, and is deprecated in favour of a new character added specifically to avoid the problem.
>> 
>>> It was conformant to unicode 2 yes, the fact that unicode then
>>> added a canonical form to 3xxx doesn't make them non conformant,
>>> systems don't have to use NFC form and they don't have to use any
>>> particular glyph, so for either reason it's perfectly conformant to
>>> use a math character for 2329.
>> 
>> Again, both composition and decomposition of U+2329 produces U+3008.
> Yes but a system isn't obliged to compose or decompose (and most do not automatically in my experience)
>> 
>>> The point is that there have been documents using those entities as
>>> math character names in continuous use since the '80s why should
>>> they all be broken? Not to mention the fact that the vast majority
>>> of use of those entities in html will also be expecting a
>>> mathematical bracket (even if on some systems, with some fonts the
>>> character glyph used was actually designed for CJK punctuation).
>>> 
>>> In fact where classical ISO usage and HTML usage differed I
>>> followed HTML usage in all cases (for all the obvious reasons) even
>>> when the HTML definitions make no sense at all (eg asymp) but in
>>> this case external factors (ie Unicode moving the goalposts) meant
>>> that the "new" Uniocde 3.2 character should be used here.
>> 
>> Do these documents use the entities with the same "&...;" notation?
> 
> yes, of course.
> 
>> MathML didn't exist in the 80's, so what are the documents that
>> actually conflict with HTML, or with compound XHTML documents?
> 
> Well the point of breaking the mathml (and html) entities into a separate spec was to get a uniform set of definitions across different uses. If (as was the case) the same entity name (used via the same syntax) means different things in docbook, mathml and html, then formally you may argue that everything is OK and consistent, each document obeys its own language definition, but in practice moving fragments between documents results in silent data corruption.
> the entity spec was separated out from the mathml spec in 2003 and went through numerous public revisions, people in the old and the new HTML groups were asked to commnt on it, people in the UTC/Unicode list and people on the original ISO working groups who defined the entitiy names originally, after 7 years of open review it went to REC earlier this year (and MathML3 depending on it will hopefully go to REC this month)
> 
> 
>>>>> the only fix the UTC suggest for that is just not using 2329 at
>>>>> all and use 27E8 instead. Which is what the entity spec
>>>>> recommends.
>>>> 
>>>> 
>>>> Did they actually suggest to use it for the lang entity in HTML,
>>>> or did they suggest to use it when a math character is desired?
> 
> the comments were in relation to the entities draft which has the explicit intention of being a common set of definitions for any uses of these entity names.
> 
>>> xhtml entities have document scope it is not possible for an
>>> xhtml+mathml document to have different definitions for html and
>>> mathml use, but even for pure html use it is fairly clear that 27e8
>>> is the correct choice.
>> 
>> I wasn't asking about HTML vs. XHTML - both used to define&rang; in
>> the same way.
> 
> The same way as MathML2, actually. This change isn't about matching XHTML or MathML2, it's about tracking changes to Unicode.
> 
> I can re-phrase my question as "Did they actually
>> suggest to use it for the lang entity in (X)HTML, or did they suggest
>> to use it when a math character is desired?"
>> 
> 
>> 
>> I don't think that characterizing what we did in WebKit as bizarre in
>> the extreme is fair.
> 
> fair or not, I think it is was clearly the wrong thing to do (even if well intentioned) nothing in HTML or XHTML specifications would licence such a definition. You could claim perhaps that you were using HTML followed by NFC normalisation, but that's a very weak argument I think.
> 
> The Unicode spec (or at least the code chart page at http://www.unicode.org/charts/PDF/U2300.pdf which is what I have to hand) doesn't say it is deprecated in favour of 3009 it says that it is deprecated _because_ of the equivalence to CJK punctuation and that mathematical use is strongly recommended to use 27e8 instead.
> 
> It is very hard to think that anyone using CJK characters (and so presumably with access to some convenient keyboarding scheme for those code ranges) suddenly requires an ascii entity name reference to access a punctuation character. Conversely mathematical usage habitually uses long ascii names for characters, It is clear that rang and lang have always been intended as mathematical characters, and I ask again whether you really think that (barring artificial test cases) anyone writing in CJK languages uses these english ascii entity references for just those two characters? I don't see how it is possible to read Uniocde as saying anything other than rang ought to point to 27e8
> 
> Unicode techical report 25 says
> 
> Unicode 3.2 added two new mathematical angle bracket characters ⟨ ⟩ (U+27E8 and U+27E9) that are unequivocally intended for mathematical use and should be used instead of U+2329 and U+232A.
> 
> 
> 
> David
> 
> 

- WBR, Alexey Proskuryakov