# [webkit-dev] HTML5 & MathML3 entities

Sat Jul 10 11:00:35 PDT 2010

On Sat, Jul 10, 2010 at 4:49 AM, Maciej Stachowiak <mjs at apple.com> wrote:
> On Jul 10, 2010, at 3:47 AM, Sausset François wrote:
>> I'm currently working on the MathML3 implementation and I noticed that new XML entities have been defined by the W3C:
>> http://www.w3.org/TR/xml-entity-names/
>>
>> They are supposed to be used by both HTML 5 & MathML 3.
>>
>> I would like to include them in WebCore/html/HTMLEntityNames.gperf.
>> However there is one conflict with the existing XHTML 1.0 entities: \rangle (and \langle) doesn't point to the same Unicode character in XHTML 1.0 and HTML 5 entity definitions.
>> For instance, U+27E9 ("⟩") instead of U+3009 ("〉").
>>
>> There are two possibilities:
>> - either update WebCore/html/HTMLEntityNames.gperf and overwrite the two conflicting cases with the new standard, but it won't respect the XHTML 1.0 specification anymore.
>> - or use two sets of HTML entities depending on the DTD of the document. It would be the cleanest way, but I don't know how to make WebCore handle two such sets.
>>
>> I think the best solution is the second one, but I'll need help to make WebCore handle two entity sets and switch depending on the DTD. It is outside of my present skills.
>
> Go with the HTML5 / MathML 3 definitions for everything. Our XHTML implementation targets XHTML5, not XHTML 1.0.

Implementing MathML entities is not as easy as adding them to
HTMLEntityNames.gperf.  The problem is our entity parsing code (both
the legacy entity parser and thew new HTML5 one we're using) assumes
that all named entities are <= 8 characters:

http://trac.webkit.org/browser/trunk/WebCore/html/HTMLEntityParser.cpp#L194

Rather than just bumping up that number, we need to change the data
structure we use to store entities.  Instead of a perfect hash, we
should use a prefix tree.  In order to parse entities correctly
according to the spec, we need to know whether a given string is a
prefix of a named entity, which is what the prefix tree would tell us.