[webkit-dev] HTML5 & MathML3 entities

Sat Jul 10 11:17:31 PDT 2010

On Sat, Jul 10, 2010 at 11:10 AM, Sausset François <sausset at gmail.com> wrote:
> I just saw that when looking at the code by myself.
> What do you exactly mean by a prefix tree?

http://en.wikipedia.org/wiki/Trie

> I also noticed that the entity parser does not take into account combined
> Unicode characters (see §A.3 in: http://www.w3.org/TR/xml-entity-names/).
> In addition, even without entities, combined characters are displayed as
> separate ones.

My understanding is that is the correct behavior w.r.t. the HTML5
specification of entity parsing.  Our entity processing aims for
perfect compliance with this algorithm:

http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references

My belief is the only things we're missing for perfect compliance is
the expanded list of entity names:

http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-references.html#named-character-references

and the prefix tree.

Adam

> Le 10 juil. 2010 à 21:00, Adam Barth a écrit :
> Implementing MathML entities is not as easy as adding them to
> HTMLEntityNames.gperf.  The problem is our entity parsing code (both
> the legacy entity parser and thew new HTML5 one we're using) assumes
> that all named entities are <= 8 characters:
>
> http://trac.webkit.org/browser/trunk/WebCore/html/HTMLEntityParser.cpp#L194
>
> Rather than just bumping up that number, we need to change the data
> structure we use to store entities.  Instead of a perfect hash, we
> should use a prefix tree.  In order to parse entities correctly
> according to the spec, we need to know whether a given string is a
> prefix of a named entity, which is what the prefix tree would tell us.
>
> Adam
>