[webkit-reviews] review denied: [Bug 65803] Initial pass at a new XML tree builder : [Attachment 103137] Patch

Fri Aug 5 19:25:29 PDT 2011

Adam Barth <abarth at webkit.org> has denied Jeffrey Pfau <jeffrey at endrift.com>'s
request for review:
Bug 65803: Initial pass at a new XML tree builder
https://bugs.webkit.org/show_bug.cgi?id=65803

Attachment 103137: Patch
https://bugs.webkit.org/attachment.cgi?id=103137&action=review

------- Additional Comments from Adam Barth <abarth at webkit.org>
View in context: https://bugs.webkit.org/attachment.cgi?id=103137&action=review

Overall, I would split most of this code out into a new object. 
NewXMLDocumentParser is going to be a large complex class.  The more we can
move off into smaller classes, the better off our lives will be.  For HTML, we
split this code into two pieces (the TreeBuilder and the ConstructionSite), but
that's probably not necessary for XML because this part of the algorithm is
simpler.

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:124
>  void NewXMLDocumentParser::finish()
>  {
> +    m_finishWasCalled = true;

Should we assert that !m_finishWasCalled ?  I think finish can only be called
once.

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:171
> +void NewXMLDocumentParser::pushCurrentNode(ContainerNode* n)

n => node ?

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:213
> +    ExceptionCode ec = 0;
> +
> +    document()->setXMLVersion(String::adopt(token.xmlVersion()), ec);
> +    document()->setXMLStandalone(token.xmlStandalone(), ec);

Does this ec just vanish?

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:221
> +    String extId(token.publicIdentifier().data(),
token.publicIdentifier().size());
> +    String sysId(token.systemIdentifier().data(),
token.systemIdentifier().size());

We should use complete words in variable names.

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:232
> +    if ((extId == "-//W3C//DTD XHTML 1.0 Transitional//EN")
> +	   || (extId == "-//W3C//DTD XHTML 1.1//EN")
> +	   || (extId == "-//W3C//DTD XHTML 1.0 Strict//EN")
> +	   || (extId == "-//W3C//DTD XHTML 1.0 Frameset//EN")
> +	   || (extId == "-//W3C//DTD XHTML Basic 1.0//EN")
> +	   || (extId == "-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN")
> +	   || (extId == "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG
1.1//EN")
> +	   || (extId == "-//WAPFORUM//DTD XHTML Mobile 1.0//EN")

These string constants should be AtomicStrings (as should extId) so we can do
this comparison much faster.

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:237
> +void NewXMLDocumentParser::processStartTag(const AtomicXMLToken& token)

This function is too long and should be broken up into it's component pieces.

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:247
> +	       if (attribute->name().prefix() == "xmlns")

This should also be an atomic string comparison.

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:274
> +    if (token.attributes()) {
> +	   for (unsigned i = 0; i < token.attributes()->length(); ++i) {
> +	       Attribute* attribute = token.attributes()->attributeItem(i);
> +	       ExceptionCode ec = 0;
> +	       if (attribute->name().prefix() == "xmlns")
> +		   newElement->setAttributeNS(XMLNSNames::xmlnsNamespaceURI,
"xmlns:" + attribute->name().localName(), attribute->value(), ec);
> +	       else if (attribute->name() == "xmlns")
> +		   newElement->setAttributeNS(XMLNSNames::xmlnsNamespaceURI,
xmlnsAtom, attribute->value(), ec);
> +	       else {
> +		   QualifiedName qName(attribute->prefix(),
attribute->localName(), namespaceForPrefix(attribute->prefix(), nullAtom));
> +		   newElement->setAttribute(qName, attribute->value());
> +	       }
> +	   }
> +    }

For example, this is a self-contained, logical unity.  Maybe this should be a
method on the token to give its attributes to the element?

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:290
> +    if (isFirstElement && document()->frame())
> +	   document()->frame()->loader()->dispatchDocumentElementAvailable();
> +
> +    m_currentNodeItem = m_currentNodeStack.last();

dispatchDocumentElementAvailable is likely to do a lot of work.  Do we want to
make sure this object is in a consistent state before calling out into a bunch
of complex code that my re-enter us or assume that we're in a consistent state?

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:298
> +    ContainerNode* n = m_currentNode;
> +    n->finishParsingChildren();

finishParsingChildren can run arbitrary code and re-enter this function. 
There's no guantee that n will still be alive after calling this.  At the very
least, we should grab a RefPtr to n before calling out to arbitrary code, but
we should also ensure that we're in a consistent state.  (Also, we should
rename n to something like node.)

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:358
> +	   HTMLEntitySearch search;
> +	   const AtomicString& name = token.name();
> +	   for (unsigned i = 0; i < name.length(); ++i) {
> +	       search.advance(name[i]);
> +	       if (!search.isEntityPrefix())
> +		   stopParsing();
> +		   return;
> +	   }
> +	   search.advance(';');
> +	   ExceptionCode ec = 0;
> +	   UChar32 entityValue = search.currentValue();
> +	   if (entityValue <= 0xFFFF)
> +	      
m_leafTextNode->appendData(String(reinterpret_cast<UChar*>(&entityValue), 1),
ec);
> +	   else {
> +	       UChar utf16Pair[2] = { U16_LEAD(entityValue),
U16_TRAIL(entityValue) };
> +	       m_leafTextNode->appendData(String(utf16Pair, 2), ec);
> +	   }

This also looks like a self-contained block of code that should be factored
out.

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:380
> +	   AtomicString amp("amp");
> +	   AtomicString apos("apos");
> +	   AtomicString gt("gt");
> +	   AtomicString lt("lt");
> +	   AtomicString quot("quot");
> +	   ExceptionCode ec = 0;
> +
> +	   if (token.name() == amp)
> +	       m_leafTextNode->appendData("&", ec);
> +	   else if (token.name() == apos)
> +	       m_leafTextNode->appendData("'", ec);
> +	   else if (token.name() == gt)
> +	       m_leafTextNode->appendData(">", ec);
> +	   else if (token.name() == lt)
> +	       m_leafTextNode->appendData("<", ec);
> +	   else if (token.name() == quot)
> +	       m_leafTextNode->appendData("\"", ec);
> +	   else {
> +	       stopParsing();
> +	       return;
> +	   }

I'd also factor this out into a static free function.  The constants should be
DEFINE_STATIC_LOCAL so we only need to allocate / hash them once.

> Source/WebCore/xml/parser/NewXMLDocumentParser.cpp:395
> +void NewXMLDocumentParser::enterText()
> +{
> +    if (!m_sawFirstElement) {
> +	   // FIXME: ensure it's just whitespace
> +	   return;
> +    }
> +
> +    if (!m_leafTextNode) {
> +	   m_leafTextNode = Text::create(document(), "");
> +	   m_currentNode->parserAddChild(m_leafTextNode.get());
> +    }
> +}

Can we optimize this to allocate the text node with the first text token
directly?  I'm not sure if it's expensive to allocate text node with an empty
string and then fill it, but it could well cost an extra malloc.  This
operation happens many, many times while parsing.

> Source/WebCore/xml/parser/NewXMLDocumentParser.h:67
> +	   ContainerNode* m_node;
> +	   AtomicString m_namespace;

Data members should be private.  You can expose getter/setting functions if you
like.