[webkit-dev] Why so many text nodes in the DOM? (especially ones with just whitespace)

Thu Jun 17 09:53:15 PDT 2010

On 16 Jun 2010, at 23:12, David Hyatt wrote:

> On Jun 14, 2010, at 7:00 PM, Matt 'Murph' Finnicum wrote:
> 
>> Why are there so many Text nodes in the DOM? I had a look at the initial DOM tree from rendering slashdot, and there are 1959 Text nodes. Of those 1959, 1246 were whitespace-only nodes.
>> 
>> Does there need to be this many nodes? Why can't whitespace be combined with the nodes next to it?

> Whitespace nodes most commonly occur between elements, so they can't be coalesced.

Hmm, this touches on a very interesting topic...

Strictly speaking, a basic parser, be it XML or HTML, should never, ever report anything to downstream consumers that was not in the original source document. The software is doing its job pretty accurately in that respect. All it needs is a little help from the consumer/user/developer. --Think JS minifying. This saves on bandwidth and even, IIRC, makes the compiler's job easier.
Basically, almost all of that whitespace serves only one purpose: to make the source human-readable. All well while you're developing a website or webapp, but come deployment, you will always fare better if the input stream is guaranteed to be processor-friendly to begin with. Less ghosts to chase.

If the input is at least XHTML, and one is a tiny bit versed in XSLT, adding a preprocessing whitespace stripper stylesheet could be a quick-fix solution to reduce the waste of resources. That does consume resources elsewhere, obviously, so you may want to check if it's really worth the effort.
xml:space, you would get for free if the processor is compliant in that respect. For the remainder, basically a plain copy template for all nodes. The exception being text nodes, for which you can use normalize-space() to see if they contain anything other than XML whitespace, and thus need copying.
The limitation is that you do not have access to the resolved CSS, IIC. In other words, if you have elements that can have #PCDATA content and that get a class assigned that sets properties related to whitespace preservation, the XSLT stylesheet will not see it (although there may exist extensions for CSS parsing, not sure). 
Then again, whitespace within, before or after text nodes is no problem, since that is presumed significant by default (but that gets coalesced with the other text later on, so no issue at all).

Part of it could, maybe, remotely, be implemented in WebKit itself.
If WebKit chooses, for example, to ignore character events from the parser in nodes where logically it doesn't make sense to have stray characters (which, incidentally, is the strategy Apache FOP uses, but that may be a slightly different story since that is pure XML), it could mean a significant reduction of the above 1246 nodes... perhaps even to 0? 

Downsides? The live DOM no longer *exactly* reflects the input, so it would definitely need to be configurable, just in case one does need that functionality. OTOH, let's say that 95% of a site's visitors is not interested at all in what the HTML source looks like. If you really want to share your code with the other 5%, there are far better ways to do that than relying on 'View Source', no? For the remainder, I must admit I am having a hard time imagining scenarios where ignorable whitespace would be desirable to keep around. In the worst case, it could even needlessly complicate certain layout- and rendering-related tasks...

Regards,

Andreas Delmelle
---