[Webkit-unassigned] [Bug 35831] New: WebCore PreloadScanner Entity Detection Bug - Non-HTML Entities are being treated as entities

bugzilla-daemon at webkit.org bugzilla-daemon at webkit.org
Sat Mar 6 12:46:02 PST 2010


https://bugs.webkit.org/show_bug.cgi?id=35831

           Summary: WebCore PreloadScanner Entity Detection Bug - Non-HTML
                    Entities are being treated as entities
           Product: WebKit
           Version: 528+ (Nightly build)
          Platform: Macintosh Intel
               URL: http://www.vistaprint.com/gallery.aspx
        OS/Version: Mac OS X 10.6
            Status: UNCONFIRMED
          Severity: Normal
          Priority: P2
         Component: Page Loading
        AssignedTo: webkit-unassigned at lists.webkit.org
        ReportedBy: mirthy at gmail.com


The entity detector in WebCore's PreloadScanner is broken.

The HTML tokenizer used will accept things that look like entities but aren't
and convert them into Unicode characters.

For example, in scanning the HTML to pull out IMG tags, we might have a case
like this:

<img src="http://www.webkit.org/getImage.aspx?id=12345&lang_id=1"/>

The tokenizer spots &lang_id=1 and thinks it might be an entity (it isn't!),
but the test for entities isn't correct in the PreloadScanner (as it is in
HTMLTokenizer).

Code area:
http://trac.webkit.org/browser/trunk/WebCore/html/PreloadScanner.cpp#L257

The actual problematic line:
http://trac.webkit.org/browser/trunk/WebCore/html/PreloadScanner.cpp#L268

The loop actually halts and the text is check for entities when a
non-alphanumeric character is reached.  It should really only be checking when
a semicolon is reached.

This causes query strings to get truncated and replaced with a unicode <
symbol.  The mangled URL is then passed back to the preloader looking like:
<img src="http://www.webkit.org/getImage.aspx?id=12345<_id=1"/>

The preloader then tries to fetch it with an invalid URL (which will most
likely 404).

Other examples where this might be problematic:
&amp_energy=100
&lt-now=10

Basically, any query string variable name that starts like a HTML entity name
and has a non-alphanumeric separator.

Proposed fix would to just remove the alphanumeric check.  The semicolon check
above should be sufficient, if there are cases of bad entities that that are
too long or don't contain a semicolon, then leave them be.

Build Info:
SVN Rev: 55620
Regular WebKit on Mac OS X 
XCode 3.2.1

-- 
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.



More information about the webkit-unassigned mailing list