[Webkit-unassigned] [Bug 66084] New: The XML parser doesn't ignore the BOM when HTTP Content-Type: charset=foo conflicts with it

bugzilla-daemon at webkit.org bugzilla-daemon at webkit.org
Thu Aug 11 12:24:36 PDT 2011


https://bugs.webkit.org/show_bug.cgi?id=66084

           Summary: The XML parser doesn't ignore the BOM when HTTP
                    Content-Type: charset=foo conflicts with it
           Product: WebKit
           Version: 528+ (Nightly build)
          Platform: All
               URL: http://malform.no/testing/html5/bom/xml.html
        OS/Version: All
            Status: UNCONFIRMED
          Severity: Major
          Priority: P2
         Component: XML
        AssignedTo: webkit-unassigned at lists.webkit.org
        ReportedBy: xn--mlform-iua at xn--mlform-iua.no
                CC: ap at webkit.org


ISSUE: 

   Webkit fails to ignore the BOM when the charset=foo attribute of the HTTP Content-Type: header conflicts with it. In other words: it lets the BOM take precedence over the HTTP Content-Type: header. NB: This bug is actually also present for *both* HTML and XML. Of course, the BOM is not actually the BOM unless the the document is uses an encoding which includes the BOM. And hence, if the HTTP Content-Type: says "ISO-8859-1" while the document contains the BOM, then - according to current specs, the parser should emit a FATAL ERROR.

BACKGROUND:

  See section 4.3.3 of the XML 1.0 spec.

WAYS TO REPRODUCE THIS BUG:

* Visit http://malform.no/testing/html5/bom/xml.html
    That page is accompanied with a HTTP Content-Type: which says "application/xhtml+xml;charset=KOI8-r". However, internally the page is actually UTF-8 encoded, and - importantly - it also contains the BOM. But when read as KOI8-R encoded - as XML 1.0 requires, then the BOM becomes an illegal character before the DOCTYPE, which in turn should cause FATAL ERROR.

EXPECTED RESULT:  Webkit should obey the charset info in the HTTP Content-Type: header w.r.t. the encoding. Hence it should emit a FATAL ERROR.

ACTUAL RESULT:  Webkit instead ignores the charset info in the HTTP Content-Type: header and obeys the BOM.

COMMENTS:

[OTHER PARSERS:] Firefox does not have this bug. Opera does also not have this bug (unless the user manually overrides the encoding - which is another bug and one that it shares with Webkit). And xmllib2 also does not have this bug. But it seems that IE9 has this bug too. In fact, there are a few XML parsers with similar issues, for more data, read http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897 (As noted in that bug, I wonder if the Webkit behaviour should become the correct one ... But so far it hasn't happened - and I know about at least one parser [Xerces C++] which is aligning with the specs.)

-- 
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.



More information about the webkit-unassigned mailing list