[Webkit-unassigned] [Bug 66085] New: The HTML parser doesn't ignore the BOM when the HTTP charset=foo conflicts with it

bugzilla-daemon at webkit.org bugzilla-daemon at webkit.org
Thu Aug 11 12:42:46 PDT 2011


https://bugs.webkit.org/show_bug.cgi?id=66085

           Summary: The HTML parser doesn't ignore the BOM when the HTTP
                    charset=foo conflicts with it
           Product: WebKit
           Version: 528+ (Nightly build)
          Platform: All
               URL: http://malform.no/testing/html5/bom/htm.html
        OS/Version: All
            Status: UNCONFIRMED
          Severity: Major
          Priority: P2
         Component: Layout and Rendering
        AssignedTo: webkit-unassigned at lists.webkit.org
        ReportedBy: xn--mlform-iua at xn--mlform-iua.no
                CC: ap at webkit.org


ISSUE: 

   Webkit fails to ignore the BOM when the charset=foo attribute of the HTTP Content-Type: header conflicts with it. In other words: it lets the BOM take precedence over the HTTP Content-Type: header. NB: This bug is actually also present for *both* HTML and XML. Of course, the BOM is not actually the BOM unless the the document is uses an encoding which includes the BOM. And hence, if the HTTP Content-Type: says "ISO-8859-1" while the document contains the BOM, then - according to current specs, the parser should land in  *QUIRKS-MODE*, due to the presence of the illegal "BOM" before the DOCTYPE.

BACKGROUND:

  HTML5 requires the Charset=FOO attribute of the HTTP Content-Type header to take presedence over page-internal information, including the BOM.

WAYS TO REPRODUCE THIS BUG:

* Visit http://malform.no/testing/html5/bom/htm.html
    That page is accompanied with a HTTP Content-Type: which says "text/html;charset=KOI8-r". However, internally the page is actually UTF-8 encoded, and - importantly - it also contains the BOM. But when read as KOI8-R encoded - as HTML 5 requires, then the BOM becomes an illegal character before the DOCTYPE, which in turn should cause QUIRKS-MODE.

EXPECTED RESULT:  Webkit should obey the charset info in the HTTP Content-Type: header w.r.t. the encoding. Hence it should land in QUIRKS-MODE.

ACTUAL RESULT:  Webkit instead ignores the charset info in the HTTP Content-Type: header and obeys the BOM.

COMMENTS:

[BOM CAUSES UAs TO NOT PERMIT USERS TO OVERRIDE THE ENCODING:] 
    For HTML, unlike for XML, it is permitted that the user overrides the encoding. However, actually, when the page includes the BOM, then IE (IE6 to IE9) and Webkit browsers do not allow the user to override the encoding. This is, in my view, a good thing - and I don't want to change it! However, to be in accordance with what currently is specified in HTTP andin HTML5, the charset info comfing from HTTP, should actually take precedence, when it differs from the BOM - so that's a detail that perhaps should be changed.

[OVERVIEW - OTHER PARSERS:]
    Firefox does not have this bug - Firefox also lets users override the encoding also when there is a BOM. Opera behaves like Firefox - [but Opera makes a special exception for ISO-8859-1 for some reason ... see http://malform.no/testing/html5/bom/]). It seems that IE6 to IE9 behaves like Webkit too. In fact, there are a few HTML parsers with similar issues, for more data, read http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897 (As noted in that bug, I wonder if the Webkit behaviour should become the correct one ... )

-- 
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.



More information about the webkit-unassigned mailing list