[Webkit-unassigned] [Bug 66055] New: The XML parser doesn't (always) default to UTF-8 when HTTP charset or encoding declaration is lacking

bugzilla-daemon at webkit.org bugzilla-daemon at webkit.org
Thu Aug 11 07:16:37 PDT 2011


https://bugs.webkit.org/show_bug.cgi?id=66055

           Summary: The XML parser doesn't (always) default to UTF-8 when
                    HTTP charset or encoding declaration is lacking
           Product: WebKit
           Version: 528+ (Nightly build)
          Platform: All
               URL: http://malform.no/testing/html5/bom/normal-XML-BOMless
                    -HTTPcharsetLESS
        OS/Version: All
            Status: UNCONFIRMED
          Severity: Major
          Priority: P2
         Component: XML
        AssignedTo: webkit-unassigned at lists.webkit.org
        ReportedBy: xn--mlform-iua at xn--mlform-iua.no


ISSUE: 

   Webkit fails to *always* assume that UTF-8 is the default encoding of an XML file for which explicit external or internal encoding information is lacking.

BACKGROUND:

   According to section 4.3.3 of the XML 1.0 spec, documents that are not served with - or do not contain - an explicit encoding information MUST be either UTF-16 encoded or UTF-8 encoded:

]]
  In the absence of external character encoding information (such as MIME 
  headers), parsed entities which are stored in an encoding other than 
  UTF-8 or UTF-16 must begin with a text declaration (see 4.3.1 The Text 
  Declaration) containing an encoding declaration
[[ 

(Note that the encoding known as  "UTF-16" always includes the BOM - which in principle is a form of explicit encoding declaration.) 

Further down in the same section it is stated that  when a page is not served with - or does not contain -  explicit encoding information, including when it does not contain the BOM, then it is a FATAL ERROR if the page is not encoded as UTF-8:

]]
   In the absence of information provided by an external transport protocol 
   (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding
   declaration to be presented to the XML processor in an encoding other 
   than that named in the declaration, or for an entity which begins with 
   neither a Byte Order Mark nor an encoding declaration to use an encoding
   other than UTF-8.
[[

STEPS TO REPRODUCE THIS BUG:

1) In a browser in the Webkit family (including nightly build), go to the "Text Encodings" submenu of the View menu and select something other than "Default" or "UTF-8". (I will assume that you select "KOI8-r" .) This step changes - for the current window or tab - the default encoding from "Default/Automatic" to the encoding that you selected.

2) Now, within the same window or tab, visit this page:
    http://malform.no/testing/html5/bom/normal-XML-BOMless-HTTPcharsetLESS

    That page has the following features:
      a) XHTML page
      b) served as application/xhtml+xml in the HTTP Content-Type: header
      c) served *without* the charset=foo attribute in the HTTP Content-Type: header
      d) *no* BOM (byte order mark) in the document
      f) *no* encoding declaration (<?xml version="1.0" encoding="UTF-8" ?>) in the document

EXPECTED RESULTS:  Webkit should ignore that the user changed the default encoding to KOI8-R and instead, in accordance with section 4.3.3. of XML 1.0,  assume that the encoding of the page to be "UTF-8"

ACTUAL RESULTS:  Webkit instead pays respect to the user's choice of default encoding (i.e. it renders the page as KOI8-r), and without displaying a fatal error.

COMMENTS:

[OTHER PARSERS:] Firefox does not have this bug. Opera *does* have a similar bug. I don't know if IE9 has this bug. I don't think XML parsers in general (e.g. XMLlib2) have this bug. 

       [RELEVANCE:] Because XML must default to UTF-8 in absense of other info from the page server or from the page, Polylogot Markup [1] states that one does not need to declare the encoding for XML parsers. However, as long as Webkit does not abide to XML 1.0's default to UTF-8, Polyglto Mark's advice does not really float. Thus the only way that works, is to use the BOM  - which however some are against using. [2]

[1] http://dev.w3.org/html5/html-xhtml-author-guide/html-xhtml-authoring-guide.html#character-encoding
[2]  http://www.w3.org/Bugs/Public/show_bug.cgi?id=13392

-- 
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


More information about the webkit-unassigned mailing list