[Webkit-unassigned] [Bug 66185] New: Sniff UTF-8 instead of defaulting to WINDOWS-1252 (or other locale defaults)

bugzilla-daemon at webkit.org bugzilla-daemon at webkit.org
Sat Aug 13 06:30:55 PDT 2011


https://bugs.webkit.org/show_bug.cgi?id=66185

           Summary: Sniff UTF-8 instead of defaulting to WINDOWS-1252 (or
                    other locale defaults)
           Product: WebKit
           Version: 528+ (Nightly build)
          Platform: All
               URL: http://dev.w3.org/html5/spec/parsing#encoding-sniffing
                    -algorithm
        OS/Version: All
            Status: UNCONFIRMED
          Severity: Major
          Priority: P2
         Component: HTML DOM
        AssignedTo: webkit-unassigned at lists.webkit.org
        ReportedBy: xn--mlform-iua at xn--mlform-iua.no
                CC: ap at webkit.org


ISSUE: 

   When a HTML page is UTF-8 encoded, but there is no page internal encoding declaratation, no BOM and also no accompanying external encoding info inside HTTP or MIME, then Webkit will default to WINDOWS-1252 instead of sniffing the encoding to be UTF-8.

BACKGROUND:

  HTML5's encoding sniffing algorithm, step 7, states:

]]
    The user agent may attempt to autodetect the character encoding from applying frequency analysis or other algorithms to the data stream. Such algorithms may use information about the resource other than the resource's contents, including the address of the resource. If autodetection succeeds in determining a character encoding, then return that encoding, with the confidence tentative, and abort these steps. [UNIVCHARDET]

    [Note:] The UTF-8 encoding has a highly detectable bit pattern. Documents that contain bytes with values greater than 0x7F which match the UTF-8 pattern are very likely to be UTF-8, while documents with byte sequences that do not match it are very likely not. User-agents are therefore encouraged to search for this common encoding. [PPUTF8] [UTF8DET]
[[

HOW TO REPRODUCE THIS BUG:

1. Verify that Webkit's encoding choice is set to Default (or Automatic)
2. Open the page http://malform.no/testing/html5/bom/normal-HTML-BOMless-HTTPcharsetLESS
   (That HTML page has no BOM, and no accompanying external encoding info inside the HTTP Content-Type: header and no internal encoding declaration.)

EXPECTED RESULT:  Webkit should sniff the page to be UTF-8 encoded.

ACTUAL RESULT:  Webkit instead defaults to WINDOWS-1252 (more correclty: to the default encoding for the current locale)

COMMENTS:

 * By default, Chrome, Opera and IE (at least version 8) do *NOT* have this bug 
 * Byt default, Firefox *DOES* have this bug

-- 
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.



More information about the webkit-unassigned mailing list