[Webkit-unassigned] [Bug 21990] New: When a rare EUC-JP character is present, explicitly (and correctly) labelled EUC-JP document is mistreated as Shift_JIS

bugzilla-daemon at webkit.org bugzilla-daemon at webkit.org
Thu Oct 30 16:24:57 PDT 2008


https://bugs.webkit.org/show_bug.cgi?id=21990

           Summary: When a rare EUC-JP character is present, explicitly (and
                    correctly) labelled EUC-JP document is mistreated as
                    Shift_JIS
           Product: WebKit
           Version: 528+ (Nightly build)
          Platform: All
               URL: http://www.google.com/search?hl=en&inlang=ja&ie=EUC-
                    JP&oe=EUC-JP&q=%8F%A2%C3&btnG=Search
        OS/Version: All
            Status: NEW
          Severity: Normal
          Priority: P2
         Component: Page Loading
        AssignedTo: webkit-unassigned at lists.webkit.org
        ReportedBy: jshin at chromium.org
 BugsThisDependsOn: 16482


1. Go to 
http://www.google.com/search?hl=en&inlang=ja&ie=EUC-JP&oe=EUC-JP&q=%8F%A2%C3&btnG=Search

(it's explicitly and correctly labelled as in EUC-JP in HTTP C-T header field).

2. You'd see '召テ'  instead of '¦'.

3. The latter is represented in 0x8F 0xA2 0xC3 in EUC-JP (3 bytes). 

Japanese Encoding detector in TextResourceDecoder.cpp is fooled by '0x8F' and
misdetect the document as in Shift_JIS.  

I think this logic for invoking JapaneseEncoding detector is too liberal:

if (m_source != UserChosenEncoding && m_source != AutoDetectedEncoding && en
coding().isJapanese())

No encoding detector is perfect and I'd rather not invoke any encoding detector
(Unicode BOM detection can be an exception) for documents with an explicit
charset declaration (http header or meta).  After resolving bug 16482 (ICU
encoding detector hook-up), I'll revisit this issue.


-- 
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the webkit-unassigned mailing list