[Webkit-unassigned] [Bug 179303] New: UMBRELLA: text encoding oddities

bugzilla-daemon at webkit.org bugzilla-daemon at webkit.org
Sun Nov 5 15:20:19 PST 2017


https://bugs.webkit.org/show_bug.cgi?id=179303

            Bug ID: 179303
           Summary: UMBRELLA: text encoding oddities
           Product: WebKit
           Version: Safari Technology Preview
          Hardware: Unspecified
                OS: Unspecified
            Status: NEW
          Severity: Normal
          Priority: P2
         Component: Text
          Assignee: webkit-unassigned at lists.webkit.org
          Reporter: mjs at apple.com
                CC: mmaxfield at apple.com

Created attachment 326079

  --> https://bugs.webkit.org/attachment.cgi?id=326079&action=review

The Encoding Standard's set of encodings

WebKit has numerous oddities in its text encoding support, comparing to the definitions in the <https://encoding.spec.whatwg.org> standard and comparing between platforms.

I made a script to spot inconsistencies, and I'll attach both that and its output.

Here are some of the odd things I noticed:

(1) WebKit thinks the canonical name for EUC-KR/windows-949 is windows-949, but the standard thinks the canonical name is EUC-KR. Not sure if this matters. Probably WebKit should be consistent with the standard unless there is some reason not to.

(2) WebKit thinks big5-hkscs is Big5-HKSCS, while the standard thinks it is Big5. (WebKit has a bunch of other names for Big5-HKSCS which the standard doesn't seem to recognize. If Big5 and Big5-HKSCS are not the same, then this is almost surely an interop bug and must be fixed either in WebKit or in the Encoding standard.

(3) WebKit recognizes various extra names for encodings that are in the standard. For example, we recognize "windows-10007" and "maccyrillic" as "x-maccyrillic", but the Encoding Standard only recognizes "x-maccyrillic" and "x-mac-ukrainian" as names for that encoding. There are many more like this. It's not clear if these are WebKit bugs or spec bugs.

(4) WebKit on all platforms knows some encodings that the Encoding Standard doesn't recognize at all: x-mac-turkish, UTF-32BE, UTF-32LE, UTF-32, Big5-HKSCS, x-mac-centraleurroman and x-mac-greek. At least UTF-32 is likely to be an interop and security problem. Big5-HKCS is a problem for reasons noted in (2). I don't know if the others are WebKit bugs or spec bugs.

(5) macOS WebKit knows extra aliases for encodingings that are known to both cross-platform WebKit and the Encoding Standard. For example, macOS WebKit recognizes ['iso-8859-14', 'iso8859141998', 'isoceltic', 'isoir199', 'l8', 'latin8'] as extra names for "ISO-8859-14", but the standard only recognizes ["iso-8859-14", "iso8859-14", "iso885914"]. There are quite a few like this. It's not clear why these names are needed, and why only on macOS.

(6) macOS WebKit knows a number of extra encodings that aren't known to other WebKit ports and which are not in the Encoding standard. There are about 30 of these so I won't list them all here. These are all implemented via TEC rather than ICU, though it's possible some have ICU implementations available. Maybe some of these are required for legacy reasons but it's not at all clear which.

(7) macOS WebKit uses TEC for some encodings that use ICU on all other platforms (including iOS), for example iso-8859-16. It's not clear why. These should probably be switched to use the ICU implementation on macOS.

(8) iOS WebKit supports a number of extra encodings via ICU, with a comment claiming this is due to lack of TEC on the iOS platform. However, these seocndings aren't supported in any other WebKit port, not even macOS via TEC. It's not clear if these are required for anything.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-unassigned/attachments/20171105/4911c751/attachment-0001.html>


More information about the webkit-unassigned mailing list