[Webkit-unassigned] [Bug 179303] UMBRELLA: text encoding oddities

Sun Nov 5 17:02:22 PST 2017

https://bugs.webkit.org/show_bug.cgi?id=179303

--- Comment #7 from Darin Adler <darin at apple.com> ---
I have noticed these anomalies, too. I was looking into this myself recently after the work I did on bug 178207. Many can be explained by some of this history:

- The TEC decoder was the first one in WebKit. We created that decoder before the rest of them, and at the time we created it our goal was to support all encodings that TEC knew how to decode rather than necessarily selecting an appropriate set for the web. Many strange things about encodings are due to the fact that we have treated that as the "main" decoder on Mac, rather than using it only for a few special purposes. Seems likely we would not need it at all since ICU should have the support we need.

- Most of the character set names used by the TEC decoder were based on the IANA character set assignments <https://www.iana.org/assignments/character-sets/>. A snapshot of the IANA assignments (about 11 years old) still exists in the source tree at Source/WebCore/platform/text/mac/character-sets.txt and is used by the script that generates the character set name table used by the TEC decoder. This file is where most of the aliases came from.

- There is a separate list of additional encoding names at Source/WebCore/platform/text/mac/mac-encodings.txt that are also used for the TEC encoder on the Mac. Even back when this was last modified in 2009, the status of this file was "we would like to get rid of it".

I think we should eliminate as many encoding names as we can, and synchronize with the encoding specification. Any encoding names that we decide to continue to support that is not mentioned in the specification needs a really good rationale; perhaps such encoding names can be limited to the context where they are required or, alternatively, added to the encoding specification and to other web browser engines.

I don’t know how to best determine how eliminating support for certain encoding names or changing canonical names (which I think mainly affects encoding names in form submission?) will affect website and app compatibility.

I suspect that if we eliminate those unneeded encoding names, we will find that we can easily eliminate the TEC decoder entirely, and many of the anomalies above will simply disappear if we remove the encoding names. When doing this work and removing the TEC decoder we should be aware that it’s possible for a decoder to add aliases that affect even encodings that are actually supported only by other decoders.

As background for why item (2) might currently be as it is, there are three Big5-encoding-related changes in 2003, all done by me and reviewed by you, Maciej; easy to find by searching for "Big5" in Source/WebCore/ChangeLog-2003-10-25.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-unassigned/attachments/20171106/60debed8/attachment.html>