[webkit-dev] WebKit compatibility in India

Wed Oct 22 14:21:02 PDT 2008

Hi everybody,

There was recently somewhat of a controversy regarding Embedded
OpenType (EOT) support in WebKit. The most important reason to support
this technology is not for web designers who want custom fonts, but
because some sites using legacy technology use a custom encoding with
a custom embedded font to display their non-Latin characters. Most of
these sites are in India or Indic languages.

I am very much not an expert in this area. My goal is to start a
discussion about "what to do about Indic compatibility" rather than
"should EOT be supported" in WebKit. Just supporting EOT in WebKit
would make the sites appear correctly, but it would not address some
of the basic problems like copy and paste or Google Chrome's full text
indexing feature.

Waiting for the sites to fix themselves / evangelism (basically what
all browsers are doing now) is an option, and has apparently had some
success. Some sites seem to be stuck on old technology, so it may not
be their choice to not use Unicode. Sticking with this plan may make
WebKit adoption possible in the long term, but would not help very
much in the short term.

Google Search does some special detection to transcode sites that use
these custom encodings. One approach would be to do the same in the
browser. The browser would contain a list of domains with problems,
and a character map table that maps the custom 8-bit encoding to
Unicode (hopefully there are many fewer encodings than sites).
Alternatively, it could key off the font name, if we find that these
are unique enough to identify the encoding (anybody know if this is
the case?). All incoming pages would first be checked against this
list, and if a match was found, it would trigger the converter. I
found a list of ~100 popular sites that require special encodings that
we can start with.

Doing this conversion has several challenges:

- It could not be blindly applied to all pages on the site. Many of
the sites have English pages which we wouldn't want to convert, and if
the site ever fixes itself to use a standard encoding, we would want
to be able to automatically pick that up. Some pages declare the
charset as "x-user-defined", while some list something else (I saw
ISO-8859-1 but there may be others). I think there would need to be
a somewhat smart encoding detector here (like auto charset detection today).

- It could not be blindly applied to all content in a single page.
Many of the pages are a combination of custom-encoded text using an
EOT font and English (or other language) using a different font. For
example, see http://www.futuresamachar.com/fs/hindi/index.htm
("Duration", "By Post", etc. on the right are coded to use "Verdana"
to get the regular encoding and would be corrupted if a transcoder was
applied to the entire page). This makes me wonder what integration
with WebKit would look like, since being dependent on CSS means it
couldn't be just applied in the normal character set conversion phase
during parsing.

Are there other approaches that WebKit-based browsers can take to
getting better compatibility with Indic sites? What problems do people
more familiar with this area see with the transcoding approach? Could
it be implemented cleanly and would a whitelist ever have a hope of
covering the sites that Indian users care about? Or should we continue
with evangelism and wait?

Brett