[Webkit-unassigned] [Bug 24420] Consider doing encoding detection before decoding starts

bugzilla-daemon at webkit.org bugzilla-daemon at webkit.org
Thu Mar 12 10:38:25 PDT 2009


------- Comment #2 from jshin at chromium.org  2009-03-12 10:38 PDT -------
(In reply to comment #1)

> bug 16482 comment #31
> Yes, it may we ll be. Are you well familiar with how Gecko encoding detector
> works? How much data does it normally need to choose an encoding with
> confidence (and how much data would you expect our detector to need)?

I have an overall understanding of  Gecko's algorithms (byte unigram and byte
bigram), but don't know those criteria you asked about. However, I know two (or
three) people who developed them and can ask as a short cut instead of reading
the source code. 

As for ICU's encoding detector, it uses byte-trigrams and my not-so-scientific
experiment indicates that it can be pretty reliable after 1kB or so. Needless
to say, if the first 1kB is (almost) entirely made up of ASCII bytes, it'd not
work as well. ICU gives back a confidence level. The way it's calculated is not
so elaborate, though. So, we need to employee some heuristics when using it. 

Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

More information about the webkit-unassigned mailing list