[Webkit-unassigned] [Bug 22166] HTML entities for surrogate pair codepoints cause rendering issues

Tue Nov 25 11:48:16 PST 2008

https://bugs.webkit.org/show_bug.cgi?id=22166

jshin at chromium.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         OS/Version|Mac OS X 10.5               |All
           Platform|Macintosh Intel             |All

------- Comment #14 from jshin at chromium.org  2008-11-25 11:48 PDT -------
Unicode 5.1 section 3.2 ( http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf
) has the following conformance requirement:

C10 When a process interprets a code unit sequence which purports to be in a
Unicode character
encoding form, it shall treat ill-formed code unit sequences as an error
condition
and shall not interpret such sequences as characters.
• For example, in UTF-8 every code unit of the form 110xxxx2 must be followed
by a code unit of the form 10xxxxxx2. A sequence such as 110xxxxx2 0xxxxxxx2
is ill-formed and must never be generated. When faced with this ill-formed
code unit sequence while transforming or interpreting text, a conformant
process
must treat the first code unit 110xxxxx2 as an illegally terminated code unit
sequence—for example, by signaling an error, filtering the code unit out, or
representing the code unit with a marker such as U+FFFD replacement
character.

Section 3.9 (Unicode Encoding Forms) has the following (2nd bullet point in
D93):
Encoding Form Conversion
D93 Encoding form conversion: A conversion defined directly between the code
unit
sequences of one Unicode encoding form and the code unit sequences of another
Unicode encoding form

A conformant encoding form conversion will treat any ill-formed code unit
sequence as an error condition. (See conformance clause C10.) This guarantees
that it will neither interpret nor emit an ill-formed code unit sequence. Any
implementation of encoding form conversion must take this requirement into
account, because an encoding form conversion implicitly involves a verification
that the Unicode strings being converted do, in fact, contain well-formed code
unit sequences.

--------------
I'm not quoting D91 (defining UTF-16 and what's ill-formed in UTF-16) because
it's obvious that an isolated surrogate codepoint is ill-formed. 

BTW, the corresponding Firefox bug is
http://bugzilla.mozilla.org/show_bug.cgi?id=317216

Using the last resort glyph for an isolated surrogate code point is arguably
considered as a way of signaling error, but it's not just rendering that is at
stake. Other parts in webkit need to deal with them. By replacing isolated
surrogate code points with U+FFFD at the earliest stage, we can spare them from
having to do that check themselves. IMHO, it's always a good idea to validate
what's coming from an external source before doing anything. 

-- 
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.