[Webkit-unassigned] [Bug 4920] Non-BMP characters in JavaScript identifiers

Tue Jun 14 02:38:22 PDT 2011

https://bugs.webkit.org/show_bug.cgi?id=4920

--- Comment #14 from Gavin Barraclough <barraclough at apple.com>  2011-06-14 02:38:22 PST ---
> I do not think that this is how it works. Only a Unicode character can have a category, so the halves in UTF-16 encoding don't have categories at all. In other words, U+D801 has category Cs, but bytes 0xD801 don't have a category.

I'm afraid I don't have quite the same reading of the Unicode spec as you.  The Unicode spec defines types for Unicode Code Points, which are specified as simply being any integer in the range 0..0x10FFFF.  Is it valid to request a Code Point type for the value 0xD801? - yes - it has a type, which is Cs.

And I think that a clearer point here is it that this seems unequivocally to be the intention of the ES5 spec.  ES5 clearly specifies that the source text should be being lexed one Unicode code unit at a time.  The specification defines both the term 'SourceCharacter' and 'character' correspond to a single UTF-16 Code Unit (deliberately drawing a distinction with "Unicode characters", below).  Arising from this definition, the specification of elements within an IdentifierName are only considering the Unicode Code Point type from a single Code Unit at a time.

Unless and until the EMCA spec changes to instead define 'SourceCharacter' and 'character' to be a "Unicode character" (which may be represented by more than one code unit), we shouldn't change our behaviour here - I think it's pretty clear that we are in compliance with the ES5 spec as it current stands.

-- 
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.