On Dec 2, 2009, at 9:07 PM, Darin Fisher wrote:

On Wed, Dec 2, 2009 at 8:44 PM, Maciej Stachowiak <mjs@apple.com> wrote:

On Dec 2, 2009, at 8:14 PM, Darin Fisher wrote:

What about Maciej's comment.  JS strings are often use to store binary values.  Obviously, if people stick to octets, then it should be fine, but perhaps some folks leverage all 16 bits?

I think some people do use JavaScript strings this way, though not necessarily with LocalStorage. This kind of use will probably become obsolete when we add a proper way to store binary data from the platform.

Most Web-related APIs are fully accepting of JavaScript strings that are not proper UTF-16. I don't see a strong reason to make LocalStorage an exception. It does make sense for WebSocket to be an exception, since in that case charset transcoding is required by the protocol, and since it is desirable in that case to prevent any funny business that may trip up the server..

Also, looking at UTF-16 more closely, it seems like all UTF-16 can be transcoded to UTF-8 and round-tripped if one is willing to allow technically invalid UTF-8 that encodes unpaired characters in the surrogate range as if they were characters. It's not clear to me why Firefox or IE choose to reject instead of doing this. This also removes my original objection to storing strings as UTF-8.


I think it is typical for UTF-16 to UTF-8 conversion to involve the intermediate step of forming a Unicode code point.  If that cannot be done, then conversion fails.  This may actually be a security thing.  If something expects UTF-8, it is safer to ensure that it gets valid UTF-8 (even if that involves loss of information).

These security considerations seem important for WebSocket where the protocol uses UTF-8 per spec, but not for the internal storage representation of JavaScript strings in LocalStorage (where observable input and output are both possibly-invalid UTF-16).

Regards,
Maciej