[webkit-reviews] review granted: [Bug 45594] Add AtomicString::fromUTF8 : [Attachment 73819] Patch

Mon Nov 15 12:11:32 PST 2010

Darin Adler <darin at apple.com> has granted Patrick R. Gansterer
<paroga at paroga.com>'s request for review:
Bug 45594: Add AtomicString::fromUTF8
https://bugs.webkit.org/show_bug.cgi?id=45594

Attachment 73819: Patch
https://bugs.webkit.org/attachment.cgi?id=73819&action=review

------- Additional Comments from Darin Adler <darin at apple.com>
View in context: https://bugs.webkit.org/attachment.cgi?id=73819&action=review

Looks good. Please consider some of my suggestions for improvement. For now, I
set commit-queue- so you could think over if you want to make these changes
before you land.

> JavaScriptCore/wtf/text/AtomicString.cpp:233
> +	   UChar *target;

The * is positioned incorrectly here. It should be next to the UChar.

> JavaScriptCore/wtf/text/AtomicString.cpp:234
> +	   location = StringImpl::createUninitialized(buffer.utf16Length,
target).releaseRef();

Please use the new name, leakRef, not the old name, releaseRef, in new code.

> JavaScriptCore/wtf/text/AtomicString.h:112
> +    // the input data contains invalid UTF-8 characters.

Comment should say “invalid UTF-8 sequences” not “invalid UTF-8 characters”.

> JavaScriptCore/wtf/text/WTFString.h:313
> +    // the input data contains invalid UTF-8 characters.

Same comment.

> JavaScriptCore/wtf/unicode/UTF8.cpp:238
> +    UChar32 ch = 0;

I’d much prefer “character” for this local.

> JavaScriptCore/wtf/unicode/UTF8.cpp:243
> +	   case 6: ch += static_cast<unsigned char>(*sequence++); ch <<= 6; //
remember, illegal UTF-8
> +	   case 5: ch += static_cast<unsigned char>(*sequence++); ch <<= 6; //
remember, illegal UTF-8

Who are we saying “remember” to in these comments? I know you just moved them,
but I think these comments are unhelpful and unclear. We should remove the
comments.

> JavaScriptCore/wtf/unicode/UTF8.cpp:262
> +	   if (source + utf8SequenceLength > sourceEnd) {

This should be written like this:

    if (sourceEnd - source < utf8SequenceLength)

It’s not great to compute an address past the end of the buffer, and doing it
this way guarantees we don’t try to do that.

> JavaScriptCore/wtf/unicode/UTF8.cpp:313
> +unsigned calculateHashFromUTF8(const char* data, const char* dataEnd,
unsigned& utf16Length)

This function needs a policy for handling lengths that do not fit in unsigned.
Even if that policy is simply to call CRASH when encountering a giant length.
Not doing so can lead to security vulnerabilities. Alternatively, the type of
the out argument could be changed to a size_t, but then the callers would need
to check if they narrowed it to an unsigned.

> JavaScriptCore/wtf/unicode/UTF8.cpp:322
> +	   if (!(*data & 0x80)) {

It would be better to use isASCII from ASCIICType.h for this.

> JavaScriptCore/wtf/unicode/UTF8.cpp:337
> +	   ASSERT(ch >= 0x80);

It would be better to use isASCII from ASCIICType.h for this.

> JavaScriptCore/wtf/unicode/UTF8.cpp:339
> +	   if (ch <= 0xFFFF) {

It would be better to use U_IS_BMP for this. This is a macro from ICU that we
may already implement for non-ICU platforms, but if not we can do so.

> JavaScriptCore/wtf/unicode/UTF8.cpp:341
> +	       if (ch >= 0xD800 && ch <= 0xDFFF)

It would be better to use U_IS_SURROGATE for this. Another ICU macro.

> JavaScriptCore/wtf/unicode/UTF8.cpp:348
> +	   } else if (ch <= 0x10FFFF) {
> +	       ch -= 0x0010000UL;
> +	       stringHasher.addCharacters(static_cast<UChar>((ch >> 10) +
0xD800),
> +					  static_cast<UChar>((ch & 0x03FF) +
0xDC00));

It would be better to use U_IS_SUPPLEMENTARY, U16_LEAD, and U16_TRAIL for this.
More ICU macros.

> JavaScriptCore/wtf/unicode/UTF8.cpp:360
> +	   if (!(*b & 0x80)) {

It would be better to use isASCII from ASCIICType.h for this.

> JavaScriptCore/wtf/unicode/UTF8.cpp:361
> +	       if (*a++ != static_cast<UChar>(*b++))

The static_cast is unfortunate. Can we use a local variable to avoid the
typecast?

> JavaScriptCore/wtf/unicode/UTF8.cpp:375
> +	   ASSERT(ch >= 0x80);

It would be better to use isASCII from ASCIICType.h for this.

> JavaScriptCore/wtf/unicode/UTF8.cpp:377
> +	   if (ch <= 0xFFFF) {

It would be better to use U_IS_BMP for this.

> JavaScriptCore/wtf/unicode/UTF8.cpp:379
> +	       if (ch >= 0xD800 && ch <= 0xDFFF)

It would be better to use U_IS_SURROGATE for this.

> JavaScriptCore/wtf/unicode/UTF8.cpp:381
> +	       if (*a++ != static_cast<UChar>(ch))

I don’t think we need the static_cast here. Please remove it.

> JavaScriptCore/wtf/unicode/UTF8.cpp:389
> +	   } else if (ch <= 0x10FFFF) {
> +	       ch -= 0x0010000UL;
> +
> +	       if (*a++ != static_cast<UChar>((ch >> 10) + 0xD800))
> +		   return false;
> +	       if (*a++ != static_cast<UChar>((ch & 0x03FF) + 0xDC00))
> +		   return false;

It would be better to use U_IS_SUPPLEMENTARY, U16_LEAD, and U16_TRAIL for this.

> JavaScriptCore/wtf/unicode/UTF8.h:73
> +    unsigned calculateHashFromUTF8(const char* data, const char* dataEnd,
unsigned& utf16Length);

Still unfortunate to have this here in this file. At the very least this needs
a comment to explain what kind of hash we are talking about.

> JavaScriptCore/wtf/unicode/UTF8.h:75
> +    bool equalUTF8(const UChar* a, const UChar* aEnd, const char* b, const
char* bEnd);

This name should mention UTF-16, not just UTF-8, since the function compares
UTF-16 to UTF-8.