[webkit-reviews] review denied: [Bug 45594] Add AtomicString::fromUTF8 : [Attachment 70357] Patch

Sat Oct 30 15:23:28 PDT 2010

Darin Adler <darin at apple.com> has denied Patrick R. Gansterer
<paroga at paroga.com>'s request for review:
Bug 45594: Add AtomicString::fromUTF8
https://bugs.webkit.org/show_bug.cgi?id=45594

Attachment 70357: Patch
https://bugs.webkit.org/attachment.cgi?id=70357&action=review

------- Additional Comments from Darin Adler <darin at apple.com>
View in context: https://bugs.webkit.org/attachment.cgi?id=70357&action=review

Basic idea here is quite good, but we should improve the execution at least a
bit before landing.

One thing we could do to cut down the size of the patch a bit is to add a
non-optimized AtomicString::fromUTF8 first along with changes to adopt it, and
then add the second patch which adds the optimized code path.

> JavaScriptCore/wtf/text/AtomicString.cpp:217
> +    mutable RefPtr<StringImpl> string;

It’s pretty strange to have a mutable member in a struct with no member
functions. It might be better to use a class. I think this idiom deserves at
least a comment. Using a mutable structure so we can allocate the string during
the lookup process and then reuse it when putting the string into the hash
table is cleverness. Cleverness is a bad thing without comments to explain to
those who follow.

Quite a bit of additional complexity is here with the intention of speeding up
cases that I don’t think are common enough to deserve it. Storing a StringImpl
in the structure optimizes the case where we compare and discover the string is
a new string, by removing one pass of UTF-8-to-16 conversion, but at the cost
of extra string allocations where we compare and discover the string is not
new, so I think it’s not a good tradeoff.

With no comments about rationale, this all is a bit mysterious.

> JavaScriptCore/wtf/text/AtomicString.cpp:220
> +static inline bool equal(StringImpl* string, const char* characters,
unsigned length)

Since this function is only legal for ASCII characters, then the name needs to
express that limitation, and I’d like to see an assertion as well. I’m not sure
a fast path for all-ASCII is needed, though.

> JavaScriptCore/wtf/text/AtomicString.cpp:234
> +static inline void addStringImplIntoBuffer(const HashAndUTF8Characters&
buffer)

The function name here is pretty confusing. This seems like a perfect candidate
for a member function.

The more-typical idiom for this would be to use a private member for the
string, and then have a public accessor function that does this work of
creating the string. Then we avoid having this awkwardly-named function, and we
also have an interface that's harder to use incorrectly.

> JavaScriptCore/wtf/text/AtomicString.cpp:258
> +	   addStringImplIntoBuffer(buffer);
> +	   return WTF::equal(string, buffer.string->characters(),
buffer.string->length());

I think we easily can write code to compare a UTF-8 string with a UTF-16 string
without allocating memory. This would come up in the relatively common case of
looking for a non-ASCII string that is already in the table. It would be faster
not to allocate memory in that case.

> JavaScriptCore/wtf/text/AtomicString.cpp:377
> +	   return AtomicString(StringImpl::empty());

I believe it’s faster to return emptyAtom here. Other call sites in the same
file should do the same if they don’t already. Unless it’s not faster!

> JavaScriptCore/wtf/text/AtomicString.cpp:383
> +    // We'll use a StringImpl as a buffer; if the source string only
contains ascii this should be
> +    // the right length, if there are any multi-byte sequences this buffer
will be too large.
> +    UChar* buffer;
> +    RefPtr<StringImpl> stringBuffer =
StringImpl::createUninitialized(length, buffer);
> +    UChar* bufferEnd = buffer + length;

This is not the right approach. We should write code that can look up the
string in the hash table without allocating memory at all. There’s no reason we
need a buffer to convert UTF-8 to UTF-16 until we have proven the string is not
already in the table. I don’t think we need a special case for ASCII either; it
will come out fast automatically.

In this patch, we reserve the optimization of not allocating the string for
null-terminated strings. That’s not the tradeoff we usually make in WebKit.

> JavaScriptCore/wtf/text/AtomicString.cpp:399
> +AtomicString AtomicString::fromUTF8(const char* characters)

This should call the other fromUTF8 function after first calling strlen rather
than having separate logic.

> JavaScriptCore/wtf/text/AtomicString.h:112
> +    static AtomicString fromUTF8(const char*, size_t);
> +    static AtomicString fromUTF8(const char*);

I think these functions and the ones already in the String class deserve a
comment indicating what that the function returns a null string if the
characters are not valid UTF-8.

> JavaScriptCore/wtf/unicode/UTF8.h:73
> +    bool calculateHashAndLength(const char* rawData, unsigned& hash,
unsigned& rawLength, unsigned& utf16Length);

This function is unclear. It has an argument name “rawData” and nothing in the
function name or argument name make it clear that it’s expected to be UTF-8
data. I think you are using “raw” here to mean “UTF-8”, which is not a usual
term of art. I also find the boolean result unclear. Without a comment I have
no idea what its significance is. Maybe true if the function fails, or false if
it fails?

It’s not right to have this function work on null-terminated strings. Our
normal approach is that all functions work on strings with an explicit separate
length argument, except for a few convenience functions. Those can use strlen
and then call through to the general case. Since this function works only on
null-terminated strings, we can’t use it at all in functions that have strings
with an explicit length: the function would run off the end of the buffer
looking for a null character and also would misinterpret null characters as
end-of-string indicators.

A single function that computes a hash for a UTF-8 string and also returns the
UTF-16 length as a side effect is OK. But we should add special cases only if
we can prove with profiling they are beneficial. More branches mean the code is
harder to test. Knowing the length in advance can be useful so we can allocate
a perfectly-sized buffer without scanning the string again. It also serves the
dual purpose of being an all-ASCII flag, but it’s not clear we need such a
flag. The specifici motivation and value in having oblique functions with
multiple purposes like this one means we need at least a brief mention in a
comment of the benefit of doing these two things at once.

The location of the function in this header isn’t perfect either, and creates a
bit of a modularity problem within WTF. The function computes a hash, but the
UTF8.h file is a lower level Unicode building block and this level knows
nothing about the hashing algorithm. The code that knows about both UTF-8 and
the hash functions probably belongs in StringHashFunctions.h rather than
UTF8.h.

I’d like us to add a function that compares a UTF-16 string to a UTF-8 one so
we can find non-ASCII strings that are already present in the AtomicString
table without allocating memory.