[webkit-reviews] review granted: [Bug 179573] Clean up old URL parser remnants : [Attachment 326678] Patch

Sun Nov 12 17:54:14 PST 2017

Darin Adler <darin at apple.com> has granted Alex Christensen
<achristensen at apple.com>'s request for review:
Bug 179573: Clean up old URL parser remnants
https://bugs.webkit.org/show_bug.cgi?id=179573

Attachment 326678: Patch

https://bugs.webkit.org/attachment.cgi?id=326678&action=review

--- Comment #4 from Darin Adler <darin at apple.com> ---
Comment on attachment 326678
  --> https://bugs.webkit.org/attachment.cgi?id=326678
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=326678&action=review

Looks OK, except I think this breaks the context menu’s handling of some
special characters.

> Source/WebCore/platform/URL.cpp:547
> +    for (size_t i = 0; i < input.length(); ++i) {
> +	   if (UNLIKELY(shouldEncode(input[i]))) {
> +	       CString utf8 = input.utf8();
> +	       StringBuilder builder;
> +	       builder.reserveCapacity(utf8.length() + 2);
> +	       for (size_t j = 0; j < utf8.length(); j++) {
> +		   auto c = utf8.data()[j];
> +		   if (shouldEncode(c)) {
> +		       builder.append('%');
> +		       builder.append(upperNibbleToASCIIHexDigit(c));
> +		       builder.append(lowerNibbleToASCIIHexDigit(c));
> +		   } else
> +		       builder.append(c);
> +	       }
> +	       return builder.toString();
> +	   }
> +    }
> +    return input;

It’s inelegant to have the algorithm for actually encoding nested inside an if
statement in the middle of a loop (containing a second loop!). We should
separate the two loops, putting them into separate functions or lambdas if we
want to avoid booleans. Another idea (see below) is to use one loop to compute
the length, and another to do the encoding, then there’s a clean place to exit
between the two.

It’s not quite right to use size_t to iterate a String; indices into
WTF::String are unsigned, not size_t, so it’s slightly wasteful to explicitly
use the bigger type.

This loop looks to me like it will be pretty slow, but I guess that’s OK. Here
are some thoughts on performance:

- Calling a function pointer once per character seems like it will be pretty
expensive. But maybe it’s OK for the way we are using this.

- Converting to UTF-8 using String::utf8 means allocating and destroying a
CStringBuffer. That’s costly and we can make sure we have code that converts to
UTF-8 in streaming fashion without a buffer.

- Returning a String that is then used to concatenate with other strings means
allocating and destroying a StringImpl. If this was turned into a
StringConcatenate StringTypeAdapter then callers could concatenate this with
other strings and only allocate one buffer for the entire result, eliminating
this extra StringImpl.

- Calling CString::length and CString::data each time through a loop is not so
great. They both do null checks on m_buffer and dereference it each time; I am
not sure the compiler will be smart enough to hoist all of that out of the
loop, given that we are calling a function through a function pointer each time
through.

- I don’t fully understand the rationale for the StringBuilder::reserveCapacity
call. It seems to make the assumption that exactly one character needs to be
percent-encoded. That makes it likely we will need to reallocate the
StringBuffer. If we split the function into two halves as I suggested above, we
could compute the exact correct length the first time through instead of
stopping on the first character we should encode. Then we could use
String::createUninitialized instead of StringBuilder to make the string, if we
aren’t doing the StringTypeAdapter thing yet.

> Source/WebCore/platform/URL.cpp:666
> +    auto questionMarkOrNumberSign = [] (UChar c) {
> +	   return c == '?' || c == '#';
> +    };

Could use the word "character". I don’t think it would be any harder to read.

> Source/WebCore/platform/URL.cpp:667
> +    URLParser parser(makeString(m_string.left(m_portEnd),
percentEncodeCharacters(path, questionMarkOrNumberSign),
m_string.substring(m_pathEnd)));

makeString can work with StringView arguments instead of String arguments, so
we could cut down on string allocations here by using StringView::substring
instead of String::left and String::substring.

> Source/WebKit/WebProcess/WebCoreSupport/WebContextMenuClient.cpp:65
> -    String encoded = encodeWithURLEscapeSequences(searchString);
> -    encoded.replace(ASCIILiteral { "%20" }, ASCIILiteral { "+" });
>  
> -    String url = "http://www.google.com/search?q=" + encoded +
"&ie=UTF-8&oe=UTF-8";
> +    URL url(URL(), "https://www.google.com/search?q=" + searchString +
"&ie=UTF-8&oe=UTF-8");

This doesn’t seem quite right.

I understand that the URL parser will escape characters that would make the URL
illegal otherwise, such as control characters, non-ASCII characters, and
spaces. But there is other escaping that is needed to successfully convey an
arbitrary search string to the server. Four examples:

"#" needs to be represented as "%23", otherwise it will be treated as the end
of the query, and the part after it will be treated as the fragment
"%" needs to be represented as "%25" if it comes before valid hex digits,
otherwise it will be treated as a the start of an escape sequence
"&" needs to be represented as "%26", otherwise it will be treated as the end
of the string, delimiting another argument
"+" needs to be represented as "%2B", otherwise it will be treated as a space

After this change I don’t think we can correctly do searches for strings
containing those characters. If we wanted the URL class to help us, using
URL::setQuery instead of concatenating would take care of "#", but not the
other characters. I don’t think the URL class has the functions needed to deal
with these other issues.

The comment in the ChangeLog does not offer rationale for why we now decided to
use "%20" rather than "+". We went out of our way to send the shorter form
before; this patch removes the code without comment.