[Webkit-unassigned] [Bug 38572] [WTFURL] Add core URL parser

Fri May 7 00:39:30 PDT 2010

https://bugs.webkit.org/show_bug.cgi?id=38572

--- Comment #23 from Adam Barth <abarth at webkit.org>  2010-05-07 00:39:29 PST ---
> +    // Compatability data points. I list "host", "path" extracted:
> 
> Typo: should be "compatibility".

Fixed.

> I don't think that having the "Us" column is a
> good idea, there is too much danger of it getting out of sync. Data for IE6 an
> unspecified version of Firefox is of questionable value, too.

Removed.

> +        // Found "//<some data>", looks like an authority section. Treat
> everything
> 
> Even if it's "javascript://%20Nothing%20but%20a%20comment"? If this function is
> only called for standard URLs, a comment should explain that.

Done.

> Similar comment for parsePath() - '?' and "#" characters don't break
> non-standard URLs into pieces.
> 
> +            case '#':
> +                // Record the first # sign only.
> +                if (refSeparator < 0)
> +                    refSeparator = i;
> +                break;
> 
> Seems appropriate to break out of the loop here for slightly better
> performance, and possibly more readable code.

I'm not sure how to break out of the loop without a goto, and I'm not sure
having a goto improve readability.  :)

> +        // components. The code below words from the end back to the
> beginning,
> 
> "words"?

Presumably "works".  Fixed.

> +    // Initializes a path URL which is merely a scheme followed by a path.
> Examples
> +    // include "about:foo" and "javascript:alert('bar');"
> +    static void parsePathURL(const CHAR* spec, int specLength, URLSegments&
> parsed)
> 
> It's quite unclear which functions are API, and which are internal
> implementation. In particular, why does parsePathURL() trim spaces, given that
> some code has already checked the URL and found it to be a path URL,
> necessarily trimming spaces just for that?

I'm slightly hesitant to remove this code until we've got more of the library
landed so we can see which methods are exposed publicly.  I've added a FIXME to
reflect our current understanding.

> A related general question: what kind of input do these functions operate on?
> Is it byte arrays, or not? I'd expect "spec" to be a byte array, but they are
> all const CHAR*.

See brettw's answer above.

> +        } else {
> +            // No scheme found, just path.
> 
> What cases is this expected to catch? We seem to fall back on this case for
> /foo/bar UNIX-style paths, but not for C:\foo\bar ones. And a colon in a
> UNIX-style path would fool us, too.

See brettw's answer above.

> Same comments about mailto parser.
> 
> +        if (!component.isNonempty())
> 
> Not part of this patch, but "empty" should be capitalized.

Done.

> And double negation isn't great.

Yeah, that doesn't seem ideal.  Changed to isEmptyOrInvalid().

> +        // Skip over any leading 0s.
> 
> A drive-by comment: do we have a test for spaces between ':' and port, e.g.
> "http://foo.com:   80/"?

I think so:
http://trac.webkit.org/browser/trunk/LayoutTests/fast/url/script-tests/segments.js#L20

I'm not sure if that test covers what you're interested in.

> +        // Copy valid digits to the buffer.
> +        char digits[maxDigits + 1]; // +1 for null terminator
> +        for (int i = 0; i < nonZeroDigits.length(); i++) {
> +            CHAR ch = spec[nonZeroDigits.begin() + i];
> +            if (!isPortDigit(ch)) {
> +                // Invalid port digit, fail.
> +                return InvalidPort;
> +            }
> +            digits[i] = static_cast<char>(ch);
> +        }
> 
> We're doing all this work just to call atoi in the end? We could just as easily
> parse the number directly.

Indeed.  Done.

> +        // parameter. The path should start with a slash, so we don't need to
> check
> +        // the first one.
> 
> This comment slightly clashes with a later one: "No slash found, this means the
> input was degenerate."

Fixed.

> +        if (!query.isNonempty())
> 
> More double negation.

Fixed.

> +    // We treat slashes and backslashes the same for IE compatability.
> 
> Should be "compatibility".

Fixed.

> +    // Given an already-initialized begin index and length, this shrinks the
> range
> +    // to eliminate "should-be-trimmed" characters. Note that the length does
> *not*
> +    // indicate the length of untrimmed data from |begin|, but rather the
> position
> +    // in the input string (so the string starts at character |begin| in the
> spec,
> +    // and goes until |length|).
> 
> Then it shouldn't be called length.

What name do you suggest?  I've changed it to "specLength" because that's the
name of the parameter that's always supplied.  (Note that although "end" is a
tempting name, it's actually end+1.)

> +} // namespace WTF
> +
> +#endif // URLParser_h
> 
> If there is anything in this file to use outside of WTF, there should be using
> decarations for these symbols.

The exported symbols will use concrete types.  Client aren't meant to call into
this code directly.

> I expect that you'll want to address some of the above comments, so marking r-.

Thanks for the review.  Updated patch coming shortly.

-- 
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.