[Webkit-unassigned] [Bug 38572] [WTFURL] Add core URL parser

Thu May 6 13:13:08 PDT 2010

https://bugs.webkit.org/show_bug.cgi?id=38572

--- Comment #20 from Brett Wilson (Google) <brettw at chromium.org>  2010-05-06 13:13:07 PST ---
(In reply to comment #18)
> (From update of attachment 55188 [details])
> > I'm happy to use whatever license block you like.  > Can you point me to an
> example file?
> 
> I'm still deferring to Maciej, but an example of what I did last time when
> importing a Mozilla tri-license file is in LayoutTests/java/lc3.
> 
> +    // Compatability data points. I list "host", "path" extracted:
> 
> Typo: should be "compatibility". I don't think that having the "Us" column is a
> good idea, there is too much danger of it getting out of sync. Data for IE6 an
> unspecified version of Firefox is of questionable value, too.
> 
> +        // Found "//<some data>", looks like an authority section. Treat
> everything
> 
> Even if it's "javascript://%20Nothing%20but%20a%20comment"? If this function is
> only called for standard URLs, a comment should explain that.

It's correct this is called only for standard URLs.

> Similar comment for parsePath() - '?' and "#" characters don't break
> non-standard URLs into pieces.

A "path URL" is javascript: or data: with no breaking, named so because the
data is contained in the "path" component of the URL. This function parses the
path of a URL, which is used for both standard and file URLs. Probably the
comment should state this.

> +            case '#':
> +                // Record the first # sign only.
> +                if (refSeparator < 0)
> +                    refSeparator = i;
> +                break;
> 
> Seems appropriate to break out of the loop here for slightly better
> performance, and possibly more readable code.
> 
> +        // components. The code below words from the end back to the
> beginning,
> 
> "words"?
> 
> +    // Initializes a path URL which is merely a scheme followed by a path.
> Examples
> +    // include "about:foo" and "javascript:alert('bar');"
> +    static void parsePathURL(const CHAR* spec, int specLength, URLSegments&
> parsed)
> 
> It's quite unclear which functions are API, and which are internal
> implementation. In particular, why does parsePathURL() trim spaces, given that
> some code has already checked the URL and found it to be a path URL,
> necessarily trimming spaces just for that?

I agree this trimming looks unnecessary.

There are three layers of API, here. The top layer is the object-oriented layer
(KURL, GURL) which isn't relevant to this work. The bottom layer is the low
level "given this file URL parse it as a file URL". There is a middle layer
that figures out what type of URL it is, knows about known schemes, and then
dispatches to the correct low-level parser and canonicalizer.

Chrome uses all of these layers in different places. For example, our URL bar
calls ParseStandardURL directly because the user is typing URL fragments and we
want to know "given this input, what is the most reasonable possible thing that
this could be if it was a standard URL."

> A related general question: what kind of input do these functions operate on?
> Is it byte arrays, or not? I'd expect "spec" to be a byte array, but they are
> all const CHAR*.

They operate on both byte arrays and UTF-16. It's templatized/overloaded to
support both. The canonical version of a URL we use is 8-bit, which has a nice
space advantage and also matches the network format. But many sources of
uncanonicalized URL data have 16-bit encoding, for example, the DOM, which is
why there is this duality.

> +        } else {
> +            // No scheme found, just path.
> 
> What cases is this expected to catch? We seem to fall back on this case for
> /foo/bar UNIX-style paths, but not for C:\foo\bar ones. And a colon in a
> UNIX-style path would fool us, too.

Yes, that's correct. If you ever hit this code, the URL is invalid. But this
code is to make the most reasonable possible parsing of the invalid URL
possible. This is necessary, for example, in the above example of the URL,
where the URL may be invalid but we don't care (because we can fix it once we
know where things are, for example, by adding "http".

> Same comments about mailto parser.
> 
> +        if (!component.isNonempty())
> 
> Not part of this patch, but "empty" should be capitalized. And double negation
> isn't great.

The "problem" is that there are three states: (0,-1) meaning invalid or not
present (x,0) meaning valid but 0-length, and (x,y) meaning valid and nonzero
length. The name IsEmpty gives some ambiguity about whether it treats invalid
parts as empty, and I didn't want people to write code like if (!isValid() &&
!isEmpty()). Nonempty is unambiguously "there are characters" here which also
corresponds to the thing we most often want to test.

-- 
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.