[Webkit-unassigned] [Bug 38572] [WTFURL] Add core URL parser

Thu May 6 12:03:39 PDT 2010

https://bugs.webkit.org/show_bug.cgi?id=38572

Alexey Proskuryakov <ap at webkit.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #55188|review?, commit-queue?      |review-, commit-queue-
               Flag|                            |

--- Comment #18 from Alexey Proskuryakov <ap at webkit.org>  2010-05-06 12:03:38 PST ---
(From update of attachment 55188)
> I'm happy to use whatever license block you like.  > Can you point me to an
example file?

I'm still deferring to Maciej, but an example of what I did last time when
importing a Mozilla tri-license file is in LayoutTests/java/lc3.

+    // Compatability data points. I list "host", "path" extracted:

Typo: should be "compatibility". I don't think that having the "Us" column is a
good idea, there is too much danger of it getting out of sync. Data for IE6 an
unspecified version of Firefox is of questionable value, too.

+        // Found "//<some data>", looks like an authority section. Treat
everything

Even if it's "javascript://%20Nothing%20but%20a%20comment"? If this function is
only called for standard URLs, a comment should explain that.

Similar comment for parsePath() - '?' and "#" characters don't break
non-standard URLs into pieces.

+            case '#':
+                // Record the first # sign only.
+                if (refSeparator < 0)
+                    refSeparator = i;
+                break;

Seems appropriate to break out of the loop here for slightly better
performance, and possibly more readable code.

+        // components. The code below words from the end back to the
beginning,

"words"?

+    // Initializes a path URL which is merely a scheme followed by a path.
Examples
+    // include "about:foo" and "javascript:alert('bar');"
+    static void parsePathURL(const CHAR* spec, int specLength, URLSegments&
parsed)

It's quite unclear which functions are API, and which are internal
implementation. In particular, why does parsePathURL() trim spaces, given that
some code has already checked the URL and found it to be a path URL,
necessarily trimming spaces just for that?

A related general question: what kind of input do these functions operate on?
Is it byte arrays, or not? I'd expect "spec" to be a byte array, but they are
all const CHAR*.

+        } else {
+            // No scheme found, just path.

What cases is this expected to catch? We seem to fall back on this case for
/foo/bar UNIX-style paths, but not for C:\foo\bar ones. And a colon in a
UNIX-style path would fool us, too.

Same comments about mailto parser.

+        if (!component.isNonempty())

Not part of this patch, but "empty" should be capitalized. And double negation
isn't great.

+        // Skip over any leading 0s.

A drive-by comment: do we have a test for spaces between ':' and port, e.g.
"http://foo.com:   80/"?

+        // Copy valid digits to the buffer.
+        char digits[maxDigits + 1]; // +1 for null terminator
+        for (int i = 0; i < nonZeroDigits.length(); i++) {
+            CHAR ch = spec[nonZeroDigits.begin() + i];
+            if (!isPortDigit(ch)) {
+                // Invalid port digit, fail.
+                return InvalidPort;
+            }
+            digits[i] = static_cast<char>(ch);
+        }

We're doing all this work just to call atoi in the end? We could just as easily
parse the number directly.

+        // parameter. The path should start with a slash, so we don't need to
check
+        // the first one.

This comment slightly clashes with a later one: "No slash found, this means the
input was degenerate."

+        if (!query.isNonempty())

More double negation.

+    // We treat slashes and backslashes the same for IE compatability.

Should be "compatibility".

+    // Given an already-initialized begin index and length, this shrinks the
range
+    // to eliminate "should-be-trimmed" characters. Note that the length does
*not*
+    // indicate the length of untrimmed data from |begin|, but rather the
position
+    // in the input string (so the string starts at character |begin| in the
spec,
+    // and goes until |length|).

Then it shouldn't be called length.

+} // namespace WTF
+
+#endif // URLParser_h

If there is anything in this file to use outside of WTF, there should be using
decarations for these symbols.

I expect that you'll want to address some of the above comments, so marking r-.

-- 
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.