[Webkit-unassigned] [Bug 69083] wrong CSS lexer rules

Thu Sep 29 09:38:48 PDT 2011

https://bugs.webkit.org/show_bug.cgi?id=69083

--- Comment #3 from Darin Adler <darin at apple.com>  2011-09-29 09:38:47 PST ---
(In reply to comment #0)
> The "original" comes form: http://www.w3.org/TR/CSS21/grammar.html "G.2 Lexical scanner"
> The "wk" comes form css/tokenizer.flex
> 
> original:
> nonascii    [\240-\377]
> wk:
> nonascii        [\200-\377]
> 
> They start nonascii from 160 not 128. Not sure why.

The characters in the range U+0080-009F are considered “control characters” so I can imagine they might not be permitted in some contexts. But we need to be careful about what kind of text is passed in. If this tokenizer is used on UTF-8 text then it’s critical to allow 80-9F bytes.

> original:
> string1        \"([^\n\r\f\\"]|\\{nl}|{escape})*\"
> wk:
> string1         \"([\t !#$%&(-~]|\\{nl}|\'|{nonascii}|{escape})*\"
> 
> Basically we disallow 127 (DELETE) and <32 non-newline chars while the original grammar allows them.

Another batch of what are considered control characters.

> original:
> unicode        \\{h}{1,6}(\r\n|[ \t\r\n\f])?
> wk:
> unicode         \\{h}{1,6}[ \t\r\n\f]?
> 
> This can be exploited by a \r\n newline: A\41\r\nB should be "AAB" but it will be "AA" and "B" in WK.

Seems likely to just be a bug.

> original:
> {num}%            {return PERCENTAGE;}
> wk:
> {num}%+                 {yyTok = PERCENTAGE; return yyTok;}
> 
> Why do we allow multpile percent sign? Although we still treat them as one...

Subversion indicates that I made this change 8 years ago in <http://trac.webkit.org/changeset/4059>.

The problem being solved was a table element had a content attribute width="90%%". Other browsers were treating this as 90%. So this wasn’t about CSS itself, but rather about parsing the value of a width content attribute.

Here’s what Hyatt advised me back in 2003 (things may have changed since then):

> (1) We could just patch the tokenizer of the CSS parser to make the percent token be %+ (i.e., 1 or more percent signs). 
> (2) The yacc grammar could be patched.
> (3) A simple function could be written that parses HTML widths without using the CSS parser at all, which is what we did for color.
>
> (3) is IMO where we want to go ultimately, but either (1) or (2) would do as a beta fix.

I think we are looking at three or four separate issues here, and we should use separate bugs for them. We’ll need to fix and test each separately. Perhaps the two control character issues could be grouped, though.

Even if we make no change, for each of these we should construct test cases demonstrating the behavior. As I said, running those cases in other browsers can give us valuable information too.

When replacing a piece of code like the lexer, an important first step is to increase test coverage sufficiently. That’s the first order of business. There’s no reason to tie the behavior change to the time we do a rewrite. In fact, it’s better to do it independently either before or after the rewrite.

-- 
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.