[Webkit-unassigned] [Bug 38117] Differences between subpattern matching in use of pcre and Yarr Intrepreter

bugzilla-daemon at webkit.org bugzilla-daemon at webkit.org
Wed May 19 11:46:15 PDT 2010


https://bugs.webkit.org/show_bug.cgi?id=38117





--- Comment #1 from Gavin Barraclough <barraclough at apple.com>  2010-05-19 11:46:14 PST ---
Hi Peter,

In all three cases here, I believe Yarr is correct, and its results match those of FireFox.

Where the spec appeared ambiguous I based Yarr's behaviour on an average of the behaviours of other browsers, and what seemed to make sense.  I seem to recall I generally found IE to have the most spec compliant and sensible (to my opinion) results.  In all the cases that I'm aware of differences between PCRE & Yarr, we believe Yarr to be correct – though of course there may well be bugs in Yarr that we're not aware of (e.g. I spotted a bug you raise the other day re ?? where Yarr clearly is wrong).

The rule here, I think, is that if you have a quantified set of capturing parentheses, the capture should be the value from the last successful match.  For a match to count as successful the pattern obviously has to match, but also for optional matches (matches after the minimum extent of the quantification has been reached) the result of the match must not be the empty string.

E.g.:
    /(a?){3}(b)/.exec("aab")
This should produce the result aab,,b since the first capture match three times, the third time matching an empty string.  This is not optional, so is recorded as the first subpattern despite being empty.
    /(a?){2,3}(b)/.exec("aab")
This should produce the result aab,a,b since the first capture only successfully matches twice (a third match is optional, and since it would match the empty string is considered a failed match, and isn't recorded).

Firefox gets these examples right, PCRE gets the latter wrong – I haven't had a chance to build Yarr interpreter – hopefully that should get it right though!

(And I think the relevant quote from the spec, if you want to try to find this, is "Step 1 of theRepeatMatcher's closure d states that, once the minimum number of repetitions has been satisfied, any more expansions of Atom that match the empty string are not considered for further repetitions. This prevents the regular expression engine from falling into an infinite loop on patterns such as:")


The other rule is that the results of any captures nested in an outer set of parens should should reflect the value from the last successful match of the outer parens.  e.g.:
    /(?:(a)|(b))*/.exec("ab")
Should result in ab,b,,b since in the last iteration of the outermost capture the first nested capture (a) does not match but the second nested capture (b) does.  PCRE does not reset its nested matches between iterations, and again it gets this wrong.  Firefox gets this right, and again I've not had time to test Yarr interpreter, but believe we should get this right!


Our general approach towards fixing PCRE right now is that we'd like to just remove it altogether! – we rather not have multiple RE engines in the codebase, so our plan is that when we have time it to just working on optimizing YARR interpreter, with a view that once this is fast enough we'll be able to switch all builds to use this instead.  That said, patches to fix PCRE are always welcome too!

cheers,
G.

-- 
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


More information about the webkit-unassigned mailing list