[webkit-dev] regular expression notes (mostly about performance optimization)

Sat Jan 5 18:27:12 PST 2008

I've been spending some time recently on our JavaScript regular  
expression code.

For background, our regular expression code is a much-modified fork of  
PCRE 6.5. When I talk about "latest" PCRE here, I'm talking about PCRE  
version 7.4.

Here are a few notes I thought others might find interesting:

SunSpider's regular expression test coverage is limited.
- The regexp-dna test covers only a small part of the regular  
expression engine.

PCRE has first character bitmap optimization that is only done as part  
of "study" in the real PCRE.
- But it's part of study, not compile -- maybe because it's too costly  
for many expressions?
- Doing this could help eliminate annoyingly complex code to handle  
cased vs. caseless firstByte and reqByte values.
- I'd like to do it for reqByte too, not just firstByte, but there may  
be a reason that's not a good idea.
- This would be a huge win for the regexp-dna test.

Latest PCRE has an optimized form of bracket for cases where the  
compiler determines it can never be empty.
- For this form the execution engine can use tail recursion, which is  
faster.
- For this form the execution engine can skip the "bracket chain"  
work, also faster.
- Brackets that are known to never be empty don't have to increment  
the matchCount and check against matchLimit.
- Will be a win for the regexp-dna test.

Latest PCRE handles the opcodes for capturing brackets in a cleaner  
way, eliminates including the number in the opcode.
- Not necessarily faster, but a precondition for the optimized form of  
bracket.
- If we do this we can remove the code to do opcode jump table  
initialization in the match function.
- That might make things a bit faster.

Latest PCRE has a function that counts capturing subpatterns.
- We need to know this to make the JavaScript semantic for references  
vs. octal values work.
- Using a function liek the PCRE one would no doubt be more efficient  
than calling the calculate length function twice as we currently do.

Latest PCRE uses the real compilation code to compute length rather  
than having a separate function.
- According to Philip Hazel this is slower than the separate length  
computing function was.
- But it's easier to maintain; I think the buffer overrun bugs PCRE  
had may have motivated this change.

Latest PCRE converts a*?b into a*b for speed.
- Maybe we want this optmimization.
- The function to do it is called check_auto_possessive.
- Will not affect the regexp-dna test.

The regexp-dna expressions all could be translated from brackets into  
character classes.
- They have the form: /a|b/ where "a" and "b" are expressions solely  
containing letters and character classes of the same length.
- They could be converted into a sequence of character classes without  
a bracket.
- Would this be faster?

     -- Darin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.webkit.org/pipermail/webkit-dev/attachments/20080105/ac5b5b80/attachment-0001.html