[webkit-dev] regular expression notes (mostly about performance
optimization)
Darin Adler
darin at apple.com
Sat Jan 5 18:27:12 PST 2008
I've been spending some time recently on our JavaScript regular
expression code.
For background, our regular expression code is a much-modified fork of
PCRE 6.5. When I talk about "latest" PCRE here, I'm talking about PCRE
version 7.4.
Here are a few notes I thought others might find interesting:
SunSpider's regular expression test coverage is limited.
- The regexp-dna test covers only a small part of the regular
expression engine.
PCRE has first character bitmap optimization that is only done as part
of "study" in the real PCRE.
- But it's part of study, not compile -- maybe because it's too costly
for many expressions?
- Doing this could help eliminate annoyingly complex code to handle
cased vs. caseless firstByte and reqByte values.
- I'd like to do it for reqByte too, not just firstByte, but there may
be a reason that's not a good idea.
- This would be a huge win for the regexp-dna test.
Latest PCRE has an optimized form of bracket for cases where the
compiler determines it can never be empty.
- For this form the execution engine can use tail recursion, which is
faster.
- For this form the execution engine can skip the "bracket chain"
work, also faster.
- Brackets that are known to never be empty don't have to increment
the matchCount and check against matchLimit.
- Will be a win for the regexp-dna test.
Latest PCRE handles the opcodes for capturing brackets in a cleaner
way, eliminates including the number in the opcode.
- Not necessarily faster, but a precondition for the optimized form of
bracket.
- If we do this we can remove the code to do opcode jump table
initialization in the match function.
- That might make things a bit faster.
Latest PCRE has a function that counts capturing subpatterns.
- We need to know this to make the JavaScript semantic for references
vs. octal values work.
- Using a function liek the PCRE one would no doubt be more efficient
than calling the calculate length function twice as we currently do.
Latest PCRE uses the real compilation code to compute length rather
than having a separate function.
- According to Philip Hazel this is slower than the separate length
computing function was.
- But it's easier to maintain; I think the buffer overrun bugs PCRE
had may have motivated this change.
Latest PCRE converts a*?b into a*b for speed.
- Maybe we want this optmimization.
- The function to do it is called check_auto_possessive.
- Will not affect the regexp-dna test.
The regexp-dna expressions all could be translated from brackets into
character classes.
- They have the form: /a|b/ where "a" and "b" are expressions solely
containing letters and character classes of the same length.
- They could be converted into a sequence of character classes without
a bracket.
- Would this be faster?
-- Darin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.webkit.org/pipermail/webkit-dev/attachments/20080105/ac5b5b80/attachment-0001.html
More information about the webkit-dev
mailing list