New subject: What regular expressions do we spend time evaluating?

28 Jun 2008

      Since SunSpider time now is about 10% inside the regular expression  
matcher, it's time to reconsider regular expression optimizations. In  
particular, we should see if we can find a subset of regular  
expressions where we can implement a more efficient matching algorithm.

I wrote code to dump a histogram of time spent matching regular  
expressions: <https://bugs.webkit.org/show_bug.cgi?id=19801>.

Here's the result from running SunSpider on my computer. The column on  
the left is the number of seconds spend matching an expression, and  
the column on the right is the expression:

     0.080214 - \b\w+\b
     0.079685 - agggtaa[cgt]|[acg]ttaccct (ignore case)
     0.077910 - [cgt]gggtaaa|tttaccc[acg] (ignore case)
     0.073445 - agggta[cgt]a|t[acg]taccct (ignore case)
     0.073439 - a[act]ggtaaa|tttacc[agt]t (ignore case)
     0.073387 - agggt[cgt]aa|tt[acg]accct (ignore case)
     0.072898 - ag[act]gtaaa|tttac[agt]ct (ignore case)
     0.072552 - aggg[acg]aaa|ttt[cgt]ccct (ignore case)
     0.072440 - agg[act]taaa|ttta[agt]cct (ignore case)
     0.072167 - agggtaaa|tttaccct (ignore case)
     0.057528 - >.*\n|\n
     0.045544 - "[^"\\\n\r]*"|true|false|null|-?\d+(?:\.\d*)?(:?[eE][+ 
\-]?\d+)?
     0.024901 - (?:^|:|,)(?:\s*\[)+
     0.008831 - ^[a-zA-Z0-9\-\._]+@[a-zA-Z0-9\-_]+(\.?[a-zA-Z0-9\-_]*) 
\.[a-zA-Z]{2,3}$
     0.001060 - ^[\],:{}\s]*$
     0.000517 - \\.
     0.000264 - [\0\t\n\v\f\r\xa0'"!-]
     0.000158 - !\d\d?\d?!
     0.000062 - -.*
     0.000026 - ^
     0.000024 - ('|\\)
     0.000010 - ^[^-]*-

I wish we had a test with some better regular expression coverage!

Here's one other case with some significant regexp time in it, I think  
perhaps from JSON validtion? Loading 280slides.com and looking at the  
test presentation:

     0.028335 - \/\/.*(\r|\n)?|\/\*(?:.|\n|\r)*?\*\/|\w+\b|[+-]?\d+ 
(([.]\d+)*([eE][+-]?\d+))?|"[^"\\]*(\\.[^"\\]*)*"|'[^'\\]*(\\.[^'\ 
\]*)*'|\s+|.
     0.006968 - \S
     0.001098 - "[^"\\\n\r]*"|true|false|null|-?\d+(?:\.\d*)?(?:[eE][+ 
\-]?\d+)?
     0.000422 - \\.
     0.000263 - (?:^|:|,)(?:\s*\[)+
     0.000259 - [\:\+\-\*\/\=\<\>\&\|\!\.\%]
     0.000245 - [^\s]
     0.000193 - [\+\-\*\/\=\<\>\&\|\!\.\[\^\(]
     0.000119 - \s
     0.000109 - ^[0-9a-zA-Z_\-+~!$\^{}|.%'`#&*]+/[0-9a-zA-Z_\-+~!$ 
\^{}|.%'`#&*]+\+xml$ (ignore case)
     0.000042 - ^\w
     0.000023 - ^[\],:{}\s]*$

I had a hard time finding pages and tests where regular expressions  
took enough time to be worth looking at.

     -- Darin

What regular expressions do we spend time evaluating?

Darin Adler

Maciej Stachowiak

tags

participants (2)