What regular expressions do we spend time evaluating?
Since SunSpider time now is about 10% inside the regular expression matcher, it's time to reconsider regular expression optimizations. In particular, we should see if we can find a subset of regular expressions where we can implement a more efficient matching algorithm. I wrote code to dump a histogram of time spent matching regular expressions: <https://bugs.webkit.org/show_bug.cgi?id=19801>. Here's the result from running SunSpider on my computer. The column on the left is the number of seconds spend matching an expression, and the column on the right is the expression: 0.080214 - \b\w+\b 0.079685 - agggtaa[cgt]|[acg]ttaccct (ignore case) 0.077910 - [cgt]gggtaaa|tttaccc[acg] (ignore case) 0.073445 - agggta[cgt]a|t[acg]taccct (ignore case) 0.073439 - a[act]ggtaaa|tttacc[agt]t (ignore case) 0.073387 - agggt[cgt]aa|tt[acg]accct (ignore case) 0.072898 - ag[act]gtaaa|tttac[agt]ct (ignore case) 0.072552 - aggg[acg]aaa|ttt[cgt]ccct (ignore case) 0.072440 - agg[act]taaa|ttta[agt]cct (ignore case) 0.072167 - agggtaaa|tttaccct (ignore case) 0.057528 - >.*\n|\n 0.045544 - "[^"\\\n\r]*"|true|false|null|-?\d+(?:\.\d*)?(:?[eE][+ \-]?\d+)? 0.024901 - (?:^|:|,)(?:\s*\[)+ 0.008831 - ^[a-zA-Z0-9\-\._]+@[a-zA-Z0-9\-_]+(\.?[a-zA-Z0-9\-_]*) \.[a-zA-Z]{2,3}$ 0.001060 - ^[\],:{}\s]*$ 0.000517 - \\. 0.000264 - [\0\t\n\v\f\r\xa0'"!-] 0.000158 - !\d\d?\d?! 0.000062 - -.* 0.000026 - ^ 0.000024 - ('|\\) 0.000010 - ^[^-]*- I wish we had a test with some better regular expression coverage! Here's one other case with some significant regexp time in it, I think perhaps from JSON validtion? Loading 280slides.com and looking at the test presentation: 0.028335 - \/\/.*(\r|\n)?|\/\*(?:.|\n|\r)*?\*\/|\w+\b|[+-]?\d+ (([.]\d+)*([eE][+-]?\d+))?|"[^"\\]*(\\.[^"\\]*)*"|'[^'\\]*(\\.[^'\ \]*)*'|\s+|. 0.006968 - \S 0.001098 - "[^"\\\n\r]*"|true|false|null|-?\d+(?:\.\d*)?(?:[eE][+ \-]?\d+)? 0.000422 - \\. 0.000263 - (?:^|:|,)(?:\s*\[)+ 0.000259 - [\:\+\-\*\/\=\<\>\&\|\!\.\%] 0.000245 - [^\s] 0.000193 - [\+\-\*\/\=\<\>\&\|\!\.\[\^\(] 0.000119 - \s 0.000109 - ^[0-9a-zA-Z_\-+~!$\^{}|.%'`#&*]+/[0-9a-zA-Z_\-+~!$ \^{}|.%'`#&*]+\+xml$ (ignore case) 0.000042 - ^\w 0.000023 - ^[\],:{}\s]*$ I had a hard time finding pages and tests where regular expressions took enough time to be worth looking at. -- Darin
On Jun 28, 2008, at 11:39 AM, Darin Adler wrote:
Since SunSpider time now is about 10% inside the regular expression matcher, it's time to reconsider regular expression optimizations. In particular, we should see if we can find a subset of regular expressions where we can implement a more efficient matching algorithm.
I wrote code to dump a histogram of time spent matching regular expressions: <https://bugs.webkit.org/show_bug.cgi?id=19801>.
Here's the result from running SunSpider on my computer. The column on the left is the number of seconds spend matching an expression, and the column on the right is the expression:
I wish we had a test with some better regular expression coverage!
Besides the big all-regexp test, there's some other regexp usage on SunSpider, including JSON validation.
Here's one other case with some significant regexp time in it, I think perhaps from JSON validtion? Loading 280slides.com and looking at the test presentation:
Some of these look like the usual JSON suspects.
I had a hard time finding pages and tests where regular expressions took enough time to be worth looking at.
We could use the stress tester to find pages that use regexps at load time. - Maciej
participants (2)
-
Darin Adler
-
Maciej Stachowiak