[webkit-changes] [WebKit/WebKit] 12c34e: [Yarr] Improve processing of an alternation of str...

Michael Saboff noreply at github.com
Fri Feb 21 07:31:17 PST 2025


  Branch: refs/heads/main
  Home:   https://github.com/WebKit/WebKit
  Commit: 12c34ef5e3058990f7e9997bc07ea1883c2a9eab
      https://github.com/WebKit/WebKit/commit/12c34ef5e3058990f7e9997bc07ea1883c2a9eab
  Author: Michael Saboff <msaboff at apple.com>
  Date:   2025-02-21 (Fri, 21 Feb 2025)

  Changed paths:
    A JSTests/microbenchmarks/regexp-keyword-parsing.js
    A JSTests/stress/regexp-parsing-tokens.js
    M Source/JavaScriptCore/yarr/YarrJIT.cpp
    M Source/JavaScriptCore/yarr/YarrPattern.cpp
    M Source/JavaScriptCore/yarr/YarrPattern.h

  Log Message:
  -----------
  [Yarr] Improve processing of an alternation of strings
https://bugs.webkit.org/show_bug.cgi?id=288102
rdar://145222010

Reviewed by Yusuke Suzuki.

Added the notion of a string list to a parsed RegExp that is in the form of
  /^(?:break|case|which|do|for)/ with an optional trailing $.
Such a RegExp will not backtrack and therefore we can streamline the code we emit for such a pattern.

This change involves recognizing beginning of string anchored alternations of strings while parsing and
then treating the generation of JIT code differently for these patterns.  This includes changing how
conditional branching works, specifically that instead of the "fall through on match" for each term,
to a "jump on match" for the whole alternation.

The current code generated for the "case" elternative is:
   8:Term PatternCharacter checked-offset:(3) 'c'
               <156> 0x11381430c:    add      w1, w1, #2
               <160> 0x113814310:    cmp      w1, w2
               <164> 0x113814314:    b.hi     0x113814444 -> <468>
  10:Term PatternCharacter checked-offset:(4) 'c'
               <168> 0x113814318:    sub      x17, x0, #4
               <172> 0x11381431c:    ldr      w17, [x17, x1]
               <176> 0x113814320:    movz     w16, #0x6163
               <180> 0x113814324:    movk     w16, #0x6573, lsl #16 -> 0x65736163
               <184> 0x113814328:    cmp      w17, w16
               <188> 0x11381432c:    b.ne     0x113814444 -> <468>
  11:Term PatternCharacter checked-offset:(4) 'a' already handled
  12:Term PatternCharacter checked-offset:(4) 's' already handled
  13:Term PatternCharacter checked-offset:(4) 'e' already handled
  14:NestedAlternativeNext minimum-size:(5),checked-offset:(5)
               <192> 0x113814330:    movz     x16, #0x4444
               <196> 0x113814334:    movk     x16, #0x1381, lsl #16
               <200> 0x113814338:    movk     x16, #0x8001, lsl #32
               <204> 0x11381433c:    movk     x16, #0xc973, lsl #48 -> 0x113814444 JIT PC
               <208> 0x113814340:    stur     x16, [sp, #8]
               <212> 0x113814344:    b        0x113814404 -> <404>
With some additional backtracking code:
   9:NestedAlternativeNext minimum-size:(4),checked-offset:(4)
               <468> 0x113814444:    sub      w1, w1, #2
               <472> 0x113814448:    b        0x113814348 -> <216>

With this change, the processing of "case" becomes:
   9:StringListAlternativeNext minimum-size:(4),checked-offset:(4)
               <132> 0x12a8285c4:    sub      w1, w1, #1
               <136> 0x12a8285c8:    cmp      w1, w2
               <140> 0x12a8285cc:    b.hi     0x12a8285e8 -> <168>
  10:Term PatternCharacter checked-offset:(4) 'c'
               <144> 0x12a8285d0:    sub      x17, x0, #4
               <148> 0x12a8285d4:    ldr      w17, [x17, x1]
               <152> 0x12a8285d8:    movz     w16, #0x6163
               <156> 0x12a8285dc:    movk     w16, #0x6573, lsl #16 -> 0x65736163
               <160> 0x12a8285e0:    cmp      w17, w16
               <164> 0x12a8285e4:    b.eq     0x12a82866c -> <300>
  11:Term PatternCharacter checked-offset:(4) 'a' already handled
  12:Term PatternCharacter checked-offset:(4) 's' already handled
  13:Term PatternCharacter checked-offset:(4) 'e' already handled
  14:StringListAlternativeNext minimum-size:(5),checked-offset:(5)
With no backtracking code.

We are able to eliminate one branch and the saving of the continuation PC for backtracking.
The code size to process these string list RegExp is reduces.  For the example RegExp above,
the prior version created 1940 bytes (485 instructions) of code while the code created with this
1392 bytes (345 instructions) of code, a nearly 30% reduction in code.

This change is a ~18% progression on the new regexp-keyword-parsing microbenchmark:

                                 Baseline               YarrStringList

regexp-keyword-parsing      136.7065+-0.9807     ^    116.0161+-1.1791        ^ definitely 1.1783x faster

<geometric>                 136.7065+-0.9807     ^    116.0161+-1.1791        ^ definitely 1.1783x faster

* JSTests/microbenchmarks/regexp-keyword-parsing.js: Added.
(arrayToString):
(objectToString):
(dumpValue):
(compareArray):
(compareGroups):
(testRegExp):
(testRegExpSyntaxError):
(let.re.break.case.catch.continue.debugger.default.else.finally.if):
(let.re1.break.case.catch.continue.debugger.default.else.finally.if):
* JSTests/stress/regexp-parsing-tokens.js: Added.
(arrayToString):
(objectToString):
(dumpValue):
(compareArray):
(compareGroups):
(testRegExp):
(testRegExpSyntaxError):
* Source/JavaScriptCore/yarr/YarrJIT.cpp:
* Source/JavaScriptCore/yarr/YarrPattern.cpp:
(JSC::Yarr::YarrPatternConstructor::atomParenthesesEnd):
(JSC::Yarr::YarrPatternConstructor::checkForTerminalParentheses):
(JSC::Yarr::PatternAlternative::dump):
(JSC::Yarr::PatternTerm::dump):
* Source/JavaScriptCore/yarr/YarrPattern.h:
(JSC::Yarr::PatternTerm::PatternTerm):
(JSC::Yarr::PatternAlternative::PatternAlternative):

Canonical link: https://commits.webkit.org/290791@main



To unsubscribe from these emails, change your notification settings at https://github.com/WebKit/WebKit/settings/notifications


More information about the webkit-changes mailing list