[webkit-changes] [WebKit/WebKit] 46e6b3: Add support for RegExp lookbehind assertions

Michael Saboff noreply at github.com
Tue Dec 13 19:06:47 PST 2022


  Branch: refs/heads/main
  Home:   https://github.com/WebKit/WebKit
  Commit: 46e6b3f97425a4a7bb16fc175288903a5f74d5f2
      https://github.com/WebKit/WebKit/commit/46e6b3f97425a4a7bb16fc175288903a5f74d5f2
  Author: Michael Saboff <msaboff at apple.com>
  Date:   2022-12-13 (Tue, 13 Dec 2022)

  Changed paths:
    A JSTests/stress/regexp-lookbehind.js
    M JSTests/test262/config.yaml
    M Source/JavaScriptCore/runtime/RegExp.cpp
    M Source/JavaScriptCore/yarr/YarrInterpreter.cpp
    M Source/JavaScriptCore/yarr/YarrInterpreter.h
    M Source/JavaScriptCore/yarr/YarrJIT.cpp
    M Source/JavaScriptCore/yarr/YarrJIT.h
    M Source/JavaScriptCore/yarr/YarrParser.h
    M Source/JavaScriptCore/yarr/YarrPattern.cpp
    M Source/JavaScriptCore/yarr/YarrPattern.h
    M Source/JavaScriptCore/yarr/YarrSyntaxChecker.cpp
    M Source/WTF/wtf/PrintStream.cpp
    M Source/WTF/wtf/PrintStream.h
    M Source/WebCore/contentextensions/URLFilterParser.cpp

  Log Message:
  -----------
  Add support for RegExp lookbehind assertions
https://bugs.webkit.org/show_bug.cgi?id=174931
rdar://33183185

This change implements RegExp lookbehind in the Yarr interpreter.

This change introduces the notion of match direction, either forward or backward.
The forward match direction is the way the current code works, matching disjunciton terms and the subject
string in a right to left manner.  Lookbehind assertions, as defined in the EcmaScript spec, process disjunctions
terms right to left matching the correspondding subject string right to left as well.

Except for the Yarr JIT, almost all of the Yarr code has been touched to account for this backward matching.
An additional Byteterm has been added, HaveCheckedInput, which checks that there is at least as many characters
available in the input stream, but it doesn't move the input stream position.  This is basically a CheckInput,
without moving the input position.  For variable counted terms, we still need to check that we won't try to access
characters beyond the first character of the subject string.  For functions like readSurrogatePairChecked(),
we check for input before calling the funcion.  For new input functions with a try prefix like tryReadBackward,
the function itselfs checks for available input.  After these checks prove that it is safe to access an offset
to the left of the current input position, the actual matching can be performed.

The Yarr parser, parses regular expression in left to right order.  It also computes character offest in forward
order.  When we Byteterm compile, we process backward matching disjunctions right to left.  The parser also has
special handling of forward references within a backward matching parenthetical group.  All such forward references
are saved for that parenthetical group and are processed at the end of the group.  Every one of these forward
reference are check to see if a capture to the right of the forward reference was found, if so the forward
reference is converted to a back reference.

As part of this work, the ByteTerm dumping code was significantly updated to allow for not only dumping of the
ByteCode after it has been generated, but to dump ByteCode while it is being interpreted.  This ByteTerm dumping
while interpreting is enabled with the Interpreter::verbose compile time constant.

Reviewed by Yusuke Suzuki.

* JSTests/stress/regexp-lookbehind.js: New tests.
(arrayToString):
(dumpValue):
(compareArray):
(testRegExp):
* JSTests/test262/config.yaml:
* Source/JavaScriptCore/runtime/RegExp.cpp:
(JSC::RegExp::compile):
(JSC::RegExp::compileMatchOnly):
* Source/JavaScriptCore/yarr/YarrInterpreter.cpp:
(JSC::Yarr::ByteTermDumper::ByteTermDumper):
(JSC::Yarr::ByteTermDumper::unicode):
(JSC::Yarr::Interpreter::InputStream::readForCharacterDump):
(JSC::Yarr::Interpreter::InputStream::tryReadBackward):
(JSC::Yarr::Interpreter::InputStream::tryUncheckInput):
(JSC::Yarr::Interpreter::InputStream::isValidNegativeInputOffset):
(JSC::Yarr::Interpreter::InputStream::dump const):
(JSC::Yarr::Interpreter::checkCharacter):
(JSC::Yarr::Interpreter::checkSurrogatePair):
(JSC::Yarr::Interpreter::checkCasedCharacter):
(JSC::Yarr::Interpreter::checkCharacterClass):
(JSC::Yarr::Interpreter::checkCharacterClassDontAdvanceInputForNonBMP):
(JSC::Yarr::Interpreter::tryConsumeBackReference):
(JSC::Yarr::Interpreter::matchAssertionWordBoundary):
(JSC::Yarr::Interpreter::backtrackPatternCharacter):
(JSC::Yarr::Interpreter::backtrackPatternCasedCharacter):
(JSC::Yarr::Interpreter::matchCharacterClass):
(JSC::Yarr::Interpreter::backtrackCharacterClass):
(JSC::Yarr::Interpreter::matchBackReference):
(JSC::Yarr::Interpreter::backtrackBackReference):
(JSC::Yarr::Interpreter::recordParenthesesMatch):
(JSC::Yarr::Interpreter::matchParenthesesOnceBegin):
(JSC::Yarr::Interpreter::matchParenthesesOnceEnd):
(JSC::Yarr::Interpreter::backtrackParenthesesOnceEnd):
(JSC::Yarr::Interpreter::matchParentheticalAssertionBegin):
(JSC::Yarr::Interpreter::backtrackParentheticalAssertionBegin):
(JSC::Yarr::Interpreter::matchDisjunction):
(JSC::Yarr::ByteCompiler::compile):
(JSC::Yarr::ByteCompiler::haveCheckedInput):
(JSC::Yarr::ByteCompiler::assertionWordBoundary):
(JSC::Yarr::ByteCompiler::atomPatternCharacter):
(JSC::Yarr::ByteCompiler::atomCharacterClass):
(JSC::Yarr::ByteCompiler::atomBackReference):
(JSC::Yarr::ByteCompiler::atomParenthesesOnceBegin):
(JSC::Yarr::ByteCompiler::atomParenthesesTerminalBegin):
(JSC::Yarr::ByteCompiler::atomParenthesesSubpatternBegin):
(JSC::Yarr::ByteCompiler::atomParentheticalAssertionBegin):
(JSC::Yarr::ByteCompiler::atomParentheticalAssertionEnd):
(JSC::Yarr::ByteCompiler::atomParenthesesSubpatternEnd):
(JSC::Yarr::ByteCompiler::atomParenthesesOnceEnd):
(JSC::Yarr::ByteCompiler::atomParenthesesTerminalEnd):
(JSC::Yarr::ByteCompiler::emitDisjunction):
(JSC::Yarr::ByteCompiler::isSafeToRecurse):
(JSC::Yarr::ByteTermDumper::dumpTerm):
(JSC::Yarr::ByteTermDumper::dumpDisjunction):
(JSC::Yarr::Interpreter::InputStream::readPair): Deleted.
(JSC::Yarr::ByteCompiler::dumpDisjunction): Deleted.
* Source/JavaScriptCore/yarr/YarrInterpreter.h:
(JSC::Yarr::ByteTerm::ByteTerm):
(JSC::Yarr::ByteTerm::HaveCheckedInput):
(JSC::Yarr::ByteTerm::WordBoundary):
(JSC::Yarr::ByteTerm::BackReference):
(JSC::Yarr::ByteTerm::isCharacterType):
(JSC::Yarr::ByteTerm::isCasedCharacterType):
(JSC::Yarr::ByteTerm::isCharacterClass):
(JSC::Yarr::ByteTerm::matchDirection):
* Source/JavaScriptCore/yarr/YarrJIT.cpp:
(JSC::Yarr::dumpCompileFailure):
* Source/JavaScriptCore/yarr/YarrJIT.h:
* Source/JavaScriptCore/yarr/YarrParser.h:
(JSC::Yarr::Parser::parseParenthesesBegin):
* Source/JavaScriptCore/yarr/YarrPattern.cpp:
(JSC::Yarr::YarrPatternConstructor::resetForReparsing):
(JSC::Yarr::YarrPatternConstructor::assertionBOL):
(JSC::Yarr::YarrPatternConstructor::atomPatternCharacter):
(JSC::Yarr::YarrPatternConstructor::atomBuiltInCharacterClass):
(JSC::Yarr::YarrPatternConstructor::atomParenthesesSubpatternBegin):
(JSC::Yarr::YarrPatternConstructor::atomParentheticalAssertionBegin):
(JSC::Yarr::YarrPatternConstructor::atomParenthesesEnd):
(JSC::Yarr::YarrPatternConstructor::atomBackReference):
(JSC::Yarr::YarrPatternConstructor::copyDisjunction):
(JSC::Yarr::YarrPatternConstructor::quantifyAtom):
(JSC::Yarr::YarrPatternConstructor::disjunction):
(JSC::Yarr::YarrPatternConstructor::ParenthesisContext::SavedContext::SavedContext):
(JSC::Yarr::YarrPatternConstructor::ParenthesisContext::SavedContext::restore):
(JSC::Yarr::YarrPatternConstructor::ParenthesisContext::ParenthesisContext):
(JSC::Yarr::YarrPatternConstructor::ParenthesisContext::push):
(JSC::Yarr::YarrPatternConstructor::ParenthesisContext::pop):
(JSC::Yarr::YarrPatternConstructor::ParenthesisContext::setInvert):
(JSC::Yarr::YarrPatternConstructor::ParenthesisContext::invert const):
(JSC::Yarr::YarrPatternConstructor::ParenthesisContext::setMatchDirection):
(JSC::Yarr::YarrPatternConstructor::ParenthesisContext::matchDirection const):
(JSC::Yarr::YarrPatternConstructor::ParenthesisContext::reset):
(JSC::Yarr::YarrPatternConstructor::pushParenthesisContext):
(JSC::Yarr::YarrPatternConstructor::popParenthesisContext):
(JSC::Yarr::YarrPatternConstructor::setParenthesisInvert):
(JSC::Yarr::YarrPatternConstructor::parenthesisInvert const):
(JSC::Yarr::YarrPatternConstructor::setParenthesisMatchDirection):
(JSC::Yarr::YarrPatternConstructor::parenthesisMatchDirection const):
(JSC::Yarr::YarrPattern::YarrPattern):
(JSC::Yarr::dumpCharacterClass):
(JSC::Yarr::PatternTerm::dump):
* Source/JavaScriptCore/yarr/YarrPattern.h:
(JSC::Yarr::PatternTerm::PatternTerm):
(JSC::Yarr::PatternTerm::convertToBackreference):
(JSC::Yarr::PatternTerm::setMatchDirection):
(JSC::Yarr::PatternTerm::matchDirection const):
(JSC::Yarr::PatternAlternative::PatternAlternative):
(JSC::Yarr::PatternAlternative::matchDirection const):
(JSC::Yarr::PatternDisjunction::addNewAlternative):
(JSC::Yarr::YarrPattern::resetForReparsing):
* Source/JavaScriptCore/yarr/YarrSyntaxChecker.cpp:
(JSC::Yarr::SyntaxChecker::atomParentheticalAssertionBegin):
* Source/WTF/wtf/PrintStream.cpp:
(WTF::printInternal):
* Source/WTF/wtf/PrintStream.h:
* Source/WebCore/contentextensions/URLFilterParser.cpp:
(WebCore::ContentExtensions::PatternParser::atomParentheticalAssertionBegin):

Canonical link: https://commits.webkit.org/257823@main




More information about the webkit-changes mailing list