[Webkit-unassigned] [Bug 255467] isASCIISpace vs isHTMLSpace

bugzilla-daemon at webkit.org bugzilla-daemon at webkit.org
Fri Apr 14 18:10:06 PDT 2023


https://bugs.webkit.org/show_bug.cgi?id=255467

--- Comment #1 from Darin Adler <darin at apple.com> ---
As documented in the header (there's a comment right at the top explaining this), functions in ASCIICType.h started as non-locale-sensitive versions of the functions in <ctype.h>, safe to use even if setLocale("UTF-8") has been called. So isASCIISpace does exactly what the C library isspace function does in the default C locale, and does not innovate or get adapted to the needs of the web platform.

Later when we found that HTTP and HTML used different space definitions (each different from the other and both *subsets* of C's isspace) we added new functions rather than changing the behavior of the underlying one. It’s undeniable that it’s tempting to call isASCIISpace when the HTML documentation says "ASCII whitespace" since sadly HTML uses that name for its own ASCII whitespace subset!

One property of isASCIISpace is that it has the same definition as Unicode whitespace, and returns the same thing as ICU’s u_isspace would when the argument is an ASCII character. And it returns true for all ASCII characters that are have the Unicode White_Space property.

It would be OK to remove the isASCIISpace function if nothing in WebKit needs it. And leave a comment behind explaining why it does not exist. But first, we must change all the callers to not use it! That might take a while.

Another possibility would be to rename isASCIISpace in a way that tries to make it clear it should rarely be used. There is precedent for this in the project. I don’t immediately have an idea for the name. Maybe isLegacyASCIISpace? Maybe isASCIIUnicodeSpace? Maybe isASCIISpaceButCallIsHTMLSpaceInstead.

There are other quite a few other functions that call isASCIISpace that are almost never correct for the web platform because they use treat U+000B as whitespace. This is a complicated subject because we have at least 5 interesting definitions of whitespace:

- Unicode whitespace (u_isspace)
- ASCII whitespace (isASCII), same as Unicode whitespace but only for ASCII characters
- HTML’s "ASCII whitespace" <https://infra.spec.whatwg.org/#ascii-whitespace> (isHTMLSpace)
- HTTP’s whitespace <https://www.rfc-editor.org/rfc/rfc9110.html#name-whitespace> (isHTTPSpace)
- CSS’s document white space characters <https://www.w3.org/TR/css-text-3/#white-space>

Given this, it’s hard to write functions that handle spaces correctly without parameterizing them on which definition of whitespace to use. The following all currently use isASCIISpace and so at least some callers probably get incorrect behavior.

String/StringImpl::stripWhiteSpace
String/StringImpl::simplifyWhiteSpace
StringView::stripWhiteSpace
isSpaceOrNewline in StringImpl.h
isNotSpaceOrNewline in StringImpl.h
isNotASCIISpace in ParsingUtilities.h
parseInteger
parseIntegerAllowingTrailingJunk
charactersToDouble
charactersToFloat

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-unassigned/attachments/20230415/55bf6f09/attachment-0001.htm>


More information about the webkit-unassigned mailing list