[Webkit-unassigned] [Bug 233921] New: TextDecoder delays the streaming output for invalid UTF-8 sequences

bugzilla-daemon at webkit.org bugzilla-daemon at webkit.org
Tue Dec 7 03:55:22 PST 2021


https://bugs.webkit.org/show_bug.cgi?id=233921

            Bug ID: 233921
           Summary: TextDecoder delays the streaming output for invalid
                    UTF-8 sequences
           Product: WebKit
           Version: WebKit Nightly Build
          Hardware: Unspecified
                OS: Unspecified
            Status: NEW
          Severity: Normal
          Priority: P2
         Component: New Bugs
          Assignee: webkit-unassigned at lists.webkit.org
          Reporter: andreu at andreubotella.com

WPT tests: https://wpt.fyi/results/encoding/textdecoder-eof.any.html?label=experimental&label=master&aligned (stream: true case), https://wpt.fyi/results/encoding/textdecoder-streaming.any.html?label=experimental&label=master&aligned, https://wpt.fyi/results/encoding/streams/decode-utf8.any.html?label=experimental&label=master&aligned (non-SharedArrayBuffer cases)

Related Chromium bug: https://bugs.chromium.org/p/chromium/issues/detail?id=796697

When the TextCodecUTF8 decoder finds a non-ASCII lead byte, it waits until enough bytes are consumed to make a valid sequence starting at that position, before starting to process the bytes. This goes against the encoding spec, which requires the replacement character to be emitted as soon as enough bytes are consumed to tell that the sequence is in fact invalid.

While this does not make a difference for non-streaming input, or for streaming data coming from the network, it does make a difference in that TextDecoder returns the wrong result as per the spec when in streaming mode:

const decoder = new TextDecoder();
console.log(decoder.decode(new Uint8Array([0xF0, 0x9F]), { stream: true }));
console.log(decoder.decode(new Uint8Array([0x41]), { stream: true }));
console.log(decoder.decode(new Uint8Array([0x42]), { stream: true }));

As per the spec, and in Firefox and Chromium 98, this prints "", "�A", "B". In WebKit and previous versions of Chromium, it prints "", "", "�AB".

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-unassigned/attachments/20211207/40a7b94b/attachment.htm>


More information about the webkit-unassigned mailing list