[webkit-reviews] review denied: [Bug 10728] text encodings should work without a numeric ID : [Attachment 10396] patch, with all the trimmings

Tue Sep 5 07:33:49 PDT 2006

Alexey Proskuryakov <ap at nypop.com> has denied Alexey Proskuryakov
<ap at nypop.com>'s request for review:
Bug 10728: text encodings should work without a numeric ID
http://bugzilla.opendarwin.org/show_bug.cgi?id=10728

Attachment 10396: patch, with all the trimmings
http://bugzilla.opendarwin.org/attachment.cgi?id=10396&action=edit

------- Additional Comments from Alexey Proskuryakov <ap at nypop.com>
I have a few comments of a general kind, and I also think I spotted a couple of
bugs.

+	 (WebCore::Decoder::Decoder): Changed to use the functions above. Also
renamed
+	 m_reachedBody to m_checkedForHeadCharset.

I think that this name is slightly misleading, because it also accounts for XML
declaration, and non-default encoding being already set. The old name is no
longer correct either, since the logic has changed.

+    Copyright (C) 2006 Alexey Proskuryakov (ap at nypop.com)

Thank you!

+	 static void registerEncodingNames(EncodingNameRegistrar);
+	 static void registerCodecs(TextCodecRegistrar);

It looks like the registry is mostly useful to non-ICU platforms (with ICU, it
just makes codec creation slower, as it has to be looked up twice). But can
other codecs even provide such kind of information? Looking at Qt
documentation, QTextCodec provides customizeable heuristical approach, which
doesn't really map into a registry.

-    DeprecatedString result = decoder->decode(static_cast<const char *>([data
bytes]), [data length]);
+    String result = decoder->decode(static_cast<const char *>([data bytes]),
[data length]);

A pre-existing problem with spacing in "const char *"; there are more instances
ofworng "const char *" spacing the patch.

-    , m_decoder(new Decoder("text/css"))
+    , m_decoder(new Decoder("text/css", charset))

Oops, my bad! This bugfix needs a ChangeLog entry and a test case.

+    // FIXME: TextDecoder already checks for the BOM; do we really need this?.

Yes, we need to report a correct charset in document.characterSet and related
properties, and to use the main resource charset as a default for subresources.
I am not sure if this is already covered by test cases, it obviously needs to
be.

-  if (d->m_bFirstData) {
-      // determine the parse mode
-      d->m_doc->determineParseMode(decoded);
-      d->m_bFirstData = false;
+	 d->m_bFirstData = false;
+	 d->m_doc->determineParseMode(decoded);

Is this just cleanup? I'm wondering because you haven't changed the same
pattern in the next function.

-    static const TextEncoding utf8Encoding(UTF8Encoding);
+    static const TextEncoding utf8Encoding(UTF8Encoding());

Is this local static const even needed anymore?

+    registrar("ISO-8859-8-I", "ISO-8859-8-I");
+    registrar("csISO88598I", "ISO_8859-8-I");

The second registration looks like it has a typo - the alias is bound to an
unknown encoding.

-	 virtual DeprecatedCString fromUnicode(const DeprecatedString&, bool
allowEntities = false);
+	 virtual CString encode(const UChar*, size_t length, bool allowEntities
= false) const;

encode() relies on, and potentially changes the state of UConverter - seeing it
const is a bit strange to me.

+    String result = String::newUninitialized(len / 2, buffer);

If len is odd, and there's a byte buffered from the previous chunk, the result
may not fit.

+    if (m_littleEndian)
+	 for (size_t i = 0; i < len; i += 2) {
+	     UChar c = p[0] | (p[1] << 8);
+	     p += 2;
+	     if (c != BOM)
+		 *q++ = c;
+	 }

p[1] can be out of bounds if len is odd (same in big endian case).

+    // FIXME: CString is not a reasonable data structure for encoded UTF-16,
which will have
+    // null characters inside it. Perhaps the result of encode should not be a
CString?

Formally, it should be Vector<unsigned char>, as a char isn't even guaranteed
to be able to hold all bit patterns AFAIK.

Seriously, a Vector seems more appropriate to me, indeed.

+	 OwnPtr<TextCodec> m_decoder;

I stumbled reading code that used this variable for a moment, perhaps if should
be named m_codec?

+    // FIXME: Should normalization really be an automatic part of encoding?
+    // Normalization should probably be an explicit step separate from
encoding.

  As far as I understand <http://www.w3.org/TR/charmod-norm/>, normalization
should be an automatic part of every string operation, even append(). That's
hardly practical, but at least all text that is sent out from a browser needs
to be normalized.

 UChar TextEncoding::backslashAsCurrencySymbol() const
 {
-    if (m_flags & BackslashIsYen)
-	 return 0x00A5; // yen sign
- 
-    return '\\';
+    static const char* const a =
atomicCanonicalTextEncodingName("Shift_JIS_X0213-2000");
+    static const char* const b = atomicCanonicalTextEncodingName("EUC-JP");
+    return (m_name == a || m_name == b) ? 0x00A5 : '\\';
+}

  It's quite weird that this function doesn't include the normal Shift_JIS
encoding - Shift_JIS_X0213-2000 isn't even recognized by other browsers.
Nothing to do with this patch, of course, but I find it hard to believe that
the whole backslashAsCurrencySymbol concept works as expected. Korean encodings
have similar issues with backslash AFAIK.

Index: WebCore/platform/UChar.h

Shouldn't the use of ICU here be ifdef'ed?

+#include <Carbon/Carbon.h>

I think it's enough to include CoreServices for TEC.

-    m_response = DeprecatedString();
+    m_response = String();

This will need to be merged with a fix that I landed after you submitted this
patch, sorry (m_response is now initialized to an empty string).

I think I should r- this patch because of TextCodecUTF16 buffer overruns,
although it brings an huge amount of improvements otherwise.