[Webkit-unassigned] [Bug 22339] Support Indian web sites with EOT by on-the-fly transcoding to Unicode

Fri Jan 2 10:16:53 PST 2009

https://bugs.webkit.org/show_bug.cgi?id=22339

darin at apple.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #26367|review?                     |review-
               Flag|                            |

------- Comment #2 from darin at apple.com  2009-01-02 10:16 PDT -------
(From update of attachment 26367)
Thanks for taking this on! I think this feature will be good for WebKit.

I'm a little disappointed at how much code you're putting up for review all at
once. This includes both the "transcode based on font" infrastructure, but also
some new codecs for Tamil and TSCII.

I don't think it's good to have the transcoding be based on the origin of the
web page. We don't want the same content to work differently when it's on
different servers. We should use the font family names alone, and not involve
the security origin!

I think the concept we want here is that certain font names require re-decoding
the characters. So the interface to that would be a single function, something
like this:

    TextEncoding encodingForFont(const AtomicString& familyName);

Or perhaps like this:

    TextEncoding encodingForFont(const FontFamily&);

That function can go in platform/text or possibly in the rendering directory.

You'd give that map a font family name and ask it if there's a special
encoding. If there is, the encoding would be represented by a TextEncoding
object. To transcode, we'd first re-encode by using TextEncoding::encode on the
text, using the encoding of the web page the text was found in,
document()->decoder()->encoding(), to convert the text back from UTF-16 into
the original bytes. Then we'd re-decode those original bytes by using
TextEncoding::decode on the bytes, using the text encoding returned by
encodingForFont. This would give us new, corrected UTF-16. Something like this:

    CString reencodedBytes =
document()->decoder()->encoding().encode(characters, length,
QuestionMarksForUnencodables);
    bool sawError;
    String transcodedText = fontBasedEncoding.decode(reencodedBytes.data(),
reencodedBytes.length(), false, sawError);

If we do not already have suitable codecs for TSCII and Tamil -- and I'm
surprised that ICU does not include these -- then the right way to add those is
to add them via the TextCodec machinery, in the platform/text directory along
with the other codecs. TextEncodingRegistry.cpp is the starting point where the
various codecs register the encoding names they can handle.

We definitely don't need a new directory for transcoding functions in the
render tree code!

> +2009-01-02  prunthaban  <prunthaban at google.com>

This should have your full name, not just your login name.

> +        Test: fast/css/font-face-transcoder.html

The ChangeLog entry needs to describe what you've changed. It should include
the URL of the bug, title of the bug, and a description of the changes.

Ideally, each file and function should have comments about what changed.
Although for new files it's better to omit the list of functions, even though
the prepare-ChangeLog script generates it. So, for example, you'd want a
comment about RenderText::styleDidChange explaining the change to that
function, but you wouldn't want Encoder::maximumLookupLength.

> +#if ENABLE(TRANSCODER)

The global setting is called TRANSCODER, but that seems like far too vague a
name. Transcoding could mean many different things, and we need something
that's more specific. That's a problem with many of the names in your patch.
You're sharing the namespace with the rest of WebCore, so you need specific
enough names.

For example, you have Transcoder and Transformer. Not a clear distinction
between the two. All these classes need more specific names. But also I don't
think they need to exist (see my suggested design above).

In general, this design uses too many separate objects that each have to be
allocated separately and too many different function calls. I believe it's
going to be unnecessarily inefficient because of that.

> +    if (isTranscoderText()) {
> +        m_text = transcode(String(m_text->characters(), m_text->length()));
> +    }

Single line if statements should not have braces around them in the WebKit
coding style.

This code is going to run on every style change and risks slowing down browsing
on sites that don't involve the specially-named fonts. We have to design this
isTranscoderText function and how it's used in the render tree so that it is
very fast. I think that means that we need a design where the font family makes
the decision about whether transcoding is involved and that gets stored rather
than being recomputed each time styleDidChange on each RenderText object. I
think the right place to store this is probably the WebCore::Font object.

> +    String origin = document()->securityOrigin()->domain();
> +    Transformer* trans = transformer();

There's no need to abbreviate the name of the local variable like this, and we
try to avoid that. Our preferred style is to either use the transformer()
function directly without a local variable at all, or if there is a real
efficiency reason to use a local variable, we do this:

    Transformer* transformer = this->transformer();

> +PassRefPtr<StringImpl> RenderText::transcode(String text)

Functions should almost never take a String argument, because that
unnecessarily churns the reference count of the string. Instead the argument
should be const String&, or in some cases it should be a character pointer and
a length.

> +    String family = style()->font().family().family().string();
> +    String transformed = transformer()->convert(text, family);
> +    return StringImpl::create(transformed.characters(),transformed.length());

Need a space after the comma.

This is an unnecessarily inefficient idiom here. We already have the string in
a StringImpl, and then you create a new one by calling StringImpl::create; that
copies all the characters for no good reason. I think the return value should
just be a String, but if you really find it useful to have the return value be
a PassRefPtr<StringImpl>, you should return transformed.impl(). WebCore strings
are immutable, so there's almost never a need to copy them.

> +#if ENABLE(TRANSCODER)
> +    bool isTranscoderText() const;
> +    PassRefPtr<StringImpl> transcode(String);
> +#endif

These functions should be private, not public.

The function name isTranscoderText is not good. The text isn't "transcoder
text", so "is transcoder text" is not the right name for it. Perhaps
textNeedsTranscoding() is a good name. 

> +class Converter {
> +public:
> +    virtual String transcode(UChar) = 0;
> +};

This is much too vague a name for a class. A "converter" could convert anything
into anything else. It needs a more specific name, one that mentions
transcoding I think. But we shouldn't have this at all, so it's not worth
commenting on it in too much detail.

The interface to this class is unnecessarily inefficient. Typically we don't
want to make a separate function call for each character. Functions normally
work on runs of characters for efficiency's sake. Further, returning a String
for each character means doing multiple memory allocations per character. You
need an API that takes efficiency into account. For example, you could have an
API that takes both an input and output UChar buffer with some rules about the
size of the buffer that allow it to be allocated on the stack.

> +#include "config.h"
> +#include "DynamicParser.h"
> +#include "Syllable.h"
> +#include "Transcoder.h"
> +#include <wtf/PassRefPtr.h>
> +#include "StringImpl.h"

There should be a blank line after DynamicParser.h, the main header of the
file. Then the includes should be sorted with the "sort" tool. Thus,
StringImpl.h should be higher up the list.

> +DynamicParser::DynamicParser(String input, Encoder* encoder)

Should be const String&.

> +String DynamicParser::removeRedundantSymbols()
> +{
> +    String output = "";
> +    for (int i = 0; i < m_text.length(); i++) {
> +        UChar c = m_text[i];
> +        if (!m_encoder->isRedundant(c))
> +            output.append(c);
> +    }
> +    return output;
> +}

No function should append characters to a String object one character at a
time. That's extremely inefficient. Instead you should use a Vector<UChar> and
then convert the vector to a string at the end using the String::adopt
function.

> +        }
> +        else {

The close brace goes on the same line as the else in the WebKit coding style.

> \ No newline at end of file

Please put newlines at the end of the file.

I'm going to stop commenting on the details of the specific transcoder you
wrote at this point. The more important point is that since this is a
translation from one encoding to another, it should use the TextCodec API and
be in the platform/text directory, not in the WebCore/rendering directory.
There probably doesn't need to be a WebCore/rendering/transcoder directory at
all.
> +TSCIIEncoder::TSCIIEncoder()

Is TSCII an encoding that's not supported by ICU? Do we really need a custom
codec in WebKit for this? The heavy use of strings in these encoders is going
to make it unnecessarily slow.

review- for now; lets try a new patch that either adds the new codecs or adds
the new machinery for special decoding, but not both. We can test the decoding
machinery using some codec that's already supported and a special font name.

The main tricky part here is adding this machinery without slowing down
browsing that doesn't involve the special fonts.

-- 
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.