[Webkit-unassigned] [Bug 27032] document.title does not replace or remove space characters

Thu Nov 19 12:39:24 PST 2009

https://bugs.webkit.org/show_bug.cgi?id=27032

--- Comment #7 from Christian Sejersen <christian.webkit at gmail.com>  2009-11-19 12:39:24 PST ---
(In reply to comment #6)
> (From update of attachment 43500 [details])
> Thanks for tackling this. This patch combines refactoring and at least three
> different behavior changes. We should ideally do these separately and each
> behavior change needs either a test or a justification about why it can't be
> tested.

I am not sure I agree about the behavior changes as I have indicated below.

> 
> > +	Unify the behaviour of stripping whitespace from the title of the 
> > +	window and when using document.title. So in cases where whitespace had 
> > +	to be removed isSpaceOrNewline am now taking into account Unicode separator 
> > +	characters. 
> 
> Can't land patches with tabs. There are quite a few tabs in the patch.
> 

Oops, will fix that.

> > +        * dom/Document.cpp:
> > +        (WebCore::canonicalizedTitle):
> > +        (WebCore::Document::updateTitle):
> > +        * dom/Document.h:
> > +        (WebCore::Document::title):
> > +        * loader/DocumentLoader.cpp:
> > +        (WebCore::DocumentLoader::setTitle):
> > +        * platform/text/StringImpl.h:
> > +        (WebCore::isSpaceOrNewline):
> 
> It's better to have per-function comments. It seems I am one of the few people
> still doing this. You can win points with me by doing it, though!
> 

I am a bit behind on points so I better do it then :)

> > +/*
> > + * Performs three operations:
> > + *  1. Trim leading and trailing whitespace
> > + *  2. Collapse internal whitespace.
> > +*/
> 
> We use "//" comments almost all the time, not "/* */" except in rare cases.
> 
> I'm also not sure this is an important comment. Comments should usually state
> "why", not "what". The "what" part should be expressed in the function names
> and code itself.
>

I actually just edited the existing comment to reflect what the function
actually does but I agree it isn't important, so I will remove it. 

> > +static inline String canonicalizedTitle(const String& title)
> > +{
> > +    //ASSERT(!title.isEmpty());
> 
> Assertion should either be removed or added, not added in commented-out.
> 

Doh, will remove the assertion.

> > +        
> > +    String canonicalTitle = title.stripWhiteSpace();
> > +    canonicalTitle = canonicalTitle.simplifyWhiteSpace();
> > +        
> > +    return canonicalTitle;
> > +}
> 
> Is there a a good reason to not write this one one line?
> 
>     return title.stripWhiteSpace().simplifyWhiteSpace();

Nope.

> 
> > +    m_canonicalizedTitle = canonicalizedTitle(m_title);
> 
> It seems to me that if we store the canonicalized title, then we should check
> if it changes, and avoid the setTitle call when the title has not changed.
> 

This already happens in Document.setTitle() with

    if (m_title == title)
        return;

> > -    String title() const { return m_title; }
> > +    String title() const { return m_canonicalizedTitle; }
> 
> This is a bad situation. We do not want a class where there is a data member
> named m_title and a function named title, and they are not the same thing.
> 
> We need to adjust the names for clarity. Do we even need an m_title data member
> at all? Maybe m_canonicalizedTitle should be the only thing we store. Who uses
> m_title?
> 
> Maybe m_title should stay and change purpose.
> 

I agree that this is not ideal, but this is to prevent the situation you
mention above, to not change the title if it hasn't changed. Do you have an
idea for another way of solving that situation?

> > -/*
> > - * Performs four operations:
> > - *  1. Convert backslashes to currency symbols
> > - *  2. Convert control characters to spaces
> > - *  3. Trim leading and trailing spaces
> > - *  4. Collapse internal whitespace.
> > - */
> 
> You eliminated conversion of of control characters to spaces. Why are you sure
> that's OK? I'd like to see a separate patch for that behavior change which does
> not combine that change with all the refactoring.
> 

Actually, isASCIISpace has the same check for control characters, so there
isn't any behavior change.

> > -static inline String canonicalizedTitle(const String& title, Frame* frame)
> > -{
> > -    ASSERT(!title.isEmpty());
> > -
> > -    const UChar* characters = title.characters();
> > -    unsigned length = title.length();
> > -    unsigned i;
> > -
> > -    StringBuffer buffer(length);
> > -    unsigned builderIndex = 0;
> > -
> > -    // Skip leading spaces and leading characters that would convert to spaces
> > -    for (i = 0; i < length; ++i) {
> > -        UChar c = characters[i];
> > -        if (!(c <= 0x20 || c == 0x7F))
> > -            break;
> > -    }
> > -
> > -    if (i == length)
> > -        return "";
> > -
> > -    // Replace control characters with spaces, and backslashes with currency symbols, and collapse whitespace.
> > -    bool previousCharWasWS = false;
> > -    for (; i < length; ++i) {
> > -        UChar c = characters[i];
> > -        if (c <= 0x20 || c == 0x7F || (WTF::Unicode::category(c) & (WTF::Unicode::Separator_Line | WTF::Unicode::Separator_Paragraph))) {
> > -            if (previousCharWasWS)
> > -                continue;
> > -            buffer[builderIndex++] = ' ';
> > -            previousCharWasWS = true;
> > -        } else {
> > -            buffer[builderIndex++] = c;
> > -            previousCharWasWS = false;
> > -        }
> > -    }
> > -
> > -    // Strip trailing spaces
> > -    while (builderIndex > 0) {
> > -        --builderIndex;
> > -        if (buffer[builderIndex] != ' ')
> > -            break;
> > -    }
> > -
> > -    if (!builderIndex && buffer[builderIndex] == ' ')
> > -        return "";
> > -
> > -    buffer.shrink(builderIndex + 1);
> > -    
> > -    // Replace the backslashes with currency symbols if the encoding requires it.
> > -    frame->document()->displayBufferModifiedByEncoding(buffer.characters(), buffer.length());
> > -
> > -    return String::adopt(buffer);
> > -}
> 
> This is a hot code path, and one of the reasons the function was written the
> way it was is for performance. Making your change is OK if we know the
> performance has not changed. Have you found a way to determine this does not
> slow down performance?

I haven't run a profiler on this to determine the difference in performance,
but there is a performance penalty just due to the fact that this in done in
two functions and the string buffer is copied two times instead of once. I can
revert the code back to the original if you think that is better?

Out of curiosity, why is this a hot code path and how can I find out if
something is considered a hot code path? E.g. in this case canonicalizedTitle
is only called once per document.

> 
> > -    String trimmed = canonicalizedTitle(title, m_frame);
> > -    if (!trimmed.isEmpty() && m_pageTitle != trimmed) {
> > +    if (!title.isEmpty() && m_pageTitle != title) {
> >          frameLoader()->willChangeTitle(this);
> > -        m_pageTitle = trimmed;
> > +        // Replace the backslashes with currency symbols if the encoding requires it.
> > +        m_pageTitle = m_frame->document()->displayStringModifiedByEncoding(title);
> >          frameLoader()->didChangeTitle(this);
> >      }
> 
> This changes behavior for titles that would be empty if made canonical. Is that
> OK? Did you test behavior in other browsers? I'd prefer to have the behavior
> change in a separate patch from the refactoring for clarity, where the change
> is clearly justified and tested.

The title passed to the DocumentLoader has already been canonicalized, so there
is no change in behavior here.

> 
> >  static inline bool isSpaceOrNewline(UChar c)
> >  {
> > -    // Use isASCIISpace() for basic Latin-1.
> > -    // This will include newlines, which aren't included in Unicode DirWS.
> > -    return c <= 0x7F ? WTF::isASCIISpace(c) : WTF::Unicode::direction(c) == WTF::Unicode::WhiteSpaceNeutral;
> > +    // Use isASCIISpace() for ASCII
> > +    if (c <= 0x7F)
> > +        return WTF::isASCIISpace(c);
> > +    
> > +    // Unicode characters in the space, line or paragraph separator
> > +    // category are treated as white space
> > +    return ((WTF::Unicode::category(c) & 
> > +             (WTF::Unicode::Separator_Space | 
> > +              WTF::Unicode::Separator_Line | 
> > +              WTF::Unicode::Separator_Paragraph)));
> >  }
> 
> What's the motivation for this change? Is this about correctness? If so,
> where's the test case that shows the old behavior was incorrect? Is it about
> efficiency?

AFAICT the code: WTF::Unicode::direction(c) == WTF::Unicode::WhiteSpaceNeutral
really doesn't do anything that has anything to do with determining whether the
character is a space or newline. I believe the right implementation is to check
for all the Unicode separator categories if the input character isn't ASCII.
The test case fast/dom/Document/document-title-get.html I added the one I
extended in fast/dom/HTMLOptionElement/option-text.html are testing this code.
It will exhibit the wrong behavior in the old code. 

> 
> The formatting seems inferior to the existing to me and doesn't follow our
> rules about where to put boolean operators in multiple lines.
> 
> One way to make these more readable is to use "using namespace" inside the
> function.

OK, thanks for the review, I will upload a new patch.

-- 
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.