[Webkit-unassigned] [Bug 37765] REGRESSION(57531): the commit-queue still hates Tor Arne Vestbø

Sun Apr 18 15:36:30 PDT 2010

https://bugs.webkit.org/show_bug.cgi?id=37765

--- Comment #19 from Eric Seidel <eric at webkit.org>  2010-04-18 15:36:29 PST ---
(In reply to comment #18)

Good questions!

> Can you clarify what you are trying to say here?  It sounds like you might be
> saying more than you mean to say -- e.g. that the caller never needs to use
> unicode string literals, which is obviously not right.

I was attempting to excuse why I didn't add u"" to every string literal in the
project. :)  If you forget the u"" and it actually contains non-ascii code
points, python will yell at you.

> It seems like we shouldn't need to use StringIO as much as we do (except
> perhaps in unittest code to make certain tests easier).  It looks like you're
> removing a lot of the places it's being used which seems good.

StringIO is just confusingly named.  It deals with a stream of bytes.  aka, a
"str" object in Python 2.x.  In Python 3.0 "str" is actually a unicode()
object, and there is an explicit bytes() type.  StringIO probably takes a
bytes() object in 3.0. :)

There are few places we actually need StringIO, since we tend to follow the
model of just reading everything into memory.  The files we deal with tend to
be small, and the machines these scripts run on rather beefy.

> I understand what you're saying, but maybe the conclusion that should be drawn
> is that we shouldn't have parse_latest_entry_from_file() as part of our API? 
> In other words, the caller should be responsible.  That would also be more
> consistent with the rule of thumb to decode as early as possible.

Any API which takes a file-like object is rather confusing.  Because the read()
method is expected to return a byte stream.  However you can use codecs.open()
to create a file-like object which will return unicode() strings instead of
str() byte streams from the various methods.

> Doesn't this also go against "unicode everywhere/encode late"?  Maybe this
> would be a good spot to do type-checking between unicode/str for backwards
> compatibility.

Nope.  .write() is expected to deal with byte streams.  Think of file.write()

> I'm a bit unclear on the extent to which, for example, we'll need to be passing
> unicode strings even for every log message we create.  You've probably thought
> about this more than I have.

Well, depends on the logging infrastructure.  As far as I can tell Python 2.5
"print" seems to be smart enough to handle unicode() objects.  file.write() not
so much (unless you open the file with codecs.open and declare an encoding)  So
depends on what sort of logging we're doing.  If we're logging to the console,
it seems python is smart enough to handle unicode for us.  If we're logging to
a file we may have to be careful to call .encode("utf-8") to create the
appropriate byte stream first.

> Out of curiosity, I started looking to see how Python's logging package handles
> this.  I stopped when it started to look like the logging package has no
> preference between str and unicode.  I think it might only be the particular
> logging handlers that care, which is what the caller has control over (when
> they configure logging).

There is not difference between "str" and "unicode" as far as most of python is
concerned.  I wouldn't expect the logging module to care, except if it's
writing to files or to the console (however I suspect that when writing to the
console the unicode encoding is abstracted at a lower layer).

-- 
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.