[webkit-dev] HTML5 & Web Links (RFC 5988)

Fri Nov 12 00:29:58 PST 2010

On 11.11.2010 22:14, Alexey Proskuryakov wrote:
>
> 11.11.2010, в 11:38, Julian Reschke написал(а):
>
>> I don't think the IETF will ever approve a standard where the encoding depends on the recipient's locale, with no reliable way to find out upfront what that locale is.
>
> Yes, that makes good sense to me.
>
> Note that Safari's doesn't rely on OS locale (other than for picking its original default browser encoding, which can then be changed by the user). Surely, some people are allergic to the idea of default browser encoding too, but it's unavoidable in practice - we can't interpret untagged content as Latin-1.

There's a difference between HTTP payloads and the actual HTTP header 
fields, for which the encoding never has been changeable.

>> I disagree that "raw bytes" are a de facto standard; they do not interoperate across UAs (see above)...
>
>
> I think that we agree about technical details and empirical data now, but describe them differently.
>
> Surely, there is no way (that I'm aware of) to guarantee correct downloaded file name in all browsers for all users. A lot of server operators only care about users in their country, and can reasonably (i.e. with negligible cost to business) rely on Windows locale being set. They can just send raw bytes in language default encoding in Content-Disposition, and that works for them and their clients. For all I know, that's what almost everyone does, and it's "interoperable" for them.

Yes. That was the case back in 2003 when I only had to worry about IE 
and Mozilla/Firefox, and it hasn't been improving until 2008 when I 
started the work on collection test cases *and* results, and started 
updating the specs.

Since then some of the implementations of 2231 encoding (now 5987) have 
improved a bit (Opera) / a lot (Konqueror), FF has a few pending 
patches, and Chrome nighties have started implementing this (two weeks ago).

So we are making slow progress now.

> Global operators like Google or Yahoo obviously want to cover many languages at once, and they just send different HTTP headers to different browsers. That's not great, but that's unavoidable unless IE changes - whether changing interpretation of raw bytes or implementing RFC5987, IE would have to change.

The problem are not only operators but also software vendors that sell 
to customers that operate globally, such as, for example, Web Content 
Management systems serving content to customers just "everywhere".

And the issue with UA sniffing is that it only works for existing UAs, 
and only if they do not change.

Back when I encountered the problem first (working on a component of 
SAM's content management system) I made the optimistic assumption that 
IE would be the only UA ignoring RFC 2231, so I sniffed for IE and added 
a workaround. Guess what? The workaround for IE doesn't work in a few 
locales (because of different encoding defaults), and also Chrome and 
Safari came out breaking my optimistic assumption that new UAs would do 
the right thing (so the implementation I worked on doesn't "work" with 
Safari, even as today).

>> The spec (RFC 2616) already says that raw bytes are ISO-8859-1, so UAs overriding this are in violation of the spec (IMHO).
>
>
> Yes, that's why I'd really welcome a spec that's closer to reality in this regard. No browser whose vendor cares about markets not covered by Latin-1 can actually treat raw bytes in Content-Disposition as ISO-8859-1. No server operator who wants to serve downloadable content in those markets can stick to ISO-8859-1.

IE and Opera do treat the octets as ISO-8859-1.

So the issue is that the problem space is *very* complex; many UAs came 
up with *different* workarounds. None of these interoperate. Some even 
fail for the same UA in different locales.

>> Introducing a separate parameter (filename*) that doesn't carry the legacy problems is in my opinion the best way to move forward.
>
>
> As a browser implementor, I don't have a strong opinion about filename*. The actual content I see on the Web uses raw bytes in Content-Disposition, so I mostly care about that being adequately specified, so that at least non-IE browsers could all work the same. Firefox and Safari are already pretty close. It's unfortunate if Chrome does not implement this fallback scheme.

When you say "I see", does that refer to Safari? In which case your 
perception may already be influenced by the fact that senders are forced 
to sniff for the User Agent.

> Generally speaking, having no custom encoding is better than having an opaque custom encoding. In my opinion, the ideal situation would be for servers to send raw UTF-8, and for clients to do what Safari and Firefox do (try UTF-8, then fall back to other encodings). This may be unachievable in practice, in which case interoperability via opaque RFC2231-style encoding is a lesser evil.

Optimally, we would just declare that all header fields use UTF-8. But 
that would be an incompatible change.

Realistically, if we don't want to break existing stuff, we need to move 
the I18Nized version of the filename into a place where it can co-exist 
with the legacy information, and that's what "filename*" does. And it 
also has the benefit of four independent implementations in UAs (five, 
if you count iCab).

But we are getting off-topic; people interested in Content-Disposition 
really should come over to the HTTPbis WG's mailing list.

This thread originally was about the Link: header, which happens to use 
the same encoding for the "title" parameter. All I wanted to say is that 
there's an opportunity to share code with Chromium's implementation of 
C-D. And I really believe it would be great if we stopped 
re-implementing header field parsing/evaluation on a case-by-case basis.

Best regards, Julian