By the way, my colleague Juan has been doing an analysis of CFNetwork's content sniffing algorithm and it looks like CFNetwork doesn't have a heuristic for GIF. Based on our data, this is the second most important heuristic. Safari should see a noticeable compatibility gain from this convergence effort. Adam On Thu, Oct 9, 2008 at 10:38 AM, Adam Barth <abarth@webkit.org> wrote:
Currently, every WebKit port has to implement its own content sniffing algorithm. This is problematic for compatibility and security. We should implement a content sniffing algorithm in WebCore so that it can be used by every port.
Background
A number of web servers don't properly set the Content-Type header when they serve responses. One common misconfiguration is to not send a Content-Type header at all or to send a bogus Content-Type header (i.e., with a value like "(null)" or "application/unknown"). To render these sites correctly, all browsers employ content sniffing algorithms that look at the contents of the response to determine the type of the resource.
Some browsers have very aggressive content sniffing algorithms that often change the type of a resource. This can be dangerous if a web server allows users to upload content, such as images, and the browser treats these resources as HTML because this lets an attacker XSS the site. Designing a content sniffing algorithm is a careful balancing act between compatibility and security.
WebKit
WebKit itself does not contain a content sniffing algorithm, leaving each port to design their own. For example, Safari and Chromium each implement their own content sniffing algorithm and I imagine (although I haven't tested) that other ports do so as well. This causes unnecessary compatibility issues between different WebKit ports and leaves each port vulnerable to fend for itself in avoiding the security pitfalls.
I think it makes sense for WebCore itself to implement one content sniffing algorithm that every port can use. One starting point for this common implementation is the Chromium content sniffer, which is open source. A number of Chromium contributors, myself included, have spent a lot of effort tuning that content sniffer to maximize compatibility while minimizing attack surface, and we'd like everyone to benefit from our efforts.
Standardization
We've also been working with the HTML 5 working group on standardizing content sniffing algorithms across all browsers. Eventually, I'd like to see WebKit's content sniffer converge with the HTML 5 specification. This process will likely involve the WebKit content sniffer and the HTML 5 specification evolving over time towards convergence.
Feedback
I'm sending this email to the list to get buy-in from the rest of the WebKit community on the general direction of implementing a content sniffer. I'd also like specific feedback about which content sniffing heuristics you think are important to include. As a starting point for discussion, you can see the Chromium content sniffer here:
http://src.chromium.org/viewvc/chrome/trunk/src/net/base/mime_sniffer.cc?vie...
The top of that file has some comments that explain some of the guiding design choices in the algorithm and a comparison with the behavior of some other browsers.
Adam