Currently, every WebKit port has to implement its own content sniffing algorithm. This is problematic for compatibility and security. We should implement a content sniffing algorithm in WebCore so that it can be used by every port. Background A number of web servers don't properly set the Content-Type header when they serve responses. One common misconfiguration is to not send a Content-Type header at all or to send a bogus Content-Type header (i.e., with a value like "(null)" or "application/unknown"). To render these sites correctly, all browsers employ content sniffing algorithms that look at the contents of the response to determine the type of the resource. Some browsers have very aggressive content sniffing algorithms that often change the type of a resource. This can be dangerous if a web server allows users to upload content, such as images, and the browser treats these resources as HTML because this lets an attacker XSS the site. Designing a content sniffing algorithm is a careful balancing act between compatibility and security. WebKit WebKit itself does not contain a content sniffing algorithm, leaving each port to design their own. For example, Safari and Chromium each implement their own content sniffing algorithm and I imagine (although I haven't tested) that other ports do so as well. This causes unnecessary compatibility issues between different WebKit ports and leaves each port vulnerable to fend for itself in avoiding the security pitfalls. I think it makes sense for WebCore itself to implement one content sniffing algorithm that every port can use. One starting point for this common implementation is the Chromium content sniffer, which is open source. A number of Chromium contributors, myself included, have spent a lot of effort tuning that content sniffer to maximize compatibility while minimizing attack surface, and we'd like everyone to benefit from our efforts. Standardization We've also been working with the HTML 5 working group on standardizing content sniffing algorithms across all browsers. Eventually, I'd like to see WebKit's content sniffer converge with the HTML 5 specification. This process will likely involve the WebKit content sniffer and the HTML 5 specification evolving over time towards convergence. Feedback I'm sending this email to the list to get buy-in from the rest of the WebKit community on the general direction of implementing a content sniffer. I'd also like specific feedback about which content sniffing heuristics you think are important to include. As a starting point for discussion, you can see the Chromium content sniffer here: http://src.chromium.org/viewvc/chrome/trunk/src/net/base/mime_sniffer.cc?vie... The top of that file has some comments that explain some of the guiding design choices in the algorithm and a comparison with the behavior of some other browsers. Adam
On Oct 9, 2008, at 1:38 PM, Adam Barth wrote:
Currently, every WebKit port has to implement its own content sniffing algorithm. This is problematic for compatibility and security. We should implement a content sniffing algorithm in WebCore so that it can be used by every port.
Yah! I was a bit surprised myself when I discovered that the browsers would sniff to such an extent and that each browser was implementing this differently. This will be a good addition to WebKit.
For example, Safari and Chromium each implement their own content sniffing algorithm and I imagine (although I haven't tested) that other ports do so as well.
QtWebKit doesn't have any content sniffing and so Arora also has its own crude ContentType handling.
Feedback
I'm sending this email to the list to get buy-in from the rest of the WebKit community on the general direction of implementing a content sniffer.
I only speak for myself and not QtWebKit, but I think this is a good move for something that can be moved into WebKit.
http://src.chromium.org/viewvc/chrome/trunk/src/net/base/mime_sniffer.cc?vie...
The top of that file has some comments that explain some of the guiding design choices in the algorithm and a comparison with the behavior of some other browsers.
For what it is worth with Konq with its KParts system will follow the content type and just load a a text editor in the browser if there is no content-type for example. -Benjamin Meyer
On Friday 10 October 2008, Benjamin Meyer wrote:
For what it is worth with Konq with its KParts system will follow the content type and just load a a text editor in the browser if there is no content-type for example.
Not entirely. While Konqueror does no content-sniffing. It simply trusts the mimetypes of the network stack. The content "detection" is in the KDE HTTP- daemon. Without content "detection" almost nothing would work. It is quite simple though, as it just replaces a few simple "unknown" types with the default type of the protocol (HTML). This is compatible with at least older versions of Firefox, but not with IE that completely ignores mimetype. `Allan
By the way, my colleague Juan has been doing an analysis of CFNetwork's content sniffing algorithm and it looks like CFNetwork doesn't have a heuristic for GIF. Based on our data, this is the second most important heuristic. Safari should see a noticeable compatibility gain from this convergence effort. Adam On Thu, Oct 9, 2008 at 10:38 AM, Adam Barth <abarth@webkit.org> wrote:
Currently, every WebKit port has to implement its own content sniffing algorithm. This is problematic for compatibility and security. We should implement a content sniffing algorithm in WebCore so that it can be used by every port.
Background
A number of web servers don't properly set the Content-Type header when they serve responses. One common misconfiguration is to not send a Content-Type header at all or to send a bogus Content-Type header (i.e., with a value like "(null)" or "application/unknown"). To render these sites correctly, all browsers employ content sniffing algorithms that look at the contents of the response to determine the type of the resource.
Some browsers have very aggressive content sniffing algorithms that often change the type of a resource. This can be dangerous if a web server allows users to upload content, such as images, and the browser treats these resources as HTML because this lets an attacker XSS the site. Designing a content sniffing algorithm is a careful balancing act between compatibility and security.
WebKit
WebKit itself does not contain a content sniffing algorithm, leaving each port to design their own. For example, Safari and Chromium each implement their own content sniffing algorithm and I imagine (although I haven't tested) that other ports do so as well. This causes unnecessary compatibility issues between different WebKit ports and leaves each port vulnerable to fend for itself in avoiding the security pitfalls.
I think it makes sense for WebCore itself to implement one content sniffing algorithm that every port can use. One starting point for this common implementation is the Chromium content sniffer, which is open source. A number of Chromium contributors, myself included, have spent a lot of effort tuning that content sniffer to maximize compatibility while minimizing attack surface, and we'd like everyone to benefit from our efforts.
Standardization
We've also been working with the HTML 5 working group on standardizing content sniffing algorithms across all browsers. Eventually, I'd like to see WebKit's content sniffer converge with the HTML 5 specification. This process will likely involve the WebKit content sniffer and the HTML 5 specification evolving over time towards convergence.
Feedback
I'm sending this email to the list to get buy-in from the rest of the WebKit community on the general direction of implementing a content sniffer. I'd also like specific feedback about which content sniffing heuristics you think are important to include. As a starting point for discussion, you can see the Chromium content sniffer here:
http://src.chromium.org/viewvc/chrome/trunk/src/net/base/mime_sniffer.cc?vie...
The top of that file has some comments that explain some of the guiding design choices in the algorithm and a comparison with the behavior of some other browsers.
Adam
participants (3)
-
Adam Barth
-
Allan Sandfeld Jensen
-
Benjamin Meyer