[webkit-dev] trac.webkit.org links via Google.com

Mark Rowe mrowe at apple.com
Tue Dec 1 00:53:16 PST 2009

On 2009-11-30, at 23:38, Yaar Schnitman wrote:

> A sitemap.xml file is a more modern way of telling Google how to crawl a site and the traffic can be throttled in Google's webmaster tools (http://www.google.com/webmasters/tools/).
> Creating a daily script that generates sitemap.xml for webkit's SVN repo should trivial. There are probably trac plugins that do that already. If done right, google crawler shouldn't produce much more load than an average developer doing a daily svn sync.

Google isn't the only search engine we're concerned about.  We need to prevent all search engines from hammering the repository, even those that don't support this technology.  I can't find any information about the precedence of exclusions in robots.txt vs a sitemap so it's not clear whether that can be achieved without having to explicitly whitelist individual crawlers.

If it is possible to use a sitemap without having to whitelist individual crawlers then we should investigate doing so.  Suggesting it is trivial is being rather optimistic though.  You'd need to dramatically restrict the set of content that is exposed for indexing to make it feasible.  For instance: allow indexing only the content of files on trunk (no branches, tags, non-HEAD revisions).  You'd also want to expose individual changesets to ensure that commit messages are indexed.

But… from what I can see a sitemap only points at content that is available, it doesn't restrict what can be indexed.  While we'd want individual changeset pages to be indexed we'd certainly not want it to follow every individual "view diff" link on such a page, nor would we want it to follow the numerous other links within the content back to previous revisions, other branches, tags, etc.

Maybe there's something that I'm missing that makes sitemaps usable for this purpose though.

- Mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-dev/attachments/20091201/4d0ec77c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3166 bytes
Desc: not available
URL: <http://lists.webkit.org/pipermail/webkit-dev/attachments/20091201/4d0ec77c/attachment.bin>

More information about the webkit-dev mailing list