[webkit-dev] trac.webkit.org links via Google.com

Mark Rowe mrowe at apple.com
Tue Dec 1 11:33:11 PST 2009

On 2009-12-01, at 11:04, Yaar Schnitman wrote:

> Robots.txt can exclude most of the trac site, and then include the sitemap.xml. This way you block most of the junk and only give permission to the important file. All major search engine support sitemap.xml, and those that don't will be blocked by robots.txt.
> A script could generate sitemap.xml from a local svn checkout of trunk. It will produce one url for each source file (frequency=daily) and one url for every revision (frequency=year). That will cover most of the search requirements.

Forgive me, but this doesn't seem to address the issues that I raised in my previous message.

To reiterate: We need to allow only an explicit set of URLs to be crawled.  Sitemaps do not provide this ability.  They expose information about set of URLs to a crawler, they do not limit the set of URLs that it can crawl.  A robots.txt file does provide the ability to limit the set of URLs that can be crawled.

However, the semantics of robots.txt seem to make it incredibly unwieldy to expose only the content of interest, if it is possible at all.  For instance, to expose <http://trac.webkit.org/changeset/#{revision}> while preventing <http://trac.webkit.org/changeset/#{revision}/#{path}> or <http://trac.webkit.org/changeset/#{revision}?format=zip&new=#{revision}> from being crawled.  Another example would be exposing <http://trac.webkit.org/browser/#{path}> while preventing <http://trac.webkit.org/browser/#{path}?rev=#{revision}> from being crawled.

Is there something that I'm missing?

- Mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-dev/attachments/20091201/90e9395a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3166 bytes
Desc: not available
URL: <http://lists.webkit.org/pipermail/webkit-dev/attachments/20091201/90e9395a/attachment.bin>

More information about the webkit-dev mailing list