[webkit-dev] trac.webkit.org links via Google.com

Tue Dec 1 13:36:10 PST 2009

On 2009-12-01, at 12:31, Yaar Schnitman wrote:

> The urls in sitemap.xml are not patterns - there are exact urls the search engine will retrieve.

They are exact URLs that the crawler will retrieve, but they in no way restrict the set of URLs that the crawler can retrieve.

> So, you would blacklist most urls with blanket rules in robots.txt and whitelist explicit urls in sitemap.xml. e.g. in robots.txt, blacklist /changeset/*, and in sitemap.xml whitelist all http://trac.webkit.org/changeset/1 to http://trac.webkit.org/changeset/60000 (It's going to be a big file alright).

This proposal relies on the sitemap being treated as a whitelist that is consulted prior to processing the exclusions listed in robots.txt.  As I have mentioned several times, I cannot find any information that states that a sitemap acts as a whitelist, nor what its precedence relative to robots.txt would be if it were to be treated as a whitelist.  Is there something I'm missing?

- Mark

> On Tue, Dec 1, 2009 at 11:33 AM, Mark Rowe <mrowe at apple.com> wrote:
> 
> On 2009-12-01, at 11:04, Yaar Schnitman wrote:
> 
>> Robots.txt can exclude most of the trac site, and then include the sitemap.xml. This way you block most of the junk and only give permission to the important file. All major search engine support sitemap.xml, and those that don't will be blocked by robots.txt.
>> 
>> A script could generate sitemap.xml from a local svn checkout of trunk. It will produce one url for each source file (frequency=daily) and one url for every revision (frequency=year). That will cover most of the search requirements.
> 
> Forgive me, but this doesn't seem to address the issues that I raised in my previous message.
> 
> To reiterate: We need to allow only an explicit set of URLs to be crawled.  
> Sitemaps do not provide this ability.  They expose information about set of URLs to a crawler, they do not limit the set of URLs that it can crawl.  A robots.txt file does provide the ability to limit the set of URLs that can be crawled.
> 
> However, the semantics of robots.txt seem to make it incredibly unwieldy to expose only the content of interest, if it is possible at all.  For instance, to expose <http://trac.webkit.org/changeset/#{revision}> while preventing <http://trac.webkit.org/changeset/#{revision}/#{path}> or <http://trac.webkit.org/changeset/#{revision}?format=zip&new=#{revision}> from being crawled.  Another example would be exposing <http://trac.webkit.org/browser/#{path}> while preventing <http://trac.webkit.org/browser/#{path}?rev=#{revision}> from being crawled.
> 
> Is there something that I'm missing?
> 
> - Mark
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-dev/attachments/20091201/e4f0558a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3166 bytes
Desc: not available
URL: <http://lists.webkit.org/pipermail/webkit-dev/attachments/20091201/e4f0558a/attachment.bin>