[webkit-dev] trac.webkit.org links via Google.com

Tue Dec 1 12:31:17 PST 2009

The urls in sitemap.xml are not patterns - there are exact urls the search
engine will retrieve.

So, you would blacklist most urls with blanket rules in robots.txt and
whitelist explicit urls in sitemap.xml. e.g. in robots.txt, blacklist
/changeset/*, and in sitemap.xml whitelist all
http://trac.webkit.org/changeset/1<http://trac.webkit.org/changeset/#%7Brevision%7D>
 to http://trac.webkit.org/changeset/60000 (It's going to be a big file
alright).

On Tue, Dec 1, 2009 at 11:33 AM, Mark Rowe <mrowe at apple.com> wrote:

>
> On 2009-12-01, at 11:04, Yaar Schnitman wrote:
>
> Robots.txt can exclude most of the trac site, and then include the
> sitemap.xml. This way you block most of the junk and only give permission to
> the important file. All major search engine support sitemap.xml, and those
> that don't will be blocked by robots.txt.
>
> A script could generate sitemap.xml from a local svn checkout of trunk. It
> will produce one url for each source file (frequency=daily) and one url for
> every revision (frequency=year). That will cover most of the search
> requirements.
>
>
> Forgive me, but this doesn't seem to address the issues that I raised in my
> previous message.
>
> To reiterate: We need to allow only an explicit set of URLs to be crawled.
>
>
Sitemaps *do not* provide this ability.  They expose information about set
> of URLs to a crawler, they do not limit the set of URLs that it can crawl.
>  A robots.txt file *does* provide the ability to limit the set of URLs
> that can be crawled.
>
> However, the semantics of robots.txt seem to make it incredibly unwieldy to
> expose *only *the content of interest, if it is possible at all.  For
> instance, to expose <http://trac.webkit.org/changeset/#{revision}<http://trac.webkit.org/changeset/#%7Brevision%7D>>
> while preventing <http://trac.webkit.org/changeset/#{revision}/#{path}> or
> <http://trac.webkit.org/changeset/#{revision}?<http://trac.webkit.org/changeset/#%7Brevision%7D?>format=zip&new=#{revision}> from
> being crawled.  Another example would be exposing <
> http://trac.webkit.org/browser/#{path}<http://trac.webkit.org/browser/#%7Bpath%7D>>
> while preventing <http://trac.webkit.org/browser/#{path}?rev=#{revision}>
> from being crawled.
>
> Is there something that I'm missing?
>
> - Mark
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-dev/attachments/20091201/97cde683/attachment.html>