[webkit-dev] Spam and indexing

Fri Mar 29 09:48:07 PDT 2019

29.03.2019, 19:16, "Alexey Proskuryakov" <ap at webkit.org>:
>> 28 марта 2019 г., в 14:10, Konstantin Tokarev <annulen at yandex.ru> написал(а):
>>
>> 28.03.2019, 23:58, "Alexey Proskuryakov" <ap at webkit.org>:
>>> Hello,
>>>
>>> The robots.txt file that we have on bugs.webkit.org currently allows search engines access to individual bug pages, but not to any bug lists. As a result, search engines and the Internet Archive only index bugs that were filed before robots.txt changes a few years ago, and bugs that are directly linked from webpages elsewhere. These bugs are where most spam content naturally ends up on.
>>>
>>> This is quite wrong, as indexing just a subset of bugs is not beneficial to anyone other than spammers. So we can go in either direction:
>>>
>>> 1. Allow indexers to enumerate bugs, thus indexing all of them.
>>>
>>> Seems reasonable that people should be able to find bugs using search engines.
>>
>> Yes, and it may give better result even than searching bugzilla directly
>>
>>> On the other hand, we'll need to do something to ensure that indexers don't destroy Bugzilla performance,
>>
>> This can be solved by caching
>
> Is this something that other Bugzilla instances do? I'm actually not sure how caching can be meaningfully applied to Bugzilla. One wants to always see the latest updates, and our automation in particular won't be OK with stale data.

I'm not sure if HTTP-level caching may be used here, but quick search brings this:
https://www.bugzilla.org/releases/5.0.4/release-notes.html#feat_caching_performance

If we can update Bugzilla it should be possible at least to reduce number of database hits when pages
are rendered.

> - Alexey
>
>>> and of course spammers will love having more flexibility.
>>
>> rel="nofollow" on all links in comments should be enough to make spamming useless
>>
>>> 2. Block indexing completely.
>>>
>>> Seems like no one was bothered by lack of indexing on new bugs so far.
>>
>> That's survival bias - if nobody can find relevant bugs, nobody will ever complain
>>
>>> Thoughts?
>>>
>>> For reference, here is the current robots.txt content:
>>>
>>> $ curl https://bugs.webkit.org/robots.txt
>>> User-agent: *
>>> Allow: /index.cgi
>>> Allow: /show_bug.cgi
>>> Disallow: /
>>> Crawl-delay: 20
>>>
>>> - Alexey
>>> - Alexey
>>>
>>> _______________________________________________
>>> webkit-dev mailing list
>>> webkit-dev at lists.webkit.org
>>> https://lists.webkit.org/mailman/listinfo/webkit-dev
>>
>> --
>> Regards,
>> Konstantin

-- 
Regards,
Konstantin