Hello, The robots.txt file that we have on bugs.webkit.org currently allows search engines access to individual bug pages, but not to any bug lists. As a result, search engines and the Internet Archive only index bugs that were filed before robots.txt changes a few years ago, and bugs that are directly linked from webpages elsewhere. These bugs are where most spam content naturally ends up on. This is quite wrong, as indexing just a subset of bugs is not beneficial to anyone other than spammers. So we can go in either direction: 1. Allow indexers to enumerate bugs, thus indexing all of them. Seems reasonable that people should be able to find bugs using search engines. On the other hand, we'll need to do something to ensure that indexers don't destroy Bugzilla performance, and of course spammers will love having more flexibility. 2. Block indexing completely. Seems like no one was bothered by lack of indexing on new bugs so far. Thoughts? For reference, here is the current robots.txt content: $ curl https://bugs.webkit.org/robots.txt User-agent: * Allow: /index.cgi Allow: /show_bug.cgi Disallow: / Crawl-delay: 20 - Alexey - Alexey
28.03.2019, 23:58, "Alexey Proskuryakov" <ap@webkit.org>:
Hello,
The robots.txt file that we have on bugs.webkit.org currently allows search engines access to individual bug pages, but not to any bug lists. As a result, search engines and the Internet Archive only index bugs that were filed before robots.txt changes a few years ago, and bugs that are directly linked from webpages elsewhere. These bugs are where most spam content naturally ends up on.
This is quite wrong, as indexing just a subset of bugs is not beneficial to anyone other than spammers. So we can go in either direction:
1. Allow indexers to enumerate bugs, thus indexing all of them.
Seems reasonable that people should be able to find bugs using search engines.
Yes, and it may give better result even than searching bugzilla directly
On the other hand, we'll need to do something to ensure that indexers don't destroy Bugzilla performance,
This can be solved by caching
and of course spammers will love having more flexibility.
rel="nofollow" on all links in comments should be enough to make spamming useless
2. Block indexing completely.
Seems like no one was bothered by lack of indexing on new bugs so far.
That's survival bias - if nobody can find relevant bugs, nobody will ever complain
Thoughts?
For reference, here is the current robots.txt content:
$ curl https://bugs.webkit.org/robots.txt User-agent: * Allow: /index.cgi Allow: /show_bug.cgi Disallow: / Crawl-delay: 20
- Alexey - Alexey
_______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev
-- Regards, Konstantin
On Mar 28, 2019, at 2:10 PM, Konstantin Tokarev <annulen@yandex.ru> wrote:
28.03.2019, 23:58, "Alexey Proskuryakov" <ap@webkit.org <mailto:ap@webkit.org>>:
Hello,
The robots.txt file that we have on bugs.webkit.org <http://bugs.webkit.org/> currently allows search engines access to individual bug pages, but not to any bug lists. As a result, search engines and the Internet Archive only index bugs that were filed before robots.txt changes a few years ago, and bugs that are directly linked from webpages elsewhere. These bugs are where most spam content naturally ends up on.
This is quite wrong, as indexing just a subset of bugs is not beneficial to anyone other than spammers. So we can go in either direction:
1. Allow indexers to enumerate bugs, thus indexing all of them.
Seems reasonable that people should be able to find bugs using search engines.
Yes, and it may give better result even than searching bugzilla directly
On the other hand, we'll need to do something to ensure that indexers don't destroy Bugzilla performance,
This can be solved by caching
and of course spammers will love having more flexibility.
rel="nofollow" on all links in comments should be enough to make spamming useless
Theoretically yes… but a couple google searches say it doesn’t make a difference. Here is one of many https://www.seroundtable.com/google-nofollow-link-attribute-failed-comments-... <https://www.seroundtable.com/google-nofollow-link-attribute-failed-comments-26959.html> I expect that spammers don’t reply care if they get a nofollow or not, they are mostly un-manned scripts anyway. I’m not opposed to adding this, I just don’t expect it will solve the problem. We could measure and see. Lucas
2. Block indexing completely.
Seems like no one was bothered by lack of indexing on new bugs so far.
That's survival bias - if nobody can find relevant bugs, nobody will ever complain
Thoughts?
For reference, here is the current robots.txt content:
$ curl https://bugs.webkit.org/robots.txt User-agent: * Allow: /index.cgi Allow: /show_bug.cgi Disallow: / Crawl-delay: 20
- Alexey - Alexey
_______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev
-- Regards, Konstantin
_______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org <mailto:webkit-dev@lists.webkit.org> https://lists.webkit.org/mailman/listinfo/webkit-dev <https://lists.webkit.org/mailman/listinfo/webkit-dev>
28 марта 2019 г., в 14:10, Konstantin Tokarev <annulen@yandex.ru> написал(а):
28.03.2019, 23:58, "Alexey Proskuryakov" <ap@webkit.org>:
Hello,
The robots.txt file that we have on bugs.webkit.org currently allows search engines access to individual bug pages, but not to any bug lists. As a result, search engines and the Internet Archive only index bugs that were filed before robots.txt changes a few years ago, and bugs that are directly linked from webpages elsewhere. These bugs are where most spam content naturally ends up on.
This is quite wrong, as indexing just a subset of bugs is not beneficial to anyone other than spammers. So we can go in either direction:
1. Allow indexers to enumerate bugs, thus indexing all of them.
Seems reasonable that people should be able to find bugs using search engines.
Yes, and it may give better result even than searching bugzilla directly
On the other hand, we'll need to do something to ensure that indexers don't destroy Bugzilla performance,
This can be solved by caching
Is this something that other Bugzilla instances do? I'm actually not sure how caching can be meaningfully applied to Bugzilla. One wants to always see the latest updates, and our automation in particular won't be OK with stale data. - Alexey
and of course spammers will love having more flexibility.
rel="nofollow" on all links in comments should be enough to make spamming useless
2. Block indexing completely.
Seems like no one was bothered by lack of indexing on new bugs so far.
That's survival bias - if nobody can find relevant bugs, nobody will ever complain
Thoughts?
For reference, here is the current robots.txt content:
$ curl https://bugs.webkit.org/robots.txt User-agent: * Allow: /index.cgi Allow: /show_bug.cgi Disallow: / Crawl-delay: 20
- Alexey - Alexey
_______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev
-- Regards, Konstantin
29.03.2019, 19:16, "Alexey Proskuryakov" <ap@webkit.org>:
28 марта 2019 г., в 14:10, Konstantin Tokarev <annulen@yandex.ru> написал(а):
28.03.2019, 23:58, "Alexey Proskuryakov" <ap@webkit.org>:
Hello,
The robots.txt file that we have on bugs.webkit.org currently allows search engines access to individual bug pages, but not to any bug lists. As a result, search engines and the Internet Archive only index bugs that were filed before robots.txt changes a few years ago, and bugs that are directly linked from webpages elsewhere. These bugs are where most spam content naturally ends up on.
This is quite wrong, as indexing just a subset of bugs is not beneficial to anyone other than spammers. So we can go in either direction:
1. Allow indexers to enumerate bugs, thus indexing all of them.
Seems reasonable that people should be able to find bugs using search engines.
Yes, and it may give better result even than searching bugzilla directly
On the other hand, we'll need to do something to ensure that indexers don't destroy Bugzilla performance,
This can be solved by caching
Is this something that other Bugzilla instances do? I'm actually not sure how caching can be meaningfully applied to Bugzilla. One wants to always see the latest updates, and our automation in particular won't be OK with stale data.
I'm not sure if HTTP-level caching may be used here, but quick search brings this: https://www.bugzilla.org/releases/5.0.4/release-notes.html#feat_caching_perf... If we can update Bugzilla it should be possible at least to reduce number of database hits when pages are rendered.
- Alexey
and of course spammers will love having more flexibility.
rel="nofollow" on all links in comments should be enough to make spamming useless
2. Block indexing completely.
Seems like no one was bothered by lack of indexing on new bugs so far.
That's survival bias - if nobody can find relevant bugs, nobody will ever complain
Thoughts?
For reference, here is the current robots.txt content:
$ curl https://bugs.webkit.org/robots.txt User-agent: * Allow: /index.cgi Allow: /show_bug.cgi Disallow: / Crawl-delay: 20
- Alexey - Alexey
_______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev
-- Regards, Konstantin
-- Regards, Konstantin
On Thu, Mar 28, 2019 at 3:57 PM, Alexey Proskuryakov <ap@webkit.org> wrote:
2. Block indexing completely.
Seems like no one was bothered by lack of indexing on new bugs so far.
Spam problem seems worse than not being indexed. If you want to search for WebKit bugs, you can do that on WebKit Bugzilla, right? Michael
29.03.2019, 19:30, "Michael Catanzaro" <mcatanzaro@igalia.com>:
On Thu, Mar 28, 2019 at 3:57 PM, Alexey Proskuryakov <ap@webkit.org> wrote:
2. Block indexing completely.
Seems like no one was bothered by lack of indexing on new bugs so far.
Spam problem seems worse than not being indexed.
If you want to search for WebKit bugs, you can do that on WebKit Bugzilla, right?
1. If some issue is referenced from external Web sites (such as, e.g., StackOverflow), it's placed higher in search engine results, so for issues affecting many people using search engine may allow finding right issue faster 2. Search engines allow searching answer in all web, which may be useful if one is not sure if bug is in WebKit or not. -- Regards, Konstantin
Not indexing bugs.webkit.org will be sad for people who won't be able to find bugs they may be interested in via search engines... but those people are probably not WebKit developers working with WebKit on a daily basis. For us, it's just annoying to deal with the spam. I would turn off the indexing if we think it could make a difference.
22.04.2019, 18:58, "Michael Catanzaro" <mcatanzaro@igalia.com>:
Not indexing bugs.webkit.org will be sad for people who won't be able to find bugs they may be interested in via search engines... but those people are probably not WebKit developers working with WebKit on a daily basis. For us, it's just annoying to deal with the spam. I would turn off the indexing if we think it could make a difference.
Another possible way is to disable self-registration for new users, similarly to what LLVM project did [1]. [1] https://bugs.llvm.org/ -- Regards, Konstantin
On Mon, Apr 22, 2019 at 11:06 AM, Konstantin Tokarev <annulen@yandex.ru> wrote:
Another possible way is to disable self-registration for new users, similarly to what LLVM project did [1].
GCC Bugzilla did this a long time ago. It will make it really hard to convince users to report bugs. I would try deindexing first, since it's a smaller hammer. Then we could try this if that fails. Michael
One change that I'm going to make is to mark spam comments as private instead of simply tagging. That way, bugs will look cleaner, and there will be no doubt about whether search engines index hidden comments or not. I'll also mark old spam comments as private. I think that this will generate e-mail notifications, so apologies for the upcoming e-mail storm. These should be possible to delete all at once in most e-mail clients. - Alexey
Should we post instructions somewhere for people dealing with spam? I believe the instructions are: 1) Look up the email address of the account that posted the spam and disable it first, so spammers don’t get email about other steps. Do this by clicking on Administration, Users, finding the user and putting the word “Spam” into the disable text. 2) Move any spam bugs into the Spam component. 3) Mark any spam comments as Private and also add the tag “spam”. But maybe there’s more to it than that. For example, can someone without administration privileges do the right thing? Should we make a small tool to make this easier to do correctly? I like the idea of having instructions so this isn’t oral tradition. — Darin
I posted a tool that I used for this today to https://bugs.webkit.org/show_bug.cgi?id=197537. Probably a lot to improve, but it works. - Alexey
2 мая 2019 г., в 14:32, Darin Adler <darin@apple.com> написал(а):
Should we post instructions somewhere for people dealing with spam? I believe the instructions are:
1) Look up the email address of the account that posted the spam and disable it first, so spammers don’t get email about other steps. Do this by clicking on Administration, Users, finding the user and putting the word “Spam” into the disable text.
2) Move any spam bugs into the Spam component.
3) Mark any spam comments as Private and also add the tag “spam”.
But maybe there’s more to it than that. For example, can someone without administration privileges do the right thing? Should we make a small tool to make this easier to do correctly?
I like the idea of having instructions so this isn’t oral tradition.
— Darin
participants (5)
-
Alexey Proskuryakov
-
Darin Adler
-
Konstantin Tokarev
-
Lucas Forschler
-
Michael Catanzaro