Spam and indexing

Alexey Proskuryakov

28 Mar 2019 28 Mar '19

8:57 p.m.

Hello, The robots.txt file that we have on bugs.webkit.org currently allows search engines access to individual bug pages, but not to any bug lists. As a result, search engines and the Internet Archive only index bugs that were filed before robots.txt changes a few years ago, and bugs that are directly linked from webpages elsewhere. These bugs are where most spam content naturally ends up on. This is quite wrong, as indexing just a subset of bugs is not beneficial to anyone other than spammers. So we can go in either direction: 1. Allow indexers to enumerate bugs, thus indexing all of them. Seems reasonable that people should be able to find bugs using search engines. On the other hand, we'll need to do something to ensure that indexers don't destroy Bugzilla performance, and of course spammers will love having more flexibility. 2. Block indexing completely. Seems like no one was bothered by lack of indexing on new bugs so far. Thoughts? For reference, here is the current robots.txt content: $ curl https://bugs.webkit.org/robots.txt User-agent: * Allow: /index.cgi Allow: /show_bug.cgi Disallow: / Crawl-delay: 20 - Alexey - Alexey

Show replies by date

Konstantin Tokarev

28 Mar 28 Mar

9:10 p.m.

28.03.2019, 23:58, "Alexey Proskuryakov" <ap@webkit.org>:

...

Hello,

The robots.txt file that we have on bugs.webkit.org currently allows search engines access to individual bug pages, but not to any bug lists. As a result, search engines and the Internet Archive only index bugs that were filed before robots.txt changes a few years ago, and bugs that are directly linked from webpages elsewhere. These bugs are where most spam content naturally ends up on.

This is quite wrong, as indexing just a subset of bugs is not beneficial to anyone other than spammers. So we can go in either direction:

1. Allow indexers to enumerate bugs, thus indexing all of them.

Seems reasonable that people should be able to find bugs using search engines.

Yes, and it may give better result even than searching bugzilla directly

...

On the other hand, we'll need to do something to ensure that indexers don't destroy Bugzilla performance,

This can be solved by caching

...

and of course spammers will love having more flexibility.

rel="nofollow" on all links in comments should be enough to make spamming useless

...

2. Block indexing completely.

Seems like no one was bothered by lack of indexing on new bugs so far.

That's survival bias - if nobody can find relevant bugs, nobody will ever complain

...

Thoughts?

For reference, here is the current robots.txt content:

$ curl https://bugs.webkit.org/robots.txt User-agent: * Allow: /index.cgi Allow: /show_bug.cgi Disallow: / Crawl-delay: 20

- Alexey - Alexey

_______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev

-- Regards, Konstantin

Lucas Forschler

10:34 p.m.

...

On Mar 28, 2019, at 2:10 PM, Konstantin Tokarev <annulen@yandex.ru> wrote:

28.03.2019, 23:58, "Alexey Proskuryakov" <ap@webkit.org <mailto:ap@webkit.org>>:

...
Hello,

The robots.txt file that we have on bugs.webkit.org <http://bugs.webkit.org/> currently allows search engines access to individual bug pages, but not to any bug lists. As a result, search engines and the Internet Archive only index bugs that were filed before robots.txt changes a few years ago, and bugs that are directly linked from webpages elsewhere. These bugs are where most spam content naturally ends up on.

This is quite wrong, as indexing just a subset of bugs is not beneficial to anyone other than spammers. So we can go in either direction:

1. Allow indexers to enumerate bugs, thus indexing all of them.

Seems reasonable that people should be able to find bugs using search engines.

Yes, and it may give better result even than searching bugzilla directly

...
On the other hand, we'll need to do something to ensure that indexers don't destroy Bugzilla performance,

This can be solved by caching

...
and of course spammers will love having more flexibility.

rel="nofollow" on all links in comments should be enough to make spamming useless

Theoretically yes… but a couple google searches say it doesn’t make a difference. Here is one of many https://www.seroundtable.com/google-nofollow-link-attribute-failed-comments-... <https://www.seroundtable.com/google-nofollow-link-attribute-failed-comments-26959.html> I expect that spammers don’t reply care if they get a nofollow or not, they are mostly un-manned scripts anyway. I’m not opposed to adding this, I just don’t expect it will solve the problem. We could measure and see. Lucas

...

...
2. Block indexing completely.

Seems like no one was bothered by lack of indexing on new bugs so far.

That's survival bias - if nobody can find relevant bugs, nobody will ever complain

...
Thoughts?

For reference, here is the current robots.txt content:

$ curl https://bugs.webkit.org/robots.txt User-agent: * Allow: /index.cgi Allow: /show_bug.cgi Disallow: / Crawl-delay: 20

- Alexey - Alexey

_______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev

-- Regards, Konstantin

_______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org <mailto:webkit-dev@lists.webkit.org> https://lists.webkit.org/mailman/listinfo/webkit-dev <https://lists.webkit.org/mailman/listinfo/webkit-dev>

Alexey Proskuryakov

29 Mar 29 Mar

4:16 p.m.

...

28 марта 2019 г., в 14:10, Konstantin Tokarev <annulen@yandex.ru> написал(а):

28.03.2019, 23:58, "Alexey Proskuryakov" <ap@webkit.org>:

...
Hello,

The robots.txt file that we have on bugs.webkit.org currently allows search engines access to individual bug pages, but not to any bug lists. As a result, search engines and the Internet Archive only index bugs that were filed before robots.txt changes a few years ago, and bugs that are directly linked from webpages elsewhere. These bugs are where most spam content naturally ends up on.

This is quite wrong, as indexing just a subset of bugs is not beneficial to anyone other than spammers. So we can go in either direction:

1. Allow indexers to enumerate bugs, thus indexing all of them.

Seems reasonable that people should be able to find bugs using search engines.

Yes, and it may give better result even than searching bugzilla directly

...
On the other hand, we'll need to do something to ensure that indexers don't destroy Bugzilla performance,

This can be solved by caching

Is this something that other Bugzilla instances do? I'm actually not sure how caching can be meaningfully applied to Bugzilla. One wants to always see the latest updates, and our automation in particular won't be OK with stale data. - Alexey

...

...
and of course spammers will love having more flexibility.

rel="nofollow" on all links in comments should be enough to make spamming useless

...
2. Block indexing completely.

Seems like no one was bothered by lack of indexing on new bugs so far.

That's survival bias - if nobody can find relevant bugs, nobody will ever complain

...
Thoughts?

For reference, here is the current robots.txt content:

$ curl https://bugs.webkit.org/robots.txt User-agent: * Allow: /index.cgi Allow: /show_bug.cgi Disallow: / Crawl-delay: 20

- Alexey - Alexey

_______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev

-- Regards, Konstantin

Konstantin Tokarev

4:48 p.m.

29.03.2019, 19:16, "Alexey Proskuryakov" <ap@webkit.org>:

...

...
28 марта 2019 г., в 14:10, Konstantin Tokarev <annulen@yandex.ru> написал(а):

28.03.2019, 23:58, "Alexey Proskuryakov" <ap@webkit.org>:

...
Hello,

The robots.txt file that we have on bugs.webkit.org currently allows search engines access to individual bug pages, but not to any bug lists. As a result, search engines and the Internet Archive only index bugs that were filed before robots.txt changes a few years ago, and bugs that are directly linked from webpages elsewhere. These bugs are where most spam content naturally ends up on.

This is quite wrong, as indexing just a subset of bugs is not beneficial to anyone other than spammers. So we can go in either direction:

1. Allow indexers to enumerate bugs, thus indexing all of them.

Seems reasonable that people should be able to find bugs using search engines.

Yes, and it may give better result even than searching bugzilla directly

...
On the other hand, we'll need to do something to ensure that indexers don't destroy Bugzilla performance,

This can be solved by caching

Is this something that other Bugzilla instances do? I'm actually not sure how caching can be meaningfully applied to Bugzilla. One wants to always see the latest updates, and our automation in particular won't be OK with stale data.

I'm not sure if HTTP-level caching may be used here, but quick search brings this: https://www.bugzilla.org/releases/5.0.4/release-notes.html#feat_caching_perf... If we can update Bugzilla it should be possible at least to reduce number of database hits when pages are rendered.

...

- Alexey

...
...
and of course spammers will love having more flexibility.

rel="nofollow" on all links in comments should be enough to make spamming useless

...
2. Block indexing completely.

Seems like no one was bothered by lack of indexing on new bugs so far.

That's survival bias - if nobody can find relevant bugs, nobody will ever complain

...
Thoughts?

For reference, here is the current robots.txt content:

$ curl https://bugs.webkit.org/robots.txt User-agent: * Allow: /index.cgi Allow: /show_bug.cgi Disallow: / Crawl-delay: 20

- Alexey - Alexey

_______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev

-- Regards, Konstantin

-- Regards, Konstantin

Michael Catanzaro

4:29 p.m.

On Thu, Mar 28, 2019 at 3:57 PM, Alexey Proskuryakov <ap@webkit.org> wrote:

...

2. Block indexing completely.

Seems like no one was bothered by lack of indexing on new bugs so far.

Spam problem seems worse than not being indexed. If you want to search for WebKit bugs, you can do that on WebKit Bugzilla, right? Michael

Konstantin Tokarev

4:59 p.m.

29.03.2019, 19:30, "Michael Catanzaro" <mcatanzaro@igalia.com>:

...

On Thu, Mar 28, 2019 at 3:57 PM, Alexey Proskuryakov <ap@webkit.org> wrote:

...
2. Block indexing completely.

Seems like no one was bothered by lack of indexing on new bugs so far.

Spam problem seems worse than not being indexed.

If you want to search for WebKit bugs, you can do that on WebKit Bugzilla, right?

1. If some issue is referenced from external Web sites (such as, e.g., StackOverflow), it's placed higher in search engine results, so for issues affecting many people using search engine may allow finding right issue faster 2. Search engines allow searching answer in all web, which may be useful if one is not sure if bug is in WebKit or not. -- Regards, Konstantin

Michael Catanzaro

22 Apr 22 Apr

3:57 p.m.

Not indexing bugs.webkit.org will be sad for people who won't be able to find bugs they may be interested in via search engines... but those people are probably not WebKit developers working with WebKit on a daily basis. For us, it's just annoying to deal with the spam. I would turn off the indexing if we think it could make a difference.

Konstantin Tokarev

4:06 p.m.

22.04.2019, 18:58, "Michael Catanzaro" <mcatanzaro@igalia.com>:

...

Not indexing bugs.webkit.org will be sad for people who won't be able to find bugs they may be interested in via search engines... but those people are probably not WebKit developers working with WebKit on a daily basis. For us, it's just annoying to deal with the spam. I would turn off the indexing if we think it could make a difference.

Another possible way is to disable self-registration for new users, similarly to what LLVM project did [1]. [1] https://bugs.llvm.org/ -- Regards, Konstantin

Michael Catanzaro

4:11 p.m.

On Mon, Apr 22, 2019 at 11:06 AM, Konstantin Tokarev <annulen@yandex.ru> wrote:

...

Another possible way is to disable self-registration for new users, similarly to what LLVM project did [1].

GCC Bugzilla did this a long time ago. It will make it really hard to convince users to report bugs. I would try deindexing first, since it's a smaller hammer. Then we could try this if that fails. Michael

Alexey Proskuryakov

2 May 2 May

8:39 p.m.

One change that I'm going to make is to mark spam comments as private instead of simply tagging. That way, bugs will look cleaner, and there will be no doubt about whether search engines index hidden comments or not. I'll also mark old spam comments as private. I think that this will generate e-mail notifications, so apologies for the upcoming e-mail storm. These should be possible to delete all at once in most e-mail clients. - Alexey

Darin Adler

9:32 p.m.

Should we post instructions somewhere for people dealing with spam? I believe the instructions are: 1) Look up the email address of the account that posted the spam and disable it first, so spammers don’t get email about other steps. Do this by clicking on Administration, Users, finding the user and putting the word “Spam” into the disable text. 2) Move any spam bugs into the Spam component. 3) Mark any spam comments as Private and also add the tag “spam”. But maybe there’s more to it than that. For example, can someone without administration privileges do the right thing? Should we make a small tool to make this easier to do correctly? I like the idea of having instructions so this isn’t oral tradition. — Darin

Michael Catanzaro

11:28 p.m.

On Thu, May 2, 2019 at 4:32 PM, Darin Adler <darin@apple.com> wrote:

...

For example, can someone without administration privileges do the right thing?

Nope.

Alexey Proskuryakov

3 May 3 May

12:47 a.m.

I posted a tool that I used for this today to https://bugs.webkit.org/show_bug.cgi?id=197537. Probably a lot to improve, but it works. - Alexey

...

2 мая 2019 г., в 14:32, Darin Adler <darin@apple.com> написал(а):

Should we post instructions somewhere for people dealing with spam? I believe the instructions are:

1) Look up the email address of the account that posted the spam and disable it first, so spammers don’t get email about other steps. Do this by clicking on Administration, Users, finding the user and putting the word “Spam” into the disable text.

2) Move any spam bugs into the Spam component.

3) Mark any spam comments as Private and also add the tag “spam”.

But maybe there’s more to it than that. For example, can someone without administration privileges do the right thing? Should we make a small tool to make this easier to do correctly?

I like the idea of having instructions so this isn’t oral tradition.

— Darin

2384

Age (days ago)

2420

Last active (days ago)

List overview

Download

13 comments

5 participants

participants (5)

Alexey Proskuryakov
Darin Adler
Konstantin Tokarev
Lucas Forschler
Michael Catanzaro

Spam and indexing

tags

participants (5)