Bots possibly lying about purpose of collection #104

NotAFile · 2025-04-14T14:37:44Z

NotAFile
Apr 14, 2025

As anti-ai block lists become more common and easily deployed via e.g. cloudflare, it is inevitable that some bots will simply lie about their purpose. This seems to already be happening with Semrush, which has grown to send a volume of repetitive requests that is to me implausible for anything except LLM training.

This raises some questions about what abusive bots should and shouldn't be included in ai-robots.txt in the future.

It seems some of the newer bots claim to be "SEO" services to avoid landing on lists like these. Semrush claims to be one such bot but it's pattern is clearly AI training like (absurd request rate, absolutely no caching, huge ip range). There's just no reason for an "SEO" crawler to be requesting the same url hundreds of times a day.

Originally posted by @NotAFile in #86 (comment)

glyn · 2025-04-14T14:42:20Z

glyn
Apr 14, 2025
Maintainer

I changed this to a discussion until something actionable has been agreed.

0 replies

glyn · 2025-04-14T14:43:14Z

glyn
Apr 14, 2025
Maintainer

As I wrote in #86:

We could easily get into scope creep. Options that occur to me are (a) a hard fork of this project with a larger scope (abusive crawlers?) and (b) we might consider maintaining "min" and "max" lists with the min list being like the current list and the max list having a wider scope. The problem in either case is how to define the wider scope crisply so it doesn't just end up being a list of all the crawlers someone objects to.

0 replies

jsheard · 2025-04-14T15:35:15Z

jsheard
Apr 14, 2025

If scrapers wants to avoid being blocked by lists like this then all they have to do is ignore robots.txt and spoof their UA to something generic. The overlap of bots rude enough to lie about their purpose but polite enough to still respect robots.txt has to be pretty small.

0 replies

glyn · 2025-04-14T15:52:25Z

glyn
Apr 14, 2025
Maintainer

Don't forget bots which ignore robots.txt but are up-front about their UA. The README suggests several ways of blocking such bots.

Bots that spoof their UA are going to slip through the net, but will hopefully earn their owners a bad reputation in the process.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai.robots.txt

Bots possibly lying about purpose of collection #104

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

ai.robots.txt

Bots possibly lying about purpose of collection #104

Uh oh!

NotAFile Apr 14, 2025

Replies: 4 comments

Uh oh!

glyn Apr 14, 2025 Maintainer

Uh oh!

glyn Apr 14, 2025 Maintainer

Uh oh!

jsheard Apr 14, 2025

Uh oh!

Uh oh!

glyn Apr 14, 2025 Maintainer

NotAFile
Apr 14, 2025

glyn
Apr 14, 2025
Maintainer

glyn
Apr 14, 2025
Maintainer

jsheard
Apr 14, 2025

glyn
Apr 14, 2025
Maintainer