Bots possibly lying about purpose of collection #104
Replies: 4 comments
-
I changed this to a discussion until something actionable has been agreed. |
Beta Was this translation helpful? Give feedback.
-
As I wrote in #86: We could easily get into scope creep. Options that occur to me are (a) a hard fork of this project with a larger scope (abusive crawlers?) and (b) we might consider maintaining "min" and "max" lists with the min list being like the current list and the max list having a wider scope. The problem in either case is how to define the wider scope crisply so it doesn't just end up being a list of all the crawlers someone objects to. |
Beta Was this translation helpful? Give feedback.
-
If scrapers wants to avoid being blocked by lists like this then all they have to do is ignore robots.txt and spoof their UA to something generic. The overlap of bots rude enough to lie about their purpose but polite enough to still respect robots.txt has to be pretty small. |
Beta Was this translation helpful? Give feedback.
-
Don't forget bots which ignore robots.txt but are up-front about their UA. The README suggests several ways of blocking such bots. Bots that spoof their UA are going to slip through the net, but will hopefully earn their owners a bad reputation in the process. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
As anti-ai block lists become more common and easily deployed via e.g. cloudflare, it is inevitable that some bots will simply lie about their purpose. This seems to already be happening with Semrush, which has grown to send a volume of repetitive requests that is to me implausible for anything except LLM training.
This raises some questions about what abusive bots should and shouldn't be included in ai-robots.txt in the future.
Originally posted by @NotAFile in #86 (comment)
Beta Was this translation helpful? Give feedback.
All reactions