-
-
Notifications
You must be signed in to change notification settings - Fork 33.1k
Description
Bug report
The urllib.robotparser
module implements an unofficial standard originally specified in http://www.robotstxt.org/orig.html, with some additions (support not only "disallow", but also "allow" rules, support additional fields "crawl-delay", "request-rate" and "sitemap"). The practice of using robots.txt files differs significantly from the original specification. The new standard RFC 9309 was published in 2022, but drafts were used as a de facto standard for many years before that. There are several open issues regarding the module's inconsistency with current practices. These can be addressed separately, but to finally resolve the issue, we need to implement support for RFC 9309. I consider this not a feature request, but a bug fix, because incorrect support of robots.txt files can make Python code that uses robotparser
malicious.
See also https://discuss.python.org/t/about-robotparser/103683