Skip to content

Support RFC 9309 in robotparser #138907

@serhiy-storchaka

Description

@serhiy-storchaka

Bug report

The urllib.robotparser module implements an unofficial standard originally specified in http://www.robotstxt.org/orig.html, with some additions (support not only "disallow", but also "allow" rules, support additional fields "crawl-delay", "request-rate" and "sitemap"). The practice of using robots.txt files differs significantly from the original specification. The new standard RFC 9309 was published in 2022, but drafts were used as a de facto standard for many years before that. There are several open issues regarding the module's inconsistency with current practices. These can be addressed separately, but to finally resolve the issue, we need to implement support for RFC 9309. I consider this not a feature request, but a bug fix, because incorrect support of robots.txt files can make Python code that uses robotparser malicious.

See also https://discuss.python.org/t/about-robotparser/103683

Linked PRs

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    3.13bugs and security fixes3.14bugs and security fixes3.15new features, bugs and security fixesstdlibStandard Library Python modules in the Lib/ directorytype-bugAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions