-
-
Notifications
You must be signed in to change notification settings - Fork 33.4k
Description
Bug report
Bug description:
The below example code returns False on my test environment
- Python 3.9.6 (macos default)
- Python 3.12.2 (installed using brew on macos)
Test code
from urllib import robotparser
lines = [
"User-agent: *",
"Disallow: /",
"Allow: /public",
]
rp = robotparser.RobotFileParser()
rp.parse(lines)
# It supposed to return True because of the last line above, but it returns False
print(rp.can_fetch("*", "https://example.org/public"))It seems to be immediately returned False when the RobotFilePaser met the first rule (Disallow: /) according to http://www.robotstxt.org/norobots-rfc.txt section 3.2.2 "The first match found is used".
(FYI It has been written as non-standard internet draft in Nov 1997)
But the latest IETF RFC https://www.rfc-editor.org/rfc/rfc9309.html section 2.2.2 says "If an "allow" rule and a "disallow" rule are equivalent, then the "allow" rule SHOULD be used"
The given example should return True and I believe this is the expected behavior that industry understand now.
Reference: Useful Rules "Disallow crawling of the whole site except a subdirectory" in https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt (How Google interprets the robots.txt specification)
CPython versions tested on:
3.12
Operating systems tested on:
macOS