Skip to content

urllib.robotparser misbehave in the case of Disallow crawling of the whole site except a subdirectoryย #116740

@jooncheol

Description

@jooncheol

Bug report

Bug description:

The below example code returns False on my test environment

  • Python 3.9.6 (macos default)
  • Python 3.12.2 (installed using brew on macos)

Test code

from urllib import robotparser

lines = [
    "User-agent: *",
    "Disallow: /",
    "Allow: /public",
]

rp = robotparser.RobotFileParser()
rp.parse(lines)

# It supposed to return True because of the last line above, but it returns False
print(rp.can_fetch("*", "https://example.org/public"))

It seems to be immediately returned False when the RobotFilePaser met the first rule (Disallow: /) according to http://www.robotstxt.org/norobots-rfc.txt section 3.2.2 "The first match found is used".
(FYI It has been written as non-standard internet draft in Nov 1997)

But the latest IETF RFC https://www.rfc-editor.org/rfc/rfc9309.html section 2.2.2 says "If an "allow" rule and a "disallow" rule are equivalent, then the "allow" rule SHOULD be used"

The given example should return True and I believe this is the expected behavior that industry understand now.
Reference: Useful Rules "Disallow crawling of the whole site except a subdirectory" in https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt (How Google interprets the robots.txt specification)

CPython versions tested on:

3.12

Operating systems tested on:

macOS

Linked PRs

Metadata

Metadata

Labels

3.15new features, bugs and security fixestype-bugAn unexpected behavior, bug, or errortype-featureA feature request or enhancement

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions