Skip to content

robots.allowed returns false for other sites (and domains) #110

@gk1544

Description

@gk1544

Hi,

Let's take a look at the following example from Google:

robots.txt location:
http://example.com/robots.txt

Valid for:
http://example.com/
http://example.com/folder/file
Not valid for:
http://other.example.com/
https://example.com/
http://example.com:8181/

For instance, when asked if any page on http://other.example.com/ is allowed, reppy returns False.

It should either return True or potentially throw an exception, but definitely not False.
Returning False is incorrect because robots.txt is not a whitelist.

Here is an example:

import reppy
robots_content = 'Disallow: /abc'
robots = reppy.Robots.parse('http://example.com/robots.txt', robots_content)

print(robots.allowed('http://example.com/', '*'))
# True (**correct**)
print(robots.allowed('http://other.example.com/', '*'))
# False (**incorrect**)
print(robots.allowed('http://apple.com/', '*'))
# False (**incorrect**)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions