Skip to content

Don't allow arbitrary prefixes to our paths #336

@PGijsbers

Description

@PGijsbers

I updated the robots.txt in #334. Unfortunately, we still see a sizable number of crawlers stuck because of two issues (see also #335). One issue is that urls may contain arbitrary prefixes in their path, e.g. http://openml.org/not-really-something-we-want/d/151 will gladly redirect to the dataset page, instead of just going to a 404-page. As I understand it, this means that the crawlers will happily crawl these pages (in any case, crawlers do visit pages with prefixes that don't do anything). I am hoping/assuming that disallowing these arbitrary prefixes will significantly reduce traffic as there are fewer urls to explore.
I am also not sure why crawlers try to crawl these pages though, that's probably a separate issue to figure out.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions