Skip to content

Detect infinite baseUrl explosions #42

@eaton

Description

@eaton

In some situations where dynamic pages are generated based on the incoming URL, and links can be entered without a proper protocol, it's easy for some CMSs to generate infinite exploding URL trees. For example:

  1. http://example.com/event/{event-id} is used as a URL route
  2. Editors may create arbitrary sub-pages for the event, so any number of sub-urls are theoretically valid.
  3. Event ID parsing is permissive, discarding anything after the ID but not redirecting to the shorter, valid URL. For example, '1234' and '1234/foo' both render the page for '1234' but no redirection is performed.
  4. IF a malformed URL with no protocol is added to that page, browsers will treat the current page's URL as the base. For example, 'www.cnn.com' becomes 'http://example.com/event/1234/www.cnn.com';
  5. Following the malformed link, with the permissive ID parsing in effect, generates a new page that stacks the URL: 'http://example.com/event/1234/www.cnn.com/www.cnn.com'.
  6. Particularly when multiple protocol-less links are present, the size of the crawl explodes geometrically, quickly skewing page counts and bloating the dataset.

We need to figure out if there's a good way to detect these scenarios; even a brute force check is probably preferable to a hung crawl or (worse) a dataset that's trashed and has to be re-crawled.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions