-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
In some situations where dynamic pages are generated based on the incoming URL, and links can be entered without a proper protocol, it's easy for some CMSs to generate infinite exploding URL trees. For example:
- http://example.com/event/{event-id} is used as a URL route
- Editors may create arbitrary sub-pages for the event, so any number of sub-urls are theoretically valid.
- Event ID parsing is permissive, discarding anything after the ID but not redirecting to the shorter, valid URL. For example, '1234' and '1234/foo' both render the page for '1234' but no redirection is performed.
- IF a malformed URL with no protocol is added to that page, browsers will treat the current page's URL as the base. For example, 'www.cnn.com' becomes 'http://example.com/event/1234/www.cnn.com';
- Following the malformed link, with the permissive ID parsing in effect, generates a new page that stacks the URL: 'http://example.com/event/1234/www.cnn.com/www.cnn.com'.
- Particularly when multiple protocol-less links are present, the size of the crawl explodes geometrically, quickly skewing page counts and bloating the dataset.
We need to figure out if there's a good way to detect these scenarios; even a brute force check is probably preferable to a hung crawl or (worse) a dataset that's trashed and has to be re-crawled.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels