ignore / limit crawling of same-structure URL paths with variable IDs #1526
Closed
OlexTratisky
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I’m using katana and running into an issue with URLs that share the same structure but differ only by a path parameter (ID).
For example:
/details/123
/details/456
/details/789
Katana treats each of these as a new URL, but in reality they all resolve to the same page template, just with a different ID. This causes the crawler to spend a huge amount of time crawling essentially identical pages.
I’m aware of options like -iqp / -ignore-query-params, but those only work for query parameters, not path parameters.
I also tried using:
-crawl-out-scope "./details/."
However, this limits the entire /details/ path, which is not ideal because there may be useful pages under paths like:
/details/more_details/
What I’m looking for
Is there a way (or could there be a feature) for katana to:
Detect URL paths that contain variable IDs (e.g. /details/)
Treat them as the same structure
Stop crawling after N occurrences (e.g. 10 unique IDs)
Then move on to URLs with a different structure, such as:
/details/more_details/
Conceptually, something similar to -ignore-query-params, but for path segments, or a way to define regex / templated path rules like:
/details/{id}
/details/more_details/{id}
Beta Was this translation helpful? Give feedback.
All reactions