Skip to content

Conversation

fmigneault
Copy link

@fmigneault fmigneault commented Aug 30, 2025

Purpose

Add linkcheck_allow_forbidden option to let 403 be marked as "working".

Defining it as a properly configurable option, as requested: #9762 (comment)

References

@jayaddison
Copy link
Contributor

I don't think that we should implement this, because my understanding of HTTP 403 response statuses is that they don't indicate whether a resource exists or not, and even if a server is misconfigured in some way that does provide existence checks, HTTP 4xx codes all represent errors.

Even so: if we do go ahead with something like this, then I think the request was to make the list of accepted HTTP status codes for a given URL pattern configurable, and I don't think that this PR does that yet.

@gastmaier
Copy link

@fmigneault I think this is extremely nuanced and the rules will diverge per user and even per domain, we could benefit more from examples on how to do user-side extensions to modify the default behavior and adjust the source code to allow that, if needed, which IMO should be kept simple.

@fmigneault
Copy link
Author

HTTP 403 response statuses is that they don't indicate whether a resource exists or not

The same could be said about 401, yet this one is supported. There are many servers that don't respect correct HTTP codes.

An increasing number of HTTP 403 gets thrown back by rate limiting in attempts to block checks (even if not the correct code). This causes certain pipelines to fail sporadically, and it is extremely annoying. The next option is to completely ignore the links, which is bad since real 404 won't be caught. The problem is not that certain site always return 4xx, but that they sometimes do it for unrelated reasons. Unless an actual 404 is returned, I personally prefer to ignore these errors temporarily.

I agree that having per-site sets of HTTP codes to allow is better. Is there anything like this already in place? It seems that would be an entirely new feature, irrespective of the specific HTTP code.

@gastmaier
Copy link

gastmaier commented Sep 3, 2025

How about merging linkcheck_allow_forbidden and linkcheck_allow_unauthorized as a list linkcheck_allow that takes the HTTP error codes that should be ignored?

On

I agree that having per-site sets of HTTP codes to allow is better. Is there anything like this already in place? It seems that would be an entirely new feature, irrespective of the specific HTTP code.

and

HTTP 403 gets thrown back by rate limiting in attempts to block check

Not that I am aware of. If you own the resource, linkcheck_auth / linkcheck_request_headers should be able to pass an API key to bypass the ratelimit.
The source code of it also can give you an idea how to implement linkcheck_allow per site.
If you don't own the resource, try
linkcheck_rate_limit_timeout+ linkcheck_retries + linkcheck_timeout

Still, linkcheck_auth + linkcheck_request_headers + linkcheck_rate_limit_timeout + linkcheck_retries + linkcheck_timeout are already so many options that maybe if you still get rate limited after tuning, it may be a valid failure point.

@fmigneault
Copy link
Author

How about merging linkcheck_allow_forbidden and linkcheck_allow_unauthorized as a list linkcheck_allow that takes the HTTP error codes that should be ignored?

Yes. Sounds good.

If you own the resource
[...]
linkcheck_rate_limit_timeout+ linkcheck_retries + linkcheck_timeout

That's the thing. I don't own it so auth doesn't apply (and it is open access anyway).

I am already using these options, but some servers just decide to misbehave anyway.
Rate limit works if 429 is returned, but not when servers outright forbids immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants