Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
6aee5b8
linkcheck builder: update handling of HTTP 401 status code to conside…
jayaddison May 21, 2023
7f68206
linkcheck builder: introduce a 'linkcheck_allow_unauthorized' config …
jayaddison Sep 21, 2023
2d4f4bc
Update CHANGES.rst
jayaddison Sep 21, 2023
a958e86
Clarify phrasing in CHANGES.rst: the setting has to be configured to …
jayaddison Sep 21, 2023
712f060
CHANGES.rst: fixup: relocate the changelog entry to the correct locat…
jayaddison Sep 21, 2023
ee7348f
docs: add documentation for the 'linkcheck_allow_unauthorized' config…
jayaddison Sep 21, 2023
c19f7c3
CHANGES.rst: nitpick: undo accidental empty-line removal
jayaddison Sep 21, 2023
a290e3c
CHANGES.rst: nitpick: phrasing: 'handle...as broken' -> 'report...as …
jayaddison Sep 22, 2023
46d8206
CHANGES.rst: nitpick: brevity
jayaddison Sep 22, 2023
c895ca2
linkcheck builder: add a deprecation warning indicating that the 'lin…
jayaddison Sep 22, 2023
37b50ae
CHANGES.rst: fixup: add self-attribution
jayaddison Sep 22, 2023
bc3c390
Merge branch 'master' into issue-11433/adjust-linkcheck-http-401-hand…
jayaddison Sep 23, 2023
37634b1
ruff: linting fixups
jayaddison Sep 23, 2023
a81f283
Apply code review suggestion: filter warnings in test case using more…
jayaddison Sep 26, 2023
de6ac23
Apply code review suggestion: make the 'allow_unauthorized' worker va…
jayaddison Sep 26, 2023
9f87c41
docs: add removal note to 'linkcheck_allow_unauthorized' config optio…
jayaddison Sep 26, 2023
2b90b76
linkcheck builder: relocate 'linkcheck_allow_unauthorized' warning to…
jayaddison Sep 26, 2023
d9144cc
fixup: use pytest.mark.filterwarnings to filter warning _before_ app …
jayaddison Sep 26, 2023
fa9be8a
cleanup: remove unused imports
jayaddison Sep 26, 2023
4750345
Apply code review suggestion: set default value for 'allow_unauthoriz…
jayaddison Sep 27, 2023
6d450c2
Updated plan: instead of removing the setting entirely in Sphinx 8.0,…
jayaddison Sep 27, 2023
cdfa342
Apply code review suggestion: when an HTTP 401 response is encountere…
jayaddison Sep 27, 2023
daf4efb
Code behaviour consensus: the URI of unauthorized HTTP responses shou…
jayaddison Sep 30, 2023
e17581a
CHANGES.rst: prefer Pythonic representation of false value
jayaddison Oct 1, 2023
2070c88
Merge branch 'master' into issue-11433/adjust-linkcheck-http-401-hand…
jayaddison Oct 3, 2023
bc97be3
Merge branch 'master' into issue-11433/adjust-linkcheck-http-401-hand…
jayaddison Oct 11, 2023
a75abcf
Merge branch 'master' into issue-11433/adjust-linkcheck-http-401-hand…
jayaddison Dec 11, 2023
5d16bb3
Merge branch 'master' into issue-11433/adjust-linkcheck-http-401-hand…
jayaddison Dec 26, 2023
000af8b
Merge branch 'master' into issue-11433/adjust-linkcheck-http-401-hand…
jayaddison Dec 30, 2023
bfaf179
Merge branch 'master' into issue-11433/adjust-linkcheck-http-401-hand…
jayaddison Jan 9, 2024
16a3695
Update sphinx/builders/linkcheck.py
AA-Turner Jan 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -304,6 +304,8 @@ Features added

Bugs fixed
----------
* #11433: The ``linkcheck`` builder has been updated to consider hyperlinks
that respond with an HTTP 401 (unauthorized) response to be broken.

* Restored the ``footnote-reference`` class that has been removed in
the latest (unreleased) version of Docutils.
Expand Down
3 changes: 1 addition & 2 deletions sphinx/builders/linkcheck.py
Original file line number Diff line number Diff line change
Expand Up @@ -437,9 +437,8 @@ def _check_uri(self, uri: str, hyperlink: Hyperlink) -> tuple[str, str, int]:
except HTTPError as err:
error_message = str(err)

# Unauthorised: the reference probably exists
if status_code == 401:
return 'working', 'unauthorized', 0
return 'broken', 'unauthorized', 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think that the 401 still makes sense if people do not want to expose credentials in their configuration and only assume that something exists (and could return a 401).

    As such, I think we could keep the 'working' but change a bit the way that linkcheck_auth is handled:

    • If we specify "fake" credentials by None, we keep 'working' and do not emit warnings.
    • If no fake credential for the corresponding URL is specified, we keep the 'working', but we emit a warning saying that the content couldn't be accessed and that users should specify credentials (fake or not).

That way, this also ensures that credentials are not exposed unintentionally and that the contents with 401 errors are detected correctly.

  • It's more of a follow-up, but what about using an enumeration (or at least mark the possible string statuses we returm using Literal) for the status instead of using 'working', 'broken', etc?

    In process_result, we use if-case but using match instead + enumerations may be more elegant and clearer (also we wouldn't have an arbitrary status code being emitted).

Copy link
Contributor Author

@jayaddison jayaddison Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the 401 still makes sense if people do not want to expose credentials in their configuration and only assume that something exists (and could return a 401).

Hmm.. point understood, although I think that even when an URL has no linkcheck credentials configured (the "no fake credential" case), an HTTP 401 response should be considered a broken link, because it means that the link is unchecked. This is probably the core of the disruptive/controversial angle I have on this.

If for some reason a user is unable to provide credentials for some hyperlinks included in a documentation set, but wants a success report from linkchecking despite that, I'd argue they should use linkcheck_anchors_ignore linkcheck_ignore to skip the relevant websites.

I'm still thinking about the first point -- using None as a special marker for intentionally-empty credentials. It seems a bit fragile and too-clever; I like simple and hard-to-misunderstand things, because I misunderstand a lot.

It's more of a follow-up, but what about using an enumeration (or at least mark the possible string statuses we returm using Literal) for the status instead of using 'working', 'broken', etc?

👍

In process_result, we use if-case but using match instead + enumerations may be more elegant and clearer (also we wouldn't have an arbitrary status code being emitted).

Mostly agree, although we still support Py3.9 at the moment - so this could be a future enhancement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me using linkcheck_anchors_ignore although it's a bit counterintuitive in this case (like I don't really want to ignore).

using None as a special marker for intentionally-empty credentials

It was an example. We could have another configuration value like linkcheck_auth_bypass and put what links we expect to have 401.

Concerning the match, I forgot it was introduced in 3.10.


Actually, when writing my comment, I thought about a much more elegant solution, namely, you could specify the HTTP response code to expect for specific links and treat them as "working".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me using linkcheck_anchors_ignore although it's a bit counterintuitive in this case (like I don't really want to ignore).

Let me try to make a convicing argument that it makes sense :) (I could be wrong! in which case that'll help too)

The HTTP 401 response is zero-information in terms of hyperlink validity. It does confirm that the client should use auth if it wants to gain more-than-zero information, but that's the only feedback it provides -- and we hide that from users at the moment.

So we're making network requests that fail, and we're not informing the user about it. I think we should start by informing the user -- and then they either choose to find auth to gain greater-than-zero info, or they ignore the links and reduce the outbound traffic (and linkchecking time).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 I think we should start by informing the user 

Agreed. Actually, by counterintuitive, I meant that using the linkcheck_anchors_ignore was weird to me since I would have used it to suppress errors related to anchors rather than to HTTP response codes.

So, we could keep what you suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agh, sorry. I should have written linkcheck_ignore instead. I completely failed to notice/parse the word anchor in there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it's ok for me !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) thank you!

Copy link
Contributor Author

@jayaddison jayaddison Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more of a follow-up, but what about using an enumeration (or at least mark the possible string statuses we returm using Literal) for the status instead of using 'working', 'broken', etc?
In process_result, we use if-case but using match instead + enumerations may be more elegant and clearer (also we wouldn't have an arbitrary status code being emitted).

I'm working on some refactoring for this at the moment; the CheckResult class relies upon being JSON-serializable at the moment, and I don't feel like introducing custom serialization code, so I've opted to use the StrEnum type -- Enum with string values isn't serializable by default.

Taking that approach adds a dependency on Py3.11, so I'm experimenting with adding match statements at the same time.

I did read up a little bit about using Literal - it looks like it's mostly a type-checking aid? I think I'd prefer the enum route, even if it may take a while before it could land in the codebase. Maybe I'll change my mind as I learn more, though..


# Rate limiting; back-off if allowed, or report failure otherwise
if status_code == 429:
Expand Down
12 changes: 7 additions & 5 deletions tests/test_build_linkcheck.py
Original file line number Diff line number Diff line change
Expand Up @@ -343,8 +343,12 @@ class CustomHandler(http.server.BaseHTTPRequestHandler):

def authenticated(method):
def method_if_authenticated(self):
if (expected_token is None
or self.headers["Authorization"] == f"Basic {expected_token}"):
if expected_token is None:
return method(self)
elif not self.headers["Authorization"]:
self.send_response(401, "Unauthorized")
self.end_headers()
elif self.headers["Authorization"] == f"Basic {expected_token}":
return method(self)
else:
self.send_response(403, "Forbidden")
Expand Down Expand Up @@ -397,9 +401,7 @@ def test_auth_header_no_match(app):
with open(app.outdir / "output.json", encoding="utf-8") as fp:
content = json.load(fp)

# TODO: should this test's webserver return HTTP 401 here?
# https://github.com/sphinx-doc/sphinx/issues/11433
assert content["info"] == "403 Client Error: Forbidden for url: http://localhost:7777/"
assert content["info"] == "unauthorized"
assert content["status"] == "broken"


Expand Down