refactor: unify input URL fetching with the link-checker's HostPool#2100
refactor: unify input URL fetching with the link-checker's HostPool#2100
Conversation
Previously, the UrlContentResolver used its own bare reqwest::Client to fetch remote input documents (e.g. `lychee https://example.com`). This separate code path silently missed several important features compared to the link-checking path: - No user-agent was set (#1886) - Custom headers were forwarded but per-host headers were not - No rate limiting, retries, or backoff - No cookie jar, TLS settings, or redirect policy This commit replaces the bare reqwest::Client in UrlContentResolver with the same Arc<HostPool> used by the link checker. In main.rs, the lychee Client is now built before the Collector so its HostPool can be shared with the Collector via the new .host_pool() builder method. Both the Collector (for input fetching) and the WebsiteChecker (for link checking) now use the same HostPool instance, so all configuration is automatically applied to both paths. As a side effect, fetching a remote input document now counts against the per-host rate limit bucket for that host. This is intentional: we want to be a good citizen of the web regardless of whether a request is for input fetching or link checking. The Collector::default() and Collector::new() cases (used in tests and library code) fall back to HostPool::default(), which is a lightweight default-configured pool -- no heavier than the previous bare reqwest::Client::new().
cristiklein
left a comment
There was a problem hiding this comment.
Hi @mre . Thanks for involving me.
Overall, I like the idea of lychee having a single, shared HostPool which controls all host-related parameters and applies them uniformly to both fetching input URLs and collected URLs.
I have two comments:
- It noticed that #2099 contains a few tests. Would be great to add them to this PR to show that something which was previously broken is now fixed.
- I'm surprised by the need to go from
pub(create)topub. Is that really necessary?
Also make CacheableResponse and execute_request crate-private.
Oh, you're right. It was necessary, but thanks to some refactoring it's not necessary anymore. 😄 👍 Done.
Makes sense. Brought them over. |
katrinafyi
left a comment
There was a problem hiding this comment.
I'm quite happy with this. It is also a happy surprise how small the change is. I was definitely expecting something way bigger.
But we may not always get so lucky, and there's definitely a discussion to be had about refractors or big changes. There's now 3 ish big ticket items in my mind (recursion, base url, now status enum) which will need extensive changes. Maybe it's worth discussing in a dedicated issue.
|
Cool. My vote goes to merging this and in turn closing #2099. This will resolve a few issues and it's a relatively straightforward change. It shouldn't draw us into a corner when tackling the bigger, architectural issues. |
cristiklein
left a comment
There was a problem hiding this comment.
With the test in place and seeing how many bugs are resolved by this PR, my vote also goes to merging this and closing #2099 .
And I agree with @katrinafyi . It turned out rather small in comparison to the effect it has.
(Note that I'm not a lychee maintainer, so my approval doesn't really count. 😄)
thomas-zahner
left a comment
There was a problem hiding this comment.
Thank you @mre this is definitely the way to go. I've had something like this in the back of my mind too.
78c8497 to
a54f81d
Compare
|
Yes! Thanks for contributing @thomas-zahner. All great changes. |
mre
left a comment
There was a problem hiding this comment.
This seems fine to merge. Thanks everyone for the review and to @thomas-zahner for finalizing the PR. 👍
This is the follow-up to #2099, which took a conservative approach to fixing #1886.
Previously, the
reqwest::Clientused byUrlContentResolver(which fetches the body of remote CLI input URLs) was built without a user-agent, rate limiting, retries, TLS settings, or per-host configuration. This meant that passing a URL directly as a CLI argument silently diverged from how link checking works. For example, Wikipedia returns a403with no user-agent set, solychee https://en.wikipedia.org/wiki/...would find zero links and report success.The fix in #2099 was intentionally minimal: store the configured user-agent on the
Collectorand use it when building the resolver'sreqwest::Client. It fixes the immediate issue but treats the two code paths separately, which isn't great.This PR takes the approach I described in #2099 as the "alternative": instead of the
Collectormaintaining its ownreqwest::Client, it now shares the sameArc<HostPool>that the link checker uses. Thelychee_lib::Clientis built before theCollectorinmain.rs, and its pool is handed to theCollectorvia the new.host_pool()builder method. Both input fetching and link checking now go through the same pool, so all configuration (user-agent, custom headers, per-host headers, TLS, cookies, rate limiting, retries) is applied consistently to both paths.As a side effect, fetching a remote input document now counts against the per-host rate limit bucket for that host. This is actually the correct behavior since we want lychee to be a good web citizen regardless of whether a request is for input fetching or link checking. =)
One tradeoff worth noting:
Collector::default()andCollector::new()(which are used in tests without a fullClientBuildersetup) now fall back toHostPool::default()instead ofreqwest::Client::new().HostPool::default()is equally lightweight because it just wraps a defaultreqwest::Clientwith lazy host creation, so this should not be a big deal in practice, but it's worth mentioning.I believe now that this is the superior approach to resolve the issue. wdyt?
Fixes #1886
Fixes #1673
@katrinafyi @thomas-zahner @cristiklein feedback welcome!