refactor: unify input URL fetching with the link-checker's HostPool by mre · Pull Request #2100 · lycheeverse/lychee

mre · 2026-03-25T00:40:50Z

This is the follow-up to #2099, which took a conservative approach to fixing #1886.

Previously, the reqwest::Client used by UrlContentResolver (which fetches the body of remote CLI input URLs) was built without a user-agent, rate limiting, retries, TLS settings, or per-host configuration. This meant that passing a URL directly as a CLI argument silently diverged from how link checking works. For example, Wikipedia returns a 403 with no user-agent set, so lychee https://en.wikipedia.org/wiki/... would find zero links and report success.

The fix in #2099 was intentionally minimal: store the configured user-agent on the Collector and use it when building the resolver's reqwest::Client. It fixes the immediate issue but treats the two code paths separately, which isn't great.

This PR takes the approach I described in #2099 as the "alternative": instead of the Collector maintaining its own reqwest::Client, it now shares the same Arc<HostPool> that the link checker uses. The lychee_lib::Client is built before the Collector in main.rs, and its pool is handed to the Collector via the new .host_pool() builder method. Both input fetching and link checking now go through the same pool, so all configuration (user-agent, custom headers, per-host headers, TLS, cookies, rate limiting, retries) is applied consistently to both paths.

As a side effect, fetching a remote input document now counts against the per-host rate limit bucket for that host. This is actually the correct behavior since we want lychee to be a good web citizen regardless of whether a request is for input fetching or link checking. =)

One tradeoff worth noting: Collector::default() and Collector::new() (which are used in tests without a full ClientBuilder setup) now fall back to HostPool::default() instead of reqwest::Client::new(). HostPool::default() is equally lightweight because it just wraps a default reqwest::Client with lazy host creation, so this should not be a big deal in practice, but it's worth mentioning.

I believe now that this is the superior approach to resolve the issue. wdyt?

Fixes #1886
Fixes #1673

@katrinafyi @thomas-zahner @cristiklein feedback welcome!

Previously, the UrlContentResolver used its own bare reqwest::Client to fetch remote input documents (e.g. `lychee https://example.com`). This separate code path silently missed several important features compared to the link-checking path: - No user-agent was set (#1886) - Custom headers were forwarded but per-host headers were not - No rate limiting, retries, or backoff - No cookie jar, TLS settings, or redirect policy This commit replaces the bare reqwest::Client in UrlContentResolver with the same Arc<HostPool> used by the link checker. In main.rs, the lychee Client is now built before the Collector so its HostPool can be shared with the Collector via the new .host_pool() builder method. Both the Collector (for input fetching) and the WebsiteChecker (for link checking) now use the same HostPool instance, so all configuration is automatically applied to both paths. As a side effect, fetching a remote input document now counts against the per-host rate limit bucket for that host. This is intentional: we want to be a good citizen of the web regardless of whether a request is for input fetching or link checking. The Collector::default() and Collector::new() cases (used in tests and library code) fall back to HostPool::default(), which is a lightweight default-configured pool -- no heavier than the previous bare reqwest::Client::new().

cristiklein

Hi @mre . Thanks for involving me.

Overall, I like the idea of lychee having a single, shared HostPool which controls all host-related parameters and applies them uniformly to both fetching input URLs and collected URLs.

I have two comments:

It noticed that #2099 contains a few tests. Would be great to add them to this PR to show that something which was previously broken is now fixed.
I'm surprised by the need to go from pub(create) to pub. Is that really necessary?

Also make CacheableResponse and execute_request crate-private.

mre · 2026-03-25T23:54:45Z

I'm surprised by the need to go from pub(create) to pub. Is that really necessary?

Oh, you're right. It was necessary, but thanks to some refactoring it's not necessary anymore. 😄 👍 Done.

It noticed that #2099 contains a few tests. Would be great to add them to this PR to show that something which was previously broken is now fixed.

Makes sense. Brought them over.

katrinafyi

I'm quite happy with this. It is also a happy surprise how small the change is. I was definitely expecting something way bigger.

But we may not always get so lucky, and there's definitely a discussion to be had about refractors or big changes. There's now 3 ish big ticket items in my mind (recursion, base url, now status enum) which will need extensive changes. Maybe it's worth discussing in a dedicated issue.

mre · 2026-03-26T01:12:53Z

Cool. My vote goes to merging this and in turn closing #2099. This will resolve a few issues and it's a relatively straightforward change. It shouldn't draw us into a corner when tackling the bigger, architectural issues.

cristiklein

With the test in place and seeing how many bugs are resolved by this PR, my vote also goes to merging this and closing #2099 .

And I agree with @katrinafyi . It turned out rather small in comparison to the effect it has.

(Note that I'm not a lychee maintainer, so my approval doesn't really count. 😄)

thomas-zahner

Thank you @mre this is definitely the way to go. I've had something like this in the back of my mind too.

lychee-bin/src/main.rs

lychee-bin/tests/cli.rs

lychee-lib/src/collector.rs

lychee-bin/tests/cli.rs

thomas-zahner · 2026-03-26T11:19:22Z

@mre I've created a54f81d to address my comments, you can of course amend/force push if you wouldn't agree with something.

mre · 2026-03-26T22:15:04Z

Yes! Thanks for contributing @thomas-zahner. All great changes.

mre

This seems fine to merge. Thanks everyone for the review and to @thomas-zahner for finalizing the PR. 👍

cristiklein reviewed Mar 25, 2026

View reviewed changes

Add tests to verify user-agent is set for remote input URLs

d9dbe6c

Also make CacheableResponse and execute_request crate-private.

katrinafyi approved these changes Mar 26, 2026

View reviewed changes

mre mentioned this pull request Mar 26, 2026

--insecure flag does not work for URL args #1673

Closed

cristiklein approved these changes Mar 26, 2026

View reviewed changes

thomas-zahner approved these changes Mar 26, 2026

View reviewed changes

Code review: minor improvements

a54f81d

thomas-zahner force-pushed the refactor/unify-input-resolver-client branch from 78c8497 to a54f81d Compare March 26, 2026 11:29

thomas-zahner mentioned this pull request Mar 26, 2026

set user-agent when fetching remote input URLs #2099

Closed

mre commented Mar 26, 2026

View reviewed changes

mre merged commit 48663cb into master Mar 26, 2026
8 checks passed

mre deleted the refactor/unify-input-resolver-client branch March 26, 2026 22:18

mre mentioned this pull request Mar 26, 2026

chore: release v0.24.0 #2048

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: unify input URL fetching with the link-checker's HostPool#2100

refactor: unify input URL fetching with the link-checker's HostPool#2100
mre merged 3 commits intomasterfrom
refactor/unify-input-resolver-client

mre commented Mar 25, 2026 •

edited

Loading

Uh oh!

cristiklein left a comment

Uh oh!

mre commented Mar 25, 2026

Uh oh!

katrinafyi left a comment

Uh oh!

mre commented Mar 26, 2026

Uh oh!

cristiklein left a comment

Uh oh!

thomas-zahner left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomas-zahner commented Mar 26, 2026 •

edited

Loading

Uh oh!

mre commented Mar 26, 2026

Uh oh!

mre left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

mre commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cristiklein left a comment

Choose a reason for hiding this comment

Uh oh!

mre commented Mar 25, 2026

Uh oh!

katrinafyi left a comment

Choose a reason for hiding this comment

Uh oh!

mre commented Mar 26, 2026

Uh oh!

cristiklein left a comment

Choose a reason for hiding this comment

Uh oh!

thomas-zahner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomas-zahner commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mre commented Mar 26, 2026

Uh oh!

mre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mre commented Mar 25, 2026 •

edited

Loading

thomas-zahner commented Mar 26, 2026 •

edited

Loading