Skip to content

fix: using GET for candidate checks - sitemaps#3464

Merged
nikitachapovskii-dev merged 5 commits intomasterfrom
fix/get-sitemaps
Mar 10, 2026
Merged

fix: using GET for candidate checks - sitemaps#3464
nikitachapovskii-dev merged 5 commits intomasterfrom
fix/get-sitemaps

Conversation

@nikitachapovskii-dev
Copy link
Contributor

@nikitachapovskii-dev nikitachapovskii-dev commented Mar 4, 2026

New case discovered.

This fix parallelizes sitemap candidate probing in discoverValidSitemaps for sitemaps.

Candidate checks are now executed concurrently and collected, so a slow/failing candidate no longer blocks others. This removes avoidable discovery delays while preserving sitemap detection semantics.

Closes #3463

@github-actions github-actions bot added this to the 135th sprint - Tooling team milestone Mar 4, 2026
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Mar 4, 2026
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Mar 4, 2026
Copy link
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo urlExists() should be a lightweight check, which is in-line with HEAD semantics. Fetching with GET will download the entire resource before resolving the call (and sitemaps can be hefty).

As per the HTTP Semantics RFC, general-purpose HTTP servers must support GET and HEAD methods. Imo the right solution for misbehaving servers is stricter timeouts or parallel processing, so the failing response doesn't fully block the ones that could pass.

@nikitachapovskii-dev
Copy link
Contributor Author

Imo urlExists() should be a lightweight check, which is in-line with HEAD semantics. Fetching with GET will download the entire resource before resolving the call (and sitemaps can be hefty).

As per the HTTP Semantics RFC, general-purpose HTTP servers must support GET and HEAD methods. Imo the right solution for misbehaving servers is stricter timeouts or parallel processing, so the failing response doesn't fully block the ones that could pass.

Agreed, thanks for the feedback @barjin

I reverted the GET. The implemented fix is parallel candidate probing with Promise.allSettled, so one slow/misbehaving candidate does not serialize discovery for the rest.

Note: because candidate probing is parallel, emission order between domains is no longer strictly deterministic; tests were updated accordingly.

Copy link
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thank you @nikitachapovskii-dev !

I'm wondering how we managed to make sitemap discovery so complicated, but that's not a problem of this PR (previous state wasn't any better 😅 )

return firstUrl.toString();
});
const candidateResults = await Promise.allSettled(
candidateSitemapUrls.map((candidateSitemapUrl) => urlExists(candidateSitemapUrl)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
candidateSitemapUrls.map((candidateSitemapUrl) => urlExists(candidateSitemapUrl)),
candidateSitemapUrls.map(urlExists),

Copy link
Contributor Author

@nikitachapovskii-dev nikitachapovskii-dev Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a bunch of issues where we solve problems which are already addressed in V4, so when we switch to it on most of actors the sitemap drama(/comedy?) should be over
Thanks for review!

nikitachapovskii-dev and others added 2 commits March 10, 2026 09:38
Co-authored-by: Jindřich Bär <jindrichbar@gmail.com>
@nikitachapovskii-dev nikitachapovskii-dev merged commit f82eb04 into master Mar 10, 2026
9 checks passed
@nikitachapovskii-dev nikitachapovskii-dev deleted the fix/get-sitemaps branch March 10, 2026 08:54
@B4nan
Copy link
Member

B4nan commented Mar 12, 2026

Next time please adjust the PR title to match what the PR actually does before you merge it. Looking at the diff, we didn't really change the HEAD method in the end, but now using GET for candidate checks - sitemaps will end up in the changelog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sitemap discovery still waits for global timeout because HEAD candidate probes can stall

4 participants