fix: using GET for candidate checks - sitemaps#3464
fix: using GET for candidate checks - sitemaps#3464nikitachapovskii-dev merged 5 commits intomasterfrom
Conversation
barjin
left a comment
There was a problem hiding this comment.
Imo urlExists() should be a lightweight check, which is in-line with HEAD semantics. Fetching with GET will download the entire resource before resolving the call (and sitemaps can be hefty).
As per the HTTP Semantics RFC, general-purpose HTTP servers must support GET and HEAD methods. Imo the right solution for misbehaving servers is stricter timeouts or parallel processing, so the failing response doesn't fully block the ones that could pass.
Agreed, thanks for the feedback @barjin I reverted the GET. The implemented fix is parallel candidate probing with Note: because candidate probing is parallel, emission order between domains is no longer strictly deterministic; tests were updated accordingly. |
barjin
left a comment
There was a problem hiding this comment.
lgtm, thank you @nikitachapovskii-dev !
I'm wondering how we managed to make sitemap discovery so complicated, but that's not a problem of this PR (previous state wasn't any better 😅 )
| return firstUrl.toString(); | ||
| }); | ||
| const candidateResults = await Promise.allSettled( | ||
| candidateSitemapUrls.map((candidateSitemapUrl) => urlExists(candidateSitemapUrl)), |
There was a problem hiding this comment.
nit
| candidateSitemapUrls.map((candidateSitemapUrl) => urlExists(candidateSitemapUrl)), | |
| candidateSitemapUrls.map(urlExists), |
There was a problem hiding this comment.
There was a bunch of issues where we solve problems which are already addressed in V4, so when we switch to it on most of actors the sitemap drama(/comedy?) should be over
Thanks for review!
Co-authored-by: Jindřich Bär <jindrichbar@gmail.com>
|
Next time please adjust the PR title to match what the PR actually does before you merge it. Looking at the diff, we didn't really change the HEAD method in the end, but now |
New case discovered.
This fix parallelizes sitemap candidate probing in
discoverValidSitemapsfor sitemaps.Candidate checks are now executed concurrently and collected, so a slow/failing candidate no longer blocks others. This removes avoidable discovery delays while preserving sitemap detection semantics.
Closes #3463