-
Notifications
You must be signed in to change notification settings - Fork 548
feat: Persist the SitemapRequestLoader state
#1347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 8 commits
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
0f15326
add persist state in sitemap loader
Mantisus d7c8c9e
add test
Mantisus 91118b5
update use state
Mantisus 9f8d7b4
up tests
Mantisus f7b0dfb
up tests
Mantisus c8cf0d5
check test
Mantisus c3f0f68
use fixture
Mantisus 7ee1ab9
up docs
Mantisus fa76aca
up docs
Mantisus 34008c9
add smaap wait in abort loading
Mantisus 1c30115
up default example
Mantisus 8cfec56
up basic guide
Mantisus 0bdec01
up guide
Mantisus ed7116c
remove extra sleep
Mantisus e28dccf
remove sleep from example
Mantisus 82ae318
up example
Mantisus 812f9fa
protect _get_state
Mantisus b34ad2b
fix flaky tests
Mantisus ccc9b68
move requests from in_progress to queue after restore
Mantisus 97570f3
add comment in example
Mantisus 0600583
resolve
Mantisus 24cf32b
Merge branch 'master' into persist-sitemap-loader
Mantisus c5e0609
await link in `fetch_next_request`
Mantisus c1bfd79
add processed_sitemap_urls
Mantisus 097ab25
resolve
Mantisus 0f71af2
add comment
Mantisus File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I see this from the outside, it would be great if you could write down how the persistence mechanism works in the docblock of this class.
Also, I don't see processed sitemap URLs being tracked in a any way, is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I may be wrong, but I think that cyclic links are not expected in sitemaps. Thanks to this, we don't need to store links to processed sitemaps.
JS uses similar behavior - https://github.com/apify/crawlee/blob/master/packages/core/src/storages/sitemap_request_list.ts#L108
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't be surprised to encounter a cyclic sitemap somewhere, but I don't have a real-world example 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cyclic sitemaps aside, can you please briefly describe how is this state model used by the loader? Like "The crawler processes one sitemap at a time. The current one is kept in
in_progress_sitemap_url..."There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. 🙂
The crawler processes one sitemap at a time. The current sitemap is stored in
in_progress_sitemap_url. Theparse_sitemapfunction parses the sitemap and returns elements as an async iterator. Each element retrieved from the iterator is processed based on its type. If the element is aNestedSitemap, its URL is added topending_sitemap_urls. If the element is aSitemapUrl, the system checks whether it already exists incurrent_sitemap_processed_urls. If it exists, the loader was restarted from a saved state and the URL is skipped.If the URL is new, it is first added to
url_queue, then tocurrent_sitemap_processed_urls, andtotal_countis incremented by 1. When all elements from the current sitemap iterator have been processed,in_progress_sitemap_urlis set toNoneandcurrent_sitemap_processed_urlsis cleared. The next sitemap is retrieved frompending_sitemap_urls. Ifpending_sitemap_urlsis empty,completedis set toTrue.When
fetch_next_requestis called, a URL is extracted fromurl_queueand placed inin_progress. Whenmark_request_as_handledis called for the extracted URL, it is removed fromin_progressandhandled_countis incremented by 1.During initial startup or restart after persistence, state validation occurs in
_get_state. If bothpending_sitemap_urlsandin_progress_sitemap_urlare empty andcompletedis False, this indicates a fresh start. In this case,self._sitemap_urlsare moved topending_sitemap_urls. Otherwise, the system is restarting from a persisted state. Ifin_progresscontains any URLs, they are moved back tourl_queueandin_progressis cleared.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding cyclic sitemaps. According to sitemaps.org, only a sitemapindex should contain links to sitemaps, and a sitemapindex should not contain links to other sitemapindex files.
This means the hierarchy should be: a sitemapindex contains links to one or more sitemaps, where each sitemap contains one or more links to website pages.
If a website follows the recommended protocol, this should prevent circular sitemaps. However, a website might have multiple sitemapindex files that reference the same sitemap (I don't think search engines would like this, but it's possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Now put it inside the docblock please 😁
I could imagine a website implementing something like this as a scraper tarpit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added 🙂
I added
processed_sitemap_urlsso that we can ensure we don't reprocess the same sitemap. 🙂