Skip to content

Commit 3241785

Browse files
committed
feat: Persist the SitemapRequestLoader state (apify#1347)
### Description - Persist the `SitemapRequestLoader` state ### Issues - Closes: apify#1269
1 parent 3f0bf8a commit 3241785

File tree

7 files changed

+414
-49
lines changed

7 files changed

+414
-49
lines changed
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
import asyncio
2+
import logging
3+
4+
from crawlee import service_locator
5+
from crawlee.request_loaders import RequestList
6+
7+
logging.basicConfig(level=logging.INFO, format='%(asctime)s-%(levelname)s-%(message)s')
8+
logger = logging.getLogger(__name__)
9+
10+
11+
# Disable clearing the `KeyValueStore` on each run.
12+
# This is necessary so that the state keys are not cleared at startup.
13+
# The recommended way to achieve this behavior is setting the environment variable
14+
# `CRAWLEE_PURGE_ON_START=0`
15+
configuration = service_locator.get_configuration()
16+
configuration.purge_on_start = False
17+
18+
19+
async def main() -> None:
20+
# Open the request list, if it does not exist, it will be created.
21+
# Leave name empty to use the default request list.
22+
request_list = RequestList(
23+
name='my-request-list',
24+
requests=[
25+
'https://apify.com/',
26+
'https://crawlee.dev/',
27+
'https://crawlee.dev/python/',
28+
],
29+
# Enable persistence
30+
persist_state_key='my-persist-state',
31+
persist_requests_key='my-persist-requests',
32+
)
33+
34+
# We receive only one request.
35+
# Each time you run it, it will be a new request until you exhaust the `RequestList`.
36+
request = await request_list.fetch_next_request()
37+
if request:
38+
logger.info(f'Processing request: {request.url}')
39+
# Do something with it...
40+
41+
# And mark it as handled.
42+
await request_list.mark_request_as_handled(request)
43+
44+
45+
if __name__ == '__main__':
46+
asyncio.run(main())

docs/guides/code_examples/request_loaders/sitemap_basic_example.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ async def main() -> None:
1717
max_buffer_size=500, # Keep up to 500 URLs in memory before processing.
1818
)
1919

20+
# We work with the loader until we process all relevant links from the sitemap.
2021
while request := await sitemap_loader.fetch_next_request():
2122
# Do something with it...
2223
print(f'Processing {request.url}')
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
import asyncio
2+
import logging
3+
4+
from crawlee import service_locator
5+
from crawlee.http_clients import ImpitHttpClient
6+
from crawlee.request_loaders import SitemapRequestLoader
7+
8+
logging.basicConfig(level=logging.INFO, format='%(asctime)s-%(levelname)s-%(message)s')
9+
logger = logging.getLogger(__name__)
10+
11+
12+
# Disable clearing the `KeyValueStore` on each run.
13+
# This is necessary so that the state keys are not cleared at startup.
14+
# The recommended way to achieve this behavior is setting the environment variable
15+
# `CRAWLEE_PURGE_ON_START=0`
16+
configuration = service_locator.get_configuration()
17+
configuration.purge_on_start = False
18+
19+
20+
async def main() -> None:
21+
# Create an HTTP client for fetching sitemaps
22+
# Use the context manager for `SitemapRequestLoader` to correctly save the state when
23+
# the work is completed.
24+
async with (
25+
ImpitHttpClient() as http_client,
26+
SitemapRequestLoader(
27+
sitemap_urls=['https://crawlee.dev/sitemap.xml'],
28+
http_client=http_client,
29+
# Enable persistence
30+
persist_state_key='my-persist-state',
31+
) as sitemap_loader,
32+
):
33+
# We receive only one request.
34+
# Each time you run it, it will be a new request until you exhaust the sitemap.
35+
request = await sitemap_loader.fetch_next_request()
36+
if request:
37+
logger.info(f'Processing request: {request.url}')
38+
# Do something with it...
39+
40+
# And mark it as handled.
41+
await sitemap_loader.mark_request_as_handled(request)
42+
43+
44+
if __name__ == '__main__':
45+
asyncio.run(main())

docs/guides/request_loaders.mdx

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ import RlTandemExample from '!!raw-loader!roa-loader!./code_examples/request_loa
1515
import RlExplicitTandemExample from '!!raw-loader!roa-loader!./code_examples/request_loaders/rl_tandem_example_explicit.py';
1616
import SitemapTandemExample from '!!raw-loader!roa-loader!./code_examples/request_loaders/sitemap_tandem_example.py';
1717
import SitemapExplicitTandemExample from '!!raw-loader!roa-loader!./code_examples/request_loaders/sitemap_tandem_example_explicit.py';
18+
import RlBasicPersistExample from '!!raw-loader!roa-loader!./code_examples/request_loaders/rl_basic_example_with_persist.py';
19+
import SitemapPersistExample from '!!raw-loader!roa-loader!./code_examples/request_loaders/sitemap_example_with_persist.py';
1820

1921
The [`request_loaders`](https://github.com/apify/crawlee-python/tree/master/src/crawlee/request_loaders) sub-package extends the functionality of the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>, providing additional tools for managing URLs and requests. If you are new to Crawlee and unfamiliar with the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>, consider starting with the [Storages](https://crawlee.dev/python/docs/guides/storages) guide first. Request loaders define how requests are fetched and stored, enabling various use cases such as reading URLs from files, external APIs, or combining multiple sources together.
2022

@@ -116,6 +118,16 @@ Here is a basic example of working with the <ApiLink to="class/RequestList">`Req
116118
{RlBasicExample}
117119
</RunnableCodeBlock>
118120

121+
### Request list with persistence
122+
123+
The <ApiLink to="class/RequestList">`RequestList`</ApiLink> supports state persistence, allowing it to resume from where it left off after interruption. This is particularly useful for long-running crawls or when you need to pause and resume crawling later.
124+
125+
To enable persistence, provide `persist_state_key` and optionally `persist_requests_key` parameters, and disable automatic cleanup by setting `purge_on_start = False` in the configuration. The `persist_state_key` saves the loader's progress, while `persist_requests_key` ensures that the request data doesn't change between runs. For more details on resuming interrupted crawls, see the [Resuming a paused crawl](../examples/resuming-paused-crawl) example.
126+
127+
<RunnableCodeBlock className="language-python" language="python">
128+
{RlBasicPersistExample}
129+
</RunnableCodeBlock>
130+
119131
### Sitemap request loader
120132

121133
The <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink> is a specialized request loader that reads URLs from XML sitemaps. It's particularly useful when you want to crawl a website systematically by following its sitemap structure. The loader supports filtering URLs using glob patterns and regular expressions, allowing you to include or exclude specific types of URLs. The <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink> provides streaming processing of sitemaps, ensuring efficient memory usage without loading the entire sitemap into memory.
@@ -124,6 +136,16 @@ The <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink> is
124136
{SitemapExample}
125137
</RunnableCodeBlock>
126138

139+
### Sitemap request loader with persistence
140+
141+
Similarly, the <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink> supports state persistence to resume processing from where it left off. This is especially valuable when processing large sitemaps that may take considerable time to complete.
142+
143+
<RunnableCodeBlock className="language-python" language="python">
144+
{SitemapPersistExample}
145+
</RunnableCodeBlock>
146+
147+
When using persistence with `SitemapRequestLoader`, make sure to use the context manager (`async with`) to properly save the state when the work is completed.
148+
127149
## Request managers
128150

129151
The <ApiLink to="class/RequestManager">`RequestManager`</ApiLink> extends `RequestLoader` with write capabilities. In addition to reading requests, a request manager can add and reclaim them. This is essential for dynamic crawling projects where new URLs may emerge during the crawl process, or when certain requests fail and need to be retried. For more details, refer to the <ApiLink to="class/RequestManager">`RequestManager`</ApiLink> API reference.

src/crawlee/_utils/sitemap.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
from hashlib import sha256
1010
from logging import getLogger
1111
from typing import TYPE_CHECKING, Literal, TypedDict
12+
from xml.sax import SAXParseException
1213
from xml.sax.expatreader import ExpatParser
1314
from xml.sax.handler import ContentHandler
1415

@@ -192,7 +193,8 @@ async def flush(self) -> AsyncGenerator[_SitemapItem, None]:
192193

193194
def close(self) -> None:
194195
"""Clean up resources."""
195-
self._parser.close()
196+
with suppress(SAXParseException):
197+
self._parser.close()
196198

197199

198200
def _get_parser(content_type: str = '', url: str | None = None) -> _XmlSitemapParser | _TxtSitemapParser:

0 commit comments

Comments
 (0)