Skip to content

Commit d08d13d

Browse files
authored
feat: Add RedisStorageClient based on Redis v8.0+ (#1406)
### Description This PR implements a storage client `RedisStorageClient` based on Redis v8+. The minimum version 8 requirement is due to the fact that all data structures used are only available starting from Redis Open-Source version 8, without any additional extensions. ### Testing * Added new unit tests * For testing without actual Redis usage, [`fakeredis`](https://fakeredis.readthedocs.io/en/latest/) is used
1 parent 467872d commit d08d13d

23 files changed

+2917
-11
lines changed
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
from crawlee.crawlers import ParselCrawler
2+
from crawlee.storage_clients import RedisStorageClient
3+
4+
# Create a new instance of storage client using connection string.
5+
# 'redis://localhost:6379' is the just placeholder, replace it with your actual
6+
# connection string.
7+
storage_client = RedisStorageClient(connection_string='redis://localhost:6379')
8+
9+
# And pass it to the crawler.
10+
crawler = ParselCrawler(storage_client=storage_client)
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
from redis.asyncio import Redis
2+
3+
from crawlee.configuration import Configuration
4+
from crawlee.crawlers import ParselCrawler
5+
from crawlee.storage_clients import RedisStorageClient
6+
7+
# Create a new instance of storage client using a Redis client with custom settings.
8+
# Replace host and port with your actual Redis server configuration.
9+
# Other Redis client settings can be adjusted as needed.
10+
storage_client = RedisStorageClient(
11+
redis=Redis(
12+
host='localhost',
13+
port=6379,
14+
retry_on_timeout=True,
15+
socket_keepalive=True,
16+
socket_connect_timeout=10,
17+
)
18+
)
19+
20+
# Create a configuration with custom settings.
21+
configuration = Configuration(purge_on_start=False)
22+
23+
# And pass them to the crawler.
24+
crawler = ParselCrawler(
25+
storage_client=storage_client,
26+
configuration=configuration,
27+
)

docs/guides/storage_clients.mdx

Lines changed: 175 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ import CustomStorageClientExample from '!!raw-loader!roa-loader!./code_examples/
1717
import RegisteringStorageClientsExample from '!!raw-loader!roa-loader!./code_examples/storage_clients/registering_storage_clients_example.py';
1818
import SQLStorageClientBasicExample from '!!raw-loader!roa-loader!./code_examples/storage_clients/sql_storage_client_basic_example.py';
1919
import SQLStorageClientConfigurationExample from '!!raw-loader!./code_examples/storage_clients/sql_storage_client_configuration_example.py';
20+
import RedisStorageClientBasicExample from '!!raw-loader!./code_examples/storage_clients/redis_storage_client_basic_example.py';
21+
import RedisStorageClientConfigurationExample from '!!raw-loader!./code_examples/storage_clients/redis_storage_client_configuration_example.py';
2022

2123
Storage clients provide a unified interface for interacting with <ApiLink to="class/Dataset">`Dataset`</ApiLink>, <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>, and <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>, regardless of the underlying implementation. They handle operations like creating, reading, updating, and deleting storage instances, as well as managing data persistence and cleanup. This abstraction makes it easy to switch between different environments, such as local development and cloud production setups.
2224

@@ -26,7 +28,8 @@ Crawlee provides three main storage client implementations:
2628

2729
- <ApiLink to="class/FileSystemStorageClient">`FileSystemStorageClient`</ApiLink> - Provides persistent file system storage with in-memory caching.
2830
- <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> - Stores data in memory with no persistence.
29-
- <ApiLink to="class/SqlStorageClient">`SqlStorageClient`</ApiLink> – Provides persistent storage using a SQL database ([SQLite](https://sqlite.org/) or [PostgreSQL](https://www.postgresql.org/)). Requires installing the extra dependency: 'crawlee[sql_sqlite]' for SQLite or 'crawlee[sql_postgres]' for PostgreSQL.
31+
- <ApiLink to="class/SqlStorageClient">`SqlStorageClient`</ApiLink> - Provides persistent storage using a SQL database ([SQLite](https://sqlite.org/) or [PostgreSQL](https://www.postgresql.org/)). Requires installing the extra dependency: `crawlee[sql_sqlite]` for SQLite or `crawlee[sql_postgres]` for PostgreSQL.
32+
- <ApiLink to="class/RedisStorageClient">`RedisStorageClient`</ApiLink> - Provides persistent storage using a [Redis](https://redis.io/) database v8.0+. Requires installing the extra dependency `crawlee[redis]`.
3033
- [`ApifyStorageClient`](https://docs.apify.com/sdk/python/reference/class/ApifyStorageClient) - Manages storage on the [Apify platform](https://apify.com), implemented in the [Apify SDK](https://github.com/apify/apify-sdk-python).
3134

3235
```mermaid
@@ -56,6 +59,8 @@ class MemoryStorageClient
5659
5760
class SqlStorageClient
5861
62+
class RedisStorageClient
63+
5964
class ApifyStorageClient
6065
6166
%% ========================
@@ -65,6 +70,7 @@ class ApifyStorageClient
6570
StorageClient --|> FileSystemStorageClient
6671
StorageClient --|> MemoryStorageClient
6772
StorageClient --|> SqlStorageClient
73+
StorageClient --|> RedisStorageClient
6874
StorageClient --|> ApifyStorageClient
6975
```
7076

@@ -304,15 +310,181 @@ Configuration options for the <ApiLink to="class/SqlStorageClient">`SqlStorageCl
304310

305311
Configuration options for the <ApiLink to="class/SqlStorageClient">`SqlStorageClient`</ApiLink> can be set via constructor arguments:
306312

307-
- **`connection_string`** (default: SQLite in <ApiLink to="class/Configuration">`Configuration`</ApiLink> storage dir) SQLAlchemy connection string, e.g. `sqlite+aiosqlite:///my.db` or `postgresql+asyncpg://user:pass@host/db`.
308-
- **`engine`** Pre-configured SQLAlchemy AsyncEngine (optional).
313+
- **`connection_string`** (default: SQLite in <ApiLink to="class/Configuration">`Configuration`</ApiLink> storage dir) - SQLAlchemy connection string, e.g. `sqlite+aiosqlite:///my.db` or `postgresql+asyncpg://user:pass@host/db`.
314+
- **`engine`** - Pre-configured SQLAlchemy AsyncEngine (optional).
309315

310316
For advanced scenarios, you can configure <ApiLink to="class/SqlStorageClient">`SqlStorageClient`</ApiLink> with a custom SQLAlchemy engine and additional options via the <ApiLink to="class/Configuration">`Configuration`</ApiLink> class. This is useful, for example, when connecting to an external PostgreSQL database or customizing connection pooling.
311317

312318
<CodeBlock className="language-python" language="python">
313319
{SQLStorageClientConfigurationExample}
314320
</CodeBlock>
315321

322+
### Redis storage client
323+
324+
:::warning Experimental feature
325+
The <ApiLink to="class/RedisStorageClient">`RedisStorageClient`</ApiLink> is experimental. Its API and behavior may change in future releases.
326+
:::
327+
328+
The <ApiLink to="class/RedisStorageClient">`RedisStorageClient`</ApiLink> provides persistent storage using [Redis](https://redis.io/) database. It supports concurrent access from multiple independent clients or processes and uses Redis native data structures for efficient operations.
329+
330+
:::note dependencies
331+
The <ApiLink to="class/RedisStorageClient">`RedisStorageClient`</ApiLink> is not included in the core Crawlee package.
332+
To use it, you need to install Crawlee with the Redis extra dependency:
333+
334+
<code>pip install 'crawlee[redis]'</code>
335+
336+
Additionally, Redis version 8.0 or higher is required.
337+
:::
338+
339+
:::note Redis persistence
340+
Data persistence in Redis depends on your [database configuration](https://redis.io/docs/latest/operate/oss_and_stack/management/persistence/).
341+
:::
342+
343+
The client requires either a Redis connection string or a pre-configured Redis client instance. Use a pre-configured client when you need custom Redis settings such as connection pooling, timeouts, or SSL/TLS encryption.
344+
345+
<CodeBlock className="language-python" language="python">
346+
{RedisStorageClientBasicExample}
347+
</CodeBlock>
348+
349+
Data is organized using Redis key patterns. Below are the main data structures used for each storage type:
350+
351+
```mermaid
352+
---
353+
config:
354+
class:
355+
hideEmptyMembersBox: true
356+
---
357+
358+
classDiagram
359+
360+
%% ========================
361+
%% Storage Client
362+
%% ========================
363+
364+
class RedisDatasetClient {
365+
<<Dataset>>
366+
}
367+
368+
%% ========================
369+
%% Dataset Keys
370+
%% ========================
371+
372+
class DatasetKeys {
373+
datasets:[name]:items - JSON Array
374+
datasets:[name]:metadata - JSON Object
375+
}
376+
377+
class DatasetsIndexes {
378+
datasets:id_to_name - Hash
379+
datasets:name_to_id - Hash
380+
}
381+
382+
%% ========================
383+
%% Client to Keys arrows
384+
%% ========================
385+
386+
RedisDatasetClient --> DatasetKeys
387+
RedisDatasetClient --> DatasetsIndexes
388+
```
389+
390+
```mermaid
391+
---
392+
config:
393+
class:
394+
hideEmptyMembersBox: true
395+
---
396+
397+
classDiagram
398+
399+
%% ========================
400+
%% Storage Clients
401+
%% ========================
402+
403+
class RedisKeyValueStoreClient {
404+
<<Key-value store>>
405+
}
406+
407+
%% ========================
408+
%% Key-Value Store Keys
409+
%% ========================
410+
411+
class KeyValueStoreKeys {
412+
key_value_stores:[name]:items - Hash
413+
key_value_stores:[name]:metadata_items - Hash
414+
key_value_stores:[name]:metadata - JSON Object
415+
}
416+
417+
class KeyValueStoresIndexes {
418+
key_value_stores:id_to_name - Hash
419+
key_value_stores:name_to_id - Hash
420+
}
421+
422+
%% ========================
423+
%% Client to Keys arrows
424+
%% ========================
425+
426+
RedisKeyValueStoreClient --> KeyValueStoreKeys
427+
RedisKeyValueStoreClient --> KeyValueStoresIndexes
428+
```
429+
430+
```mermaid
431+
---
432+
config:
433+
class:
434+
hideEmptyMembersBox: true
435+
---
436+
437+
classDiagram
438+
439+
%% ========================
440+
%% Storage Clients
441+
%% ========================
442+
443+
class RedisRequestQueueClient {
444+
<<Request queue>>
445+
}
446+
447+
%% ========================
448+
%% Request Queue Keys
449+
%% ========================
450+
451+
class RequestQueueKeys{
452+
request_queues:[name]:queue - List
453+
request_queues:[name]:data - Hash
454+
request_queues:[name]:in_progress - Hash
455+
request_queues:[name]:added_bloom_filter - Bloom Filter | bloom queue_dedup_strategy
456+
request_queues:[name]:handled_bloom_filter - Bloom Filter | bloom queue_dedup_strategy
457+
request_queues:[name]:pending_set - Set | default queue_dedup_strategy
458+
request_queues:[name]:handled_set - Set | default queue_dedup_strategy
459+
request_queues:[name]:metadata - JSON Object
460+
}
461+
462+
class RequestQueuesIndexes {
463+
request_queues:id_to_name - Hash
464+
request_queues:name_to_id - Hash
465+
}
466+
467+
%% ========================
468+
%% Client to Keys arrows
469+
%% ========================
470+
471+
RedisRequestQueueClient --> RequestQueueKeys
472+
RedisRequestQueueClient --> RequestQueuesIndexes
473+
```
474+
475+
Configuration options for the <ApiLink to="class/RedisStorageClient">`RedisStorageClient`</ApiLink> can be set through environment variables or the <ApiLink to="class/Configuration">`Configuration`</ApiLink> class:
476+
477+
- **`purge_on_start`** (env: `CRAWLEE_PURGE_ON_START`, default: `True`) - Whether to purge default storages on start.
478+
479+
Configuration options for the <ApiLink to="class/RedisStorageClient">`RedisStorageClient`</ApiLink> can be set via constructor arguments:
480+
481+
- **`connection_string`** - Redis connection string, e.g. `redis://localhost:6379/0`.
482+
- **`redis`** - Pre-configured Redis client instance (optional).
483+
484+
<CodeBlock className="language-python" language="python">
485+
{RedisStorageClientConfigurationExample}
486+
</CodeBlock>
487+
316488
## Creating a custom storage client
317489

318490
A storage client consists of two parts: the storage client factory and individual storage type clients. The <ApiLink to="class/StorageClient">`StorageClient`</ApiLink> acts as a factory that creates specific clients (<ApiLink to="class/DatasetClient">`DatasetClient`</ApiLink>, <ApiLink to="class/KeyValueStoreClient">`KeyValueStoreClient`</ApiLink>, <ApiLink to="class/RequestQueueClient">`RequestQueueClient`</ApiLink>) where the actual storage logic is implemented.

pyproject.toml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ dependencies = [
4848
]
4949

5050
[project.optional-dependencies]
51-
all = ["crawlee[adaptive-crawler,beautifulsoup,cli,curl-impersonate,httpx,parsel,playwright,otel,sql_sqlite,sql_postgres]"]
51+
all = ["crawlee[adaptive-crawler,beautifulsoup,cli,curl-impersonate,httpx,parsel,playwright,otel,sql_sqlite,sql_postgres,redis]"]
5252
adaptive-crawler = [
5353
"jaro-winkler>=2.0.3",
5454
"playwright>=1.27.0",
@@ -79,6 +79,7 @@ sql_sqlite = [
7979
"sqlalchemy[asyncio]>=2.0.0,<3.0.0",
8080
"aiosqlite>=0.21.0",
8181
]
82+
redis = ["redis[hiredis] >= 7.0.0"]
8283

8384
[project.scripts]
8485
crawlee = "crawlee._cli:cli"
@@ -98,6 +99,7 @@ dev = [
9899
"apify_client", # For e2e tests.
99100
"build<2.0.0", # For e2e tests.
100101
"dycw-pytest-only<3.0.0",
102+
"fakeredis[probabilistic,json,lua]<3.0.0",
101103
"mypy~=1.18.0",
102104
"pre-commit<5.0.0",
103105
"proxy-py<3.0.0",

src/crawlee/storage_clients/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,13 @@
1313
with _try_import(__name__, 'SqlStorageClient'):
1414
from ._sql import SqlStorageClient
1515

16+
with _try_import(__name__, 'RedisStorageClient'):
17+
from ._redis import RedisStorageClient
18+
1619
__all__ = [
1720
'FileSystemStorageClient',
1821
'MemoryStorageClient',
22+
'RedisStorageClient',
1923
'SqlStorageClient',
2024
'StorageClient',
2125
]
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
from ._dataset_client import RedisDatasetClient
2+
from ._key_value_store_client import RedisKeyValueStoreClient
3+
from ._request_queue_client import RedisRequestQueueClient
4+
from ._storage_client import RedisStorageClient
5+
6+
__all__ = ['RedisDatasetClient', 'RedisKeyValueStoreClient', 'RedisRequestQueueClient', 'RedisStorageClient']

0 commit comments

Comments
 (0)