You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: Add RedisStorageClient based on Redis v8.0+ (#1406)
### Description
This PR implements a storage client `RedisStorageClient` based on Redis
v8+. The minimum version 8 requirement is due to the fact that all data
structures used are only available starting from Redis Open-Source
version 8, without any additional extensions.
### Testing
* Added new unit tests
* For testing without actual Redis usage,
[`fakeredis`](https://fakeredis.readthedocs.io/en/latest/) is used
Storage clients provide a unified interface for interacting with <ApiLinkto="class/Dataset">`Dataset`</ApiLink>, <ApiLinkto="class/KeyValueStore">`KeyValueStore`</ApiLink>, and <ApiLinkto="class/RequestQueue">`RequestQueue`</ApiLink>, regardless of the underlying implementation. They handle operations like creating, reading, updating, and deleting storage instances, as well as managing data persistence and cleanup. This abstraction makes it easy to switch between different environments, such as local development and cloud production setups.
22
24
@@ -26,7 +28,8 @@ Crawlee provides three main storage client implementations:
26
28
27
29
- <ApiLinkto="class/FileSystemStorageClient">`FileSystemStorageClient`</ApiLink> - Provides persistent file system storage with in-memory caching.
28
30
- <ApiLinkto="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> - Stores data in memory with no persistence.
29
-
- <ApiLinkto="class/SqlStorageClient">`SqlStorageClient`</ApiLink> – Provides persistent storage using a SQL database ([SQLite](https://sqlite.org/) or [PostgreSQL](https://www.postgresql.org/)). Requires installing the extra dependency: 'crawlee[sql_sqlite]' for SQLite or 'crawlee[sql_postgres]' for PostgreSQL.
31
+
- <ApiLinkto="class/SqlStorageClient">`SqlStorageClient`</ApiLink> - Provides persistent storage using a SQL database ([SQLite](https://sqlite.org/) or [PostgreSQL](https://www.postgresql.org/)). Requires installing the extra dependency: `crawlee[sql_sqlite]` for SQLite or `crawlee[sql_postgres]` for PostgreSQL.
32
+
- <ApiLinkto="class/RedisStorageClient">`RedisStorageClient`</ApiLink> - Provides persistent storage using a [Redis](https://redis.io/) database v8.0+. Requires installing the extra dependency `crawlee[redis]`.
30
33
-[`ApifyStorageClient`](https://docs.apify.com/sdk/python/reference/class/ApifyStorageClient) - Manages storage on the [Apify platform](https://apify.com), implemented in the [Apify SDK](https://github.com/apify/apify-sdk-python).
31
34
32
35
```mermaid
@@ -56,6 +59,8 @@ class MemoryStorageClient
56
59
57
60
class SqlStorageClient
58
61
62
+
class RedisStorageClient
63
+
59
64
class ApifyStorageClient
60
65
61
66
%% ========================
@@ -65,6 +70,7 @@ class ApifyStorageClient
65
70
StorageClient --|> FileSystemStorageClient
66
71
StorageClient --|> MemoryStorageClient
67
72
StorageClient --|> SqlStorageClient
73
+
StorageClient --|> RedisStorageClient
68
74
StorageClient --|> ApifyStorageClient
69
75
```
70
76
@@ -304,15 +310,181 @@ Configuration options for the <ApiLink to="class/SqlStorageClient">`SqlStorageCl
304
310
305
311
Configuration options for the <ApiLinkto="class/SqlStorageClient">`SqlStorageClient`</ApiLink> can be set via constructor arguments:
306
312
307
-
-**`connection_string`** (default: SQLite in <ApiLinkto="class/Configuration">`Configuration`</ApiLink> storage dir) – SQLAlchemy connection string, e.g. `sqlite+aiosqlite:///my.db` or `postgresql+asyncpg://user:pass@host/db`.
-**`connection_string`** (default: SQLite in <ApiLinkto="class/Configuration">`Configuration`</ApiLink> storage dir) - SQLAlchemy connection string, e.g. `sqlite+aiosqlite:///my.db` or `postgresql+asyncpg://user:pass@host/db`.
For advanced scenarios, you can configure <ApiLinkto="class/SqlStorageClient">`SqlStorageClient`</ApiLink> with a custom SQLAlchemy engine and additional options via the <ApiLinkto="class/Configuration">`Configuration`</ApiLink> class. This is useful, for example, when connecting to an external PostgreSQL database or customizing connection pooling.
The <ApiLinkto="class/RedisStorageClient">`RedisStorageClient`</ApiLink> is experimental. Its API and behavior may change in future releases.
326
+
:::
327
+
328
+
The <ApiLinkto="class/RedisStorageClient">`RedisStorageClient`</ApiLink> provides persistent storage using [Redis](https://redis.io/) database. It supports concurrent access from multiple independent clients or processes and uses Redis native data structures for efficient operations.
329
+
330
+
:::note dependencies
331
+
The <ApiLinkto="class/RedisStorageClient">`RedisStorageClient`</ApiLink> is not included in the core Crawlee package.
332
+
To use it, you need to install Crawlee with the Redis extra dependency:
333
+
334
+
<code>pip install 'crawlee[redis]'</code>
335
+
336
+
Additionally, Redis version 8.0 or higher is required.
337
+
:::
338
+
339
+
:::note Redis persistence
340
+
Data persistence in Redis depends on your [database configuration](https://redis.io/docs/latest/operate/oss_and_stack/management/persistence/).
341
+
:::
342
+
343
+
The client requires either a Redis connection string or a pre-configured Redis client instance. Use a pre-configured client when you need custom Redis settings such as connection pooling, timeouts, or SSL/TLS encryption.
request_queues:[name]:pending_set - Set | default queue_dedup_strategy
458
+
request_queues:[name]:handled_set - Set | default queue_dedup_strategy
459
+
request_queues:[name]:metadata - JSON Object
460
+
}
461
+
462
+
class RequestQueuesIndexes {
463
+
request_queues:id_to_name - Hash
464
+
request_queues:name_to_id - Hash
465
+
}
466
+
467
+
%% ========================
468
+
%% Client to Keys arrows
469
+
%% ========================
470
+
471
+
RedisRequestQueueClient --> RequestQueueKeys
472
+
RedisRequestQueueClient --> RequestQueuesIndexes
473
+
```
474
+
475
+
Configuration options for the <ApiLinkto="class/RedisStorageClient">`RedisStorageClient`</ApiLink> can be set through environment variables or the <ApiLinkto="class/Configuration">`Configuration`</ApiLink> class:
476
+
477
+
-**`purge_on_start`** (env: `CRAWLEE_PURGE_ON_START`, default: `True`) - Whether to purge default storages on start.
478
+
479
+
Configuration options for the <ApiLinkto="class/RedisStorageClient">`RedisStorageClient`</ApiLink> can be set via constructor arguments:
480
+
481
+
-**`connection_string`** - Redis connection string, e.g. `redis://localhost:6379/0`.
A storage client consists of two parts: the storage client factory and individual storage type clients. The <ApiLinkto="class/StorageClient">`StorageClient`</ApiLink> acts as a factory that creates specific clients (<ApiLinkto="class/DatasetClient">`DatasetClient`</ApiLink>, <ApiLinkto="class/KeyValueStoreClient">`KeyValueStoreClient`</ApiLink>, <ApiLinkto="class/RequestQueueClient">`RequestQueueClient`</ApiLink>) where the actual storage logic is implemented.
0 commit comments