-
Notifications
You must be signed in to change notification settings - Fork 489
refactor!: Introduce new storage client system #1194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
45 commits
Select commit
Hold shift + click to select a range
f285707
refactor!: Introduce new storage client system
vdusek dd9be6e
Cleanup
vdusek 89bfa5b
Address feedback
vdusek 4050c75
Add purge_if_needed method and improve some typing based on Pylance
vdusek 26f46e2
Address more feedback
vdusek c83a36a
RQ FS client improvements
vdusek c967fe5
Add caching to RQ FS client
vdusek 7df046f
RQ FS performance optimization in add_requests
vdusek 3555565
RQ FS performance issues in fetch_next_request
vdusek 946d1e2
RQ FS fetch performance for is_empty
vdusek 9f10b95
rm code duplication for open methods
vdusek 0864ff8
Request loaders use async getters for handled/total req cnt
vdusek af0d129
Add missing_ok when removing files
vdusek 9998a58
Improve is_empty
vdusek fdee111
Optimize RQ memory storage client
vdusek 79cdfc0
Add upgrading guide and skip problematic test
vdusek 3d2fd73
Merge branch 'master' into new-storage-clients
vdusek e818585
chore: update `docusaurus-plugin-typedoc-api`, fix failing docs build
barjin 65db9ac
fix docs
vdusek 2b786f7
add retries to atomic write
vdusek 2cb04c5
chore(deps): update dependency pytest-cov to ~=6.2.0 (#1244)
renovate[bot] 0c8c4ec
Fix atomic write on Windows
vdusek ce1eeb1
resolve write function during import time
vdusek 4c05cee
Merge branch 'master' into new-storage-clients
vdusek 8c80513
Update file utils
vdusek 70bc071
revert un-intentionally makefile changes
vdusek 78efb4d
Address Honza's comments (p1)
vdusek fa18d19
Introduce storage instance manager
vdusek c783dac
Utilize recoverable state for the FS RQ state
vdusek 437071e
Details
vdusek df4bfa7
Rm default_"storage"_id options (were not used at all)
vdusek e133fcd
Update storages guide and add storage clients guide
vdusek 76f1ffb
Docs guides - code examples
vdusek fa48644
Docs guides polishment
vdusek 5c935af
docs fix lint & type checks for py 3.9
vdusek ac259ce
Address Honza's feedback
vdusek 1cbf15e
SDK fixes
vdusek bc50990
Add KVS record_exists method
vdusek d1cf967
reduce test duplicities for storages & storage clients
vdusek aa9bfd3
Create locks in async context only
vdusek d6c9877
rm open methods from base storage clients
vdusek 3b133ce
update storage clients inits
vdusek 43b9fe9
async metadata getter
vdusek b628fbb
better typing in storage instance manager
vdusek 9dfac4b
update upgrading guide
vdusek File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
20 changes: 20 additions & 0 deletions
20
docs/guides/code_examples/storages/cleaning_purge_explicitly_example.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
import asyncio | ||
|
||
from crawlee.storages import Dataset | ||
|
||
|
||
async def main() -> None: | ||
# Create storage client with configuration | ||
dataset = await Dataset.open(name='my-dataset') | ||
|
||
# Purge the dataset explicitly - purging will remove all items from the dataset. | ||
# But keeps the dataset itself and its metadata. | ||
await dataset.purge() | ||
|
||
# Or you can drop the dataset completely, which will remove the dataset | ||
# and all its items. | ||
await dataset.drop() | ||
|
||
|
||
if __name__ == '__main__': | ||
asyncio.run(main()) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,197 @@ | ||
--- | ||
id: storage-clients | ||
title: Storage clients | ||
description: How to work with storage clients in Crawlee, including the built-in clients and how to create your own. | ||
--- | ||
|
||
import ApiLink from '@site/src/components/ApiLink'; | ||
import Tabs from '@theme/Tabs'; | ||
import TabItem from '@theme/TabItem'; | ||
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock'; | ||
|
||
Storage clients in Crawlee are subclasses of <ApiLink to="class/StorageClient">`StorageClient`</ApiLink>. They handle interactions with different storage backends. For instance: | ||
|
||
- <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink>: Stores data purely in memory with no persistence. | ||
- <ApiLink to="class/FileSystemStorageClient">`FileSystemStorageClient`</ApiLink>: Provides persistent file system storage with in-memory caching for better performance. | ||
- [`ApifyStorageClient`](https://docs.apify.com/sdk/python/reference/class/ApifyStorageClient): Manages storage on the [Apify platform](https://apify.com). Apify storage client is implemented in the [Apify SDK](https://github.com/apify/apify-sdk-python). | ||
|
||
Each storage client is responsible for maintaining the storages in a specific environment. This abstraction makes it easier to switch between different environments, e.g. between local development and cloud production setup. | ||
|
||
Storage clients provide a unified interface for interacting with <ApiLink to="class/Dataset">`Dataset`</ApiLink>, <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>, and <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>, regardless of the underlying storage implementation. They handle operations like creating, reading, updating, and deleting storage instances, as well as managing data persistence and cleanup. | ||
|
||
## Built-in storage clients | ||
|
||
Crawlee Python currently provides two main storage client implementations: | ||
|
||
### Memory storage client | ||
|
||
The <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> stores all data in memory using Python data structures. It provides fast access but does not persist data between runs, meaning all data is lost when the program terminates. | ||
|
||
```python | ||
from crawlee.storage_clients import MemoryStorageClient | ||
from crawlee.crawlers import ParselCrawler | ||
|
||
# Create memory storage client. | ||
storage_client = MemoryStorageClient() | ||
|
||
# Or pass it directly to the crawler. | ||
crawler = ParselCrawler(storage_client=storage_client) | ||
``` | ||
|
||
The `MemoryStorageClient` is a good choice for testing, development, or short-lived operations where speed is more important than data persistence. It is not suitable for production use or long-running crawls, as all data will be lost when the program exits. | ||
vdusek marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
### File system storage client | ||
|
||
The <ApiLink to="class/FileSystemStorageClient">`FileSystemStorageClient`</ApiLink> provides persistent storage by writing data directly to the file system. It uses smart caching and batch processing for better performance while storing data in human-readable JSON format. | ||
|
||
This storage client is ideal for large datasets, and long-running operations where data persistence is required. Data can be easily inspected and shared with other tools. | ||
|
||
```python | ||
from crawlee.storage_clients import FileSystemStorageClient | ||
from crawlee.crawlers import ParselCrawler | ||
|
||
# Create file system storage client. | ||
storage_client = FileSystemStorageClient() | ||
|
||
# Or pass it directly to the crawler. | ||
crawler = ParselCrawler(storage_client=storage_client) | ||
``` | ||
|
||
Configuration options for the <ApiLink to="class/FileSystemStorageClient">`FileSystemStorageClient`</ApiLink> can be set through environment variables or the <ApiLink to="class/Configuration">`Configuration`</ApiLink> class. | ||
- **`storage_dir`** (env: `CRAWLEE_STORAGE_DIR`, default: `'./storage'`): The root directory for all storage data. | ||
- **`purge_on_start`** (env: `CRAWLEE_PURGE_ON_START`, default: `True`): Whether to purge default storages on start. | ||
|
||
Data are stored using the following directory structure: | ||
|
||
```text | ||
{CRAWLEE_STORAGE_DIR}/ | ||
├── datasets/ | ||
│ └── {DATASET_NAME}/ | ||
│ ├── __metadata__.json | ||
│ ├── 000000001.json | ||
│ └── 000000002.json | ||
├── key_value_stores/ | ||
│ └── {KVS_NAME}/ | ||
│ ├── __metadata__.json | ||
│ ├── key1.json | ||
│ ├── key2.txt | ||
│ └── key3.json | ||
└── request_queues/ | ||
└── {RQ_NAME}/ | ||
├── __metadata__.json | ||
├── {REQUEST_ID_1}.json | ||
└── {REQUEST_ID_2}.json | ||
``` | ||
|
||
Where: | ||
- `{CRAWLEE_STORAGE_DIR}`: The root directory for local storage | ||
- `{DATASET_NAME}`, `{KVS_NAME}`, `{RQ_NAME}`: The unique names for each storage instance (defaults to `"default"`) | ||
- Files are stored directly without additional metadata files for simpler structure | ||
|
||
```python | ||
from crawlee.configuration import Configuration | ||
from crawlee.storage_clients import FileSystemStorageClient | ||
from crawlee.crawlers import ParselCrawler | ||
|
||
configuration = Configuration( | ||
storage_dir='./my_storage', | ||
purge_on_start=False, | ||
) | ||
storage_client = FileSystemStorageClient(configuration=configuration) | ||
crawler = ParselCrawler(storage_client=storage_client) | ||
``` | ||
|
||
:::warning Concurrency limitation | ||
The `FileSystemStorageClient` is not safe for concurrent access from multiple crawler processes. Use it only when running a single crawler process at a time. | ||
::: | ||
|
||
## Creating a custom storage client | ||
|
||
A custom storage client consists of two parts: the storage client factory and individual storage type clients. The <ApiLink to="class/StorageClient">`StorageClient`</ApiLink> acts as a factory that creates specific clients (<ApiLink to="class/DatasetClient">`DatasetClient`</ApiLink>, <ApiLink to="class/KeyValueStoreClient">`KeyValueStoreClient`</ApiLink>, <ApiLink to="class/RequestQueueClient">`RequestQueueClient`</ApiLink>) where the actual storage logic is implemented. | ||
|
||
```python | ||
# First, implement the specific storage clients by subclassing the abstract base classes: | ||
|
||
from crawlee.storage_clients._base import DatasetClient, KeyValueStoreClient, RequestQueueClient | ||
|
||
class CustomDatasetClient(DatasetClient): | ||
# Implement all abstract methods for dataset operations. | ||
pass | ||
|
||
class CustomKeyValueStoreClient(KeyValueStoreClient): | ||
# Implement all abstract methods for key-value store operations. | ||
pass | ||
|
||
class CustomRequestQueueClient(RequestQueueClient): | ||
# Implement all abstract methods for request queue operations. | ||
pass | ||
|
||
# Then implement the storage client that provides these specific clients: | ||
|
||
from crawlee.storage_clients import StorageClient | ||
from crawlee.configuration import Configuration | ||
|
||
class CustomStorageClient(StorageClient): | ||
async def create_dataset_client( | ||
self, | ||
*, | ||
id: str | None = None, | ||
name: str | None = None, | ||
configuration: Configuration | None = None, | ||
) -> CustomDatasetClient: | ||
# Create an instance of custom dataset client and return it. | ||
pass | ||
|
||
async def create_kvs_client( | ||
self, | ||
*, | ||
id: str | None = None, | ||
name: str | None = None, | ||
configuration: Configuration | None = None, | ||
) -> CustomKeyValueStoreClient: | ||
# Create an instance of custom key-value store client and return it. | ||
pass | ||
|
||
async def create_rq_client( | ||
self, | ||
*, | ||
id: str | None = None, | ||
name: str | None = None, | ||
configuration: Configuration | None = None, | ||
) -> CustomRequestQueueClient: | ||
# Create an instance of custom request queue client and return it. | ||
pass | ||
``` | ||
|
||
Custom storage clients can implement any storage logic, such as connecting to a database, using a cloud storage service, or integrating with other systems. They must implement the required methods for creating, reading, updating, and deleting data in the respective storages. | ||
|
||
## Registering storage clients | ||
|
||
Custom storage clients can be registered with the <ApiLink to="class/ServiceLocator">`ServiceLocator`</ApiLink> or passed directly to the crawler or specific storage. This allows you to use your custom storage implementation seamlessly with Crawlee's abstractions. | ||
|
||
```python | ||
from crawlee.storage_clients import CustomStorageClient | ||
from crawlee.service_locator import service_locator | ||
from crawlee.crawlers import ParselCrawler | ||
from crawlee.storages import Dataset | ||
|
||
# Create custom storage client. | ||
storage_client = CustomStorageClient() | ||
storage_client = CustomStorageClient() | ||
|
||
# Register it either with the service locator. | ||
service_locator.set_storage_client(storage_client) | ||
|
||
# Or pass it directly to the crawler. | ||
crawler = ParselCrawler(storage_client=storage_client) | ||
|
||
# Or just provide it when opening a storage (e.g. dataset). | ||
dataset = await Dataset.open( | ||
name='my_dataset', | ||
storage_client=storage_client, | ||
) | ||
``` | ||
|
||
## Conclusion | ||
|
||
Storage clients in Crawlee provide different backends for storages. Use <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> for testing and fast operations without persistence, or <ApiLink to="class/FileSystemStorageClient">`FileSystemStorageClient`</ApiLink> for environments where data needs to persist. You can also create custom storage clients for specialized backends by implementing the <ApiLink to="class/StorageClient">`StorageClient`</ApiLink> interface. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.