-
Notifications
You must be signed in to change notification settings - Fork 15
feat: Implement Scrapy HTTP cache backend #403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 17 commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
30c1e97
feat(scrapy): add Scrapy cache using Apify KV store
honzajavorek 96f1142
style: format code
honzajavorek 40076f3
fix: make linter happy
honzajavorek eb5b73a
fix: return back nested syntax
honzajavorek ece55b1
feat: introduce the extensions package
honzajavorek 468d3d2
refactor: make linter happy
honzajavorek 48ee026
fix: don't use public properties
honzajavorek f6701db
fix: rename module to httpcache and fix tests location
honzajavorek 21ce5fc
docs: improve docstring of ApifyCacheStorage, describe usage
honzajavorek c19a93e
refactor: rename kv to kvs per convention
honzajavorek d8adf62
feat: set HTTPCACHE_STORAGE in apply_apify_settings, document usage
honzajavorek 1b02e45
docs: improve stylistics
honzajavorek b6969ba
docs: document workaround for https://github.com/apify/actor-template…
honzajavorek b40d555
docs: improve stylistics
honzajavorek d07cbe1
fix: change public path of the ApifyCacheStorage class
honzajavorek 41af4df
Update run_code_checks.yaml
vdusek 4ff56d7
feat: support wider variety of spider names
honzajavorek 22af7ac
fix: truncate too long kvs names
honzajavorek File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
from apify.scrapy.extensions._httpcache import ApifyCacheStorage | ||
|
||
__all__ = ['ApifyCacheStorage'] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,190 @@ | ||
from __future__ import annotations | ||
|
||
import gzip | ||
import io | ||
import pickle | ||
import re | ||
import struct | ||
from logging import getLogger | ||
from time import time | ||
from typing import TYPE_CHECKING | ||
|
||
from scrapy.http.headers import Headers | ||
from scrapy.responsetypes import responsetypes | ||
|
||
from apify import Configuration | ||
from apify.apify_storage_client import ApifyStorageClient | ||
from apify.scrapy._async_thread import AsyncThread | ||
from apify.storages import KeyValueStore | ||
|
||
if TYPE_CHECKING: | ||
from scrapy import Request, Spider | ||
from scrapy.http.response import Response | ||
from scrapy.settings import BaseSettings | ||
from scrapy.utils.request import RequestFingerprinterProtocol | ||
|
||
logger = getLogger(__name__) | ||
|
||
|
||
class ApifyCacheStorage: | ||
"""A Scrapy cache storage that uses the Apify `KeyValueStore` to store responses. | ||
It can be set as a storage for Scrapy's built-in `HttpCacheMiddleware`, which caches | ||
responses to requests. See HTTPCache middleware settings (prefixed with `HTTPCACHE_`) | ||
in the Scrapy documentation for more information. Requires the asyncio Twisted reactor | ||
to be installed. | ||
""" | ||
|
||
def __init__(self, settings: BaseSettings) -> None: | ||
self._expiration_max_items = 100 | ||
self._expiration_secs: int = settings.getint('HTTPCACHE_EXPIRATION_SECS') | ||
self._spider: Spider | None = None | ||
self._kvs: KeyValueStore | None = None | ||
self._fingerprinter: RequestFingerprinterProtocol | None = None | ||
self._async_thread: AsyncThread | None = None | ||
|
||
def open_spider(self, spider: Spider) -> None: | ||
"""Open the cache storage for a spider.""" | ||
logger.debug('Using Apify key value cache storage', extra={'spider': spider}) | ||
self._spider = spider | ||
self._fingerprinter = spider.crawler.request_fingerprinter | ||
kvs_name = get_kvs_name(spider.name) | ||
|
||
async def open_kvs() -> KeyValueStore: | ||
config = Configuration.get_global_configuration() | ||
if config.is_at_home: | ||
storage_client = ApifyStorageClient.from_config(config) | ||
return await KeyValueStore.open(name=kvs_name, storage_client=storage_client) | ||
honzajavorek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return await KeyValueStore.open(name=kvs_name) | ||
|
||
logger.debug("Starting background thread for cache storage's event loop") | ||
self._async_thread = AsyncThread() | ||
logger.debug(f"Opening cache storage's {kvs_name!r} key value store") | ||
self._kvs = self._async_thread.run_coro(open_kvs()) | ||
|
||
def close_spider(self, _: Spider, current_time: int | None = None) -> None: | ||
"""Close the cache storage for a spider.""" | ||
if self._async_thread is None: | ||
raise ValueError('Async thread not initialized') | ||
|
||
logger.info(f'Cleaning up cache items (max {self._expiration_max_items})') | ||
if self._expiration_secs > 0: | ||
if current_time is None: | ||
current_time = int(time()) | ||
|
||
async def expire_kvs() -> None: | ||
if self._kvs is None: | ||
raise ValueError('Key value store not initialized') | ||
i = 0 | ||
async for item in self._kvs.iterate_keys(): | ||
value = await self._kvs.get_value(item.key) | ||
try: | ||
gzip_time = read_gzip_time(value) | ||
except Exception as e: | ||
logger.warning(f'Malformed cache item {item.key}: {e}') | ||
await self._kvs.set_value(item.key, None) | ||
else: | ||
if self._expiration_secs < current_time - gzip_time: | ||
logger.debug(f'Expired cache item {item.key}') | ||
await self._kvs.set_value(item.key, None) | ||
else: | ||
logger.debug(f'Valid cache item {item.key}') | ||
if i == self._expiration_max_items: | ||
break | ||
i += 1 | ||
|
||
self._async_thread.run_coro(expire_kvs()) | ||
|
||
logger.debug('Closing cache storage') | ||
try: | ||
self._async_thread.close() | ||
except KeyboardInterrupt: | ||
logger.warning('Shutdown interrupted by KeyboardInterrupt!') | ||
except Exception: | ||
logger.exception('Exception occurred while shutting down cache storage') | ||
finally: | ||
logger.debug('Cache storage closed') | ||
|
||
def retrieve_response(self, _: Spider, request: Request, current_time: int | None = None) -> Response | None: | ||
"""Retrieve a response from the cache storage.""" | ||
if self._async_thread is None: | ||
raise ValueError('Async thread not initialized') | ||
if self._kvs is None: | ||
raise ValueError('Key value store not initialized') | ||
if self._fingerprinter is None: | ||
raise ValueError('Request fingerprinter not initialized') | ||
|
||
key = self._fingerprinter.fingerprint(request).hex() | ||
value = self._async_thread.run_coro(self._kvs.get_value(key)) | ||
|
||
if value is None: | ||
logger.debug('Cache miss', extra={'request': request}) | ||
return None | ||
|
||
if current_time is None: | ||
current_time = int(time()) | ||
if 0 < self._expiration_secs < current_time - read_gzip_time(value): | ||
logger.debug('Cache expired', extra={'request': request}) | ||
return None | ||
|
||
data = from_gzip(value) | ||
url = data['url'] | ||
status = data['status'] | ||
headers = Headers(data['headers']) | ||
body = data['body'] | ||
respcls = responsetypes.from_args(headers=headers, url=url, body=body) | ||
|
||
logger.debug('Cache hit', extra={'request': request}) | ||
return respcls(url=url, headers=headers, status=status, body=body) | ||
|
||
def store_response(self, _: Spider, request: Request, response: Response) -> None: | ||
"""Store a response in the cache storage.""" | ||
if self._async_thread is None: | ||
raise ValueError('Async thread not initialized') | ||
if self._kvs is None: | ||
raise ValueError('Key value store not initialized') | ||
if self._fingerprinter is None: | ||
raise ValueError('Request fingerprinter not initialized') | ||
|
||
key = self._fingerprinter.fingerprint(request).hex() | ||
data = { | ||
'status': response.status, | ||
'url': response.url, | ||
'headers': dict(response.headers), | ||
'body': response.body, | ||
} | ||
value = to_gzip(data) | ||
self._async_thread.run_coro(self._kvs.set_value(key, value)) | ||
|
||
|
||
def to_gzip(data: dict, mtime: int | None = None) -> bytes: | ||
"""Dump a dictionary to a gzip-compressed byte stream.""" | ||
with io.BytesIO() as byte_stream: | ||
with gzip.GzipFile(fileobj=byte_stream, mode='wb', mtime=mtime) as gzip_file: | ||
pickle.dump(data, gzip_file, protocol=4) | ||
return byte_stream.getvalue() | ||
|
||
|
||
def from_gzip(gzip_bytes: bytes) -> dict: | ||
"""Load a dictionary from a gzip-compressed byte stream.""" | ||
with io.BytesIO(gzip_bytes) as byte_stream, gzip.GzipFile(fileobj=byte_stream, mode='rb') as gzip_file: | ||
data: dict = pickle.load(gzip_file) | ||
return data | ||
|
||
|
||
def read_gzip_time(gzip_bytes: bytes) -> int: | ||
"""Read the modification time from a gzip-compressed byte stream without decompressing the data.""" | ||
header = gzip_bytes[:10] | ||
header_components = struct.unpack('<HBBI2B', header) | ||
mtime: int = header_components[3] | ||
return mtime | ||
|
||
|
||
def get_kvs_name(spider_name: str) -> str: | ||
"""Get the key value store name for a spider.""" | ||
slug = re.sub(r'[^a-zA-Z0-9-]', '-', spider_name) | ||
slug = re.sub(r'-+', '-', slug) | ||
slug = slug.strip('-') | ||
if not slug: | ||
raise ValueError(f'Unsupported spider name: {spider_name!r} (slug: {slug!r})') | ||
return f'httpcache-{slug}' |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
from time import time | ||
|
||
import pytest | ||
|
||
from apify.scrapy.extensions._httpcache import from_gzip, get_kvs_name, read_gzip_time, to_gzip | ||
|
||
FIXTURE_DICT = {'name': 'Alice'} | ||
|
||
FIXTURE_BYTES = ( | ||
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x02\xffk`\x99*\xcc\x00\x01\xb5SzX\xf2\x12s' | ||
b'S\xa7\xf4\xb0:\xe6d&\xa7N)\xd6\x03\x00\x1c\xe8U\x9c\x1e\x00\x00\x00' | ||
) | ||
|
||
|
||
def test_gzip() -> None: | ||
assert from_gzip(to_gzip(FIXTURE_DICT)) == FIXTURE_DICT | ||
|
||
|
||
def test_to_gzip() -> None: | ||
data_bytes = to_gzip(FIXTURE_DICT, mtime=0) | ||
|
||
assert data_bytes == FIXTURE_BYTES | ||
|
||
|
||
def test_from_gzip() -> None: | ||
data_dict = from_gzip(FIXTURE_BYTES) | ||
|
||
assert data_dict == FIXTURE_DICT | ||
|
||
|
||
def test_read_gzip_time() -> None: | ||
assert read_gzip_time(FIXTURE_BYTES) == 0 | ||
|
||
|
||
def test_read_gzip_time_non_zero() -> None: | ||
current_time = int(time()) | ||
data_bytes = to_gzip(FIXTURE_DICT, mtime=current_time) | ||
|
||
assert read_gzip_time(data_bytes) == current_time | ||
|
||
|
||
@pytest.mark.parametrize( | ||
('spider_name', 'expected'), | ||
[ | ||
('test', 'httpcache-test'), | ||
('123', 'httpcache-123'), | ||
('test-spider', 'httpcache-test-spider'), | ||
('test_spider', 'httpcache-test-spider'), | ||
('test spider', 'httpcache-test-spider'), | ||
('test👻spider', 'httpcache-test-spider'), | ||
('test@spider', 'httpcache-test-spider'), | ||
(' test spider ', 'httpcache-test-spider'), | ||
('testspider.com', 'httpcache-testspider-com'), | ||
], | ||
) | ||
def test_get_kvs_name(spider_name: str, expected: str) -> None: | ||
assert get_kvs_name(spider_name) == expected | ||
|
||
|
||
@pytest.mark.parametrize( | ||
('spider_name'), | ||
[ | ||
'', | ||
'-', | ||
'-@-/-', | ||
], | ||
) | ||
def test_get_kvs_name_raises(spider_name: str) -> None: | ||
with pytest.raises(ValueError, match='Unsupported spider name'): | ||
assert get_kvs_name(spider_name) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.