Skip to content
Merged
Show file tree
Hide file tree
Changes from 56 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
5c437c9
Rm old Apify storage clients
vdusek Apr 28, 2025
bf55338
Add init version of new Apify storage clients
vdusek May 9, 2025
6b2f82b
Move specific models from Crawlee to SDK
vdusek Jun 12, 2025
38bef68
Adapt to Crawlee v1
vdusek Jun 18, 2025
1f85430
Adapt to Crawlee v1 (p2)
vdusek Jun 23, 2025
a3d68a2
Fix default storage IDs
vdusek Jun 25, 2025
c77e8d5
Fix integration test and Not implemented exception in purge
vdusek Jun 26, 2025
8731aff
Fix unit tests
vdusek Jun 26, 2025
8dfaffb
fix lint
vdusek Jun 26, 2025
53fad07
add KVS record_exists not implemented
vdusek Jun 26, 2025
5869f8e
update to apify client 1.12 and implement record exists
vdusek Jun 26, 2025
82e65fc
Move default storage IDs to Configuration
vdusek Jun 27, 2025
8de950b
opening storages get default id from config
vdusek Jun 27, 2025
98b76c5
Addressing more feedback
vdusek Jun 27, 2025
7b5ee07
Fixing integration test test_push_large_data_chunks_over_9mb
vdusek Jun 27, 2025
afcb8c7
Abstract open method is removed from storage clients
vdusek Jun 30, 2025
3bacab7
fixing generate public url for KVS records
vdusek Jun 30, 2025
287a119
add async metadata getters
vdusek Jul 1, 2025
e45d65b
Merge branch 'master' into new-apify-storage-clients
vdusek Jul 1, 2025
51178ca
better usage of apify config
vdusek Jul 1, 2025
3cd7dfe
renaming
vdusek Jul 2, 2025
6fe9eb3
Merge branch 'master' into new-apify-storage-clients
vdusek Jul 3, 2025
1547cbd
fixes after merge commit
vdusek Jul 3, 2025
bb47efc
Merge branch 'master' into new-apify-storage-clients
vdusek Jul 4, 2025
4e4fa93
Change from orphan commit to master in crawlee version
Pijukatel Jul 9, 2025
683cb31
Merge branch 'master' into new-apify-storage-clients
vdusek Jul 9, 2025
e5b2bc4
fix encrypted secrets test
vdusek Jul 9, 2025
638756f
Add Apify's version of FS client that keeps the INPUT json
vdusek Jul 10, 2025
931b0ca
update metadata fixes
vdusek Jul 16, 2025
ad7c0d8
Merge branch 'master' into new-apify-storage-clients
vdusek Jul 16, 2025
1f3c481
KVS metadata extended model
vdusek Jul 16, 2025
44d8e09
fix url signing secret key
vdusek Jul 16, 2025
ca72313
Apify storage client fixes and new docs groups
vdusek Jul 19, 2025
bc61fee
Add test for `RequestQueue.is_finished`
Pijukatel Jul 21, 2025
16b76dd
Check `_queue_has_locked_requests` in `is_empty`
Pijukatel Jul 21, 2025
b6e8a5f
Merge branch 'master' into new-apify-storage-clients
vdusek Jul 22, 2025
a3f8c6e
Package structure update
vdusek Jul 22, 2025
594a8e5
Fix request list (HttpResponse.read is now async)
vdusek Jul 22, 2025
e1afe2d
init upgrading guide to v3
vdusek Jul 24, 2025
8ce6902
addres RQ feedback from Pepa
vdusek Jul 25, 2025
42810f0
minor RQ client update
vdusek Jul 25, 2025
9edac0f
Merge branch 'master' into new-apify-storage-clients
vdusek Jul 28, 2025
ec2a9f0
Fix 2 tests in RQ Apify storage client
vdusek Jul 29, 2025
f82d110
Merge branch 'master' into new-apify-storage-clients
vdusek Jul 30, 2025
71ac38d
Update request queue to use manual request tracking
vdusek Aug 3, 2025
a8881dd
httpx vs impit
vdusek Aug 3, 2025
f5189c5
Merge branch 'master' into new-apify-storage-clients
vdusek Aug 5, 2025
89e572e
rm broken crawlers integration tests
vdusek Aug 5, 2025
ae3044e
Try to patch the integration tests for the crawlee branch
Pijukatel Aug 5, 2025
4bc5c91
Add deduplication and test
Pijukatel Aug 5, 2025
70908b3
Add logging for debug
Pijukatel Aug 6, 2025
91ff3fd
Format and type check
Pijukatel Aug 6, 2025
03dcb15
Keep only relevant log
Pijukatel Aug 7, 2025
65b297a
Update to handle parallel requests with same links
Pijukatel Aug 7, 2025
079f890
Merge remote-tracking branch 'origin/master' into add-deduplication
Pijukatel Aug 13, 2025
2c3d0ce
Handle unprocessed requests in deduplication cache correctly
Pijukatel Aug 13, 2025
329baed
Adress review comments
Pijukatel Aug 15, 2025
978d49e
Add deduplication test for `use_extended_unique_key` requests
Pijukatel Aug 15, 2025
1b92532
Do early response validation
Pijukatel Aug 15, 2025
cfdb1e2
Merge remote-tracking branch 'origin/master' into add-deduplication
Pijukatel Aug 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 59 additions & 10 deletions src/apify/storage_clients/_apify/_request_queue_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -242,17 +242,66 @@ async def add_batch_of_requests(
Returns:
Response containing information about the added requests.
"""
# Prepare requests for API by converting to dictionaries.
requests_dict = [
request.model_dump(
by_alias=True,
exclude={'id'}, # Exclude ID fields from requests since the API doesn't accept them.
)
for request in requests
]
# Do not try to add previously added requests to avoid pointless expensive calls to API

new_requests: list[Request] = []
already_present_requests: list[dict[str, str | bool]] = []

for request in requests:
if self._requests_cache.get(request.id):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging by apify/crawlee#3120, a day may come when we try to limit the size of _requests_cache somehow. Perhaps we should think ahead and come up with a more space-efficient way of tracking already added requests?

EDIT: hollup a minute, do you use the ID here for deduplication instead of unique key?

Copy link
Contributor Author

@Pijukatel Pijukatel Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is this deterministic transformation function unique_key_to_request_id, which respects Apify platform way of creating IDs, this seems ok. If someone starts creating Requests with a custom id, then deduplication will most likely stop working.

There are two issues I created based on the discussion about this:

# We are no sure if it was already handled at this point, and it is not worth calling API for it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean "We are now sure that it was already handled..."? I'm not sure 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was not very clear. Updated

already_present_requests.append(
{
'id': request.id,
'uniqueKey': request.unique_key,
'wasAlreadyPresent': True,
'wasAlreadyHandled': request.was_already_handled,
}
)

else:
# Add new request to the cache.
processed_request = ProcessedRequest.model_validate(
{
'id': request.id,
'uniqueKey': request.unique_key,
'wasAlreadyPresent': True,
'wasAlreadyHandled': request.was_already_handled,
}
)
self._cache_request(
unique_key_to_request_id(request.unique_key),
processed_request,
)
new_requests.append(request)

if new_requests:
# Prepare requests for API by converting to dictionaries.
requests_dict = [
request.model_dump(
by_alias=True,
exclude={'id'}, # Exclude ID fields from requests since the API doesn't accept them.
)
for request in new_requests
]

# Send requests to API.
response = await self._api_client.batch_add_requests(requests=requests_dict, forefront=forefront)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably out of the scope of the PR, but it might be worth it to validate the response with a Pydantic model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That happens in the original code already few lines down: api_response = AddRequestsResponse.model_validate(response)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I meant validating the whole response object with the two lists, so that you wouldn't need to do response['unprocessedRequests']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, added.

# Add the locally known already present processed requests based on the local cache.
response['processedRequests'].extend(already_present_requests)

# Send requests to API.
response = await self._api_client.batch_add_requests(requests=requests_dict, forefront=forefront)
# Remove unprocessed requests from the cache
for unprocessed in response['unprocessedRequests']:
self._requests_cache.pop(unique_key_to_request_id(unprocessed['uniqueKey']), None)

else:
response = {'unprocessedRequests': [], 'processedRequests': already_present_requests}

logger.debug(
f'Tried to add new requests: {len(new_requests)}, '
f'succeeded to add new requests: {len(response["processedRequests"])}, '
f'skipped already present requests: {len(already_present_requests)}'
)

# Update assumed total count for newly added requests.
api_response = AddRequestsResponse.model_validate(response)
Expand Down
220 changes: 178 additions & 42 deletions tests/integration/test_actor_request_queue.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,40 @@
from __future__ import annotations

from typing import TYPE_CHECKING
import asyncio
import logging
from typing import TYPE_CHECKING, Any
from unittest import mock

import pytest

from apify_shared.consts import ApifyEnvVars

from ._utils import generate_unique_resource_name
from apify import Actor, Request

if TYPE_CHECKING:
import pytest
from collections.abc import AsyncGenerator

from apify_client import ApifyClientAsync
from crawlee.storages import RequestQueue

from .conftest import MakeActorFunction, RunActorFunction


@pytest.fixture
async def apify_named_rq(
apify_client_async: ApifyClientAsync, monkeypatch: pytest.MonkeyPatch
) -> AsyncGenerator[RequestQueue]:
assert apify_client_async.token
monkeypatch.setenv(ApifyEnvVars.TOKEN, apify_client_async.token)
request_queue_name = generate_unique_resource_name('request_queue')

async with Actor:
request_queue = await Actor.open_request_queue(name=request_queue_name, force_cloud=True)
yield request_queue
await request_queue.drop()


async def test_same_references_in_default_rq(
make_actor: MakeActorFunction,
run_actor: RunActorFunction,
Expand Down Expand Up @@ -61,55 +81,171 @@ async def main() -> None:

async def test_force_cloud(
apify_client_async: ApifyClientAsync,
monkeypatch: pytest.MonkeyPatch,
apify_named_rq: RequestQueue,
) -> None:
assert apify_client_async.token is not None
monkeypatch.setenv(ApifyEnvVars.TOKEN, apify_client_async.token)
request_queue_id = (await apify_named_rq.get_metadata()).id
request_info = await apify_named_rq.add_request(Request.from_url('http://example.com'))
request_queue_client = apify_client_async.request_queue(request_queue_id)

request_queue_name = generate_unique_resource_name('request_queue')
request_queue_details = await request_queue_client.get()
assert request_queue_details is not None
assert request_queue_details.get('name') == apify_named_rq.name

async with Actor:
request_queue = await Actor.open_request_queue(name=request_queue_name, force_cloud=True)
request_queue_id = (await request_queue.get_metadata()).id
request_queue_request = await request_queue_client.get_request(request_info.id)
assert request_queue_request is not None
assert request_queue_request['url'] == 'http://example.com'

request_info = await request_queue.add_request(Request.from_url('http://example.com'))

request_queue_client = apify_client_async.request_queue(request_queue_id)
async def test_request_queue_is_finished(
apify_named_rq: RequestQueue,
) -> None:
request_queue = await Actor.open_request_queue(name=apify_named_rq.name, force_cloud=True)
await request_queue.add_request(Request.from_url('http://example.com'))
assert not await request_queue.is_finished()

try:
request_queue_details = await request_queue_client.get()
assert request_queue_details is not None
assert request_queue_details.get('name') == request_queue_name
request = await request_queue.fetch_next_request()
assert request is not None
assert not await request_queue.is_finished(), (
'RequestQueue should not be finished unless the request is marked as handled.'
)

request_queue_request = await request_queue_client.get_request(request_info.id)
assert request_queue_request is not None
assert request_queue_request['url'] == 'http://example.com'
finally:
await request_queue_client.delete()
await request_queue.mark_request_as_handled(request)
assert await request_queue.is_finished()


async def test_request_queue_is_finished(
apify_client_async: ApifyClientAsync,
monkeypatch: pytest.MonkeyPatch,
async def test_request_queue_deduplication(
make_actor: MakeActorFunction,
run_actor: RunActorFunction,
) -> None:
assert apify_client_async.token is not None
monkeypatch.setenv(ApifyEnvVars.TOKEN, apify_client_async.token)
"""Test that the deduplication works correctly. Try to add 2 same requests, but it should call API just once.

request_queue_name = generate_unique_resource_name('request_queue')
This tests internal optimization that changes no behavior for the user.
The functions input/output behave the same way,it only uses less amount of API calls.
"""

async with Actor:
try:
request_queue = await Actor.open_request_queue(name=request_queue_name, force_cloud=True)
await request_queue.add_request(Request.from_url('http://example.com'))
assert not await request_queue.is_finished()

request = await request_queue.fetch_next_request()
assert request is not None
assert not await request_queue.is_finished(), (
'RequestQueue should not be finished unless the request is marked as handled.'
)

await request_queue.mark_request_as_handled(request)
assert await request_queue.is_finished()
finally:
await request_queue.drop()
async def main() -> None:
import asyncio

from apify import Actor, Request

async with Actor:
request = Request.from_url('http://example.com')
rq = await Actor.open_request_queue()

await asyncio.sleep(10) # Wait to be sure that metadata are updated

# Get raw client, because stats are not exposed in `RequestQueue` class, but are available in raw client
rq_client = Actor.apify_client.request_queue(request_queue_id=rq.id)
_rq = await rq_client.get()
assert _rq
stats_before = _rq.get('stats', {})
Actor.log.info(stats_before)

# Add same request twice
await rq.add_request(request)
await rq.add_request(request)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could make two distinct Request instances with the same uniqueKey here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. I also added one more test to make it explicit that deduplication works based on unique_key only and unless we use use_extended_unique_key argument, some attributes of the request might be ignored. Another test makes this behavior clearly intentional to avoid some confusion in the future.


await asyncio.sleep(10) # Wait to be sure that metadata are updated
_rq = await rq_client.get()
assert _rq
stats_after = _rq.get('stats', {})
Actor.log.info(stats_after)

assert (stats_after['writeCount'] - stats_before['writeCount']) == 1

actor = await make_actor(label='rq-deduplication', main_func=main)
run_result = await run_actor(actor)

assert run_result.status == 'SUCCEEDED'


async def test_request_queue_parallel_deduplication(
make_actor: MakeActorFunction,
run_actor: RunActorFunction,
) -> None:
"""Test that the deduplication works correctly even with parallel attempts to add same links."""

async def main() -> None:
import asyncio
import logging

from apify import Actor, Request

async with Actor:
logging.getLogger('apify.storage_clients._apify._request_queue_client').setLevel(logging.DEBUG)

requests = [Request.from_url(f'http://example.com/{i}') for i in range(100)]
rq = await Actor.open_request_queue()

await asyncio.sleep(10) # Wait to be sure that metadata are updated

# Get raw client, because stats are not exposed in `RequestQueue` class, but are available in raw client
rq_client = Actor.apify_client.request_queue(request_queue_id=rq.id)
_rq = await rq_client.get()
assert _rq
stats_before = _rq.get('stats', {})
Actor.log.info(stats_before)

# Add same requests in 10 parallel workers
async def add_requests_worker() -> None:
await rq.add_requests(requests)

add_requests_workers = [asyncio.create_task(add_requests_worker()) for _ in range(10)]
await asyncio.gather(*add_requests_workers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you made sure that these do in fact run in parallel? To the naked eye, 100 requests doesn't seem like much, I'd expect that the event loop may run the tasks in sequence.

Maybe you could add the requests in each worker in smaller batches and add some random delays? Or just add a comment saying that you verified parallel execution empirically 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote the test for the implementation that did not take parallel execution into account, and it was failing consistently. So from that perspective, I consider the test sufficient.

Anyway, I added some chunking to make the test slightly more challenging. The parallel execution can be verified in the logs. For example, below. From the logs it can be seen that the add_batch_of_requests that was started first did not finish first - as it was "taken over" during it's await by another worker.

DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 10
DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 20
DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 90
DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 80
DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 0
DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 40
DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 50
DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 60
DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 30
DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 70
INFO  {'readCount': 0, 'writeCount': 100, 'deleteCount': 0, 'headItemReadCount': 0, 'storageBytes': 7400}


await asyncio.sleep(10) # Wait to be sure that metadata are updated
_rq = await rq_client.get()
assert _rq
stats_after = _rq.get('stats', {})
Actor.log.info(stats_after)

assert (stats_after['writeCount'] - stats_before['writeCount']) == len(requests)

actor = await make_actor(label='rq-parallel-deduplication', main_func=main)
run_result = await run_actor(actor)

assert run_result.status == 'SUCCEEDED'


async def test_request_queue_deduplication_unprocessed_requests(
apify_named_rq: RequestQueue,
) -> None:
"""Test that the deduplication does not add unprocessed requests to the cache."""
logging.getLogger('apify.storage_clients._apify._request_queue_client').setLevel(logging.DEBUG)

await asyncio.sleep(10) # Wait to be sure that metadata are updated

# Get raw client, because stats are not exposed in `RequestQueue` class, but are available in raw client
rq_client = Actor.apify_client.request_queue(request_queue_id=apify_named_rq.id)
_rq = await rq_client.get()
assert _rq
stats_before = _rq.get('stats', {})
Actor.log.info(stats_before)

def return_unprocessed_requests(requests: list[dict], *_: Any, **__: Any) -> dict[str, list[dict]]:
"""Simulate API returning unprocessed requests."""
return {
'processedRequests': [],
'unprocessedRequests': [
{'url': request['url'], 'uniqueKey': request['uniqueKey'], 'method': request['method']}
for request in requests
],
}

with mock.patch(
'apify_client.clients.resource_clients.request_queue.RequestQueueClientAsync.batch_add_requests',
side_effect=return_unprocessed_requests,
):
# Simulate failed API call for adding requests. Request was not processed and should not be cached.
await apify_named_rq.add_requests(['http://example.com/1'])

# This will succeed.
await apify_named_rq.add_requests(['http://example.com/1'])
Comment on lines +303 to +311
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we could verify that the request was actually not cached between the two add_requests calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is checked implicitly in the last line where it is asserted that there was exactly 1 writeCount difference. The first call is "hardcoded" to fail, even on all retries, so it never even sends the API request and thus has no chance of increasing the writeCount.

The second call can make the write only if it is not cached, as cached requests do not make the call (tested in other tests). So this means the request was not cached in between.

I could assert the state of the cache in between those calls, but since it is kind of an implementation detail, I would prefer not to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, can you explain this in a comment then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added to the test description.


await asyncio.sleep(10) # Wait to be sure that metadata are updated
_rq = await rq_client.get()
assert _rq
stats_after = _rq.get('stats', {})
Actor.log.info(stats_after)

assert (stats_after['writeCount'] - stats_before['writeCount']) == 1
4 changes: 2 additions & 2 deletions uv.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading