-
Notifications
You must be signed in to change notification settings - Fork 15
feat: Add deduplication to add_batch_of_requests
#534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 59 commits
Commits
Show all changes
60 commits
Select commit
Hold shift + click to select a range
5c437c9
Rm old Apify storage clients
vdusek bf55338
Add init version of new Apify storage clients
vdusek 6b2f82b
Move specific models from Crawlee to SDK
vdusek 38bef68
Adapt to Crawlee v1
vdusek 1f85430
Adapt to Crawlee v1 (p2)
vdusek a3d68a2
Fix default storage IDs
vdusek c77e8d5
Fix integration test and Not implemented exception in purge
vdusek 8731aff
Fix unit tests
vdusek 8dfaffb
fix lint
vdusek 53fad07
add KVS record_exists not implemented
vdusek 5869f8e
update to apify client 1.12 and implement record exists
vdusek 82e65fc
Move default storage IDs to Configuration
vdusek 8de950b
opening storages get default id from config
vdusek 98b76c5
Addressing more feedback
vdusek 7b5ee07
Fixing integration test test_push_large_data_chunks_over_9mb
vdusek afcb8c7
Abstract open method is removed from storage clients
vdusek 3bacab7
fixing generate public url for KVS records
vdusek 287a119
add async metadata getters
vdusek e45d65b
Merge branch 'master' into new-apify-storage-clients
vdusek 51178ca
better usage of apify config
vdusek 3cd7dfe
renaming
vdusek 6fe9eb3
Merge branch 'master' into new-apify-storage-clients
vdusek 1547cbd
fixes after merge commit
vdusek bb47efc
Merge branch 'master' into new-apify-storage-clients
vdusek 4e4fa93
Change from orphan commit to master in crawlee version
Pijukatel 683cb31
Merge branch 'master' into new-apify-storage-clients
vdusek e5b2bc4
fix encrypted secrets test
vdusek 638756f
Add Apify's version of FS client that keeps the INPUT json
vdusek 931b0ca
update metadata fixes
vdusek ad7c0d8
Merge branch 'master' into new-apify-storage-clients
vdusek 1f3c481
KVS metadata extended model
vdusek 44d8e09
fix url signing secret key
vdusek ca72313
Apify storage client fixes and new docs groups
vdusek bc61fee
Add test for `RequestQueue.is_finished`
Pijukatel 16b76dd
Check `_queue_has_locked_requests` in `is_empty`
Pijukatel b6e8a5f
Merge branch 'master' into new-apify-storage-clients
vdusek a3f8c6e
Package structure update
vdusek 594a8e5
Fix request list (HttpResponse.read is now async)
vdusek e1afe2d
init upgrading guide to v3
vdusek 8ce6902
addres RQ feedback from Pepa
vdusek 42810f0
minor RQ client update
vdusek 9edac0f
Merge branch 'master' into new-apify-storage-clients
vdusek ec2a9f0
Fix 2 tests in RQ Apify storage client
vdusek f82d110
Merge branch 'master' into new-apify-storage-clients
vdusek 71ac38d
Update request queue to use manual request tracking
vdusek a8881dd
httpx vs impit
vdusek f5189c5
Merge branch 'master' into new-apify-storage-clients
vdusek 89e572e
rm broken crawlers integration tests
vdusek ae3044e
Try to patch the integration tests for the crawlee branch
Pijukatel 4bc5c91
Add deduplication and test
Pijukatel 70908b3
Add logging for debug
Pijukatel 91ff3fd
Format and type check
Pijukatel 03dcb15
Keep only relevant log
Pijukatel 65b297a
Update to handle parallel requests with same links
Pijukatel 079f890
Merge remote-tracking branch 'origin/master' into add-deduplication
Pijukatel 2c3d0ce
Handle unprocessed requests in deduplication cache correctly
Pijukatel 329baed
Adress review comments
Pijukatel 978d49e
Add deduplication test for `use_extended_unique_key` requests
Pijukatel 1b92532
Do early response validation
Pijukatel cfdb1e2
Merge remote-tracking branch 'origin/master' into add-deduplication
Pijukatel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Judging by apify/crawlee#3120, a day may come when we try to limit the size of
_requests_cache
somehow. Perhaps we should think ahead and come up with a more space-efficient way of tracking already added requests?EDIT: hollup a minute, do you use the ID here for deduplication instead of unique key?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is this deterministic transformation function
unique_key_to_request_id
, which respects Apify platform way of creating IDs, this seems ok. If someone starts creating Requests with a custom id, then deduplication will most likely stop working.There are two issues I created based on the discussion about this:
ApifyRequestQueueClient
#550request.id
crawlee-python#1358