Dedup Backend Initial Implementation #2868

ikreymer · 2025-09-30T03:18:42Z

Fixes #2867

Set to feature branch as base.

The backend implementation involves:

A new CollIndex CRD type
Operator that manages the new CRD type, creating a new Redis instance when the index should exist
Operator starts the crawler in 'indexer' mode (will be available from Deduplication (Initial Support). browsertrix-crawler#884)
Collection has a new hasDedupIndex field
Workflows have a new 'dedupCollId' field for dedup while crawling. The dedupCollId must also be a collection that the crawl is auto-added to.
There is a new waiting state: waiting_for_dedup_index that is entered if a crawl is starting, but index is not yet ready.

For indexing, dependent on version of crawler (1.9.0 beta 0 or higher) that supports indexing mode.

Testing

This is ready for initial frontend work and testing:

the dedupCollId can be set on the workflow to enable dedup for future crawls.
the collection has a hasDedupIndex to indicate if an index is enabled for it.
kubectl get pods -n crawlers collindex should work

tw4l

A promising start!

I left comments/suggestions throughout where I noticed things. Should also add tests for setting dedupCollId for collection and crawlconfig add and update, I don't think it's quite right as-is (at least for collections).

backend/btrixcloud/colls.py

backend/btrixcloud/crawlconfigs.py

backend/btrixcloud/crawlmanager.py

backend/btrixcloud/k8sapi.py

backend/btrixcloud/models.py

backend/btrixcloud/operator/collindexes.py

backend/btrixcloud/operator/crawls.py

backend/btrixcloud/uploads.py

SuaYoo

The collection PATCH endpoint will need to accept an empty string for dedupCollId so that it can be unset. Currently, the API returns an error that details Input should be a valid UUID, invalid length: expected length 32 for simple format, found 0

SuaYoo · 2025-10-13T18:20:28Z

Also, updating dedupCollId should update autoAddCollections on the backend.

backend/btrixcloud/colls.py

- add CollIndex crd - add new operator - added as btrix-crds 0.2.0 operator collindex: - redis_dedupe_memory and redis_dedupe_storage configurable - dedupe_importer_channel can configure crawler channel for index imports operator crawl: - add 'waiting_for_dedupe_index' state to indicate crawl is awaiting dedup index - on crawl success, get list of required crawls deduped against and set 'requiredCrawls' and 'requiredByCrawls' collection: - supports creating, updating, deleting CollIndex via CrawlManager - tracks via hasDedupeIndex - mark CollIndex object for updates by updating collItemsUpdatedAt on items add or remove - CollIndex deleted on collection deleted workflow: - has dedupeCollId if dedupe is enabled - dedupeCollId included in autoAdd collections - update empty string dedupeCollId to remove dedupe collection from workflow

Co-authored-by: Tessa Walsh <[email protected]>

ikreymer requested a review from tw4l September 30, 2025 03:18

tw4l reviewed Sep 30, 2025

View reviewed changes

ikreymer force-pushed the dedup-initial branch from fdbcaa1 to d8c3a05 Compare October 1, 2025 05:15

SuaYoo force-pushed the dedup-initial branch from 7251e57 to 5507a25 Compare October 6, 2025 21:16

SuaYoo mentioned this pull request Oct 9, 2025

task: Add dedupe form control to workflow #2893

Open

SuaYoo reviewed Oct 13, 2025

View reviewed changes

backend/btrixcloud/colls.py Outdated Show resolved Hide resolved

ikreymer modified the milestones: 1.19.4 Release, Crawl Deduplication Oct 15, 2025

ikreymer force-pushed the dedup-initial branch from 1bccfad to d6e7f3e Compare October 25, 2025 22:26

ikreymer changed the base branch from feature-dedup to main October 25, 2025 22:31

ikreymer changed the base branch from main to feature-dedup October 25, 2025 22:31

ikreymer and others added 2 commits October 25, 2025 15:31

Update backend/btrixcloud/operator/collindexes.py

c8d3c52

Co-authored-by: Tessa Walsh <[email protected]>

lastCollUpdated -> collLastUpdated

13f5084

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Dedup Backend Initial Implementation #2868

Dedup Backend Initial Implementation #2868

Uh oh!

ikreymer commented Sep 30, 2025 •

edited

Loading

Uh oh!

tw4l left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SuaYoo left a comment

Uh oh!

SuaYoo commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Dedup Backend Initial Implementation #2868

Are you sure you want to change the base?

Dedup Backend Initial Implementation #2868

Uh oh!

Conversation

ikreymer commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

tw4l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SuaYoo left a comment

Choose a reason for hiding this comment

Uh oh!

SuaYoo commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ikreymer commented Sep 30, 2025 •

edited

Loading