Skip to content

Conversation

@ikreymer
Copy link
Member

@ikreymer ikreymer commented Sep 30, 2025

Fixes #2867

Set to feature branch as base.

The backend implementation involves:

  • A new CollIndex CRD type
  • Operator that manages the new CRD type, creating a new Redis instance when the index should exist
  • Operator starts the crawler in 'indexer' mode (will be available from Deduplication (Initial Support). browsertrix-crawler#884)
  • Collection has a new hasDedupIndex field
  • Workflows have a new 'dedupCollId' field for dedup while crawling. The dedupCollId must also be a collection that the crawl is auto-added to.
  • There is a new waiting state: waiting_for_dedup_index that is entered if a crawl is starting, but index is not yet ready.

For indexing, dependent on version of crawler (1.9.0 beta 0 or higher) that supports indexing mode.

Testing

This is ready for initial frontend work and testing:

  • the dedupCollId can be set on the workflow to enable dedup for future crawls.
  • the collection has a hasDedupIndex to indicate if an index is enabled for it.
  • kubectl get pods -n crawlers collindex should work

@ikreymer ikreymer requested a review from tw4l September 30, 2025 03:18
Copy link
Member

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A promising start!

I left comments/suggestions throughout where I noticed things. Should also add tests for setting dedupCollId for collection and crawlconfig add and update, I don't think it's quite right as-is (at least for collections).

Copy link
Member

@SuaYoo SuaYoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The collection PATCH endpoint will need to accept an empty string for dedupCollId so that it can be unset. Currently, the API returns an error that details Input should be a valid UUID, invalid length: expected length 32 for simple format, found 0

@SuaYoo
Copy link
Member

SuaYoo commented Oct 13, 2025

Also, updating dedupCollId should update autoAddCollections on the backend.

- add CollIndex crd
- add new operator
- added as btrix-crds 0.2.0

operator collindex:
- redis_dedupe_memory and redis_dedupe_storage configurable
- dedupe_importer_channel can configure crawler channel for index imports

operator crawl:
- add 'waiting_for_dedupe_index' state to indicate crawl is awaiting dedup index
- on crawl success, get list of required crawls deduped against and set 'requiredCrawls' and 'requiredByCrawls'

collection:
- supports creating, updating, deleting CollIndex via CrawlManager
- tracks via hasDedupeIndex
- mark CollIndex object for updates by updating collItemsUpdatedAt on items add or remove
- CollIndex deleted on collection deleted

workflow:
- has dedupeCollId if dedupe is enabled
- dedupeCollId included in autoAdd collections
- update empty string dedupeCollId to remove dedupe collection from workflow
@ikreymer ikreymer changed the base branch from feature-dedup to main October 25, 2025 22:31
@ikreymer ikreymer changed the base branch from main to feature-dedup October 25, 2025 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants