-
-
Notifications
You must be signed in to change notification settings - Fork 59
Dedup Backend Initial Implementation #2868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature-dedup
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A promising start!
I left comments/suggestions throughout where I noticed things. Should also add tests for setting dedupCollId for collection and crawlconfig add and update, I don't think it's quite right as-is (at least for collections).
fdbcaa1 to
d8c3a05
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The collection PATCH endpoint will need to accept an empty string for dedupCollId so that it can be unset. Currently, the API returns an error that details Input should be a valid UUID, invalid length: expected length 32 for simple format, found 0
|
Also, updating |
- add CollIndex crd - add new operator - added as btrix-crds 0.2.0 operator collindex: - redis_dedupe_memory and redis_dedupe_storage configurable - dedupe_importer_channel can configure crawler channel for index imports operator crawl: - add 'waiting_for_dedupe_index' state to indicate crawl is awaiting dedup index - on crawl success, get list of required crawls deduped against and set 'requiredCrawls' and 'requiredByCrawls' collection: - supports creating, updating, deleting CollIndex via CrawlManager - tracks via hasDedupeIndex - mark CollIndex object for updates by updating collItemsUpdatedAt on items add or remove - CollIndex deleted on collection deleted workflow: - has dedupeCollId if dedupe is enabled - dedupeCollId included in autoAdd collections - update empty string dedupeCollId to remove dedupe collection from workflow
1bccfad to
d6e7f3e
Compare
Co-authored-by: Tessa Walsh <[email protected]>
Fixes #2867
Set to feature branch as base.
The backend implementation involves:
hasDedupIndexfielddedupCollIdmust also be a collection that the crawl is auto-added to.waiting_for_dedup_indexthat is entered if a crawl is starting, but index is not yet ready.For indexing, dependent on version of crawler (1.9.0 beta 0 or higher) that supports indexing mode.
Testing
This is ready for initial frontend work and testing:
dedupCollIdcan be set on the workflow to enable dedup for future crawls.hasDedupIndexto indicate if an index is enabled for it.kubectl get pods -n crawlers collindexshould work