Skip to content

Simplify the encoding and deduplication process#268

Open
datamik wants to merge 3 commits intodevelopfrom
improve-deduplication-process
Open

Simplify the encoding and deduplication process#268
datamik wants to merge 3 commits intodevelopfrom
improve-deduplication-process

Conversation

@datamik
Copy link
Contributor

@datamik datamik commented Feb 26, 2026

Simplify Deduplication Architecture
Replaces Celery chord-based parallel task orchestration with a single-task approach. The find_duplicates task now executes encoding and deduplication sequentially, leveraging vectorized matrix operations for efficient duplicate detection across large datasets.

No breaking changes to external API.

We need to set CELERY_BROKER_VISIBILITY_VAR=36000

Remove chords from encoding and deduplication
@codecov
Copy link

codecov bot commented Feb 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.71%. Comparing base (e566a43) to head (5a0bc09).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #268      +/-   ##
===========================================
+ Coverage    96.52%   96.71%   +0.18%     
===========================================
  Files           85       85              
  Lines         2127     2067      -60     
  Branches       138      139       +1     
===========================================
- Hits          2053     1999      -54     
+ Misses          54       49       -5     
+ Partials        20       19       -1     
Flag Coverage Δ
unittests 96.71% <100.00%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@datamik datamik changed the title Simplify the whole deduplication flow Simplify the encoding and deduplication process Feb 26, 2026
@domdinicola domdinicola requested a review from saxix February 26, 2026 10:04
@domdinicola domdinicola requested a review from arsen-vs February 26, 2026 11:01
@datamik datamik requested a review from srugano February 26, 2026 11:19
@datamik datamik force-pushed the improve-deduplication-process branch from 36711ce to 9068e78 Compare February 26, 2026 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants