Users post identical deals in separate threads, causing multiple Discord alerts. We need a robust deduplication mechanism that merges identical deals securely. The user raised excellent points: URL tokenization should be included, link popularities can change over time requiring re-ordering, and duplicate deals might not appear in the same scraping batch.
Instead of treating DealInfo as a 1:1 mapping to a single RedFlagDeals thread, we elevate DealInfo to represent the Abstract Deal Idea, which can contain multiple Threads.
- Add a new struct:
type ThreadContext struct { FirestoreID string `firestore:"firestoreID"` PostURL string `firestore:"postURL"` LikeCount int `firestore:"likeCount"` CommentCount int `firestore:"commentCount"` ViewCount int `firestore:"viewCount"` }
- Modify DealInfo:
- Replace
PostURL,LikeCount,CommentCount,ViewCountwithThreads []ThreadContext. - The getters for those stats will become aggregate functions:
- Primary PostURL =
Threads[0].PostURL - Aggregate Likes =
Round(Sum(Threads.Likes) / len(Threads))
- Primary PostURL =
- Add
SearchTokens []stringto store tokenized words/numbers.
- Replace
- Tokenization:
CleanTitletokens: Lowercase, remove punctuation, extract numbers/units/brands.ActualDealURLtokens: Extract the slug, split by hyphens/slashes, and append high-value English words/brands to the token list.
- Cross-Batch Deduplication Flow:
- Scrape: We scrape a batch of deals.
- Convert: Convert them into DealInfo objects where
Threadshas exactly 1 entry. - Fetch Existing: Instead of just doing a direct Firestore ID lookup, we query Firestore for recent deals (e.g., last 48 hours).
- Fuzzy Match Engine: For each scraped deal:
- Check if it matches an existing Firestore deal (via exact
ActualDealURLor high fuzzy match score onSearchTokens). - Check if it matches another deal in the current scrape batch.
- Check if it matches an existing Firestore deal (via exact
- Merge: If Deal A (scraped) matches Deal B (existing or also scraped):
- We don't save Deal A as a new document.
- We push Deal A's thread stats into Deal B's
Threadsarray (or update it if that specific thread was already in the array). - We map Deal A's FirestoreID pointer to Deal B's FirestoreID so any updates go to the parent entity.
- Re-ordering & Aggregation: Before saving or sending to Discord, we sort
ThreadsbyLikeCountdescending. This ensures the most popular thread is alwaysThreads[0].
- Add
GetRecentDeals(ctx, duration)to fetch deals from the last 24-48 hours. This is required because we can't just look up duplicates by ID anymore; we must bring recent deals into memory to run the fuzzy matcher against them. - Update UpdateDeal to save the
Threadsarray instead of individual stats.
- Update embed logic to use
deal.Threads. - Display logic:
- Loop through
deal.Threads. - For
Threads[0], hyperlinked text is[RFD]. - For
Threads[1:], append[RFD]right next to it. - Because
Threadsinherently sorts byLikeCountin the processor, the links will automatically re-order themselves in the embed if a newer thread suddenly becomes more popular!
- Loop through
- Add unit tests for the tokenization (including URL slug tokenization).
- Add tests to ensure that a newly scraped deal successfully merges into an "older" existing deal context, updates the
Threadsarray, recalculates stats, and resorts the array.
- Run
go test ./internal/...focusing on testingThreadContextmerging and sorting.
- Seed Firestore with Deal A.
- Scrape Deal B (a duplicate).
- Verify no new Discord message is sent, but Deal A's Discord message is updated to contain both Deal A's and Deal B's links, sorted by whichever has more likes.