Skip to content

Duplicate content normalization#602

Merged
kentcdodds merged 8 commits intomainfrom
cursor/duplicate-content-normalization-a9d6
Feb 19, 2026
Merged

Duplicate content normalization#602
kentcdodds merged 8 commits intomainfrom
cursor/duplicate-content-normalization-a9d6

Conversation

@kentcdodds
Copy link
Owner

@kentcdodds kentcdodds commented Feb 19, 2026

Normalize semantic search results to prevent multiple entries for the same content by de-duplicating chunk-level matches into single canonical documents.

The semantic search often returned multiple entries for the same content because the indexer stores many "chunk" vectors per document. This change collapses these chunk-level hits into a single canonical document result, keyed by type:slug (falling back to type:url), while overfetching raw matches to ensure topK unique documents are still returned.


Open in Cursor Open in Web


Note

Medium Risk
Changes affect semantic search ranking/identity semantics by rewriting result IDs and deduping/merging matches, which could alter ordering and downstream consumers’ expectations. Test coverage reduces risk, and the transcription change is a low-impact typing/compatibility fix.

Overview
Semantic search results are now normalized and de-duplicated so multiple Vectorize chunk hits collapse into a single doc-level entry, using a canonical id (preferring type:slug, then falling back to parsed vector IDs, type:url, URL-only, or type:title). The query path now clamps topK, overfetches raw matches, merges duplicates by choosing the best score and snippet, and returns the requested number of unique docs sorted by score.

Tests were expanded to run against the MSW Cloudflare mock and verify chunk-dedupe behavior (including snippet selection and non-collapsing of same-URL different-slug docs). Separately, Workers AI transcription now sends an ArrayBuffer-backed Uint8Array body to satisfy stricter TS fetch typings without unnecessary copies.

Written by Cursor Bugbot for commit 3a2a210. This will update automatically on new commits. Configure here.

Summary by CodeRabbit

  • Bug Fixes

    • Semantic search now deduplicates document-level results, improves ranking and snippet selection, and handles incomplete metadata more robustly.
    • Audio transcription uploads use an ArrayBuffer-backed request body for more reliable, efficient uploads.
  • Tests

    • New tests validate semantic search normalization, deduplication, top-K behavior, ranking, and snippet selection.

@cursor
Copy link

cursor bot commented Feb 19, 2026

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds canonical ID computation and document-level deduplication to semantic search (overfetch then collapse chunk-level matches into unique documents), adjusts top-K handling and result shaping, adds a unit test for normalization, and sends an ArrayBuffer-backed Uint8Array as fetch body for Cloudflare transcription.

Changes

Cohort / File(s) Summary
Semantic search implementation
app/utils/semantic-search.server.ts
Adds helpers (asNonEmptyString, normalizeUrlForKey, normalizeTitleForKey, normalizeSlugForKey) and getCanonicalResultId; implements overfetch (rawTopK), deduplication/merge by canonicalId (choose best score/snippet, preserve credits), sorts by score then rank, and returns deduplicated results capped to safeTopK.
Tests
app/utils/__tests__/semantic-search.server.test.ts
Adds semantic search result normalization suite, imports vi from vitest, mocks fetch/vectorize responses to simulate chunk-level duplicates, verifies document-level collapse with preserved distinct credits/snippets/ids, and restores env/fetch in finally.
Cloudflare transcription
app/utils/cloudflare-ai-transcription.server.ts
Creates mp3Body as an ArrayBuffer-backed Uint8Array (avoiding extra copy when possible) and uses it as the fetch request body; adds explanatory comments; no public API changes.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Client
participant Server as SemanticSearchKCD
participant VectorizeAPI as Vectorize Service
Client->>Server: request semantic search (query, topK)
Server->>VectorizeAPI: fetch vectorize results (rawTopK / overfetch)
VectorizeAPI-->>Server: chunk-level matches (vectorId, metadata, score, snippet)
Server->>Server: compute canonicalId, merge by canonicalId, pick best snippet/score, sort & slice to safeTopK
Server-->>Client: deduplicated, sorted SemanticSearchResult[]

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • Cloudflare API mocks #603 — touches the same Cloudflare transcription mp3 request-body/typing path (app/utils/cloudflare-ai-transcription.server.ts), likely related.

Poem

🐰 I hopped through chunks and stitched them tight,

Canonical keys to guide the flight.
Duplicates folded, snippets kept near,
Scores sorted true — the results appear,
A carrot-coded cheer for search made right. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Duplicate content normalization' directly and clearly summarizes the main objective of the changeset: deduplicating semantic search results to prevent multiple entries for the same content.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cursor/duplicate-content-normalization-a9d6

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kentcdodds kentcdodds marked this pull request as ready for review February 19, 2026 01:00
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
app/utils/semantic-search.server.ts (1)

21-32: Minor inconsistency: query strings are stripped for absolute URLs but not for relative paths.

For absolute URLs, new URL(url).pathname discards query strings and fragments. For relative paths (Line 31), only trailing slashes are stripped. If a relative path like /blog/foo?page=1 ever appears in metadata, two chunks of the same doc could fail to canonicalize. Likely not an issue in practice given the data, but worth a note.

Optional: strip query/fragment from relative paths too
-	return url && url !== '/' ? url.replace(/\/+$/, '') : url
+	const cleaned = url.split(/[?#]/)[0] ?? url
+	return cleaned && cleaned !== '/' ? cleaned.replace(/\/+$/, '') : cleaned
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/utils/semantic-search.server.ts` around lines 21 - 32, The
normalizeUrlForKey function inconsistently strips query strings/fragments for
absolute URLs but not for relative paths; update normalizeUrlForKey to remove
query (?...) and fragment (#...) from relative paths as well (in the fallback
branch that currently only trims trailing slashes) so that inputs like
"/blog/foo?page=1" canonicalize to "/blog/foo"; locate normalizeUrlForKey and
implement removing anything after the first ? or # (or use URL parsing with a
base) before trimming trailing slashes and handling the root "/" case.
app/utils/__tests__/semantic-search.server.test.ts (1)

176-182: The if (originalFetch) guard can leak the mock when fetch is not natively available.

If the test runs in an environment where globalThis.fetch is undefined before the test (e.g., older Node without native fetch), originalFetch would be falsy, and the if guard prevents restoring/unsetting the stubbed mock — leaking it to subsequent tests.

Consider using vi.unstubAllGlobals() which unconditionally cleans up all stubs registered by vi.stubGlobal, or drop the if guard.

Simpler cleanup using Vitest's built-in unstub
 		} finally {
 			for (const [key, value] of Object.entries(originalEnv)) {
 				if (typeof value === 'string') process.env[key] = value
 				else delete process.env[key]
 			}
-			if (originalFetch) vi.stubGlobal('fetch', originalFetch)
+			vi.unstubAllGlobals()
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/utils/__tests__/semantic-search.server.test.ts` around lines 176 - 182,
The finally block currently checks if (originalFetch) before restoring the fetch
stub, which leaks a mocked fetch when globalThis.fetch was initially undefined;
in the finally block replace that conditional restore of vi.stubGlobal('fetch',
originalFetch) with an unconditional cleanup using vi.unstubAllGlobals() (or
call vi.unstubAllGlobals() alongside removing the if-guard) so any
vi.stubGlobal('fetch', ...) created in the test is always removed; update/remove
the originalFetch handling if no longer needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@app/utils/__tests__/semantic-search.server.test.ts`:
- Around line 176-182: The finally block currently checks if (originalFetch)
before restoring the fetch stub, which leaks a mocked fetch when
globalThis.fetch was initially undefined; in the finally block replace that
conditional restore of vi.stubGlobal('fetch', originalFetch) with an
unconditional cleanup using vi.unstubAllGlobals() (or call vi.unstubAllGlobals()
alongside removing the if-guard) so any vi.stubGlobal('fetch', ...) created in
the test is always removed; update/remove the originalFetch handling if no
longer needed.

In `@app/utils/semantic-search.server.ts`:
- Around line 21-32: The normalizeUrlForKey function inconsistently strips query
strings/fragments for absolute URLs but not for relative paths; update
normalizeUrlForKey to remove query (?...) and fragment (#...) from relative
paths as well (in the fallback branch that currently only trims trailing
slashes) so that inputs like "/blog/foo?page=1" canonicalize to "/blog/foo";
locate normalizeUrlForKey and implement removing anything after the first ? or #
(or use URL parsing with a base) before trimming trailing slashes and handling
the root "/" case.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is ON. A Cloud Agent has been kicked off to fix the reported issue.

@cursor
Copy link

cursor bot commented Feb 19, 2026

Bugbot Autofix prepared fixes for 1 of the 1 bugs found in the latest run.

  • ✅ Fixed: Overfetch cap exceeds Vectorize topK API limit
    • Capped the overfetch topK at 20 to match the Vectorize metadata query limit, preventing invalid requests.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
app/utils/semantic-search.server.ts (1)

34-54: title-only canonical ID fallback is not case-normalized.

The title branch (line 52) is the last resort when type, slug, and url are all absent from metadata. asNonEmptyString trims whitespace but does not normalize case, so two chunks of the same document with diverging title casing ("My Post" vs "my post") would produce distinct canonical IDs and not be de-duplicated. Additionally, two genuinely different documents sharing the same type + title (e.g., multiple drafts or stub posts without slugs/URLs) would be incorrectly collapsed.

Low practical risk given the fallback depth, but cheap to harden:

🛡️ Proposed fix
-  if (type && title) return `${type}:${title}`
+  if (type && title) return `${type}:${title.toLowerCase()}`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/utils/semantic-search.server.ts` around lines 34 - 54, In
getCanonicalResultId, normalize the title for case-insensitive deduplication:
when returning `${type}:${title}` or the title-only fallback, first run the
title through a small normalizer (trim and toLowerCase, e.g.,
normalizeTitle(title)) so `"My Post"` and `"my post"` collapse to the same
canonical ID; update both the `if (type && title)` branch and the `if
(url)`/final title-only return to use the normalized title.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@app/utils/semantic-search.server.ts`:
- Around line 34-54: In getCanonicalResultId, normalize the title for
case-insensitive deduplication: when returning `${type}:${title}` or the
title-only fallback, first run the title through a small normalizer (trim and
toLowerCase, e.g., normalizeTitle(title)) so `"My Post"` and `"my post"`
collapse to the same canonical ID; update both the `if (type && title)` branch
and the `if (url)`/final title-only return to use the normalized title.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
app/utils/__tests__/semantic-search.server.test.ts (1)

168-169: Optional: strengthen blogResult assertions to improve failure diagnostics.

Using optional chaining in blogResult?.snippet causes a misleading failure message when blogResult is undefined — the test reports the wrong assertion (value mismatch) rather than the actual root cause (result not found). Asserting the score is also missing.

🔍 Proposed assertion hardening
-      const blogResult = results.find((r) => r.id === 'blog:react-hooks-pitfalls')
-      expect(blogResult?.snippet).toBe('snippet-0')
+      const blogResult = results.find((r) => r.id === 'blog:react-hooks-pitfalls')
+      expect(blogResult).toBeDefined()
+      // Best-chunk score (0.99) and its snippet are preserved after dedup.
+      expect(blogResult!.score).toBe(0.99)
+      expect(blogResult!.snippet).toBe('snippet-0')
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/utils/__tests__/semantic-search.server.test.ts` around lines 168 - 169,
The test currently uses optional chaining on blogResult (const blogResult =
results.find((r) => r.id === 'blog:react-hooks-pitfalls')) which masks a
missing-result failure; update the assertions to first assert the result exists
(e.g., expect(blogResult).toBeDefined() or not.toBeUndefined()), then assert
blogResult.snippet equals 'snippet-0' and add an explicit assertion for the
expected score (e.g., expect(blogResult.score).toBe(…)); reference the
blogResult variable and the results.find(...) expression to locate where to add
these checks.
app/utils/semantic-search.server.ts (1)

24-26: Nit: u.pathname && is always truthy — the guard is redundant.

URL.pathname is always at least '/' for any valid URL, so the leading u.pathname && never evaluates to false. The meaningful condition is u.pathname !== '/' alone.

✏️ Proposed simplification
-      return u.pathname && u.pathname !== '/' ? u.pathname.replace(/\/+$/, '') : u.pathname
+      return u.pathname !== '/' ? u.pathname.replace(/\/+$/, '') : u.pathname
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/utils/semantic-search.server.ts` around lines 24 - 26, The conditional
checking u.pathname with "u.pathname && u.pathname !== '/'" is redundant because
URL.pathname is always at least '/' — simplify the guard inside the block that
handles /^https?:\/\//i.test(url) to only check u.pathname !== '/' and then
return u.pathname.replace(/\/+$/, '') when true, otherwise return u.pathname;
update the branch using variables url and u and keep the trailing-slash trimming
logic intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@app/utils/semantic-search.server.ts`:
- Around line 219-224: The overfetch is being clamped too early by rawTopK =
Math.min(20, safeTopK * 5) so the default safeTopK (15) cannot overfetch; change
rawTopK to compute the full overfetch (rawTopK = safeTopK * 5) and move any
20-item cap into the actual Vectorize call or the code path that requests
metadata (i.e., clamp the parameter passed to Vectorize to Math.min(20, rawTopK)
only when metadata is requested) so safeTopK/rawTopK can overfetch correctly
while preserving Vectorize's metadata cap.

---

Nitpick comments:
In `@app/utils/__tests__/semantic-search.server.test.ts`:
- Around line 168-169: The test currently uses optional chaining on blogResult
(const blogResult = results.find((r) => r.id === 'blog:react-hooks-pitfalls'))
which masks a missing-result failure; update the assertions to first assert the
result exists (e.g., expect(blogResult).toBeDefined() or not.toBeUndefined()),
then assert blogResult.snippet equals 'snippet-0' and add an explicit assertion
for the expected score (e.g., expect(blogResult.score).toBe(…)); reference the
blogResult variable and the results.find(...) expression to locate where to add
these checks.

In `@app/utils/semantic-search.server.ts`:
- Around line 24-26: The conditional checking u.pathname with "u.pathname &&
u.pathname !== '/'" is redundant because URL.pathname is always at least '/' —
simplify the guard inside the block that handles /^https?:\/\//i.test(url) to
only check u.pathname !== '/' and then return u.pathname.replace(/\/+$/, '')
when true, otherwise return u.pathname; update the branch using variables url
and u and keep the trailing-slash trimming logic intact.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
app/utils/semantic-search.server.ts (2)

57-77: Slug is not lowercased unlike title — consider normalizing for parity.

normalizeTitleForKey lowercases to avoid casing-only duplicates, but slug passes through verbatim. If two chunks of the same document are indexed with different slug casing (e.g., "My-Post" vs "my-post"), they'll produce different canonical IDs and fail to collapse. Worth lowercasing slug for consistency.

♻️ Proposed normalization
-  if (type && slug) return `${type}:${slug}`
+  if (type && slug) return `${type}:${slug.toLowerCase()}`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/utils/semantic-search.server.ts` around lines 57 - 77,
getCanonicalResultId currently uses slug verbatim which can cause casing-only
duplicates; update it to normalize the slug (lowercase and any other
normalization used for titles) before building the canonical id. Use the same
normalization strategy as normalizeTitleForKey (or add a normalizeSlugForKey
helper and call it) so cases like "My-Post" vs "my-post" collapse; ensure
branches that return `${type}:${slug}` instead use the normalized slug and keep
existing uses of normalizeUrlForKey and normalizeTitleForKey unchanged.

29-41: Two dead-code branches and an empty-string edge case worth cleaning up.

  1. Line 34 — u.pathname && is always truthy. URL.pathname always returns a non-empty string (at minimum "/"), so the truthy guard is dead code. The effective condition is just u.pathname !== '/'.

  2. Line 39 — ?? url is unreachable. String.prototype.split() always returns an array with at least one element, so [0] is never undefined.

  3. Line 39–40 — Query-only relative URL produces an empty canonical key. If url is "?foo=bar", split(/[?#]/)[0] is "", cleaned is "", and the function returns "". Since getCanonicalResultId checks truthiness of the raw url parameter (not the normalized result), a non-empty raw URL like "?foo" still passes the if (type && url) guard, producing a canonical ID of "type:" or bare "". Highly unlikely in practice for a content index, but easy to guard against.

♻️ Proposed clean-up
 function normalizeUrlForKey(url: string): string {
   try {
     if (/^https?:\/\//i.test(url)) {
       const u = new URL(url)
-      return u.pathname && u.pathname !== '/' ? u.pathname.replace(/\/+$/, '') : u.pathname
+      return u.pathname !== '/' ? u.pathname.replace(/\/+$/, '') : u.pathname
     }
   } catch {
     // ignore
   }
-  const cleaned = (url.split(/[?#]/)[0] ?? url).trim()
+  const cleaned = (url.split(/[?#]/)[0] ?? '').trim()
-  return cleaned && cleaned !== '/' ? cleaned.replace(/\/+$/, '') : cleaned
+  return cleaned && cleaned !== '/' ? cleaned.replace(/\/+$/, '') : cleaned || url
 }

And in getCanonicalResultId, guard the normalized result:

-  if (type && url) return `${type}:${normalizeUrlForKey(url)}`
-  if (url) return normalizeUrlForKey(url)
+  const normUrl = normalizeUrlForKey(url)
+  if (type && url && normUrl) return `${type}:${normUrl}`
+  if (url && normUrl) return normUrl
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/utils/semantic-search.server.ts` around lines 29 - 41, normalizeUrlForKey
contains dead guards and can return an empty string for query-only inputs;
simplify the URL branch to check only u.pathname !== '/' and remove the
unreachable "?? url" fallback, and ensure the final normalized value never
returns empty by returning '/' (or another non-empty canonical path) when
cleaned === '' so query-only inputs produce a stable key; also update
getCanonicalResultId to validate the normalized result (from normalizeUrlForKey)
and fall back to a safe default if it's empty before composing the canonical ID.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@app/utils/semantic-search.server.ts`:
- Around line 236-241: The overfetch calculation for rawTopK is too constrained
by Math.min(20, safeTopK * 5) so at safeTopK=15 you only request 20 hits and
lose the intended dedupe headroom; change rawTopK to scale with safeTopK (e.g.,
rawTopK = Math.max(20, safeTopK * 5) or use a configurable MAX_RAW_K) so you
actually overfetch for larger topK, and if Vectorize enforces a 20-item cap when
metadata is requested, avoid that cap by requesting embeddings/results without
metadata and then fetching metadata separately for the unique doc ids (adjust
the Vectorize call and any metadataRequested handling around rawTopK/safeTopK to
implement this).

---

Nitpick comments:
In `@app/utils/semantic-search.server.ts`:
- Around line 57-77: getCanonicalResultId currently uses slug verbatim which can
cause casing-only duplicates; update it to normalize the slug (lowercase and any
other normalization used for titles) before building the canonical id. Use the
same normalization strategy as normalizeTitleForKey (or add a
normalizeSlugForKey helper and call it) so cases like "My-Post" vs "my-post"
collapse; ensure branches that return `${type}:${slug}` instead use the
normalized slug and keep existing uses of normalizeUrlForKey and
normalizeTitleForKey unchanged.
- Around line 29-41: normalizeUrlForKey contains dead guards and can return an
empty string for query-only inputs; simplify the URL branch to check only
u.pathname !== '/' and remove the unreachable "?? url" fallback, and ensure the
final normalized value never returns empty by returning '/' (or another
non-empty canonical path) when cleaned === '' so query-only inputs produce a
stable key; also update getCanonicalResultId to validate the normalized result
(from normalizeUrlForKey) and fall back to a safe default if it's empty before
composing the canonical ID.

cursoragent and others added 6 commits February 19, 2026 02:20
Co-authored-by: Kent C. Dodds <me+github@kentcdodds.com>
Co-authored-by: Kent C. Dodds <me+github@kentcdodds.com>
Co-authored-by: Kent C. Dodds <me+github@kentcdodds.com>
Co-authored-by: Kent C. Dodds <me+github@kentcdodds.com>
Co-authored-by: Kent C. Dodds <me+github@kentcdodds.com>
@cursor cursor bot force-pushed the cursor/duplicate-content-normalization-a9d6 branch from 0215747 to 6d2d281 Compare February 19, 2026 02:22
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@app/utils/semantic-search.server.ts`:
- Around line 247-254: The current rawTopK calculation (const rawTopK =
Math.min(20, safeTopK * 5)) collapses to 20 for any safeTopK ≥ 5, giving
insufficient overfetch; change rawTopK so it preserves a 5× overfetch for larger
safeTopK by increasing/removing the hard cap—e.g., replace Math.min(20, safeTopK
* 5) with a larger cap like Math.min(100, safeTopK * 5) or remove the cap
entirely so rawTopK = safeTopK * 5 (keeping safeTopK clamping as-is), and ensure
callers that pass rawTopK into Vectorize still respect any service metadata
limits.

cursoragent and others added 2 commits February 19, 2026 02:27
Co-authored-by: Kent C. Dodds <me+github@kentcdodds.com>
Co-authored-by: Kent C. Dodds <me+github@kentcdodds.com>
@kentcdodds kentcdodds merged commit 951b068 into main Feb 19, 2026
7 checks passed
@kentcdodds kentcdodds deleted the cursor/duplicate-content-normalization-a9d6 branch February 19, 2026 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments