Conversation
|
Cursor Agent can help with this pull request. Just |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds canonical ID computation and document-level deduplication to semantic search (overfetch then collapse chunk-level matches into unique documents), adjusts top-K handling and result shaping, adds a unit test for normalization, and sends an ArrayBuffer-backed Uint8Array as fetch body for Cloudflare transcription. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
app/utils/semantic-search.server.ts (1)
21-32: Minor inconsistency: query strings are stripped for absolute URLs but not for relative paths.For absolute URLs,
new URL(url).pathnamediscards query strings and fragments. For relative paths (Line 31), only trailing slashes are stripped. If a relative path like/blog/foo?page=1ever appears in metadata, two chunks of the same doc could fail to canonicalize. Likely not an issue in practice given the data, but worth a note.Optional: strip query/fragment from relative paths too
- return url && url !== '/' ? url.replace(/\/+$/, '') : url + const cleaned = url.split(/[?#]/)[0] ?? url + return cleaned && cleaned !== '/' ? cleaned.replace(/\/+$/, '') : cleaned🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/utils/semantic-search.server.ts` around lines 21 - 32, The normalizeUrlForKey function inconsistently strips query strings/fragments for absolute URLs but not for relative paths; update normalizeUrlForKey to remove query (?...) and fragment (#...) from relative paths as well (in the fallback branch that currently only trims trailing slashes) so that inputs like "/blog/foo?page=1" canonicalize to "/blog/foo"; locate normalizeUrlForKey and implement removing anything after the first ? or # (or use URL parsing with a base) before trimming trailing slashes and handling the root "/" case.app/utils/__tests__/semantic-search.server.test.ts (1)
176-182: Theif (originalFetch)guard can leak the mock whenfetchis not natively available.If the test runs in an environment where
globalThis.fetchisundefinedbefore the test (e.g., older Node without native fetch),originalFetchwould be falsy, and theifguard prevents restoring/unsetting the stubbed mock — leaking it to subsequent tests.Consider using
vi.unstubAllGlobals()which unconditionally cleans up all stubs registered byvi.stubGlobal, or drop theifguard.Simpler cleanup using Vitest's built-in unstub
} finally { for (const [key, value] of Object.entries(originalEnv)) { if (typeof value === 'string') process.env[key] = value else delete process.env[key] } - if (originalFetch) vi.stubGlobal('fetch', originalFetch) + vi.unstubAllGlobals() }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/utils/__tests__/semantic-search.server.test.ts` around lines 176 - 182, The finally block currently checks if (originalFetch) before restoring the fetch stub, which leaks a mocked fetch when globalThis.fetch was initially undefined; in the finally block replace that conditional restore of vi.stubGlobal('fetch', originalFetch) with an unconditional cleanup using vi.unstubAllGlobals() (or call vi.unstubAllGlobals() alongside removing the if-guard) so any vi.stubGlobal('fetch', ...) created in the test is always removed; update/remove the originalFetch handling if no longer needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@app/utils/__tests__/semantic-search.server.test.ts`:
- Around line 176-182: The finally block currently checks if (originalFetch)
before restoring the fetch stub, which leaks a mocked fetch when
globalThis.fetch was initially undefined; in the finally block replace that
conditional restore of vi.stubGlobal('fetch', originalFetch) with an
unconditional cleanup using vi.unstubAllGlobals() (or call vi.unstubAllGlobals()
alongside removing the if-guard) so any vi.stubGlobal('fetch', ...) created in
the test is always removed; update/remove the originalFetch handling if no
longer needed.
In `@app/utils/semantic-search.server.ts`:
- Around line 21-32: The normalizeUrlForKey function inconsistently strips query
strings/fragments for absolute URLs but not for relative paths; update
normalizeUrlForKey to remove query (?...) and fragment (#...) from relative
paths as well (in the fallback branch that currently only trims trailing
slashes) so that inputs like "/blog/foo?page=1" canonicalize to "/blog/foo";
locate normalizeUrlForKey and implement removing anything after the first ? or #
(or use URL parsing with a base) before trimming trailing slashes and handling
the root "/" case.
|
Bugbot Autofix prepared fixes for 1 of the 1 bugs found in the latest run.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
app/utils/semantic-search.server.ts (1)
34-54:title-only canonical ID fallback is not case-normalized.The
titlebranch (line 52) is the last resort whentype,slug, andurlare all absent from metadata.asNonEmptyStringtrims whitespace but does not normalize case, so two chunks of the same document with diverging title casing ("My Post"vs"my post") would produce distinct canonical IDs and not be de-duplicated. Additionally, two genuinely different documents sharing the sametype+title(e.g., multiple drafts or stub posts without slugs/URLs) would be incorrectly collapsed.Low practical risk given the fallback depth, but cheap to harden:
🛡️ Proposed fix
- if (type && title) return `${type}:${title}` + if (type && title) return `${type}:${title.toLowerCase()}`🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/utils/semantic-search.server.ts` around lines 34 - 54, In getCanonicalResultId, normalize the title for case-insensitive deduplication: when returning `${type}:${title}` or the title-only fallback, first run the title through a small normalizer (trim and toLowerCase, e.g., normalizeTitle(title)) so `"My Post"` and `"my post"` collapse to the same canonical ID; update both the `if (type && title)` branch and the `if (url)`/final title-only return to use the normalized title.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@app/utils/semantic-search.server.ts`:
- Around line 34-54: In getCanonicalResultId, normalize the title for
case-insensitive deduplication: when returning `${type}:${title}` or the
title-only fallback, first run the title through a small normalizer (trim and
toLowerCase, e.g., normalizeTitle(title)) so `"My Post"` and `"my post"`
collapse to the same canonical ID; update both the `if (type && title)` branch
and the `if (url)`/final title-only return to use the normalized title.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
app/utils/__tests__/semantic-search.server.test.ts (1)
168-169: Optional: strengthenblogResultassertions to improve failure diagnostics.Using optional chaining in
blogResult?.snippetcauses a misleading failure message whenblogResultisundefined— the test reports the wrong assertion (value mismatch) rather than the actual root cause (result not found). Asserting the score is also missing.🔍 Proposed assertion hardening
- const blogResult = results.find((r) => r.id === 'blog:react-hooks-pitfalls') - expect(blogResult?.snippet).toBe('snippet-0') + const blogResult = results.find((r) => r.id === 'blog:react-hooks-pitfalls') + expect(blogResult).toBeDefined() + // Best-chunk score (0.99) and its snippet are preserved after dedup. + expect(blogResult!.score).toBe(0.99) + expect(blogResult!.snippet).toBe('snippet-0')🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/utils/__tests__/semantic-search.server.test.ts` around lines 168 - 169, The test currently uses optional chaining on blogResult (const blogResult = results.find((r) => r.id === 'blog:react-hooks-pitfalls')) which masks a missing-result failure; update the assertions to first assert the result exists (e.g., expect(blogResult).toBeDefined() or not.toBeUndefined()), then assert blogResult.snippet equals 'snippet-0' and add an explicit assertion for the expected score (e.g., expect(blogResult.score).toBe(…)); reference the blogResult variable and the results.find(...) expression to locate where to add these checks.app/utils/semantic-search.server.ts (1)
24-26: Nit:u.pathname &&is always truthy — the guard is redundant.
URL.pathnameis always at least'/'for any valid URL, so the leadingu.pathname &&never evaluates tofalse. The meaningful condition isu.pathname !== '/'alone.✏️ Proposed simplification
- return u.pathname && u.pathname !== '/' ? u.pathname.replace(/\/+$/, '') : u.pathname + return u.pathname !== '/' ? u.pathname.replace(/\/+$/, '') : u.pathname🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/utils/semantic-search.server.ts` around lines 24 - 26, The conditional checking u.pathname with "u.pathname && u.pathname !== '/'" is redundant because URL.pathname is always at least '/' — simplify the guard inside the block that handles /^https?:\/\//i.test(url) to only check u.pathname !== '/' and then return u.pathname.replace(/\/+$/, '') when true, otherwise return u.pathname; update the branch using variables url and u and keep the trailing-slash trimming logic intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@app/utils/semantic-search.server.ts`:
- Around line 219-224: The overfetch is being clamped too early by rawTopK =
Math.min(20, safeTopK * 5) so the default safeTopK (15) cannot overfetch; change
rawTopK to compute the full overfetch (rawTopK = safeTopK * 5) and move any
20-item cap into the actual Vectorize call or the code path that requests
metadata (i.e., clamp the parameter passed to Vectorize to Math.min(20, rawTopK)
only when metadata is requested) so safeTopK/rawTopK can overfetch correctly
while preserving Vectorize's metadata cap.
---
Nitpick comments:
In `@app/utils/__tests__/semantic-search.server.test.ts`:
- Around line 168-169: The test currently uses optional chaining on blogResult
(const blogResult = results.find((r) => r.id === 'blog:react-hooks-pitfalls'))
which masks a missing-result failure; update the assertions to first assert the
result exists (e.g., expect(blogResult).toBeDefined() or not.toBeUndefined()),
then assert blogResult.snippet equals 'snippet-0' and add an explicit assertion
for the expected score (e.g., expect(blogResult.score).toBe(…)); reference the
blogResult variable and the results.find(...) expression to locate where to add
these checks.
In `@app/utils/semantic-search.server.ts`:
- Around line 24-26: The conditional checking u.pathname with "u.pathname &&
u.pathname !== '/'" is redundant because URL.pathname is always at least '/' —
simplify the guard inside the block that handles /^https?:\/\//i.test(url) to
only check u.pathname !== '/' and then return u.pathname.replace(/\/+$/, '')
when true, otherwise return u.pathname; update the branch using variables url
and u and keep the trailing-slash trimming logic intact.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
app/utils/semantic-search.server.ts (2)
57-77: Slug is not lowercased unlike title — consider normalizing for parity.
normalizeTitleForKeylowercases to avoid casing-only duplicates, but slug passes through verbatim. If two chunks of the same document are indexed with different slug casing (e.g.,"My-Post"vs"my-post"), they'll produce different canonical IDs and fail to collapse. Worth lowercasing slug for consistency.♻️ Proposed normalization
- if (type && slug) return `${type}:${slug}` + if (type && slug) return `${type}:${slug.toLowerCase()}`🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/utils/semantic-search.server.ts` around lines 57 - 77, getCanonicalResultId currently uses slug verbatim which can cause casing-only duplicates; update it to normalize the slug (lowercase and any other normalization used for titles) before building the canonical id. Use the same normalization strategy as normalizeTitleForKey (or add a normalizeSlugForKey helper and call it) so cases like "My-Post" vs "my-post" collapse; ensure branches that return `${type}:${slug}` instead use the normalized slug and keep existing uses of normalizeUrlForKey and normalizeTitleForKey unchanged.
29-41: Two dead-code branches and an empty-string edge case worth cleaning up.
Line 34 —
u.pathname &&is always truthy.URL.pathnamealways returns a non-empty string (at minimum"/"), so the truthy guard is dead code. The effective condition is justu.pathname !== '/'.Line 39 —
?? urlis unreachable.String.prototype.split()always returns an array with at least one element, so[0]is neverundefined.Line 39–40 — Query-only relative URL produces an empty canonical key. If
urlis"?foo=bar",split(/[?#]/)[0]is"",cleanedis"", and the function returns"". SincegetCanonicalResultIdchecks truthiness of the rawurlparameter (not the normalized result), a non-empty raw URL like"?foo"still passes theif (type && url)guard, producing a canonical ID of"type:"or bare"". Highly unlikely in practice for a content index, but easy to guard against.♻️ Proposed clean-up
function normalizeUrlForKey(url: string): string { try { if (/^https?:\/\//i.test(url)) { const u = new URL(url) - return u.pathname && u.pathname !== '/' ? u.pathname.replace(/\/+$/, '') : u.pathname + return u.pathname !== '/' ? u.pathname.replace(/\/+$/, '') : u.pathname } } catch { // ignore } - const cleaned = (url.split(/[?#]/)[0] ?? url).trim() + const cleaned = (url.split(/[?#]/)[0] ?? '').trim() - return cleaned && cleaned !== '/' ? cleaned.replace(/\/+$/, '') : cleaned + return cleaned && cleaned !== '/' ? cleaned.replace(/\/+$/, '') : cleaned || url }And in
getCanonicalResultId, guard the normalized result:- if (type && url) return `${type}:${normalizeUrlForKey(url)}` - if (url) return normalizeUrlForKey(url) + const normUrl = normalizeUrlForKey(url) + if (type && url && normUrl) return `${type}:${normUrl}` + if (url && normUrl) return normUrl🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/utils/semantic-search.server.ts` around lines 29 - 41, normalizeUrlForKey contains dead guards and can return an empty string for query-only inputs; simplify the URL branch to check only u.pathname !== '/' and remove the unreachable "?? url" fallback, and ensure the final normalized value never returns empty by returning '/' (or another non-empty canonical path) when cleaned === '' so query-only inputs produce a stable key; also update getCanonicalResultId to validate the normalized result (from normalizeUrlForKey) and fall back to a safe default if it's empty before composing the canonical ID.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@app/utils/semantic-search.server.ts`:
- Around line 236-241: The overfetch calculation for rawTopK is too constrained
by Math.min(20, safeTopK * 5) so at safeTopK=15 you only request 20 hits and
lose the intended dedupe headroom; change rawTopK to scale with safeTopK (e.g.,
rawTopK = Math.max(20, safeTopK * 5) or use a configurable MAX_RAW_K) so you
actually overfetch for larger topK, and if Vectorize enforces a 20-item cap when
metadata is requested, avoid that cap by requesting embeddings/results without
metadata and then fetching metadata separately for the unique doc ids (adjust
the Vectorize call and any metadataRequested handling around rawTopK/safeTopK to
implement this).
---
Nitpick comments:
In `@app/utils/semantic-search.server.ts`:
- Around line 57-77: getCanonicalResultId currently uses slug verbatim which can
cause casing-only duplicates; update it to normalize the slug (lowercase and any
other normalization used for titles) before building the canonical id. Use the
same normalization strategy as normalizeTitleForKey (or add a
normalizeSlugForKey helper and call it) so cases like "My-Post" vs "my-post"
collapse; ensure branches that return `${type}:${slug}` instead use the
normalized slug and keep existing uses of normalizeUrlForKey and
normalizeTitleForKey unchanged.
- Around line 29-41: normalizeUrlForKey contains dead guards and can return an
empty string for query-only inputs; simplify the URL branch to check only
u.pathname !== '/' and remove the unreachable "?? url" fallback, and ensure the
final normalized value never returns empty by returning '/' (or another
non-empty canonical path) when cleaned === '' so query-only inputs produce a
stable key; also update getCanonicalResultId to validate the normalized result
(from normalizeUrlForKey) and fall back to a safe default if it's empty before
composing the canonical ID.
Co-authored-by: Kent C. Dodds <me+github@kentcdodds.com>
Co-authored-by: Kent C. Dodds <me+github@kentcdodds.com>
Co-authored-by: Kent C. Dodds <me+github@kentcdodds.com>
Co-authored-by: Kent C. Dodds <me+github@kentcdodds.com>
Co-authored-by: Kent C. Dodds <me+github@kentcdodds.com>
0215747 to
6d2d281
Compare
There was a problem hiding this comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@app/utils/semantic-search.server.ts`:
- Around line 247-254: The current rawTopK calculation (const rawTopK =
Math.min(20, safeTopK * 5)) collapses to 20 for any safeTopK ≥ 5, giving
insufficient overfetch; change rawTopK so it preserves a 5× overfetch for larger
safeTopK by increasing/removing the hard cap—e.g., replace Math.min(20, safeTopK
* 5) with a larger cap like Math.min(100, safeTopK * 5) or remove the cap
entirely so rawTopK = safeTopK * 5 (keeping safeTopK clamping as-is), and ensure
callers that pass rawTopK into Vectorize still respect any service metadata
limits.
Co-authored-by: Kent C. Dodds <me+github@kentcdodds.com>
Co-authored-by: Kent C. Dodds <me+github@kentcdodds.com>
Normalize semantic search results to prevent multiple entries for the same content by de-duplicating chunk-level matches into single canonical documents.
The semantic search often returned multiple entries for the same content because the indexer stores many "chunk" vectors per document. This change collapses these chunk-level hits into a single canonical document result, keyed by
type:slug(falling back totype:url), while overfetching raw matches to ensuretopKunique documents are still returned.Note
Medium Risk
Changes affect semantic search ranking/identity semantics by rewriting result IDs and deduping/merging matches, which could alter ordering and downstream consumers’ expectations. Test coverage reduces risk, and the transcription change is a low-impact typing/compatibility fix.
Overview
Semantic search results are now normalized and de-duplicated so multiple Vectorize chunk hits collapse into a single doc-level entry, using a canonical
id(preferringtype:slug, then falling back to parsed vector IDs,type:url, URL-only, ortype:title). The query path now clampstopK, overfetches raw matches, merges duplicates by choosing the best score and snippet, and returns the requested number of unique docs sorted by score.Tests were expanded to run against the MSW Cloudflare mock and verify chunk-dedupe behavior (including snippet selection and non-collapsing of same-URL different-slug docs). Separately, Workers AI transcription now sends an
ArrayBuffer-backedUint8Arraybody to satisfy stricter TSfetchtypings without unnecessary copies.Written by Cursor Bugbot for commit 3a2a210. This will update automatically on new commits. Configure here.
Summary by CodeRabbit
Bug Fixes
Tests