Skip to content

fix: improve adaptive retrieval and noise filter accuracy for CJK and edge cases#401

Closed
ssyn0813 wants to merge 6 commits intoCortexReach:masterfrom
ssyn0813:fix/adaptive-retrieval-cjk-accuracy
Closed

fix: improve adaptive retrieval and noise filter accuracy for CJK and edge cases#401
ssyn0813 wants to merge 6 commits intoCortexReach:masterfrom
ssyn0813:fix/adaptive-retrieval-cjk-accuracy

Conversation

@ssyn0813
Copy link
Copy Markdown
Contributor

Summary

Four bugs in the filtering/retrieval pipeline silently drop valid content:

  1. \p{Emoji} regex matches digits"12345", "8080" are treated as pure emoji and skipped
  2. Slash command regex too broad — file paths like /usr/bin/node are misidentified as slash commands
  3. CJK short text killed in adaptive-retrievallength < 5 hard threshold fires before CJK-aware branch; Chinese queries like "他喜欢猫" (4 chars) are skipped
  4. CJK short text killed in noise-filter — same length < 5 threshold with no CJK awareness; Chinese memories are rejected at write time

Bug 3+4 together create a double loss: short CJK content can neither be stored nor retrieved.

Root Cause

  • Unicode \p{Emoji} includes ASCII digits 0-9, #, * as keycap bases
  • ^\// matches any /-prefixed text, not just /command format
  • Both adaptive-retrieval.ts:78 and noise-filter.ts:76 use length < 5 without considering that CJK characters carry far more semantic information per character than Latin characters

Changes

  • src/adaptive-retrieval.ts: Replace \p{Emoji} with \p{Extended_Pictographic}, narrow slash regex to ^\/[a-z][\w-]*\s*$/i, add CJK-aware hard threshold, add digit exemption for port/issue numbers, add slash command guard on FORCE_RETRIEVE to prevent /recall bypassing skip logic
  • src/noise-filter.ts: Add CJK-aware hard threshold (CJK: length < 2, non-CJK: length < 5), tighten boilerplate regex to avoid false positives on longer sentences starting with greetings
  • test/adaptive-retrieval.test.mjs: New — 20 test cases covering emoji, slash, CJK, and existing behavior preservation
  • test/noise-filter.test.mjs: New — 15 test cases covering CJK, English, patterns, options, and filterNoise generic
  • package.json: Register both new test files in npm test script

Test Plan

  • node --test test/adaptive-retrieval.test.mjs — 20/20 pass
  • node --test test/noise-filter.test.mjs — 15/15 pass
  • CI passes (note: reflection-bypass-hook.test.mjs has pre-existing failures on master, unrelated to this PR)

Related: #127

ssyn0813 and others added 6 commits March 29, 2026 06:05
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Use \p{Extended_Pictographic} instead of \p{Emoji} to avoid matching digits
- Narrow slash command regex to /word format, no longer matches file paths
- Add CJK-aware hard threshold (length < 2 for CJK, < 5 for non-CJK)
- Exempt digit-containing strings (port numbers, issue IDs) from length thresholds
- Lower CJK defaultMinLength from 6 to 3 for short meaningful CJK queries
- Lower non-CJK defaultMinLength from 15 to 13 to allow file path queries
- Prevent FORCE_RETRIEVE from hijacking slash commands like /recall
Short CJK text (2+ chars) is no longer falsely marked as noise.
Uses same CJK detection pattern as adaptive-retrieval.

Also tightens boilerplate greeting regex to only match standalone
greetings (≤1 trailing word), so real memories starting with "hello"
are not incorrectly filtered.
Copy link
Copy Markdown
Collaborator

@AliceLJY AliceLJY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean fixes with solid test coverage. The three changes are all well-targeted:

  1. \p{Emoji}\p{Extended_Pictographic} — correct fix for the digits-matching-as-emoji bug
  2. Slash command regex tightened to avoid matching file paths
  3. CJK thresholds lowered + digit exemption for port/issue numbers

196 lines of new tests covering edge cases. LGTM.

@rwmjhb
Copy link
Copy Markdown
Collaborator

rwmjhb commented Mar 30, 2026

Review: REQUEST-CHANGES

The four bugs you're fixing are real — \p{Emoji} matching digits, ^/ catching file paths, and CJK short text being double-dropped by both shouldSkipRetrieval and isNoise hard length gates. Good test coverage added.

Must fix:

  1. Tests were never executedtargeted_status=SKIP and full_suite_run=false in our verification run. The PR adds test/adaptive-retrieval.test.mjs and test/noise-filter.test.mjs to package.json, but the verifier couldn't run them. Please confirm these tests pass locally and ideally in CI.

  2. Slash commands with arguments regressed — the regex changed from ^/ to ^\/[a-z][\w-]*\s*$, so /recall my name is no longer recognized as a slash command. Since FORCE_RETRIEVE_PATTERNS runs before the skip check, these now trigger embedding/retrieval instead of being skipped. Consider ^\/[a-z][\w-]*(\s|$) or similar to match slash commands with arguments.

Worth considering (not blocking):

  • Keycap emoji like 1️⃣ contain a digit character, so they bypass both length gates via the hasDigit exemption AND don't match the new Extended_Pictographic pure-emoji regex. Result: shouldSkipRetrieval('1️⃣') === false. Minor, but a concrete edge case.
  • The defaultMinLength changes (15→13 English, 6→3 CJK) go beyond the stated length<5 fix — were these thresholds empirically validated?
  • Branch is behind main (stale_base=true). Please rebase and confirm no conflicts in the modified files.

Copy link
Copy Markdown
Collaborator

@AliceLJY AliceLJY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — changes are clean, on-topic, and well-tested. Approving.

Copy link
Copy Markdown

app3apps commented Apr 3, 2026

Closing this PR for now due to inactivity.

The core CJK / emoji / file-path fixes make sense, but the current head still regresses argument-bearing slash commands in shouldSkipRetrieval(). With the narrowed ^\/[a-z][\w-]*\s*$ check, inputs like /remember <content>, /lesson <content>, and /recall my name no longer count as slash commands, so they can fall through the force-retrieve path and trigger auto-recall work that the current ^/ behavior skips.

That is a real behavior change relative to the documented custom-command flow in the README, so I don’t want to merge it as-is.

If you want to revive this later, please rebase and update the slash-command handling so it still skips commands with arguments while continuing to avoid false positives for file paths like /usr/bin/node. A matching regression test for argument-bearing slash commands should come with that change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants