-
Notifications
You must be signed in to change notification settings - Fork 2.7k
fix: Improve text normalization in getSimilarity to reduce false mismatches in multi-block diff strategy #1820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Before normalization: - Only collapsed multiple whitespace characters into a single space - Only trimmed leading/trailing whitespace - Treated different line endings (\r\n vs \n) as different - Treated tabs and spaces as different characters - Zero-width spaces and other invisible characters were preserved After normalization: - Standardizes all line endings to \n - Converts tabs to spaces for consistent comparison - Still collapses multiple whitespace into single space - Removes zero-width spaces and other invisible Unicode characters - Still trims leading/trailing whitespace This fix applies to both: - Single search-replace functionality (search-replace.ts) - Multi-block diff functionality (multi-search-replace.ts) Users will encounter fewer "No sufficiently similar match found" errors across all diff operations when the content is semantically the same but has minor formatting differences.
…ation Updated the string normalization process in both search-replace and multi-search-replace strategies to handle invisible whitespace more effectively. The changes include: - Standardizing line endings to \n - Converting tabs to spaces - Removing zero-width spaces and other invisible characters - Trimming trailing spaces from each line - Collapsing multiple spaces into a single space These improvements aim to reduce false negatives in similarity matching, allowing for better handling of semantically similar content with minor formatting differences.
|
|
|
||
| // 6) Collapse multiple spaces into a single space | ||
| // (You can do this per line or across the whole string) | ||
| str = str.replace(/\s+/g, " ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this \s also matched spaces, tabs, \r, \n and collapsed them all into one space? If so I don't think you need 2, 4, 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm.. This still is not working good enough. it helps in some cases but not enough of them.
|
I think \s already matches most of these cases |
Summary of Changes:
• Unicode Normalization: We now call .normalize("NFKC") to unify visually identical characters (e.g., emoji variants, fullwidth forms).
• Standardized Line Endings: Convert all \r\n to \n to ensure consistent comparisons across different OS/platforms.
• Remove Hidden Chars: Strip zero-width spaces (\u200B) and non-breaking spaces (\u00A0) that often cause subtle mismatches.
• Trim Trailing Spaces: Remove leftover whitespace at the end of each line to reduce accidental differences.
• Unify Whitespace: Convert tabs to spaces and collapse multiple spaces into a single space, reducing minor formatting inconsistencies.
• Optional Steps: Provide placeholders to remove triple backticks (for code fences) or further unify punctuation if needed.
• Distance Computation: Retain the Levenshtein distance check from fastest-levenshtein, but now with more aggressively normalized inputs.
Why This Matters:
Previously, small invisible differences—like trailing spaces, zero-width characters, or inconsistent line endings—were causing the text similarity to drop below our threshold, often generating false mismatch errors. This new approach ensures both the “search” text and the “original” text undergo the same thorough cleanup, substantially reducing those false mismatches.
Impact:
• Higher accuracy in comparing text blocks.
• Fewer false negatives when applying diffs or matching code snippets.
• Simplified debugging because normalization is more explicit and predictable.