fix: Improve text normalization in getSimilarity to reduce false mismatches in multi-block diff strategy #1820

hannesrudolph · 2025-03-19T21:54:24Z

Summary of Changes:
• Unicode Normalization: We now call .normalize("NFKC") to unify visually identical characters (e.g., emoji variants, fullwidth forms).
• Standardized Line Endings: Convert all \r\n to \n to ensure consistent comparisons across different OS/platforms.
• Remove Hidden Chars: Strip zero-width spaces (\u200B) and non-breaking spaces (\u00A0) that often cause subtle mismatches.
• Trim Trailing Spaces: Remove leftover whitespace at the end of each line to reduce accidental differences.
• Unify Whitespace: Convert tabs to spaces and collapse multiple spaces into a single space, reducing minor formatting inconsistencies.
• Optional Steps: Provide placeholders to remove triple backticks (for code fences) or further unify punctuation if needed.
• Distance Computation: Retain the Levenshtein distance check from fastest-levenshtein, but now with more aggressively normalized inputs.

Why This Matters:
Previously, small invisible differences—like trailing spaces, zero-width characters, or inconsistent line endings—were causing the text similarity to drop below our threshold, often generating false mismatch errors. This new approach ensures both the “search” text and the “original” text undergo the same thorough cleanup, substantially reducing those false mismatches.

Impact:
• Higher accuracy in comparing text blocks.
• Fewer false negatives when applying diffs or matching code snippets.
• Simplified debugging because normalization is more explicit and predictable.

Before normalization: - Only collapsed multiple whitespace characters into a single space - Only trimmed leading/trailing whitespace - Treated different line endings (\r\n vs \n) as different - Treated tabs and spaces as different characters - Zero-width spaces and other invisible characters were preserved After normalization: - Standardizes all line endings to \n - Converts tabs to spaces for consistent comparison - Still collapses multiple whitespace into single space - Removes zero-width spaces and other invisible Unicode characters - Still trims leading/trailing whitespace This fix applies to both: - Single search-replace functionality (search-replace.ts) - Multi-block diff functionality (multi-search-replace.ts) Users will encounter fewer "No sufficiently similar match found" errors across all diff operations when the content is semantically the same but has minor formatting differences.

…ation Updated the string normalization process in both search-replace and multi-search-replace strategies to handle invisible whitespace more effectively. The changes include: - Standardizing line endings to \n - Converting tabs to spaces - Removing zero-width spaces and other invisible characters - Trimming trailing spaces from each line - Collapsing multiple spaces into a single space These improvements aim to reduce false negatives in similarity matching, allowing for better handling of semantically similar content with minor formatting differences.

changeset-bot · 2025-03-20T00:13:30Z

⚠️ No Changeset found

Latest commit: 75202eb

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

mrubens · 2025-03-20T01:35:48Z

src/core/diff/strategies/multi-search-replace.ts

+
+		// 6) Collapse multiple spaces into a single space
+		//    (You can do this per line or across the whole string)
+		str = str.replace(/\s+/g, " ")


I thought this \s also matched spaces, tabs, \r, \n and collapsed them all into one space? If so I don't think you need 2, 4, 5

hmm.. This still is not working good enough. it helps in some cases but not enough of them.

mrubens · 2025-03-20T13:13:02Z

I think \s already matches most of these cases

github-project-automation bot added this to Roo Code Roadmap Mar 19, 2025

github-project-automation bot moved this to New in Roo Code Roadmap Mar 19, 2025

hannesrudolph changed the title ~~fix: improve diff matching by handling invisible whitespace~~ fix: Improve text normalization in getSimilarity to reduce false mismatches Mar 20, 2025

hannesrudolph changed the title ~~fix: Improve text normalization in getSimilarity to reduce false mismatches~~ fix: Improve text normalization in getSimilarity to reduce false mismatches in multi-block diff strategy Mar 20, 2025

mrubens reviewed Mar 20, 2025

View reviewed changes

mrubens closed this Mar 20, 2025

github-project-automation bot moved this from New to Done in Roo Code Roadmap Mar 20, 2025

hannesrudolph deleted the fix-multi-block-dif branch May 12, 2025 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Improve text normalization in getSimilarity to reduce false mismatches in multi-block diff strategy #1820

fix: Improve text normalization in getSimilarity to reduce false mismatches in multi-block diff strategy #1820

Uh oh!

hannesrudolph commented Mar 19, 2025 •

edited

Loading

Uh oh!

changeset-bot bot commented Mar 20, 2025

Uh oh!

mrubens Mar 20, 2025 •

edited

Loading

Uh oh!

hannesrudolph Mar 20, 2025

Uh oh!

mrubens commented Mar 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: Improve text normalization in getSimilarity to reduce false mismatches in multi-block diff strategy #1820

fix: Improve text normalization in getSimilarity to reduce false mismatches in multi-block diff strategy #1820

Uh oh!

Conversation

hannesrudolph commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changeset-bot bot commented Mar 20, 2025

⚠️ No Changeset found

Uh oh!

mrubens Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hannesrudolph Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

mrubens commented Mar 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hannesrudolph commented Mar 19, 2025 •

edited

Loading

mrubens Mar 20, 2025 •

edited

Loading