Skip to content

Conversation

@hannesrudolph
Copy link
Collaborator

@hannesrudolph hannesrudolph commented Mar 19, 2025

Summary of Changes:
• Unicode Normalization: We now call .normalize("NFKC") to unify visually identical characters (e.g., emoji variants, fullwidth forms).
• Standardized Line Endings: Convert all \r\n to \n to ensure consistent comparisons across different OS/platforms.
• Remove Hidden Chars: Strip zero-width spaces (\u200B) and non-breaking spaces (\u00A0) that often cause subtle mismatches.
• Trim Trailing Spaces: Remove leftover whitespace at the end of each line to reduce accidental differences.
• Unify Whitespace: Convert tabs to spaces and collapse multiple spaces into a single space, reducing minor formatting inconsistencies.
• Optional Steps: Provide placeholders to remove triple backticks (for code fences) or further unify punctuation if needed.
• Distance Computation: Retain the Levenshtein distance check from fastest-levenshtein, but now with more aggressively normalized inputs.

Why This Matters:
Previously, small invisible differences—like trailing spaces, zero-width characters, or inconsistent line endings—were causing the text similarity to drop below our threshold, often generating false mismatch errors. This new approach ensures both the “search” text and the “original” text undergo the same thorough cleanup, substantially reducing those false mismatches.

Impact:
• Higher accuracy in comparing text blocks.
• Fewer false negatives when applying diffs or matching code snippets.
• Simplified debugging because normalization is more explicit and predictable.

Before normalization:
- Only collapsed multiple whitespace characters into a single space
- Only trimmed leading/trailing whitespace
- Treated different line endings (\r\n vs \n) as different
- Treated tabs and spaces as different characters
- Zero-width spaces and other invisible characters were preserved

After normalization:
- Standardizes all line endings to \n
- Converts tabs to spaces for consistent comparison
- Still collapses multiple whitespace into single space
- Removes zero-width spaces and other invisible Unicode characters
- Still trims leading/trailing whitespace

This fix applies to both:
- Single search-replace functionality (search-replace.ts)
- Multi-block diff functionality (multi-search-replace.ts)

Users will encounter fewer "No sufficiently similar match found" errors across all diff operations when the content is semantically the same but has minor formatting differences.
…ation

Updated the string normalization process in both search-replace and multi-search-replace strategies to handle invisible whitespace more effectively. The changes include:
- Standardizing line endings to \n
- Converting tabs to spaces
- Removing zero-width spaces and other invisible characters
- Trimming trailing spaces from each line
- Collapsing multiple spaces into a single space

These improvements aim to reduce false negatives in similarity matching, allowing for better handling of semantically similar content with minor formatting differences.
@changeset-bot
Copy link

changeset-bot bot commented Mar 20, 2025

⚠️ No Changeset found

Latest commit: 75202eb

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@hannesrudolph hannesrudolph changed the title fix: improve diff matching by handling invisible whitespace fix: Improve text normalization in getSimilarity to reduce false mismatches Mar 20, 2025
@hannesrudolph hannesrudolph changed the title fix: Improve text normalization in getSimilarity to reduce false mismatches fix: Improve text normalization in getSimilarity to reduce false mismatches in multi-block diff strategy Mar 20, 2025

// 6) Collapse multiple spaces into a single space
// (You can do this per line or across the whole string)
str = str.replace(/\s+/g, " ")
Copy link
Collaborator

@mrubens mrubens Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this \s also matched spaces, tabs, \r, \n and collapsed them all into one space? If so I don't think you need 2, 4, 5

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. This still is not working good enough. it helps in some cases but not enough of them.

@mrubens
Copy link
Collaborator

mrubens commented Mar 20, 2025

I think \s already matches most of these cases

@mrubens mrubens closed this Mar 20, 2025
@github-project-automation github-project-automation bot moved this from New to Done in Roo Code Roadmap Mar 20, 2025
@hannesrudolph hannesrudolph deleted the fix-multi-block-dif branch May 12, 2025 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants