Skip to content

Conversation

@aheizi
Copy link
Contributor

@aheizi aheizi commented Aug 8, 2025

Related GitHub Issue

#3555

Roo Code Task Context (Optional)

Description

Roo-code currently reads files directly in UTF-8 encoding, which is not very friendly to files with other encodings such as ISO-8859-1, GBK, etc. This PR fixes this issue.

Test Procedure

  • Reads files in encodings other than UTF8 (ISO-8859-1, Shift-JIS, etc.).
  • Verify file writing to ensure that the original encoding is preserved.

Pre-Submission Checklist

  • Issue Linked: This PR is linked to an approved GitHub Issue (see "Related GitHub Issue" above).
  • Scope: My changes are focused on the linked issue (one major feature/fix per PR).
  • Self-Review: I have performed a thorough self-review of my code.
  • Testing: New and/or updated tests have been added to cover my changes (if applicable).
  • Documentation Impact: I have considered if my changes require documentation updates (see "Documentation Updates" section below).
  • Contribution Guidelines: I have read and agree to the Contributor Guidelines.

Screenshots / Videos

before:
image

after:
image

Documentation Updates

Does this PR necessitate updates to user-facing documentation?

  • No documentation updates are required.
  • Yes, documentation updates are required. (Please describe what needs to be updated or link to a PR in the docs repository).

Additional Notes

Get in Touch

aheizi


Important

Adds file encoding detection and handling for reading and writing files, ensuring correct processing of various encodings and preserving original encoding on write.

  • Behavior:
    • Introduces readFileWithEncodingDetection and writeFileWithEncodingPreservation in encoding.ts to handle file reading and writing with encoding detection and preservation.
    • Replaces isBinaryFile with isBinaryFileWithEncodingDetection in readFileTool.ts, searchAndReplaceTool.ts, and writeToFileTool.ts for binary file detection with encoding consideration.
    • Updates DiffViewProvider.ts to use new encoding functions for reading and writing files.
  • Tests:
    • Adds tests for encoding detection and preservation in encoding.spec.ts.
    • Updates existing tests in DiffViewProvider.spec.ts, extract-text-large-files.spec.ts, and read-lines.spec.ts to mock new encoding functions.
  • Dependencies:
    • Adds iconv-lite and jschardet to package.json for encoding detection and conversion.

This description was created by Ellipsis for 898b437. You can customize this summary. It will automatically update as commits are pushed.

@aheizi aheizi requested review from cte, jr and mrubens as code owners August 8, 2025 09:15
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Aug 8, 2025
@hannesrudolph hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Aug 8, 2025
Copy link
Contributor

@roomote roomote bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! I've reviewed the changes and found several issues that need attention. The encoding detection feature is a valuable addition, but there are some critical issues around testing and error handling that should be addressed.


const encoding = await detectEncoding(buffer, fileExtension)
return iconv.decode(buffer, encoding)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new utility file needs dedicated unit tests. Could you add comprehensive test coverage for both and functions? This is critical functionality that should be thoroughly tested with various encoding scenarios.

encoding = detected
} else if (detected && detected.encoding) {
// Check confidence level, use default encoding if too low
if (detected.confidence < 0.7) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the 0.7 confidence threshold intentional? Could you either make this configurable or add a comment explaining why this specific value was chosen? Different use cases might benefit from different thresholds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the 0.7 confidence threshold was intentional — it’s a conservative value we’ve found to work well for our current use case where UTF-8 is the dominant encoding, and we prefer to fall back rather than risk mis-decoding.

Different thresholds have different trade-offs:

  • ≥0.9 – very strict, only accept when the detection is almost certain; reduces false positives but may fall back too often.
  • ~0.7 – balanced, common for general text processing where UTF-8 fallback is safe.
  • <0.5 – more permissive, higher coverage but greater risk of mis-decoding.

@@ -22,7 +25,7 @@ const outOfRangeError = (filepath: string, n: number) => {
* @throws {RangeError} If line numbers are invalid or out of range
*/
export function readLines(filepath: string, endLine?: number, startLine?: number): Promise<string> {
return new Promise((resolve, reject) => {
return new Promise(async (resolve, reject) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using in a Promise constructor is an anti-pattern that can lead to unhandled promise rejections. Consider refactoring this to avoid the anti-pattern:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

const fileHandle = await open(filepath, 'r');
const sampleBuffer = Buffer.alloc(65536);
await fileHandle.read(sampleBuffer, 0, sampleBuffer.length, 0);
await fileHandle.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file handle should be closed in a finally block to ensure it's always closed even if an error occurs. Consider:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -591,7 +592,7 @@ ${errorDetails ? `\nTechnical details:\n${errorDetails}\n` : ""}
if (isPreventFocusDisruptionEnabled) {
// Direct file write without diff view or opening the file
cline.diffViewProvider.editType = "modify"
cline.diffViewProvider.originalContent = await fs.readFile(absolutePath, "utf-8")
cline.diffViewProvider.originalContent = await readFileWithEncodingDetection(absolutePath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed is called multiple times for the same file (here and line 560). Could we optimize this by reading the file once and reusing the content? This would improve performance, especially for large files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Review. Let me make some revisions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's prioritize consistency with the past first. The issue of reading the file repeatedly can be addressed as an optimization for the next issue.


// 3. Verify if the encoding is supported by iconv-lite
if (!iconv.encodingExists(encoding)) {
console.warn(`Unsupported encoding detected: ${encoding}, falling back to utf8`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When encoding detection fails or falls back to UTF-8, it would be helpful to include the originally detected encoding in the warning message. This would aid in debugging encoding issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@aheizi aheizi force-pushed the fix/file-encoding-detection branch from f05c733 to 44b75e6 Compare August 8, 2025 09:40
@daniel-lxs daniel-lxs moved this from Triage to PR [Needs Prelim Review] in Roo Code Roadmap Aug 11, 2025
Copy link
Member

@daniel-lxs daniel-lxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @aheizi for addressing the previous review. I've reviewed the changes and noticed that files are being read with encoding detection but still written back in UTF-8. I think we should ensure the encoding is preserved during write operations to avoid corrupting non-UTF-8 files. The DiffViewProvider and other components that write files need to be updated to use the detected encoding. Also, some of the previous review comments about tests and the Promise pattern are still pending.

} else if (detected && detected.encoding) {
originalEncoding = detected.encoding
// Check confidence level, use default encoding if too low
if (detected.confidence < 0.7) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a comment explaining why 0.7 was chosen as the threshold? Based on your explanation in the PR comments, perhaps something like:

Suggested change
if (detected.confidence < 0.7) {
// Check confidence level, use default encoding if too low
// 0.7 is a conservative threshold that works well when UTF-8 is the dominant encoding
// and we prefer to fall back rather than risk mis-decoding
if (detected.confidence < 0.7) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


const encoding = await detectEncoding(buffer, fileExtension)
return iconv.decode(buffer, encoding)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is core functionality, would it be helpful to add comprehensive test coverage? We could test:

  • Various encodings (UTF-8, GBK, ISO-8859-1, Shift-JIS)
  • Binary file detection
  • Confidence threshold behavior
  • Fallback scenarios
  • Edge cases (empty files, very small files)

Perhaps create a test file at src/utils/tests/encoding.spec.ts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* @param filePath Path to the file
* @returns File content as string
*/
export async function readFileWithEncodingDetection(filePath: string): Promise<string> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a corresponding writeFileWithEncodingPreservation function that stores and uses the detected encoding when writing files back. Otherwise, files will be corrupted when written in UTF-8 regardless of their original encoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -591,7 +592,7 @@ ${errorDetails ? `\nTechnical details:\n${errorDetails}\n` : ""}
if (isPreventFocusDisruptionEnabled) {
// Direct file write without diff view or opening the file
cline.diffViewProvider.editType = "modify"
cline.diffViewProvider.originalContent = await fs.readFile(absolutePath, "utf-8")
cline.diffViewProvider.originalContent = await readFileWithEncodingDetection(absolutePath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also ensure the encoding is preserved when the file is written back? The DiffViewProvider might need to track and use the original encoding when saving files.

@daniel-lxs daniel-lxs moved this from PR [Needs Prelim Review] to PR [Changes Requested] in Roo Code Roadmap Aug 12, 2025
@hannesrudolph hannesrudolph added PR - Changes Requested and removed Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. labels Aug 12, 2025
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Aug 12, 2025
@aheizi
Copy link
Contributor Author

aheizi commented Aug 12, 2025

@daniel-lxs Thanks for your comment! About file write encoding, here are my thoughts:

  • For most file writes, we use the VSCode API (vscode.workspace.fs.writeFile), which automatically handles encoding detection and preservation. So we don't need to do anything extra in these cases.
  • Only the "Directly save content to a file without showing diff view" scenario (used in the preventFocusDisruption experiment) requires us to handle encoding manually. For this, we use our own writeFileWithEncodingPreservation function to ensure the original file encoding is preserved.
  • Configuration files and new files are always written in UTF-8 (which is standard).

@daniel-lxs daniel-lxs moved this from PR [Changes Requested] to PR [Needs Prelim Review] in Roo Code Roadmap Aug 12, 2025
@daniel-lxs
Copy link
Member

Hi @aheizi, thanks for working on this important feature! I've been testing the PR with various encodings to make sure it covers different use cases.

I tried creating test files in ISO-8859-1, Shift-JIS, GBK, EUC-KR, Windows-1252, and UTF-16. Some worked great (Windows-1252 and UTF-16), but others showed as "Binary file - content not displayed". Also, when I modified a Windows-1252 file, it seemed to get converted to UTF-8.

Could you take a look? My tests might be wrong, or I might be missing something in how the feature works. I just want to make sure this works correctly for all the encodings mentioned in the issue.

If you need help reproducing my tests or want me to try something specific, let me know! Really appreciate your work on fixing this encoding issue - it's been a pain point for many users.

@daniel-lxs daniel-lxs moved this from PR [Needs Prelim Review] to PR [Changes Requested] in Roo Code Roadmap Aug 14, 2025
@aheizi
Copy link
Contributor Author

aheizi commented Aug 15, 2025

Hi @daniel-lxs, I’ve found the cause of the issue. The current isBinaryFile check doesn’t detect well — for example, Windows-1252 is being identified as a binary file. I’ll update the logic to first detect the encoding and then determine whether it’s binary.

@aheizi
Copy link
Contributor Author

aheizi commented Aug 15, 2025

@hannesrudolph hannesrudolph moved this from PR [Changes Requested] to PR [Needs Prelim Review] in Roo Code Roadmap Sep 23, 2025
@hannesrudolph
Copy link
Collaborator

Let's fix these merge issue an move this to needs review

@daniel-lxs
Copy link
Member

There's a lot of failing checks here, these must be fixed before we can decide if this should be merged.

@daniel-lxs daniel-lxs marked this pull request as draft September 24, 2025 19:18
@daniel-lxs daniel-lxs moved this from PR [Needs Prelim Review] to PR [Draft / In Progress] in Roo Code Roadmap Sep 24, 2025
Copy link
Member

@daniel-lxs daniel-lxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found issues that need attention.

open(filepath, 'r')
.then(fileHandle => {
const sampleBuffer = Buffer.alloc(65536);
return fileHandle.read(sampleBuffer, 0, sampleBuffer.length, 0)
Copy link
Member

@daniel-lxs daniel-lxs Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encoding sample ignores bytesRead. If fewer than 64KB bytes are read, the tail of the buffer remains zeroed, skewing detection for small files. Use the actual bytesRead slice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// Choose decoding method based on native support
let input: NodeJS.ReadableStream;
if (nodeEncodings.includes(encoding.toLowerCase())) {
input = createReadStream(filepath, { encoding: encoding as BufferEncoding });
Copy link
Member

@daniel-lxs daniel-lxs Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential dropped stream errors. When piping to iconv.decodeStream, upstream createReadStream errors are not guaranteed to emit on the decoded stream with pipe(). Use stream/promises.pipeline or attach error handlers to both streams.

* @returns Promise<void>
*/
export async function writeFileWithEncodingPreservation(filePath: string, content: string): Promise<void> {
// Detect original file encoding
Copy link
Member

@daniel-lxs daniel-lxs Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOM not preserved. Files with UTF-8/UTF-16 BOM will lose it after writeFileWithEncodingPreservation, altering file semantics for some tools. Preserve original BOM when re-encoding.

}
}
console.warn(`No encoding detected, falling back to utf8`)
encoding = "utf8"
Copy link
Member

@daniel-lxs daniel-lxs Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback binary check calls isBinaryFile with a Buffer. Some versions expect (buffer, size). To avoid false negatives, pass the buffer length explicitly.

* @param filePath Path to the file
* @returns Promise<boolean> true if file is binary, false if it's text
*/
export async function isBinaryFileWithEncodingDetection(filePath: string): Promise<boolean> {
Copy link
Member

@daniel-lxs daniel-lxs Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning true on read error causes callers like mentions/index.ts to treat files as binary and skip them. Prefer a conservative false here and let callers handle read errors explicitly.

mockFs.readFile.mockResolvedValue(buffer)
mockPath.extname.mockReturnValue(".txt")
mockJschardet.detect.mockReturnValue({
encoding: "utf8",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test coverage for BOM preservation (UTF-8 BOM and UTF-16 LE/BE) and for read-lines sampling (bytesRead slicing). This prevents regressions in the critical paths introduced here.

@aheizi aheizi marked this pull request as ready for review September 26, 2025 16:20
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Sep 26, 2025
@kendev-11
Copy link

Just checking in on this PR. Is there an estimated timeline for when it might be ready to merge? I’m planning to rely on this feature soon, so I’d appreciate it if you could let me know. Thanks for your great work on this!

…mpt encoding detection and then determine the binary file
- Add jschardet dependency for better encoding detection
- Improve error handling in readLines with explicit stream error handling
- Update isBinaryFile calls to include buffer length parameter
- Enhance encoding tests with BOM preservation and error cases
- Fix binary file detection to return false on read errors
@aheizi aheizi force-pushed the fix/file-encoding-detection branch from 898b437 to ef0a5fa Compare November 13, 2025 12:52
@roomote
Copy link
Contributor

roomote bot commented Nov 13, 2025

Rooviewer Clock   See task on Roo Cloud

Good progress! The latest commit fixed 2 of 3 issues. One issue remains:

  • Fix bytesRead handling in read-lines.ts encoding detection sampling
  • Add explicit error handlers for both streams in read-lines.ts pipe operation
  • Preserve BOM (Byte Order Mark) when writing files with encoding preservation
Previous reviews

Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues.

Comment on lines 60 to 66
.then((fileHandle) => {
const sampleBuffer = Buffer.alloc(65536)
return fileHandle
.read(sampleBuffer, 0, sampleBuffer.length, 0)
.then(() => sampleBuffer)
.finally(() => fileHandle.close())
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encoding sample ignores bytesRead. If fewer than 64KB bytes are read, the tail of the buffer remains zeroed, skewing detection for small files. Use the actual bytesRead slice.

Suggested change
.then((fileHandle) => {
const sampleBuffer = Buffer.alloc(65536)
return fileHandle
.read(sampleBuffer, 0, sampleBuffer.length, 0)
.then(() => sampleBuffer)
.finally(() => fileHandle.close())
})
.then((fileHandle) => {
const sampleBuffer = Buffer.alloc(65536)
return fileHandle
.read(sampleBuffer, 0, sampleBuffer.length, 0)
.then(({ bytesRead }) => sampleBuffer.subarray(0, bytesRead))
.finally(() => fileHandle.close())
})

Fix it with Roo Code or mention @roomote and request a fix.

Comment on lines +82 to 93
} else {
// For non-native encodings, create streams and handle errors explicitly
const sourceStream = createReadStream(filepath)
const decodeStream = iconv.decodeStream(encoding)

// Handle errors from both streams
sourceStream.on("error", reject)
decodeStream.on("error", reject)

// Use pipe but with explicit error handling
input = sourceStream.pipe(decodeStream)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential dropped stream errors. When piping to iconv.decodeStream, upstream createReadStream errors are not guaranteed to emit on the decoded stream with pipe(). Use stream/promises.pipeline or attach error handlers to both streams.

Suggested change
} else {
// For non-native encodings, create streams and handle errors explicitly
const sourceStream = createReadStream(filepath)
const decodeStream = iconv.decodeStream(encoding)
// Handle errors from both streams
sourceStream.on("error", reject)
decodeStream.on("error", reject)
// Use pipe but with explicit error handling
input = sourceStream.pipe(decodeStream)
}
} else {
// For non-native encodings, create streams and handle errors explicitly
const sourceStream = createReadStream(filepath)
const decodeStream = iconv.decodeStream(encoding)
// Handle errors from both streams
sourceStream.on("error", reject)
decodeStream.on("error", reject)
// Use pipe but with explicit error handling
input = sourceStream.pipe(decodeStream)
}

Fix it with Roo Code or mention @roomote and request a fix.

Comment on lines +120 to +133
export async function writeFileWithEncodingPreservation(filePath: string, content: string): Promise<void> {
// Detect original file encoding
const originalEncoding = await detectFileEncoding(filePath)

// If original file is UTF-8 or does not exist, write directly
if (originalEncoding === "utf8") {
await fs.writeFile(filePath, content, "utf8")
return
}

// Convert UTF-8 content to original file encoding
const encodedBuffer = iconv.encode(content, originalEncoding)
await fs.writeFile(filePath, encodedBuffer)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BOM not preserved. Files with UTF-8/UTF-16 BOM will lose it after writeFileWithEncodingPreservation, altering file semantics for some tools. Preserve original BOM when re-encoding.

Fix it with Roo Code or mention @roomote and request a fix.

Comment on lines +107 to +110
} catch (error) {
// File read error, return false to let callers handle read errors explicitly
return false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning false on read error causes callers like mentions/index.ts to treat unreadable files as text and skip them silently. Prefer throwing the error or returning true to let callers handle read errors explicitly.

Suggested change
} catch (error) {
// File read error, return false to let callers handle read errors explicitly
return false
}
} catch (error) {
// File read error, throw to let callers handle explicitly
throw error
}

Fix it with Roo Code or mention @roomote and request a fix.

…ead was not properly handled when reading files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request PR - Draft / In Progress size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

Status: PR [Draft / In Progress]

Development

Successfully merging this pull request may close these issues.

4 participants