feat: Added file encoding detection and reading functions #6841

aheizi · 2025-08-08T09:15:11Z

Related GitHub Issue

Roo Code Task Context (Optional)

Description

Roo-code currently reads files directly in UTF-8 encoding, which is not very friendly to files with other encodings such as ISO-8859-1, GBK, etc. This PR fixes this issue.

Test Procedure

Reads files in encodings other than UTF8 (ISO-8859-1, Shift-JIS, etc.).
Verify file writing to ensure that the original encoding is preserved.

Pre-Submission Checklist

Issue Linked: This PR is linked to an approved GitHub Issue (see "Related GitHub Issue" above).
Scope: My changes are focused on the linked issue (one major feature/fix per PR).
Self-Review: I have performed a thorough self-review of my code.
Testing: New and/or updated tests have been added to cover my changes (if applicable).
Documentation Impact: I have considered if my changes require documentation updates (see "Documentation Updates" section below).
Contribution Guidelines: I have read and agree to the Contributor Guidelines.

Screenshots / Videos

before:

after:

Documentation Updates

Does this PR necessitate updates to user-facing documentation?

No documentation updates are required.
Yes, documentation updates are required. (Please describe what needs to be updated or link to a PR in the docs repository).

Additional Notes

Get in Touch

aheizi

Important

Adds file encoding detection and handling for reading and writing files, ensuring correct processing of various encodings and preserving original encoding on write.

Behavior:
- Introduces readFileWithEncodingDetection and writeFileWithEncodingPreservation in encoding.ts to handle file reading and writing with encoding detection and preservation.
- Replaces isBinaryFile with isBinaryFileWithEncodingDetection in readFileTool.ts, searchAndReplaceTool.ts, and writeToFileTool.ts for binary file detection with encoding consideration.
- Updates DiffViewProvider.ts to use new encoding functions for reading and writing files.
Tests:
- Adds tests for encoding detection and preservation in encoding.spec.ts.
- Updates existing tests in DiffViewProvider.spec.ts, extract-text-large-files.spec.ts, and read-lines.spec.ts to mock new encoding functions.
Dependencies:
- Adds iconv-lite and jschardet to package.json for encoding detection and conversion.

^{This description was created by}^{for 898b437. You can customize this summary. It will automatically update as commits are pushed.}

roomote

Thank you for your contribution! I've reviewed the changes and found several issues that need attention. The encoding detection feature is a valuable addition, but there are some critical issues around testing and error handling that should be addressed.

roomote · 2025-08-08T09:19:46Z

src/utils/encoding.ts

+
+	const encoding = await detectEncoding(buffer, fileExtension)
+	return iconv.decode(buffer, encoding)
+}


This new utility file needs dedicated unit tests. Could you add comprehensive test coverage for both and functions? This is critical functionality that should be thoroughly tested with various encoding scenarios.

roomote · 2025-08-08T09:19:47Z

src/utils/encoding.ts

+		encoding = detected
+	} else if (detected && detected.encoding) {
+		// Check confidence level, use default encoding if too low
+		if (detected.confidence < 0.7) {


Is the 0.7 confidence threshold intentional? Could you either make this configurable or add a comment explaining why this specific value was chosen? Different use cases might benefit from different thresholds.

Yes, the 0.7 confidence threshold was intentional — it’s a conservative value we’ve found to work well for our current use case where UTF-8 is the dominant encoding, and we prefer to fall back rather than risk mis-decoding.

Different thresholds have different trade-offs:

≥0.9 – very strict, only accept when the detection is almost certain; reduces false positives but may fall back too often.

~0.7 – balanced, common for general text processing where UTF-8 fallback is safe.

<0.5 – more permissive, higher coverage but greater risk of mis-decoding.

roomote · 2025-08-08T09:19:47Z

src/integrations/misc/read-lines.ts

@@ -22,7 +25,7 @@ const outOfRangeError = (filepath: string, n: number) => {
 * @throws {RangeError} If line numbers are invalid or out of range
 */
 export function readLines(filepath: string, endLine?: number, startLine?: number): Promise<string> {
-	return new Promise((resolve, reject) => {
+	return new Promise(async (resolve, reject) => {


Using in a Promise constructor is an anti-pattern that can lead to unhandled promise rejections. Consider refactoring this to avoid the anti-pattern:

roomote · 2025-08-08T09:19:47Z

src/integrations/misc/read-lines.ts

+			const fileHandle = await open(filepath, 'r');
+			const sampleBuffer = Buffer.alloc(65536);
+			await fileHandle.read(sampleBuffer, 0, sampleBuffer.length, 0);
+			await fileHandle.close();


The file handle should be closed in a finally block to ensure it's always closed even if an error occurs. Consider:

roomote · 2025-08-08T09:19:47Z

src/core/tools/multiApplyDiffTool.ts

@@ -591,7 +592,7 @@ ${errorDetails ? `\nTechnical details:\n${errorDetails}\n` : ""}
 					if (isPreventFocusDisruptionEnabled) {
 						// Direct file write without diff view or opening the file
 						cline.diffViewProvider.editType = "modify"
-						cline.diffViewProvider.originalContent = await fs.readFile(absolutePath, "utf-8")
+						cline.diffViewProvider.originalContent = await readFileWithEncodingDetection(absolutePath)


I noticed is called multiple times for the same file (here and line 560). Could we optimize this by reading the file once and reusing the content? This would improve performance, especially for large files.

Good Review. Let me make some revisions

Let's prioritize consistency with the past first. The issue of reading the file repeatedly can be addressed as an optimization for the next issue.

roomote · 2025-08-08T09:19:47Z

src/utils/encoding.ts

+
+	// 3. Verify if the encoding is supported by iconv-lite
+	if (!iconv.encodingExists(encoding)) {
+		console.warn(`Unsupported encoding detected: ${encoding}, falling back to utf8`)


When encoding detection fails or falls back to UTF-8, it would be helpful to include the originally detected encoding in the warning message. This would aid in debugging encoding issues.

daniel-lxs

Thank you @aheizi for addressing the previous review. I've reviewed the changes and noticed that files are being read with encoding detection but still written back in UTF-8. I think we should ensure the encoding is preserved during write operations to avoid corrupting non-UTF-8 files. The DiffViewProvider and other components that write files need to be updated to use the detected encoding. Also, some of the previous review comments about tests and the Promise pattern are still pending.

daniel-lxs · 2025-08-12T00:39:09Z

src/utils/encoding.ts

+	} else if (detected && detected.encoding) {
+		originalEncoding = detected.encoding
+		// Check confidence level, use default encoding if too low
+		if (detected.confidence < 0.7) {


Could we add a comment explaining why 0.7 was chosen as the threshold? Based on your explanation in the PR comments, perhaps something like:

Suggested change

if (detected.confidence < 0.7) {

// Check confidence level, use default encoding if too low

// 0.7 is a conservative threshold that works well when UTF-8 is the dominant encoding

// and we prefer to fall back rather than risk mis-decoding

if (detected.confidence < 0.7) {

daniel-lxs · 2025-08-12T00:39:09Z

src/utils/encoding.ts

+
+	const encoding = await detectEncoding(buffer, fileExtension)
+	return iconv.decode(buffer, encoding)
+}


Since this is core functionality, would it be helpful to add comprehensive test coverage? We could test:

Various encodings (UTF-8, GBK, ISO-8859-1, Shift-JIS)

Binary file detection

Confidence threshold behavior

Fallback scenarios

Edge cases (empty files, very small files)

Perhaps create a test file at src/utils/tests/encoding.spec.ts?

daniel-lxs · 2025-08-12T00:39:09Z

src/utils/encoding.ts

+ * @param filePath Path to the file
+ * @returns File content as string
+ */
+export async function readFileWithEncodingDetection(filePath: string): Promise<string> {


I think we need a corresponding writeFileWithEncodingPreservation function that stores and uses the detected encoding when writing files back. Otherwise, files will be corrupted when written in UTF-8 regardless of their original encoding.

daniel-lxs · 2025-08-12T00:39:09Z

src/core/tools/multiApplyDiffTool.ts

@@ -591,7 +592,7 @@ ${errorDetails ? `\nTechnical details:\n${errorDetails}\n` : ""}
 					if (isPreventFocusDisruptionEnabled) {
 						// Direct file write without diff view or opening the file
 						cline.diffViewProvider.editType = "modify"
-						cline.diffViewProvider.originalContent = await fs.readFile(absolutePath, "utf-8")
+						cline.diffViewProvider.originalContent = await readFileWithEncodingDetection(absolutePath)


Should we also ensure the encoding is preserved when the file is written back? The DiffViewProvider might need to track and use the original encoding when saving files.

aheizi · 2025-08-12T07:59:26Z

@daniel-lxs Thanks for your comment! About file write encoding, here are my thoughts:

For most file writes, we use the VSCode API (vscode.workspace.fs.writeFile), which automatically handles encoding detection and preservation. So we don't need to do anything extra in these cases.
Only the "Directly save content to a file without showing diff view" scenario (used in the preventFocusDisruption experiment) requires us to handle encoding manually. For this, we use our own writeFileWithEncodingPreservation function to ensure the original file encoding is preserved.
Configuration files and new files are always written in UTF-8 (which is standard).

daniel-lxs · 2025-08-14T17:34:48Z

Hi @aheizi, thanks for working on this important feature! I've been testing the PR with various encodings to make sure it covers different use cases.

I tried creating test files in ISO-8859-1, Shift-JIS, GBK, EUC-KR, Windows-1252, and UTF-16. Some worked great (Windows-1252 and UTF-16), but others showed as "Binary file - content not displayed". Also, when I modified a Windows-1252 file, it seemed to get converted to UTF-8.

Could you take a look? My tests might be wrong, or I might be missing something in how the feature works. I just want to make sure this works correctly for all the encodings mentioned in the issue.

If you need help reproducing my tests or want me to try something specific, let me know! Really appreciate your work on fixing this encoding issue - it's been a pain point for many users.

aheizi · 2025-08-15T11:48:14Z

Hi @daniel-lxs, I’ve found the cause of the issue. The current isBinaryFile check doesn’t detect well — for example, Windows-1252 is being identified as a binary file. I’ll update the logic to first detect the encoding and then determine whether it’s binary.

aheizi · 2025-08-15T11:53:37Z

Besides, I found that cline did the same: https://github.com/cline/cline/blob/f4828d3344f6a8d8b76d19e214edd583ca946882/src/integrations/misc/extract-text.ts#L11-L26

hannesrudolph · 2025-09-23T04:05:24Z

Let's fix these merge issue an move this to needs review

daniel-lxs · 2025-09-24T19:18:39Z

There's a lot of failing checks here, these must be fixed before we can decide if this should be merged.

daniel-lxs

Found issues that need attention.

daniel-lxs · 2025-09-24T19:26:18Z

src/integrations/misc/read-lines.ts

+		open(filepath, 'r')
+			.then(fileHandle => {
+				const sampleBuffer = Buffer.alloc(65536);
+				return fileHandle.read(sampleBuffer, 0, sampleBuffer.length, 0)


Encoding sample ignores bytesRead. If fewer than 64KB bytes are read, the tail of the buffer remains zeroed, skewing detection for small files. Use the actual bytesRead slice.

daniel-lxs · 2025-09-24T19:26:19Z

src/integrations/misc/read-lines.ts

+				// Choose decoding method based on native support
+				let input: NodeJS.ReadableStream;
+				if (nodeEncodings.includes(encoding.toLowerCase())) {
+					input = createReadStream(filepath, { encoding: encoding as BufferEncoding });


Potential dropped stream errors. When piping to iconv.decodeStream, upstream createReadStream errors are not guaranteed to emit on the decoded stream with pipe(). Use stream/promises.pipeline or attach error handlers to both streams.

daniel-lxs · 2025-09-24T19:26:19Z

src/utils/encoding.ts

+ * @returns Promise<void>
+ */
+export async function writeFileWithEncodingPreservation(filePath: string, content: string): Promise<void> {
+	// Detect original file encoding


BOM not preserved. Files with UTF-8/UTF-16 BOM will lose it after writeFileWithEncodingPreservation, altering file semantics for some tools. Preserve original BOM when re-encoding.

daniel-lxs · 2025-09-24T19:26:19Z

src/utils/encoding.ts

+			}
+		}
+		console.warn(`No encoding detected, falling back to utf8`)
+		encoding = "utf8"


Fallback binary check calls isBinaryFile with a Buffer. Some versions expect (buffer, size). To avoid false negatives, pass the buffer length explicitly.

daniel-lxs · 2025-09-24T19:26:19Z

src/utils/encoding.ts

+ * @param filePath Path to the file
+ * @returns Promise<boolean> true if file is binary, false if it's text
+ */
+export async function isBinaryFileWithEncodingDetection(filePath: string): Promise<boolean> {


Returning true on read error causes callers like mentions/index.ts to treat files as binary and skip them. Prefer a conservative false here and let callers handle read errors explicitly.

daniel-lxs · 2025-09-24T19:26:19Z

src/utils/__tests__/encoding.spec.ts

+			mockFs.readFile.mockResolvedValue(buffer)
+			mockPath.extname.mockReturnValue(".txt")
+			mockJschardet.detect.mockReturnValue({
+				encoding: "utf8",


Add test coverage for BOM preservation (UTF-8 BOM and UTF-16 LE/BE) and for read-lines sampling (bytesRead slicing). This prevents regressions in the critical paths introduced here.

kendev-11 · 2025-11-12T00:58:59Z

Just checking in on this PR. Is there an estimated timeline for when it might be ready to merge? I’m planning to rely on this feature soon, so I’d appreciate it if you could let me know. Thanks for your great work on this!

… encoding

…mpt encoding detection and then determine the binary file

- Add jschardet dependency for better encoding detection - Improve error handling in readLines with explicit stream error handling - Update isBinaryFile calls to include buffer length parameter - Enhance encoding tests with BOM preservation and error cases - Fix binary file detection to return false on read errors

roomote · 2025-11-13T12:53:15Z

Rooviewer See task on Roo Cloud

Good progress! The latest commit fixed 2 of 3 issues. One issue remains:

Fix bytesRead handling in read-lines.ts encoding detection sampling
Add explicit error handlers for both streams in read-lines.ts pipe operation
Preserve BOM (Byte Order Mark) when writing files with encoding preservation

Previous reviews

ef0a5fa: Review #1

_{Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues.}

roomote · 2025-11-13T12:55:42Z

src/integrations/misc/read-lines.ts

+			.then((fileHandle) => {
+				const sampleBuffer = Buffer.alloc(65536)
+				return fileHandle
+					.read(sampleBuffer, 0, sampleBuffer.length, 0)
+					.then(() => sampleBuffer)
+					.finally(() => fileHandle.close())
+			})


Encoding sample ignores bytesRead. If fewer than 64KB bytes are read, the tail of the buffer remains zeroed, skewing detection for small files. Use the actual bytesRead slice.

Suggested change

.then((fileHandle) => {

const sampleBuffer = Buffer.alloc(65536)

return fileHandle

.read(sampleBuffer, 0, sampleBuffer.length, 0)

.then(() => sampleBuffer)

.finally(() => fileHandle.close())

})

.then((fileHandle) => {

const sampleBuffer = Buffer.alloc(65536)

return fileHandle

.read(sampleBuffer, 0, sampleBuffer.length, 0)

.then(({ bytesRead }) => sampleBuffer.subarray(0, bytesRead))

.finally(() => fileHandle.close())

})

_{Fix it with Roo Code or mention @roomote and request a fix.}

roomote · 2025-11-13T12:56:13Z

src/integrations/misc/read-lines.ts

+				} else {
+					// For non-native encodings, create streams and handle errors explicitly
+					const sourceStream = createReadStream(filepath)
+					const decodeStream = iconv.decodeStream(encoding)
+
+					// Handle errors from both streams
+					sourceStream.on("error", reject)
+					decodeStream.on("error", reject)
+
+					// Use pipe but with explicit error handling
+					input = sourceStream.pipe(decodeStream)
 				}


Potential dropped stream errors. When piping to iconv.decodeStream, upstream createReadStream errors are not guaranteed to emit on the decoded stream with pipe(). Use stream/promises.pipeline or attach error handlers to both streams.

Suggested change

} else {

// For non-native encodings, create streams and handle errors explicitly

const sourceStream = createReadStream(filepath)

const decodeStream = iconv.decodeStream(encoding)

// Handle errors from both streams

sourceStream.on("error", reject)

decodeStream.on("error", reject)

// Use pipe but with explicit error handling

input = sourceStream.pipe(decodeStream)

}

} else {

// For non-native encodings, create streams and handle errors explicitly

const sourceStream = createReadStream(filepath)

const decodeStream = iconv.decodeStream(encoding)

// Handle errors from both streams

sourceStream.on("error", reject)

decodeStream.on("error", reject)

// Use pipe but with explicit error handling

input = sourceStream.pipe(decodeStream)

}

_{Fix it with Roo Code or mention @roomote and request a fix.}

roomote · 2025-11-13T12:56:48Z

src/utils/encoding.ts

+export async function writeFileWithEncodingPreservation(filePath: string, content: string): Promise<void> {
+	// Detect original file encoding
+	const originalEncoding = await detectFileEncoding(filePath)
+
+	// If original file is UTF-8 or does not exist, write directly
+	if (originalEncoding === "utf8") {
+		await fs.writeFile(filePath, content, "utf8")
+		return
+	}
+
+	// Convert UTF-8 content to original file encoding
+	const encodedBuffer = iconv.encode(content, originalEncoding)
+	await fs.writeFile(filePath, encodedBuffer)
+}


BOM not preserved. Files with UTF-8/UTF-16 BOM will lose it after writeFileWithEncodingPreservation, altering file semantics for some tools. Preserve original BOM when re-encoding.

_{Fix it with Roo Code or mention @roomote and request a fix.}

roomote · 2025-11-13T12:57:27Z

src/utils/encoding.ts

+	} catch (error) {
+		// File read error, return false to let callers handle read errors explicitly
+		return false
+	}


Returning false on read error causes callers like mentions/index.ts to treat unreadable files as text and skip them silently. Prefer throwing the error or returning true to let callers handle read errors explicitly.

Suggested change

} catch (error) {

// File read error, return false to let callers handle read errors explicitly

return false

}

} catch (error) {

// File read error, throw to let callers handle explicitly

throw error

}

_{Fix it with Roo Code or mention @roomote and request a fix.}

…ead was not properly handled when reading files

aheizi requested review from cte, jr and mrubens as code owners August 8, 2025 09:15

github-project-automation bot added this to Roo Code Roadmap and Roo Code Roadmap Aug 8, 2025

github-project-automation bot moved this to Triage in Roo Code Roadmap Aug 8, 2025

github-project-automation bot moved this to New in Roo Code Roadmap Aug 8, 2025

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Aug 8, 2025

hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Aug 8, 2025

roomote bot reviewed Aug 8, 2025

View reviewed changes

aheizi force-pushed the fix/file-encoding-detection branch from f05c733 to 44b75e6 Compare August 8, 2025 09:40

daniel-lxs moved this from Triage to PR [Needs Prelim Review] in Roo Code Roadmap Aug 11, 2025

daniel-lxs reviewed Aug 12, 2025

View reviewed changes

daniel-lxs moved this from PR [Needs Prelim Review] to PR [Changes Requested] in Roo Code Roadmap Aug 12, 2025

hannesrudolph added PR - Changes Requested and removed Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. labels Aug 12, 2025

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Aug 12, 2025

daniel-lxs moved this from PR [Changes Requested] to PR [Needs Prelim Review] in Roo Code Roadmap Aug 12, 2025

hannesrudolph added PR - Needs Preliminary Review and removed PR - Changes Requested labels Aug 12, 2025

daniel-lxs moved this from PR [Needs Prelim Review] to PR [Changes Requested] in Roo Code Roadmap Aug 14, 2025

hannesrudolph added PR - Changes Requested and removed PR - Needs Preliminary Review labels Aug 14, 2025

hannesrudolph moved this from PR [Changes Requested] to PR [Needs Prelim Review] in Roo Code Roadmap Sep 23, 2025

hannesrudolph added PR - Needs Preliminary Review and removed PR - Changes Requested labels Sep 23, 2025

aheizi force-pushed the fix/file-encoding-detection branch from 2fc7033 to ca03a90 Compare September 23, 2025 16:34

daniel-lxs marked this pull request as draft September 24, 2025 19:18

daniel-lxs moved this from PR [Needs Prelim Review] to PR [Draft / In Progress] in Roo Code Roadmap Sep 24, 2025

daniel-lxs reviewed Sep 24, 2025

View reviewed changes

hannesrudolph added PR - Draft / In Progress and removed PR - Needs Preliminary Review labels Sep 24, 2025

aheizi marked this pull request as ready for review September 26, 2025 16:20

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Sep 26, 2025

aheizi added 4 commits November 12, 2025 17:30

feat: Added file encoding detection and reading functions

f5ef6b0

Refactor: use writeFileWithEncodingPreservation save the file to save…

c7c0018

… encoding

refactor: Reconstruct the file encoding detection logic to first atte…

cd41603

…mpt encoding detection and then determine the binary file

aheizi force-pushed the fix/file-encoding-detection branch from 898b437 to ef0a5fa Compare November 13, 2025 12:52

roomote bot reviewed Nov 13, 2025

View reviewed changes

fix(File Reading): Fixed the issue where the actual number of bytes r…

48a1c62

…ead was not properly handled when reading files

feat: Added file encoding detection and reading functions #6841

Are you sure you want to change the base?

feat: Added file encoding detection and reading functions #6841

Uh oh!

Conversation

aheizi commented Aug 8, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related GitHub Issue

Roo Code Task Context (Optional)

Description

Test Procedure

Pre-Submission Checklist

Screenshots / Videos

Documentation Updates

Additional Notes

Get in Touch

Uh oh!

roomote bot left a comment

Choose a reason for hiding this comment

Uh oh!

roomote bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

roomote bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roomote bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roomote bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roomote bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roomote bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daniel-lxs left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aheizi commented Aug 12, 2025

Uh oh!

daniel-lxs commented Aug 14, 2025

Uh oh!

aheizi commented Aug 15, 2025

Uh oh!

aheizi commented Aug 15, 2025

Uh oh!

hannesrudolph commented Sep 23, 2025

Uh oh!

daniel-lxs commented Sep 24, 2025

Uh oh!

aheizi commented Aug 8, 2025 •

edited by ellipsis-dev bot

Loading

daniel-lxs left a comment •

edited

Loading

daniel-lxs Sep 24, 2025 •

edited

Loading

daniel-lxs Sep 24, 2025 •

edited

Loading

daniel-lxs Sep 24, 2025 •

edited

Loading

daniel-lxs Sep 24, 2025 •

edited

Loading

daniel-lxs Sep 24, 2025 •

edited

Loading

roomote bot commented Nov 13, 2025 •

edited

Loading