-
Notifications
You must be signed in to change notification settings - Fork 2.6k
feat: Added file encoding detection and reading functions #6841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution! I've reviewed the changes and found several issues that need attention. The encoding detection feature is a valuable addition, but there are some critical issues around testing and error handling that should be addressed.
src/utils/encoding.ts
Outdated
|
|
||
| const encoding = await detectEncoding(buffer, fileExtension) | ||
| return iconv.decode(buffer, encoding) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new utility file needs dedicated unit tests. Could you add comprehensive test coverage for both and functions? This is critical functionality that should be thoroughly tested with various encoding scenarios.
| encoding = detected | ||
| } else if (detected && detected.encoding) { | ||
| // Check confidence level, use default encoding if too low | ||
| if (detected.confidence < 0.7) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the 0.7 confidence threshold intentional? Could you either make this configurable or add a comment explaining why this specific value was chosen? Different use cases might benefit from different thresholds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the 0.7 confidence threshold was intentional — it’s a conservative value we’ve found to work well for our current use case where UTF-8 is the dominant encoding, and we prefer to fall back rather than risk mis-decoding.
Different thresholds have different trade-offs:
- ≥0.9 – very strict, only accept when the detection is almost certain; reduces false positives but may fall back too often.
- ~0.7 – balanced, common for general text processing where UTF-8 fallback is safe.
- <0.5 – more permissive, higher coverage but greater risk of mis-decoding.
src/integrations/misc/read-lines.ts
Outdated
| @@ -22,7 +25,7 @@ const outOfRangeError = (filepath: string, n: number) => { | |||
| * @throws {RangeError} If line numbers are invalid or out of range | |||
| */ | |||
| export function readLines(filepath: string, endLine?: number, startLine?: number): Promise<string> { | |||
| return new Promise((resolve, reject) => { | |||
| return new Promise(async (resolve, reject) => { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using in a Promise constructor is an anti-pattern that can lead to unhandled promise rejections. Consider refactoring this to avoid the anti-pattern:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
src/integrations/misc/read-lines.ts
Outdated
| const fileHandle = await open(filepath, 'r'); | ||
| const sampleBuffer = Buffer.alloc(65536); | ||
| await fileHandle.read(sampleBuffer, 0, sampleBuffer.length, 0); | ||
| await fileHandle.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file handle should be closed in a finally block to ensure it's always closed even if an error occurs. Consider:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| @@ -591,7 +592,7 @@ ${errorDetails ? `\nTechnical details:\n${errorDetails}\n` : ""} | |||
| if (isPreventFocusDisruptionEnabled) { | |||
| // Direct file write without diff view or opening the file | |||
| cline.diffViewProvider.editType = "modify" | |||
| cline.diffViewProvider.originalContent = await fs.readFile(absolutePath, "utf-8") | |||
| cline.diffViewProvider.originalContent = await readFileWithEncodingDetection(absolutePath) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed is called multiple times for the same file (here and line 560). Could we optimize this by reading the file once and reusing the content? This would improve performance, especially for large files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Review. Let me make some revisions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's prioritize consistency with the past first. The issue of reading the file repeatedly can be addressed as an optimization for the next issue.
src/utils/encoding.ts
Outdated
|
|
||
| // 3. Verify if the encoding is supported by iconv-lite | ||
| if (!iconv.encodingExists(encoding)) { | ||
| console.warn(`Unsupported encoding detected: ${encoding}, falling back to utf8`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When encoding detection fails or falls back to UTF-8, it would be helpful to include the originally detected encoding in the warning message. This would aid in debugging encoding issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
f05c733 to
44b75e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @aheizi for addressing the previous review. I've reviewed the changes and noticed that files are being read with encoding detection but still written back in UTF-8. I think we should ensure the encoding is preserved during write operations to avoid corrupting non-UTF-8 files. The DiffViewProvider and other components that write files need to be updated to use the detected encoding. Also, some of the previous review comments about tests and the Promise pattern are still pending.
| } else if (detected && detected.encoding) { | ||
| originalEncoding = detected.encoding | ||
| // Check confidence level, use default encoding if too low | ||
| if (detected.confidence < 0.7) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a comment explaining why 0.7 was chosen as the threshold? Based on your explanation in the PR comments, perhaps something like:
| if (detected.confidence < 0.7) { | |
| // Check confidence level, use default encoding if too low | |
| // 0.7 is a conservative threshold that works well when UTF-8 is the dominant encoding | |
| // and we prefer to fall back rather than risk mis-decoding | |
| if (detected.confidence < 0.7) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/utils/encoding.ts
Outdated
|
|
||
| const encoding = await detectEncoding(buffer, fileExtension) | ||
| return iconv.decode(buffer, encoding) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is core functionality, would it be helpful to add comprehensive test coverage? We could test:
- Various encodings (UTF-8, GBK, ISO-8859-1, Shift-JIS)
- Binary file detection
- Confidence threshold behavior
- Fallback scenarios
- Edge cases (empty files, very small files)
Perhaps create a test file at src/utils/tests/encoding.spec.ts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| * @param filePath Path to the file | ||
| * @returns File content as string | ||
| */ | ||
| export async function readFileWithEncodingDetection(filePath: string): Promise<string> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a corresponding writeFileWithEncodingPreservation function that stores and uses the detected encoding when writing files back. Otherwise, files will be corrupted when written in UTF-8 regardless of their original encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| @@ -591,7 +592,7 @@ ${errorDetails ? `\nTechnical details:\n${errorDetails}\n` : ""} | |||
| if (isPreventFocusDisruptionEnabled) { | |||
| // Direct file write without diff view or opening the file | |||
| cline.diffViewProvider.editType = "modify" | |||
| cline.diffViewProvider.originalContent = await fs.readFile(absolutePath, "utf-8") | |||
| cline.diffViewProvider.originalContent = await readFileWithEncodingDetection(absolutePath) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also ensure the encoding is preserved when the file is written back? The DiffViewProvider might need to track and use the original encoding when saving files.
|
@daniel-lxs Thanks for your comment! About file write encoding, here are my thoughts:
|
|
Hi @aheizi, thanks for working on this important feature! I've been testing the PR with various encodings to make sure it covers different use cases. I tried creating test files in ISO-8859-1, Shift-JIS, GBK, EUC-KR, Windows-1252, and UTF-16. Some worked great (Windows-1252 and UTF-16), but others showed as "Binary file - content not displayed". Also, when I modified a Windows-1252 file, it seemed to get converted to UTF-8. Could you take a look? My tests might be wrong, or I might be missing something in how the feature works. I just want to make sure this works correctly for all the encodings mentioned in the issue. If you need help reproducing my tests or want me to try something specific, let me know! Really appreciate your work on fixing this encoding issue - it's been a pain point for many users. |
|
Hi @daniel-lxs, I’ve found the cause of the issue. The current |
|
Besides, I found that cline did the same: https://github.com/cline/cline/blob/f4828d3344f6a8d8b76d19e214edd583ca946882/src/integrations/misc/extract-text.ts#L11-L26 |
|
Let's fix these merge issue an move this to needs review |
2fc7033 to
ca03a90
Compare
|
There's a lot of failing checks here, these must be fixed before we can decide if this should be merged. |
daniel-lxs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found issues that need attention.
src/integrations/misc/read-lines.ts
Outdated
| open(filepath, 'r') | ||
| .then(fileHandle => { | ||
| const sampleBuffer = Buffer.alloc(65536); | ||
| return fileHandle.read(sampleBuffer, 0, sampleBuffer.length, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encoding sample ignores bytesRead. If fewer than 64KB bytes are read, the tail of the buffer remains zeroed, skewing detection for small files. Use the actual bytesRead slice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/integrations/misc/read-lines.ts
Outdated
| // Choose decoding method based on native support | ||
| let input: NodeJS.ReadableStream; | ||
| if (nodeEncodings.includes(encoding.toLowerCase())) { | ||
| input = createReadStream(filepath, { encoding: encoding as BufferEncoding }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential dropped stream errors. When piping to iconv.decodeStream, upstream createReadStream errors are not guaranteed to emit on the decoded stream with pipe(). Use stream/promises.pipeline or attach error handlers to both streams.
| * @returns Promise<void> | ||
| */ | ||
| export async function writeFileWithEncodingPreservation(filePath: string, content: string): Promise<void> { | ||
| // Detect original file encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BOM not preserved. Files with UTF-8/UTF-16 BOM will lose it after writeFileWithEncodingPreservation, altering file semantics for some tools. Preserve original BOM when re-encoding.
| } | ||
| } | ||
| console.warn(`No encoding detected, falling back to utf8`) | ||
| encoding = "utf8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fallback binary check calls isBinaryFile with a Buffer. Some versions expect (buffer, size). To avoid false negatives, pass the buffer length explicitly.
| * @param filePath Path to the file | ||
| * @returns Promise<boolean> true if file is binary, false if it's text | ||
| */ | ||
| export async function isBinaryFileWithEncodingDetection(filePath: string): Promise<boolean> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning true on read error causes callers like mentions/index.ts to treat files as binary and skip them. Prefer a conservative false here and let callers handle read errors explicitly.
| mockFs.readFile.mockResolvedValue(buffer) | ||
| mockPath.extname.mockReturnValue(".txt") | ||
| mockJschardet.detect.mockReturnValue({ | ||
| encoding: "utf8", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add test coverage for BOM preservation (UTF-8 BOM and UTF-16 LE/BE) and for read-lines sampling (bytesRead slicing). This prevents regressions in the critical paths introduced here.
|
Just checking in on this PR. Is there an estimated timeline for when it might be ready to merge? I’m planning to rely on this feature soon, so I’d appreciate it if you could let me know. Thanks for your great work on this! |
…mpt encoding detection and then determine the binary file
- Add jschardet dependency for better encoding detection - Improve error handling in readLines with explicit stream error handling - Update isBinaryFile calls to include buffer length parameter - Enhance encoding tests with BOM preservation and error cases - Fix binary file detection to return false on read errors
898b437 to
ef0a5fa
Compare
Good progress! The latest commit fixed 2 of 3 issues. One issue remains:
Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues. |
| .then((fileHandle) => { | ||
| const sampleBuffer = Buffer.alloc(65536) | ||
| return fileHandle | ||
| .read(sampleBuffer, 0, sampleBuffer.length, 0) | ||
| .then(() => sampleBuffer) | ||
| .finally(() => fileHandle.close()) | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encoding sample ignores bytesRead. If fewer than 64KB bytes are read, the tail of the buffer remains zeroed, skewing detection for small files. Use the actual bytesRead slice.
| .then((fileHandle) => { | |
| const sampleBuffer = Buffer.alloc(65536) | |
| return fileHandle | |
| .read(sampleBuffer, 0, sampleBuffer.length, 0) | |
| .then(() => sampleBuffer) | |
| .finally(() => fileHandle.close()) | |
| }) | |
| .then((fileHandle) => { | |
| const sampleBuffer = Buffer.alloc(65536) | |
| return fileHandle | |
| .read(sampleBuffer, 0, sampleBuffer.length, 0) | |
| .then(({ bytesRead }) => sampleBuffer.subarray(0, bytesRead)) | |
| .finally(() => fileHandle.close()) | |
| }) |
Fix it with Roo Code or mention @roomote and request a fix.
| } else { | ||
| // For non-native encodings, create streams and handle errors explicitly | ||
| const sourceStream = createReadStream(filepath) | ||
| const decodeStream = iconv.decodeStream(encoding) | ||
|
|
||
| // Handle errors from both streams | ||
| sourceStream.on("error", reject) | ||
| decodeStream.on("error", reject) | ||
|
|
||
| // Use pipe but with explicit error handling | ||
| input = sourceStream.pipe(decodeStream) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential dropped stream errors. When piping to iconv.decodeStream, upstream createReadStream errors are not guaranteed to emit on the decoded stream with pipe(). Use stream/promises.pipeline or attach error handlers to both streams.
| } else { | |
| // For non-native encodings, create streams and handle errors explicitly | |
| const sourceStream = createReadStream(filepath) | |
| const decodeStream = iconv.decodeStream(encoding) | |
| // Handle errors from both streams | |
| sourceStream.on("error", reject) | |
| decodeStream.on("error", reject) | |
| // Use pipe but with explicit error handling | |
| input = sourceStream.pipe(decodeStream) | |
| } | |
| } else { | |
| // For non-native encodings, create streams and handle errors explicitly | |
| const sourceStream = createReadStream(filepath) | |
| const decodeStream = iconv.decodeStream(encoding) | |
| // Handle errors from both streams | |
| sourceStream.on("error", reject) | |
| decodeStream.on("error", reject) | |
| // Use pipe but with explicit error handling | |
| input = sourceStream.pipe(decodeStream) | |
| } |
Fix it with Roo Code or mention @roomote and request a fix.
| export async function writeFileWithEncodingPreservation(filePath: string, content: string): Promise<void> { | ||
| // Detect original file encoding | ||
| const originalEncoding = await detectFileEncoding(filePath) | ||
|
|
||
| // If original file is UTF-8 or does not exist, write directly | ||
| if (originalEncoding === "utf8") { | ||
| await fs.writeFile(filePath, content, "utf8") | ||
| return | ||
| } | ||
|
|
||
| // Convert UTF-8 content to original file encoding | ||
| const encodedBuffer = iconv.encode(content, originalEncoding) | ||
| await fs.writeFile(filePath, encodedBuffer) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BOM not preserved. Files with UTF-8/UTF-16 BOM will lose it after writeFileWithEncodingPreservation, altering file semantics for some tools. Preserve original BOM when re-encoding.
Fix it with Roo Code or mention @roomote and request a fix.
| } catch (error) { | ||
| // File read error, return false to let callers handle read errors explicitly | ||
| return false | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning false on read error causes callers like mentions/index.ts to treat unreadable files as text and skip them silently. Prefer throwing the error or returning true to let callers handle read errors explicitly.
| } catch (error) { | |
| // File read error, return false to let callers handle read errors explicitly | |
| return false | |
| } | |
| } catch (error) { | |
| // File read error, throw to let callers handle explicitly | |
| throw error | |
| } |
Fix it with Roo Code or mention @roomote and request a fix.
…ead was not properly handled when reading files
Related GitHub Issue
#3555
Roo Code Task Context (Optional)
Description
Roo-code currently reads files directly in UTF-8 encoding, which is not very friendly to files with other encodings such as ISO-8859-1, GBK, etc. This PR fixes this issue.
Test Procedure
Pre-Submission Checklist
Screenshots / Videos
before:

after:

Documentation Updates
Does this PR necessitate updates to user-facing documentation?
Additional Notes
Get in Touch
aheizi
Important
Adds file encoding detection and handling for reading and writing files, ensuring correct processing of various encodings and preserving original encoding on write.
readFileWithEncodingDetectionandwriteFileWithEncodingPreservationinencoding.tsto handle file reading and writing with encoding detection and preservation.isBinaryFilewithisBinaryFileWithEncodingDetectioninreadFileTool.ts,searchAndReplaceTool.ts, andwriteToFileTool.tsfor binary file detection with encoding consideration.DiffViewProvider.tsto use new encoding functions for reading and writing files.encoding.spec.ts.DiffViewProvider.spec.ts,extract-text-large-files.spec.ts, andread-lines.spec.tsto mock new encoding functions.iconv-liteandjschardettopackage.jsonfor encoding detection and conversion.This description was created by
for 898b437. You can customize this summary. It will automatically update as commits are pushed.