file encoding issue, files get many diffs just after saving. #7580

masriomarm · 2023-07-02T14:34:04Z

masriomarm
Jul 2, 2023

Summary

when reading file that contain western Europe characters such as 'Función', Helix reads it as 'Funci�n'. also overwrites it when saving which generates many diff - since there are many more like those. Vim/NeoVim/Emacs don't cause this issue, but, vscode does.

Reverting the file with hx doesn't help since hx can't write the right characters. once file is broken form helix, I have to revert the file with version control or if I am opening the same file with nvim and nvim can undo the hx save.

all editors encoding is set to utf-8. I can't provide the file since its property isn't mine.

Reproduction Steps

I tried this:

hx , then pick a file to work with.

I expected this to happen:
when saving the file with no changes made. The file shouldn't contain any diffs.
Instead, this happened:
but since helix can't read the characters correctly it produce many diffs.
for example
word también is converted to 'tambi�n'. It meas also in Spanish.
while Función, Helix reads it as Funci�n.

the word tambi�n is actually t a m b i & # 6 5 5 3 3 ; n without the spaces but it gets rendered. maybe that could be a clue.

Helix log

~/.cache/helix/helix.log

please provide a copy of `~/.cache/helix/helix.log` here if possible, you may need to redact some of the lines

Platform

Windows

Terminal Emulator

Windows Terminal

Helix Version

helix-term 23.03 (3cf0372) Blaž Hrastnik [email protected] A post-modern text editor.

pascalkuthe · 2023-07-02T15:04:48Z

pascalkuthe
Jul 2, 2023
Maintainer

without a file to reproduce this with a new file so this is unactionable if you can' provide an existing file where this is reproducible

0 replies

masriomarm · 2023-07-02T15:57:15Z

masriomarm
Jul 2, 2023
Author

I understand so, that's why I'm trying to add as much as info I can.

opening the file in vanilla notepad, the encoding seems to be ANSI, when saved the file with utf-8 encoding. Helix can now read the file normally.

But for helix, I can't change encoding to ANSI, when trying :encoding ANSI or :encoding ansi it reports unkown encoding
change encoding to windows -1252 - which is ascii - doesn't solve the issue.

How could I read the ANSI encoding correctly with helix?

0 replies

pascalkuthe · 2023-07-02T16:24:18Z

pascalkuthe
Jul 2, 2023
Maintainer

we probably don't detect windows -1252 correctly. There is no way to manually set the encoding while reading a file with helix its always auto detected. Once the file has been read to memory its always converted to UTF-8 so at that point the encoding information is already lost. when you change the encoding with :encoding ... you are just changing the output encoding so that doesn't help.

i guess the autodetecion fails somehow here. However, again saving a file containing también as windows-1252 and then opening that with helix works just fine. So I would need a reproducible example to test

0 replies

kirawi · 2023-07-02T16:54:21Z

kirawi
Jul 2, 2023
Collaborator

I think following up with :reload will correctly parse the file under the encoding.

0 replies

masriomarm · 2023-07-02T17:02:01Z

masriomarm
Jul 2, 2023
Author

again saving a file containing también as windows-1252 and then opening that with helix works just fine

ANSI encoding is the issue not Windows-1252.

Should try save the file as ANSI then open with helix.

0 replies

pascalkuthe · 2023-07-02T17:21:20Z

pascalkuthe
Jul 2, 2023
Maintainer

There is no such thing as ANSI encoding. ANSI is a colloquial name that usually refers to windows-1252 or windows -1254 for historic reasons

0 replies

masriomarm · 2023-07-02T18:07:17Z

masriomarm
Jul 2, 2023
Author

Thanks for correcting me. The windows notepad didn't provide details further than `ANSI` regarding the file encoding. Will check if I could provide more details regarding the text causing the issue. Regards,

…

On Sun, 2 Jul 2023, 20:21 Pascal Kuthe, ***@***.***> wrote: There is no such thing as ANSI encoding. ANSI is a colloquial name that usually refers to wibdowz-1252 or wibdows -1254 — Reply to this email directly, view it on GitHub <#7514 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASONXPDPQGRS23OTYXLCRC3XOGUZVANCNFSM6AAAAAAZ3POQDA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

masriomarm · 2023-07-05T13:22:58Z

masriomarm
Jul 5, 2023
Author

I think following up with :reload will correctly parse the file under the encoding.

Doesn't help. Same corrupted characters.

0 replies

pascalkuthe · 2023-07-05T13:28:47Z

pascalkuthe
Jul 5, 2023
Maintainer

Could you try to produce a minimal file which contains no confidential information that you can upload? At this point we are just guessing. Without a reproductioncase three ins't much we can do here

0 replies

masriomarm · 2023-07-07T08:34:15Z

masriomarm
Jul 7, 2023
Author

@pascalkuthe I tried to share the file but here what has happened.

File is a .c which can't be attached through github.
I tried to rename the file to a .txt but helix could read it normally. It didn't reproduce.

I tried to zip the file, but retrieving the file from the compression seems to not reproduce the issue.

I tried to xxd both files the original and the renamed and content seems to match.

What do you suggest for sharing the file untouched by Windows?

0 replies

pascalkuthe · 2023-07-07T16:44:42Z

pascalkuthe
Jul 7, 2023
Maintainer

You can upload it somewhere else. Google drive or whatever. Renaming a file should however never change its contents. How did you change the file extension? You should just do mv foo.c foo.txt. Don't save it under a different name with somekin of editor. I actually have a guess what is causing the problems now. Saving a file as utf-16be in helix and then opening it causes the weird spaces to appear (since encoding is detected as windows-1252):

and reloading that file while forcing utf-8 (:encoding utf-8, :reload) causes the invalid characters to appear:

Vim/NeoVim/Emacs don't cause this issue, but, vscode does.

What exactly do you mean with this? Does VSCode also display these weird characters or do you mean that creating a file and then opening it with helix causes these issue. Its possible that vscode is saving as utf-16 by default while those other editors definitely won't.

0 replies

masriomarm · 2023-07-07T20:11:32Z

masriomarm
Jul 7, 2023
Author

Glad you could reproduce. These characters are what am I getting when attempting to read the file in helix.

How did you change the file extension?

I used nvim, :save newfile.txt.
I tried to copy file and paste it.
I have no mv utility in Windows, I didn't either try to rename the file.

What exactly do you mean with this?

The file can be read normally in Vim, Neovim and emacs. But, VScode behaves like helix, reading those awkward chars and writing it back when saving. Which corrupts the file and generate many diffs.

Does VSCode also display these weird characters

Yes, exactly. I stopped using VScode long ago since it was generating this issue with many files.

Its possible that vscode is saving as utf-16 by default

Could be, since I think it auto detects encoding and I think it was windows-1252. Not really sure though.

0 replies

pascalkuthe · 2023-07-07T20:32:05Z

pascalkuthe
Jul 7, 2023
Maintainer

I used nvim, :save newfile.txt.
I tried to copy file and paste it.
I have no mv utility in Windows, I didn't either try to rename the file.

this is not the same thing, try to rename the file in windows explorer. Nvim defaults to utf8 for new files.

Your problem is that your files are encoded using utf-16be I have no idea why. Basically everybody uses utf-8 (or maybe some variant). Even wore you are using utf-16 files without a BOM. both VSCode use encoding detection that is compliant with webstandards (the encoding detection used in helix is also more or less used in firefox, vscode is based on chrome). Utf16 is only detected if it has a proper BOM but there is no autodetction for it.

Vim and emacs have that kind of autodetection for historical reasons but its a pain to implement and considering that utf-16 is basically dead as a format for plaintext files (and when its usec normally has a bom) its not worth implementing.

I would advise you to convert all your filer to utf8 or add a BOM marker. You can also manually set the encoding to :encoding utf-16be and then :reload your file

0 replies

masriomarm · 2023-07-07T21:28:08Z

masriomarm
Jul 7, 2023
Author

Will consider your recommendations and report back in 2 days. Thanks. Regards,

…

On Fri, 7 Jul 2023, 23:32 Pascal Kuthe, ***@***.***> wrote: I used nvim, :save newfile.txt. I tried to copy file and paste it. I have no mv utility in Windows, I didn't either try to rename the file. this is not the same thing, try to rename the file in windows explorer. Nvim defaults to utf8 for new files. Your problem is that your files are encoded using utf-16be I have no idea why. Basically everybody uses utf-8 (or maybe some variant). Even wore you are using utf-16 files without a BOM. both VSCode use encoding detection that is compliant with webstandards (the encoding detection used in helix is also more or less used in firefox, vscode is based on chrome). Utf16 is only detected if it has a proper BOM but there is no autodetction for it. Vim and emacs have that kind of autodetection for historical reasons but its a pain to implement and considering that utf-16 is basically dead as a format for plaintext files (and when its usec normally has a bom) its not worth implementing. I would advise you to convert all your filer to utf8 or add a BOM marker. You can also manually set the encoding to :encoding utf-16be and then :reload your file — Reply to this email directly, view it on GitHub <#7514 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASONXPGOWSLGNKMPEMPU7ALXPBW5DANCNFSM6AAAAAAZ3POQDA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

masriomarm · 2023-07-09T07:38:41Z

masriomarm
Jul 9, 2023
Author

this is not the same thing, try to rename the file in windows explorer

Renaming the file with simply pressing f2 on Windows, didn't reproduce the issue. I can read the chars normally after renaming.

You can also manually set the encoding to :encoding utf-16be and then :reload your file

That resulted in Chinese like characters.

I would advise you to convert all your filer to utf8

This solves the issue, chars can be normally read now. But, I need a solution from inside helix.

0 replies

masriomarm · 2023-07-09T08:43:32Z

masriomarm
Jul 9, 2023
Author

You can also manually set the encoding to :encoding utf-16be and then :reload your file

I tried windows-1252 then reload instead of utf16-be. It display chars normally now and doesn't generate diffs when saving.
I noticed that helix auto detects other files as windows-1252, and other files don't being. Those file that aren't auto detected as windows-1252 cause the same issue of the corrupted character display. I guess the issue is that there some files whose encoding isn't detected correctly.

Since you earlier said that there's no way to manually auto set encoding before reading files. Is there a method to automate setting encoding to windows-1252 and reloading the file. some sort of hook or BufRead vim like. @pascalkuthe Thanks for your support.

0 replies

pascalkuthe · 2023-07-09T09:01:16Z

pascalkuthe
Jul 9, 2023
Maintainer

this is not the same thing, try to rename the file in windows explorer

Renaming the file with simply pressing f2 on Windows, didn't reproduce the issue. I can read the chars normally after renaming.
It's impossible for the file name to affect encoding detection. The encoding detection doesn't even know the file name. If you rename the file back to .c afterwards does it cause problems? If it does you can just upload it as .txt and I can rename it after downloading. If it doesn't then renaming the file outside of helix changes the encoding.

You can also manually set the encoding to :encoding utf-16be and then :reload your file

I tried windows-1252 then reload instead of utf16-be. It display chars normally now and doesn't generate diffs when saving. I noticed that helix auto detects other files as windows-1252, and other files don't being. Those file that aren't auto detected as windows-1252 cause the same issue of the corrupted character display. I guess the issue is that there some files whose encoding isn't detected correctly.

Since you earlier said that there's no way to manually auto set encoding before reading files. Is there a method to automate setting encoding to windows-1252 and reloading the file. some sort of hook or BufRead vim like. @pascalkuthe Thanks for your support.

No there is no way to set the encoding manually and there are no plans to add that. It's a nieche usecase that isn't worth spending effort on. Autodetection should work if you can provide a reproducible case where it doesn't we can look into fixing that since it doesn't comolicate the codebase

0 replies

masriomarm · 2023-07-09T09:19:37Z

masriomarm
Jul 9, 2023
Author

DataMan.txt

@pascalkuthe Here's the file. changing its name seems to change encoding. changing only extensions conserve the issue.
This file was .c file. It produce the issue now. Hope you can work with it.

Also, here's a ss shows the difference between helix on the left, neovim on the right.

0 replies

Uh oh!

file encoding issue, files get many diffs just after saving. #7580

Uh oh!

Uh oh!

masriomarm Jul 2, 2023

Summary

Reproduction Steps

Helix log

Platform

Terminal Emulator

Helix Version

Replies: 18 comments

Uh oh!

Uh oh!

pascalkuthe Jul 2, 2023 Maintainer

Uh oh!

Uh oh!

masriomarm Jul 2, 2023 Author

Uh oh!

pascalkuthe Jul 2, 2023 Maintainer

Uh oh!

kirawi Jul 2, 2023 Collaborator

Uh oh!

masriomarm Jul 2, 2023 Author

Uh oh!

Uh oh!

pascalkuthe Jul 2, 2023 Maintainer

Uh oh!

masriomarm Jul 2, 2023 Author

Uh oh!

masriomarm Jul 5, 2023 Author

Uh oh!

pascalkuthe Jul 5, 2023 Maintainer

Uh oh!

masriomarm Jul 7, 2023 Author

Uh oh!

pascalkuthe Jul 7, 2023 Maintainer

Uh oh!

Uh oh!

masriomarm Jul 7, 2023 Author

Uh oh!

pascalkuthe Jul 7, 2023 Maintainer

Uh oh!

masriomarm Jul 7, 2023 Author

Uh oh!

masriomarm Jul 9, 2023 Author

Uh oh!

Uh oh!

masriomarm Jul 9, 2023 Author

Uh oh!

pascalkuthe Jul 9, 2023 Maintainer

Uh oh!

masriomarm Jul 9, 2023 Author

masriomarm
Jul 2, 2023

pascalkuthe
Jul 2, 2023
Maintainer

masriomarm
Jul 2, 2023
Author

pascalkuthe
Jul 2, 2023
Maintainer

kirawi
Jul 2, 2023
Collaborator

masriomarm
Jul 2, 2023
Author

pascalkuthe
Jul 2, 2023
Maintainer

masriomarm
Jul 2, 2023
Author

masriomarm
Jul 5, 2023
Author

pascalkuthe
Jul 5, 2023
Maintainer

masriomarm
Jul 7, 2023
Author

pascalkuthe
Jul 7, 2023
Maintainer

masriomarm
Jul 7, 2023
Author

pascalkuthe
Jul 7, 2023
Maintainer

masriomarm
Jul 7, 2023
Author

masriomarm
Jul 9, 2023
Author

masriomarm
Jul 9, 2023
Author

pascalkuthe
Jul 9, 2023
Maintainer

masriomarm
Jul 9, 2023
Author