Skip to content

Conversation

rodjuncode
Copy link

Hello! I was trying to train my own vector model, but got decoding errors during the linebreaks removal. I figured out that forcing the encoding (encoding="utf-8") during the file opening would fix it. Added to both input and output files. Also, added "ensure_ascii=False" to the json dump, so the "\u" characters are dumped as human readable chars.

Hope this helps, and thanks for the great work!

Forcing utf 8 guarantees compatibility with non-english languages.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant