Skip to content

at least one file's utf-8 encoding is wrong, presumably more? #7

@mlc

Description

@mlc

Hi, thanks for this excellent work!

I suspect it's not an isolated incident, but don't presently have anything beyond a single anecdote:

  • 夢溪筆談 is valid UTF-8 Chinese text on the Project Gutenberg website.
  • But file 073/07317.txt in the gutenberg-dammit corpus is valid UTF-8 gibberish.
  • If you take the gutenberg-dammit file, and convert it from utf-8 to "latin-1", you end up with a file which chardet says is Big5-encoded text. This appears to be mostly correct, except that there is some garbage in it and so it can not be recoded successfully by any of the few different tools I tried.

Anyway that's the data I have for now…

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions