at least one file's utf-8 encoding is wrong, presumably more?

Hi, thanks for this excellent work!

I suspect it's not an isolated incident, but don't presently have anything beyond a single anecdote:

* [夢溪筆談](https://www.gutenberg.org/ebooks/7317) is valid UTF-8 Chinese text on the Project Gutenberg website.
* But file `073/07317.txt` in the `gutenberg-dammit` corpus is valid UTF-8 gibberish.
* If you take the `gutenberg-dammit` file, and convert it from utf-8 to "latin-1", you end up with a file which `chardet` says is [Big5](https://en.wikipedia.org/wiki/Big5)-encoded text. This appears to be mostly correct, except that there is some garbage in it and so it can not be recoded successfully by any of the few different tools I tried.

Anyway that's the data I have for now…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

at least one file's utf-8 encoding is wrong, presumably more? #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

at least one file's utf-8 encoding is wrong, presumably more? #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions