Skip to content

detect_encoding modifies the string #1

@gerrywastaken

Description

@gerrywastaken

Hi so firstly really nice work. In some simple testing I was shocked to see this running 140 times faster than charlock_holmes on a 36MB file. I also love the idea of being able to remove a build step and external dependencies from my docker container. :)

Looking at the code I noticed that detect_encoding will actually modify the string in two ways which is probably not what people would expect.

It uses force encoding but then does not set the encoding back to the original state. This makes sense for the punch_encoding method, but I would expect detect_encoding to record the original encoding and then set it back to that before exiting.

Secondly it removes any UTF8 BOM from the string. Again this makes sense for the punch method but probably doesn't for detect. Is it necessary to remove this BOM before running the force_encoding().valid_encoding? checks? I get that you wouldn't want to clone the string so as to avoid an excess memory overhead, but maybe detect encoding should add it back if it has been removed.

Great work btw, and I'm surprised this doesn't have more stars.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions