Skip to content

[Feature Request] UTF-8 text filter #19

@ljdarj

Description

@ljdarj

I'm currently doing a Java prototype for tukaani-project/xz#50 and so far the results look pretty good. My choice was to convert it to SCSU because that way I'm both sure it would be reversible and wouldn't require to bring in something like ICU. Here are the results of the tests I did, using Українська кухня. Підручник from the C library's issue but getting rid first of the HTML prologue and epilogue the Internet Archive stuck in there:

  • UTF-8 file size: 2729604 bytes
  • SCSU file size: 1566123 bytes
  • KOI8-U file size: 1540610 bytes (non-reversible)
  • UTF-8 compressed file size: 425436 bytes
  • SCSU compressed file size: 399020 bytes
  • KOI8-U compressed file size: 394820 bytes (non-reversible)

I still need to polish the code before even considering a draft pull request but so far so good.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions