The need for a practical unicode codepoints encoder-decoder and validator module

### Scope
`UNICODE` is a very broad topic and often confusing, no wonder it scares away many developers. And even in some great project, unicode support is postponed or has lower priority.

But from experiences, unicode encoding-decoding is actually simple. What makes it really complicated is the unicode glyph renderer. But that is not our concern. We should stick to the well defined `unicode codepoints` encoder-decoder and validator. Nothing else, period.

### Why we need it in nim-stew
- As our libraries become mature, we cannot neglect a recurring issue: "Better support for full range unicode codepoints".
- During development of nim-websock, we discover flaw in nim-stdlib unicode module. It has _incorrect_ UTF-8 validator.
- Nim-stdlib unicode module is using `nim-string`. Theoretically, that is correct because unicode text is a text that should be represented by a string. But from experience, we often found we need to deal with raw bytes coming from network or coming from some input stream that need to be parsed before we know it should be treated as bytes or string. That's why we need a **practical** module.
- We have collected some faster and more efficient UTF-8/16 converter, decoder, and validator scattered around in many repos.
- Together with encryption and compression, such as in a PDF library or PNG library, a flexible yet efficient unicode codepoints encoder-decoder is needed.
- From our numerous nim-repos, it's hard to find unicode aware library. So far I can only find:
   - https://github.com/status-im/nim-toml-serialization, it has full support for full range of UTF-8 because it is mandated by the spec.
   - https://github.com/status-im/nim-websock, it has full range UTF-8 validator because the test suite we are using, the autobahn, has extensive test cases for UTF-8.
   - https://github.com/status-im/nim-json-serialization, only partially support in the reader, and no support in the writer. And the support is only limited to escaped codepoints, not for binary encoding.
   - https://github.com/status-im/nim-graphql, aware of unicode, but the official spec is messy and not finished yet regarding unicode codepoints.

Based on above reasons and inspiration from other modules in nim-stew, definitely we can craft a better `unicode codepoints` module. This will greatly improves unicode support in our codebase.

### Remaining obstacle
What is the appropriate name for this module? `unicode` is too broad, we are not dealing with every aspect of unicode. Only `unicode codepoints` encoding-decoding and validation often encountered during parsing text and raw bytes.

candidate: `utf`, because we  are dealing with UTF-8/16/32 codec.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The need for a practical unicode codepoints encoder-decoder and validator module #85

Scope

Why we need it in nim-stew

Remaining obstacle

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The need for a practical unicode codepoints encoder-decoder and validator module #85

Description

Scope

Why we need it in nim-stew

Remaining obstacle

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions