|
1 | 1 | # Unicode Transforms |
2 | | -This is a lightweight library supporting a limited set of unicode |
| 2 | +This is a lightweight Haskell library supporting commonly used unicode |
3 | 3 | transformations (currently only normalizations) on `ByteStrings` (UTF-8) and |
4 | | -`Text` without requiring any other system libraries. |
| 4 | +`Text`. |
5 | 5 |
|
6 | | -This package aims to fill the gap for some common unicode operations not |
7 | | -supported by `text` or any other packages without requiring the heavyweight |
8 | | -`text-icu` package. It aims to provide an API similar to `text-icu`. |
9 | | - |
10 | | -This package is based on the `utf8proc` C utility. The utf8proc version bundled |
11 | | -with this package is taken from the |
12 | | -[xqilla project](http://xqilla.sourceforge.net/HomePage) |
13 | | -(xqilla version 2.3.2). It should not be too difficult to translate this into a |
14 | | -native Haskell package. |
| 6 | +Haskell package `text-icu` provides a comprehensive set of unicode transforms. |
| 7 | +The drawback of `text-icu` is that it requires you to install the ICU library |
| 8 | +OS packages first. This package is self contained and aims to provide an API |
| 9 | +similar to `text-icu` so that it can be used as a drop-in replacement for the |
| 10 | +features it supports. |
15 | 11 |
|
16 | 12 | ## Features |
| 13 | +Unicode normalization in NFC, NFKC, NFD, NFKD forms is supported. This version |
| 14 | +of the library supports unicode versions upto 5.1.0. |
17 | 15 |
|
18 | | -### Normalization |
19 | | -Normalization in NFC, NFKC, NFD, NFKD forms is fully supported and exposed via |
20 | | -an API. |
21 | | - |
22 | | -### Casemapping and Casefolding |
23 | | -The `text` package already provides proper unicode casemapping and casefolding |
24 | | -operations. This package does not aim to expose these though the implementation |
25 | | -is available. |
26 | | - |
27 | | -### Available but not exposed |
28 | | -The following additional features are available but not exposed via an API. If |
29 | | -you need any of those they can be exposed quickly, please raise an issue or |
30 | | -send a pull request. |
31 | | - |
32 | | -* Boundary Analysis (No locale specific handling) |
33 | | -* NLF sequence conversion |
34 | | -* Stripping certain character classes |
35 | | -* Lumping certain characters |
36 | | - |
37 | | -### Missing features |
38 | | - |
39 | | -`text-icu` is a full featured implementation of unicode operations via bindings |
40 | | -to the `icu` libraries. If you do not mind a dependency on the `icu` libraries |
41 | | -(separately installed) or need a comprehensive set of unicode operations then |
42 | | -`text-icu` will be a better choice. |
43 | | - |
44 | | -The following features provided by `text-icu` are missing in this package: |
45 | | -* Normalization checks |
46 | | -* FCD normalization for collation |
47 | | -* String collation |
48 | | -* Iteration |
49 | | -* Regular expressions |
50 | | - |
51 | | -# Haskell Unicode Landscape |
52 | | - |
53 | | -Unicode functionality in Haskell is fragmented across various packages. The |
54 | | -most comprehensive functionality is provided by `text-icu` which is based on |
55 | | -the `icu` C++ libraries. |
56 | | - |
57 | | -* [text-icu](https://stackage.org/lts/package/text-icu) |
58 | | - |
59 | | -## Basic |
60 | | - |
61 | | -* [base](https://www.stackage.org/lts/package/base) Data.Char module |
62 | | -* [charset](https://www.stackage.org/lts/package/charset) Fast unicode character sets |
63 | | - |
64 | | -## Unicode Character Database |
65 | | -* [unicode-properties](https://hackage.haskell.org/package/unicode-properties) Unicode 3.2.0 character properties |
66 | | -* [hxt-charproperties](http://www.stackage.org/lts/package/hxt-charproperties) Character properties and classes for XML and Unicode |
67 | | -* [unicode-names](http://hackage.haskell.org/package/unicode-names) Unicode 3.2.0 character names |
68 | | -* [unicode](https://hackage.haskell.org/package/unicode) Construct and transform unicode characters |
69 | | - |
70 | | -## Unicode Strings |
71 | | -### ByteStrings (UTF8) |
72 | | -* [utf8-string](https://www.stackage.org/lts/package/utf8-string) Support for reading and writing UTF8 Strings |
73 | | -* [utf8-light](https://www.stackage.org/lts/package/utf8-light) Lightweight UTF8 handling |
74 | | -* [hxt-unicode](https://www.stackage.org/lts/package/hxt-unicode) Unicode en-/decoding functions for utf8, iso-latin-\* and other encodings |
75 | | -### Text (UTF16) |
76 | | -* [text](https://www.stackage.org/lts/package/text) An efficient packed Unicode text type |
77 | | -* [text-normal](https://hackage.haskell.org/package/text-normal) Data types for Unicode-normalized text - depends on text-icu |
| 16 | +## Documentation |
| 17 | +Please see the haddock documentation available with the package. |
78 | 18 |
|
79 | | -# Thoughts on package structuring |
| 19 | +## Implementation |
80 | 20 |
|
81 | | -In my opinion, it will be good to consolidate all native haskell packages into |
82 | | -a standard module structure under a minimum number of packages and evolve |
83 | | -those. The following structure in three layers should be enough to cover |
84 | | -unicode handling: |
| 21 | +This package is implemented as bindings to the `utf8proc` C utility. The |
| 22 | +utf8proc version bundled with this package is taken from the [xqilla |
| 23 | +project](http://xqilla.sourceforge.net/HomePage) (xqilla version 2.3.2). |
85 | 24 |
|
86 | | -1. **_unicode-properties_**: A single package for character database with |
87 | | -scripts to update it based on unicode standard database updates. |
88 | | -2. **_unicode-transforms_**: A lightweight native Haskell package for basic unicode |
89 | | -string transforms (normalization, case folding etc.) based on unicode-chars. |
90 | | -Not a replacement for text-icu. |
91 | | -3. **_utf8-string_**: A single UTF8 bytestring package including a normalized |
92 | | -string representation like text-normal |
93 | | -4. **_text_**: Existing text package (UTF16 representation). Include normalized |
94 | | -text (text-normal) in the text package based on the native Haskell |
95 | | -unicode-transforms package |
| 25 | +In future the underlying `utf8proc` implementation will get replaced by a |
| 26 | +Haskell implementation supporting the latest unicode versions. |
96 | 27 |
|
97 | | -# Unicode resources |
| 28 | +## Related stuff |
| 29 | +Please see the NOTES.md file shipped with this package for more details on |
| 30 | +related packages, missing features and todo etc. |
98 | 31 |
|
99 | | -* [Unicode Character Database](http://www.unicode.org/Public/UCD/latest/ucd) |
| 32 | +## Contributing |
| 33 | +Contributions are welcome! Please use the github repository at |
| 34 | +https://github.com/harendra-kumar/unicode-transforms to raise issues, request |
| 35 | +features or send pull requests. |
0 commit comments