Skip to content

Commit 84da52a

Browse files
Split README into two - README & NOTES
Conflicts: hpack.yaml unicode-transforms.cabal
1 parent aaa8060 commit 84da52a

File tree

3 files changed

+108
-89
lines changed

3 files changed

+108
-89
lines changed

NOTES.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Missing Features
2+
3+
List of unicode transforms that are not available in this package.
4+
5+
## Casemapping and Casefolding
6+
The `text` package already provides proper unicode casemapping and casefolding
7+
operations. This package does not aim to expose these though the implementation
8+
is available.
9+
10+
## Available in utf8proc but not exposed
11+
The following additional features are available but not exposed via an API. If
12+
you need any of those they can be exposed quickly, please raise an issue or
13+
send a pull request.
14+
15+
* Boundary Analysis (No locale specific handling)
16+
* NLF sequence conversion
17+
* Stripping certain character classes
18+
* Lumping certain characters
19+
20+
## Available only in text-icu
21+
22+
`text-icu` is a full featured implementation of unicode operations via bindings
23+
to the `icu` libraries. If you do not mind a dependency on the `icu` libraries
24+
(separately installed) or need a comprehensive set of unicode operations then
25+
`text-icu` will be a better choice.
26+
27+
The following features provided by `text-icu` are missing in this package:
28+
* Normalization checks
29+
* FCD normalization for collation
30+
* String collation
31+
* Iteration
32+
* Regular expressions
33+
34+
# Haskell Unicode Landscape
35+
36+
Unicode functionality in Haskell is fragmented across various packages. The
37+
most comprehensive functionality is provided by `text-icu` which is based on
38+
the `icu` C++ libraries.
39+
40+
* [text-icu](https://stackage.org/lts/package/text-icu)
41+
42+
## Basic
43+
44+
* [base](https://www.stackage.org/lts/package/base) Data.Char module
45+
* [charset](https://www.stackage.org/lts/package/charset) Fast unicode character sets
46+
47+
## Unicode Character Database
48+
* [unicode-properties](https://hackage.haskell.org/package/unicode-properties) Unicode 3.2.0 character properties
49+
* [hxt-charproperties](http://www.stackage.org/lts/package/hxt-charproperties) Character properties and classes for XML and Unicode
50+
* [unicode-names](http://hackage.haskell.org/package/unicode-names) Unicode 3.2.0 character names
51+
* [unicode](https://hackage.haskell.org/package/unicode) Construct and transform unicode characters
52+
53+
## Unicode Strings
54+
### ByteStrings (UTF8)
55+
* [utf8-string](https://www.stackage.org/lts/package/utf8-string) Support for reading and writing UTF8 Strings
56+
* [utf8-light](https://www.stackage.org/lts/package/utf8-light) Lightweight UTF8 handling
57+
* [hxt-unicode](https://www.stackage.org/lts/package/hxt-unicode) Unicode en-/decoding functions for utf8, iso-latin-\* and other encodings
58+
### Text (UTF16)
59+
* [text](https://www.stackage.org/lts/package/text) An efficient packed Unicode text type
60+
* [text-normal](https://hackage.haskell.org/package/text-normal) Data types for Unicode-normalized text - depends on text-icu
61+
62+
# Thoughts on package structuring
63+
64+
In my opinion, it will be good to consolidate all native haskell packages into
65+
a standard module structure under a minimum number of packages and evolve
66+
those. The following structure in three layers should be enough to cover
67+
unicode handling:
68+
69+
1. **_unicode-properties_**: A single package for character database with
70+
scripts to update it based on unicode standard database updates.
71+
2. **_unicode-transforms_**: A lightweight native Haskell package for basic unicode
72+
string transforms (normalization, case folding etc.) based on unicode-chars.
73+
Not a replacement for text-icu.
74+
3. **_utf8-string_**: A single UTF8 bytestring package including a normalized
75+
string representation like text-normal
76+
4. **_text_**: Existing text package (UTF16 representation). Include normalized
77+
text (text-normal) in the text package based on the native Haskell
78+
unicode-transforms package
79+
80+
# Unicode resources
81+
82+
* [Unicode Character Database](http://www.unicode.org/Public/UCD/latest/ucd)

README.md

Lines changed: 24 additions & 88 deletions
Original file line numberDiff line numberDiff line change
@@ -1,99 +1,35 @@
11
# Unicode Transforms
2-
This is a lightweight library supporting a limited set of unicode
2+
This is a lightweight Haskell library supporting commonly used unicode
33
transformations (currently only normalizations) on `ByteStrings` (UTF-8) and
4-
`Text` without requiring any other system libraries.
4+
`Text`.
55

6-
This package aims to fill the gap for some common unicode operations not
7-
supported by `text` or any other packages without requiring the heavyweight
8-
`text-icu` package. It aims to provide an API similar to `text-icu`.
9-
10-
This package is based on the `utf8proc` C utility. The utf8proc version bundled
11-
with this package is taken from the
12-
[xqilla project](http://xqilla.sourceforge.net/HomePage)
13-
(xqilla version 2.3.2). It should not be too difficult to translate this into a
14-
native Haskell package.
6+
Haskell package `text-icu` provides a comprehensive set of unicode transforms.
7+
The drawback of `text-icu` is that it requires you to install the ICU library
8+
OS packages first. This package is self contained and aims to provide an API
9+
similar to `text-icu` so that it can be used as a drop-in replacement for the
10+
features it supports.
1511

1612
## Features
13+
Unicode normalization in NFC, NFKC, NFD, NFKD forms is supported. This version
14+
of the library supports unicode versions upto 5.1.0.
1715

18-
### Normalization
19-
Normalization in NFC, NFKC, NFD, NFKD forms is fully supported and exposed via
20-
an API.
21-
22-
### Casemapping and Casefolding
23-
The `text` package already provides proper unicode casemapping and casefolding
24-
operations. This package does not aim to expose these though the implementation
25-
is available.
26-
27-
### Available but not exposed
28-
The following additional features are available but not exposed via an API. If
29-
you need any of those they can be exposed quickly, please raise an issue or
30-
send a pull request.
31-
32-
* Boundary Analysis (No locale specific handling)
33-
* NLF sequence conversion
34-
* Stripping certain character classes
35-
* Lumping certain characters
36-
37-
### Missing features
38-
39-
`text-icu` is a full featured implementation of unicode operations via bindings
40-
to the `icu` libraries. If you do not mind a dependency on the `icu` libraries
41-
(separately installed) or need a comprehensive set of unicode operations then
42-
`text-icu` will be a better choice.
43-
44-
The following features provided by `text-icu` are missing in this package:
45-
* Normalization checks
46-
* FCD normalization for collation
47-
* String collation
48-
* Iteration
49-
* Regular expressions
50-
51-
# Haskell Unicode Landscape
52-
53-
Unicode functionality in Haskell is fragmented across various packages. The
54-
most comprehensive functionality is provided by `text-icu` which is based on
55-
the `icu` C++ libraries.
56-
57-
* [text-icu](https://stackage.org/lts/package/text-icu)
58-
59-
## Basic
60-
61-
* [base](https://www.stackage.org/lts/package/base) Data.Char module
62-
* [charset](https://www.stackage.org/lts/package/charset) Fast unicode character sets
63-
64-
## Unicode Character Database
65-
* [unicode-properties](https://hackage.haskell.org/package/unicode-properties) Unicode 3.2.0 character properties
66-
* [hxt-charproperties](http://www.stackage.org/lts/package/hxt-charproperties) Character properties and classes for XML and Unicode
67-
* [unicode-names](http://hackage.haskell.org/package/unicode-names) Unicode 3.2.0 character names
68-
* [unicode](https://hackage.haskell.org/package/unicode) Construct and transform unicode characters
69-
70-
## Unicode Strings
71-
### ByteStrings (UTF8)
72-
* [utf8-string](https://www.stackage.org/lts/package/utf8-string) Support for reading and writing UTF8 Strings
73-
* [utf8-light](https://www.stackage.org/lts/package/utf8-light) Lightweight UTF8 handling
74-
* [hxt-unicode](https://www.stackage.org/lts/package/hxt-unicode) Unicode en-/decoding functions for utf8, iso-latin-\* and other encodings
75-
### Text (UTF16)
76-
* [text](https://www.stackage.org/lts/package/text) An efficient packed Unicode text type
77-
* [text-normal](https://hackage.haskell.org/package/text-normal) Data types for Unicode-normalized text - depends on text-icu
16+
## Documentation
17+
Please see the haddock documentation available with the package.
7818

79-
# Thoughts on package structuring
19+
## Implementation
8020

81-
In my opinion, it will be good to consolidate all native haskell packages into
82-
a standard module structure under a minimum number of packages and evolve
83-
those. The following structure in three layers should be enough to cover
84-
unicode handling:
21+
This package is implemented as bindings to the `utf8proc` C utility. The
22+
utf8proc version bundled with this package is taken from the [xqilla
23+
project](http://xqilla.sourceforge.net/HomePage) (xqilla version 2.3.2).
8524

86-
1. **_unicode-properties_**: A single package for character database with
87-
scripts to update it based on unicode standard database updates.
88-
2. **_unicode-transforms_**: A lightweight native Haskell package for basic unicode
89-
string transforms (normalization, case folding etc.) based on unicode-chars.
90-
Not a replacement for text-icu.
91-
3. **_utf8-string_**: A single UTF8 bytestring package including a normalized
92-
string representation like text-normal
93-
4. **_text_**: Existing text package (UTF16 representation). Include normalized
94-
text (text-normal) in the text package based on the native Haskell
95-
unicode-transforms package
25+
In future the underlying `utf8proc` implementation will get replaced by a
26+
Haskell implementation supporting the latest unicode versions.
9627

97-
# Unicode resources
28+
## Related stuff
29+
Please see the NOTES.md file shipped with this package for more details on
30+
related packages, missing features and todo etc.
9831

99-
* [Unicode Character Database](http://www.unicode.org/Public/UCD/latest/ucd)
32+
## Contributing
33+
Contributions are welcome! Please use the github repository at
34+
https://github.com/harendra-kumar/unicode-transforms to raise issues, request
35+
features or send pull requests.

unicode-transforms.cabal

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
name: unicode-transforms
22
version: 0.1.0.0
3-
synopsis: Unicode transforms (normalization, casefolding etc.)
3+
synopsis: Unicode transforms (normalization NFC/NFD/NFKC/NFKD)
44
description:
55
This is a lightweight library supporting a limited set of unicode
66
transformations (only normalizations as of now) on ByteStrings (UTF-8) and
@@ -26,6 +26,7 @@ build-type: Simple
2626
cabal-version: >=1.10
2727

2828
extra-source-files:
29+
NOTES.md
2930
README.md
3031
cbits/utf8proc.h
3132
cbits/utf8proc_data.h

0 commit comments

Comments
 (0)