Skip to content

Commit fc5a08b

Browse files
authored
Merge pull request #40 from wismill/feature/general_category
Add General_Category and further predicates
2 parents e3208a4 + 6e68e5f commit fc5a08b

File tree

18 files changed

+1100
-40
lines changed

18 files changed

+1100
-40
lines changed

.editorconfig

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# EditorConfig is awesome: https://EditorConfig.org
2+
3+
# top-most EditorConfig file
4+
root = true
5+
6+
[*]
7+
indent_style = space
8+
indent_size = 4
9+
end_of_line = lf
10+
charset = utf-8
11+
trim_trailing_whitespace = true
12+
insert_final_newline = false

.hlint.ignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,10 @@
11
lib/Unicode/Internal/Division.hs
22
lib/Unicode/Internal/Char/PropList.hs
33
lib/Unicode/Internal/Char/DerivedCoreProperties.hs
4+
lib/Unicode/Internal/Char/UnicodeData/CombiningClass.hs
5+
lib/Unicode/Internal/Char/UnicodeData/Compositions.hs
6+
lib/Unicode/Internal/Char/UnicodeData/Decomposable.hs
7+
lib/Unicode/Internal/Char/UnicodeData/DecomposableK.hs
8+
lib/Unicode/Internal/Char/UnicodeData/Decompositions.hs
9+
lib/Unicode/Internal/Char/UnicodeData/DecompositionsK2.hs
10+
lib/Unicode/Internal/Char/UnicodeData/GeneralCategory.hs

.packcheck.ignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
.packcheck.ignore
2+
.editorconfig
23
.github/workflows/haskell.yml
34
appveyor.yml
45
stack.yaml

Changelog.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,22 @@
11
# Changelog
22

3+
## 0.3.0 (December 2021)
4+
5+
- Support for big-endian architectures.
6+
- Added `GeneralCategory` data type and corresponding `generalCategoryAbbr`,
7+
`generalCategory` functions.
8+
- Added the following functions to `Unicode.Char.General`:
9+
`isAlphabetic`, `isAlphaNum`,
10+
`isControl`, `isMark`, `isPrint`, `isPunctuation`, `isSeparator`,
11+
`isSymbol` and `isWhiteSpace`.
12+
- Added the module `Unicode.Char.Numeric`.
13+
- **Breaking change:** Changed the behavior of `isLetter` and `isSpace` to match
14+
`base`’s `Data.Char` behavior. Move these functions to the compatibility module
15+
`Unicode.Char.General.Compat`. The previous behavior is obtained using
16+
`isAlphabetic` and `isWhiteSpace` respectively.
17+
- Re-export some functions from `Data.Char` in order to make `Unicode.Char`
18+
a drop-in replacement.
19+
320
## 0.2.0 (November 2021)
421

522
* Update to [Unicode 14.0.0](https://www.unicode.org/versions/Unicode14.0.0/).

README.md

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,92 @@ any other packages or use cases.
1515

1616
Please see the haddock documentation for reference documentation.
1717

18+
## Performance
19+
20+
`unicode-data` is up to _5 times faster_ than `base`.
21+
22+
The following benchmark compares the time taken in milliseconds to process all
23+
the Unicode code points for `base-4.16` and this package (v0.3).
24+
Machine: 8 × AMD Ryzen 5 2500U on Linux.
25+
26+
```
27+
All
28+
Unicode.Char.Case
29+
isLower
30+
base: OK (6.59s)
31+
26 ms ± 238 μs
32+
unicode-data: OK (1.16s)
33+
4.5 ms ± 83 μs, 0.17x
34+
isUpper
35+
base: OK (1.69s)
36+
27 ms ± 459 μs
37+
unicode-data: OK (1.21s)
38+
4.8 ms ± 77 μs, 0.18x
39+
Unicode.Char.General
40+
generalCategory
41+
base: OK (0.92s)
42+
131 ms ± 1.5 ms
43+
unicode-data: OK (1.62s)
44+
108 ms ± 1.2 ms, 0.82x
45+
isAlphaNum
46+
base: OK (3.28s)
47+
26 ms ± 300 μs
48+
unicode-data: OK (20.60s)
49+
5.0 ms ± 59 μs, 0.19x
50+
isControl
51+
base: OK (1.61s)
52+
26 ms ± 463 μs
53+
unicode-data: OK (1.22s)
54+
4.8 ms ± 53 μs, 0.19x
55+
isMark
56+
base: OK (0.80s)
57+
26 ms ± 339 μs
58+
unicode-data: OK (1.33s)
59+
5.2 ms ± 77 μs, 0.20x
60+
isPrint
61+
base: OK (3.32s)
62+
26 ms ± 498 μs
63+
unicode-data: OK (1.33s)
64+
5.2 ms ± 55 μs, 0.20x
65+
isPunctuation
66+
base: OK (3.41s)
67+
27 ms ± 497 μs
68+
unicode-data: OK (2.67s)
69+
5.3 ms ± 28 μs, 0.20x
70+
isSeparator
71+
base: OK (0.84s)
72+
27 ms ± 422 μs
73+
unicode-data: OK (1.41s)
74+
5.5 ms ± 52 μs, 0.21x
75+
isSymbol
76+
base: OK (1.72s)
77+
27 ms ± 443 μs
78+
unicode-data: OK (1.45s)
79+
5.7 ms ± 112 μs, 0.21x
80+
Unicode.Char.General.Compat
81+
isAlpha
82+
base: OK (3.26s)
83+
26 ms ± 254 μs
84+
unicode-data: OK (2.66s)
85+
5.2 ms ± 48 μs, 0.20x
86+
isLetter
87+
base: OK (1.70s)
88+
27 ms ± 453 μs
89+
unicode-data: OK (1.33s)
90+
5.2 ms ± 69 μs, 0.19x
91+
isSpace
92+
base: OK (0.85s)
93+
13 ms ± 237 μs
94+
unicode-data: OK (1.69s)
95+
6.7 ms ± 61 μs, 0.49x
96+
Unicode.Char.Numeric
97+
isNumber
98+
base: OK (1.67s)
99+
26 ms ± 316 μs
100+
unicode-data: OK (1.32s)
101+
5.2 ms ± 91 μs, 0.20x
102+
```
103+
18104
## Unicode database version update
19105

20106
To update the Unicode version please update the version number in

appveyor.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ environment:
3333
# version.
3434
#STACKVER: "1.6.5"
3535
STACK_UPGRADE: "y"
36-
RESOLVER: "lts-18.17"
36+
RESOLVER: "lts-18.18"
3737
STACK_ROOT: "c:\\sr"
3838

3939
# ------------------------------------------------------------------------

bench/Main.hs

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
import Control.DeepSeq (NFData, deepseq)
2+
import Data.Ix (Ix(..))
3+
import Test.Tasty.Bench (Benchmark, bgroup, bench, bcompare, nf, defaultMain)
4+
5+
import qualified Data.Char as B
6+
import qualified Unicode.Char.Case as C
7+
import qualified Unicode.Char.General as G
8+
import qualified Unicode.Char.General.Compat as GC
9+
import qualified Unicode.Char.Identifiers as I
10+
import qualified Unicode.Char.Normalization as N
11+
import qualified Unicode.Char.Numeric as Num
12+
13+
-- | A unit benchmark
14+
data Bench a = Bench
15+
{ _title :: !String -- ^ Name
16+
, _func :: Char -> a -- ^ Function to benchmark
17+
}
18+
19+
main :: IO ()
20+
main = defaultMain
21+
[ bgroup "Unicode.Char.Case"
22+
[ bgroup' "isLower"
23+
[ Bench "base" B.isLower
24+
, Bench "unicode-data" C.isLower
25+
]
26+
, bgroup' "isUpper"
27+
[ Bench "base" B.isUpper
28+
, Bench "unicode-data" C.isUpper
29+
]
30+
]
31+
, bgroup "Unicode.Char.General"
32+
-- Character classification
33+
[ bgroup' "generalCategory"
34+
[ Bench "base" (show . B.generalCategory)
35+
, Bench "unicode-data" (show . G.generalCategory)
36+
]
37+
, bgroup "isAlphabetic"
38+
[ benchNF "unicode-data" G.isAlphabetic
39+
]
40+
, bgroup' "isAlphaNum"
41+
[ Bench "base" B.isAlphaNum
42+
, Bench "unicode-data" G.isAlphaNum
43+
]
44+
, bgroup' "isControl"
45+
[ Bench "base" B.isControl
46+
, Bench "unicode-data" G.isControl
47+
]
48+
, bgroup' "isMark"
49+
[ Bench "base" B.isMark
50+
, Bench "unicode-data" G.isMark
51+
]
52+
, bgroup' "isPrint"
53+
[ Bench "base" B.isPrint
54+
, Bench "unicode-data" G.isPrint
55+
]
56+
, bgroup' "isPunctuation"
57+
[ Bench "base" B.isPunctuation
58+
, Bench "unicode-data" G.isPunctuation
59+
]
60+
, bgroup' "isSeparator"
61+
[ Bench "base" B.isSeparator
62+
, Bench "unicode-data" G.isSeparator
63+
]
64+
, bgroup' "isSymbol"
65+
[ Bench "base" B.isSymbol
66+
, Bench "unicode-data" G.isSymbol
67+
]
68+
, bgroup "isWhiteSpace"
69+
[ benchNF "unicode-data" G.isWhiteSpace
70+
]
71+
-- Korean Hangul Characters
72+
, bgroup "isHangul"
73+
[ benchNF "unicode-data" G.isHangul
74+
]
75+
, bgroup "isHangulLV"
76+
[ benchNF "unicode-data" G.isHangul
77+
]
78+
, bgroup "isJamo"
79+
[ benchNF "unicode-data" G.isJamo
80+
]
81+
, bgroup "jamoLIndex"
82+
[ benchNF "unicode-data" G.jamoLIndex
83+
]
84+
, bgroup "jamoVIndex"
85+
[ benchNF "unicode-data" G.jamoVIndex
86+
]
87+
, bgroup "jamoTIndex"
88+
[ benchNF "unicode-data" G.jamoTIndex
89+
]
90+
]
91+
, bgroup "Unicode.Char.General.Compat"
92+
[ bgroup' "isAlpha"
93+
[ Bench "base" B.isAlpha
94+
, Bench "unicode-data" GC.isAlpha
95+
]
96+
, bgroup' "isLetter"
97+
[ Bench "base" B.isLetter
98+
, Bench "unicode-data" GC.isLetter
99+
]
100+
, bgroup' "isSpace"
101+
[ Bench "base" B.isSpace
102+
, Bench "unicode-data" GC.isSpace
103+
]
104+
]
105+
, bgroup "Unicode.Char.Identifiers"
106+
[ bgroup "isIDContinue"
107+
[ benchNF "unicode-data" I.isIDContinue
108+
]
109+
, bgroup "isIDStart"
110+
[ benchNF "unicode-data" I.isIDStart
111+
]
112+
, bgroup "isXIDContinue"
113+
[ benchNF "unicode-data" I.isXIDContinue
114+
]
115+
, bgroup "isXIDStart"
116+
[ benchNF "unicode-data" I.isXIDStart
117+
]
118+
, bgroup "isPatternSyntax"
119+
[ benchNF "unicode-data" I.isPatternSyntax
120+
]
121+
, bgroup "isPatternWhitespace"
122+
[ benchNF "unicode-data" I.isPatternWhitespace
123+
]
124+
]
125+
, bgroup "Unicode.Char.Normalization"
126+
[ bgroup "isCombining"
127+
[ benchNF "unicode-data" N.isCombining
128+
]
129+
, bgroup "combiningClass"
130+
[ benchNF "unicode-data" N.combiningClass
131+
]
132+
, bgroup "isCombiningStarter"
133+
[ benchNF "unicode-data" N.isCombiningStarter
134+
]
135+
-- [TODO] compose, composeStarters
136+
, bgroup "isDecomposable"
137+
[ bgroup "Canonical"
138+
[ benchNF "unicode-data" (N.isDecomposable N.Canonical)
139+
]
140+
, bgroup "Kompat"
141+
[ benchNF "unicode-data" (N.isDecomposable N.Kompat)
142+
]
143+
]
144+
-- [FIXME] Fail due to non-exhaustive pattern matching
145+
-- , bgroup "decompose"
146+
-- [ bgroup "Canonical"
147+
-- [ benchNF "unicode-data" (N.decompose N.Canonical)
148+
-- ]
149+
-- , bgroup "Kompat"
150+
-- [ benchNF "unicode-data" (N.decompose N.Kompat)
151+
-- ]
152+
-- ]
153+
, bgroup "decomposeHangul"
154+
[ benchNF "unicode-data" N.decomposeHangul
155+
]
156+
]
157+
, bgroup "Unicode.Char.Numeric"
158+
[ bgroup' "isNumber"
159+
[ Bench "base" B.isNumber
160+
, Bench "unicode-data" Num.isNumber
161+
]
162+
]
163+
]
164+
where
165+
bgroup' groupTitle bs = bgroup groupTitle
166+
[ benchNF' groupTitle title f
167+
| Bench title f <- bs
168+
]
169+
170+
-- [NOTE] Works if groupTitle uniquely identifies the benchmark group.
171+
benchNF' groupTitle title = case title of
172+
"base" -> benchNF title
173+
_ -> bcompare ("$NF == \"base\" && $(NF-1) == \"" ++ groupTitle ++ "\"")
174+
. benchNF title
175+
176+
benchNF :: forall a. (NFData a) => String -> (Char -> a) -> Benchmark
177+
benchNF t f = bench t $ nf (fold_ f) (minBound, maxBound)
178+
179+
fold_ :: forall a. (NFData a) => (Char -> a) -> (Char, Char) -> ()
180+
fold_ f = foldr (deepseq . f) () . range

0 commit comments

Comments
 (0)