Skip to content

Commit bd0828f

Browse files
committed
Add Prior Art section
1 parent dccba12 commit bd0828f

File tree

1 file changed

+57
-38
lines changed

1 file changed

+57
-38
lines changed

proposals/002-text-utf-default.md

Lines changed: 57 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,9 @@ up to 2x and promote more convergence to use `Text` for all stringy things
9797
(as opposed to binary data).
9898

9999
Modern computer science research is focused on developing faster algorithms
100-
for UTF-8 data, e. g., an ultra-fast JSON decoder [simdjson](https://github.com/simdjson/simdjson). There is much less work on (and demand for) UTF-16 algorithms.
100+
for UTF-8 data, e. g., an ultra-fast JSON decoder
101+
[simdjson](https://github.com/simdjson/simdjson).
102+
There is much less work on (and demand for) UTF-16 algorithms.
101103
Switching `text` to UTF-8 will open us a way to accomodate and benefit
102104
from future developments in rapid text processing.
103105

@@ -110,7 +112,7 @@ to make another attempt.
110112

111113
# Goals
112114

113-
- Ensure that the new design of `text` and its migration strategy accommodates the needs of community members and industry
115+
- Ensure that the new design of `text` and its migration strategy accommodates the needs of community members and industry.
114116

115117
- Provide an implementation, migration, and delivery plan for changing the default encoding of `text` from UTF-16 to UTF-8.
116118

@@ -133,54 +135,69 @@ to make another attempt.
133135
- The text maintainers
134136

135137
- Xia Li-Yao (lysxia)
136-
138+
- Callan McGill (boarders)
139+
- Travis Athougies (tathougies)
140+
- Matt Parsons (parsonsmatt)
137141
- Emily Pillmore (emilypi)
138-
139142
- Dan Cartwright (chessai)
140143

141-
- Callan McGill (boarders)
142-
143144
- Stakeholders:
144145

145146
- Edward Kmett: has been vocal about his use of `text-icu` and requires it not be broken.
146147

147-
- Ben Gamari: integration with GHC
148+
- Ben Gamari: integration with GHC.
148149

149-
- The Cabal maintainers (fgaz, emilypi, mikolaj): integration with Cabal
150+
- The Cabal maintainers (fgaz, emilypi, mikolaj): integration with Cabal.
150151

151152
Progress will be reported on a weekly basis to the HF Technical Agenda
152153
Track, with Emily as support for Andrew.
153154

154155
# Timeline
155156

156-
We expect that this project will take roughly 6 months to fully
157-
complete: 3-4 months to complete the code implementation, performance
158-
testing, and unit testing, another 1-2 months to integrate with
157+
We expect to finish the bulk of implementation in 3 months,
158+
by the beginning of September, so that GHC could bump its `text`
159+
submodule before GHC 9.4 feature freeze. Following months leading
160+
to GHC release will provide us time to integrate with
159161
stakeholders and diagnose any potential issues with the migration.
160-
161-
**Preparation:**
162-
163-
Using the HVR's existing [`text-utf8`](https://github.com/text-utf8) as
164-
a starting point, the following must be done before an implementation is
165-
started:
166-
167-
- Modernize the codebase and clear out the bitrot
168-
169-
- Establish a baseline for performance and any related issues.
170-
171-
- Update testing and performance benchmarks to make use of `inspection-testing` to ensure fusion is not broken in
172-
subsequent UTF-8 related changes.
173-
174-
An MVP should completely preserve standard user-facing API, and not
175-
break fusion. Performance should not significantly diverge from the
176-
existing UTF-16 text package. There will be an expected change to the
177-
exposed Text internals, in which case, breakage should be assessed by
178-
circulating a git commit reference to a release candidate as soon as
179-
possible. This candidate should be shared publicly and loudly.
180-
181-
**Implementation:**
182-
183-
- TBD: There is a straightforward implementation, but this one is left up to Andrew for comment.
162+
The project ends by Christmas 2021.
163+
164+
**Prior art**
165+
166+
The oldest attempt to use UTF-8 in `text` is
167+
[jaspervdj's GSoC](https://github.com/jaspervdj/text/tree/utf8) back in 2011.
168+
The final results can be found
169+
[here](https://jaspervdj.be/posts/2011-08-19-text-utf8-the-aftermath.html).
170+
Significant parts of Jasper's work got merged into `text`, including
171+
comprehensive benchmark suite and many minor improvements. However,
172+
70-commits-long UTF-8 branch itself was left unmerged and now, after 10 years,
173+
is almost 800 commits past `master` branch.
174+
175+
Next comes [`text-utf8`](https://github.com/text-utf8/text-utf8) package,
176+
which forked from `text` in 2016. Its authors rebased parts of Jasper's work
177+
in a bulk commit and continued from this point onwards,
178+
accumulating \~100 commits atop of it.
179+
The work came to a halt in 2018 and at the moment is \~200 commits behind `master` branch.
180+
181+
It would be extremely challenging to rebase this work on top of current `text`,
182+
audit and verify decade-old changes, then fix remaining
183+
issues and pass a review. As one can imagine, reviewers are not usually quite happy
184+
to review 300 commits of vague provenance. Moreover, we discovered that benchmarks
185+
regressed severely in `text-utf8` and that fusion is broken on several occasions.
186+
It's unclear where exactly the problem lurks there. Finally, we'd like to explore different
187+
approaches to tackle potential performance issues.
188+
189+
We decided that the safest bet is to reimplement UTF-8 transition from the scratch,
190+
paying close attention to tests and benchmarks step by step. This way we'll be able
191+
to gain enough confidence and understanding of the nature of changes, and provide
192+
reviewers with a clean sequence of commits, facilitating timely merge.
193+
194+
Talking about developments in a wider ecosystem, one must mention
195+
`text-short` package, which provides a data structure, similar in characteristics
196+
to `ShortByteString`, but interpreted as a UTF-8 encoded data. It was argued that
197+
this type is worth inclusion into main `text` package to mirror `ShortByteString`,
198+
exposed from `bytestring`. While such acquisition is out of scope for this project,
199+
it will be easier to do so when `text` package itself switches to UTF-8, opening
200+
possibilities for even better String story in Haskell.
184201

185202
**Compatibility issues**
186203

@@ -208,7 +225,8 @@ they cannot detect build flags of `text` (and thus cannot rely on its internals
208225

209226
Instead we mark a new, UTF-8 release as `text-2.0`, and put a call for volunteers
210227
to maintain a legacy UTF-16 package. Depending on a demand, this could be done either
211-
as a continuation of `text-1.X` series, or as a separate `text-utf16` package. We'll facilitate such community project and will work with Hackage Trustees and Stackage
228+
as a continuation of `text-1.X` series, or as a separate `text-utf16` package. We'll
229+
facilitate such community project and will work with Hackage Trustees and Stackage
212230
Curators to ensure timely transition of ecosystem.
213231

214232
With regards to API compatibility, we intend to keep signatures of non-`Internal`
@@ -230,14 +248,15 @@ Another one is `Data.Text.Foreign`, which is mostly used by `text-icu` library,
230248
which binds to `libicu` for certain Unicode manipulations. `libicu` provides
231249
helpers to convert C strings
232250
[from UTF8 to UTF16](https://unicode-org.github.io/icu/userguide/strings/utf-8.html).
233-
It is up to `text-icu` maintainers either to modify their bindings. We intend
251+
It is up to `text-icu` maintainers to modify their bindings. We intend
234252
to reach to them as soon as we have an MVP.
235253

236254
Since fixing downstream compatibility issues is up to external counterparties,
237255
most of which are unpaid volunteers, we cannot expect them to do it in a limited
238256
time frame. We are devoted to having a smooth migration story and will provide
239257
as much guidance as possible, but to keep our targets time-bound we cannot tie the success
240-
of this project to actions of third parties. We will not wait for everyone to migrate.
258+
of this project to actions of third parties. We will not block this project
259+
because of unmigrated packages downstream.
241260

242261
To sum up, we plan to:
243262

0 commit comments

Comments
 (0)