You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: proposals/002-text-utf-default.md
+76-10Lines changed: 76 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -86,8 +86,8 @@ The very `instance Binary Text` serializes `Text` in UTF-8 encoding.
86
86
If we switch the internal representation of `Text` from UTF-16 to UTF-8,
87
87
all such conversions would be made redundant and we'll be able just check that
88
88
a `ByteString` is a valid UTF-8 (which is most often the case) and copy it into `Text`.
89
-
If in future (see an upcoming "Unifying vector-like types" proposal) `ByteString`
90
-
is backed by unpinned memory, we'd be able to eliminate copying entirely.
89
+
If in future `ByteString` switch to be
90
+
backed by unpinned memory, we'd be able to eliminate copying entirely.
91
91
92
92
`Text` is also often used in contexts, which involve mostly ASCII characters.
93
93
This often prompts developers to use `ByteString` instead of `Text` to save 2x space
@@ -118,6 +118,8 @@ to make another attempt.
118
118
119
119
- Performance satisfies targets listed below in "Performance impact" section.
120
120
121
+
- Compatibiltiy story satisfies targets listed below in "Compatibility issues" section.
122
+
121
123
# People
122
124
123
125
- Performers:
@@ -180,6 +182,70 @@ possible. This candidate should be shared publicly and loudly.
180
182
181
183
- TBD: There is a straightforward implementation, but this one is left up to Andrew for comment.
182
184
185
+
**Compatibility issues**
186
+
187
+
`text` is a very old package, deeply ingrained in Haskell ecosystem.
188
+
A change of internal representation is necessarily a breaking change.
189
+
Our strategy to tackle compatibility issues is guided by a desire
190
+
to finish this project in a time-bound fashion with realistic expectations
191
+
about available resources.
192
+
193
+
Current `text` HEAD supports GHCs back to GHC 8.0.
194
+
At the moment we do not foresee any blockers
195
+
to keep compatibility with GHC 8.0 after UTF-8 transition,
196
+
and we plan to stick to it even if it causes some overhead in CPP.
197
+
However, if we discover that supporting old GHCs causes a significant overhead
198
+
(e. g., a dedicated non-trivial code path, emulating missing primop
199
+
or working around a bug), we may decide to shrink the compatibility window.
200
+
Such decision would not to be taken lightly, but we believe that getting
201
+
things done for the bright future should not be hindered by old unsupported luggage.
202
+
203
+
One suggestion to improve compatibility story was to keep both UTF-16 and UTF-8
204
+
implementations in `text` and switch between them via Cabal flag. It seems,
205
+
however, that such strategy will put an undue, indefinitely long burden
206
+
on `text` maintainers, and brings little benefits to downstream packages, because
207
+
they cannot detect build flags of `text` (and thus cannot rely on its internals at all).
208
+
209
+
Instead we mark a new, UTF-8 release as `text-2.0`, and put a call for volunteers
210
+
to maintain a legacy UTF-16 package. Depending on a demand, this could be done either
211
+
as a continuation of `text-1.X` series, or as a separate `text-utf16` package. We'll facilitate such community project and will work with Hackage Trustees and Stackage
212
+
Curators to ensure timely transition of ecosystem.
213
+
214
+
With regards to API compatibility, we intend to keep signatures of non-`Internal`
215
+
modules unchanged, except `Word16` replaced by `Word8` where appropriate.
216
+
Such promise unfortunately cannot be made for `Internal` modules,
217
+
due to their nature: even while we'll strive to keep as much untouched as possible,
218
+
the semantics of internal functions is due to change drastically. This kind of breakage
219
+
should not come as a big surprise, because `Internal` modules have a disclaimer about
220
+
unstable API.
221
+
222
+
There are two places where `text` leaks details of internal represenation.
223
+
First of them is `Data.Text.Array`, which provides an access to an underlying bytearray.
224
+
Not only its API is to change from `Word16` to `Word8`, but also the semantics
225
+
of array switches from UTF-16 to UTF-8. This will cause breakage of several packages
226
+
such as `unicode-transforms` and `unicode-collation`. We intend to communicate with
227
+
respective maintainers as early as possible to help with transition.
228
+
229
+
Another one is `Data.Text.Foreign`, which is mostly used by `text-icu` library,
230
+
which binds to `libicu` for certain Unicode manipulations. `libicu` provides
231
+
helpers to convert C strings
232
+
[from UTF8 to UTF16](https://unicode-org.github.io/icu/userguide/strings/utf-8.html).
233
+
It is up to `text-icu` maintainers either to modify their bindings. We intend
234
+
to reach to them as soon as we have an MVP.
235
+
236
+
Since fixing downstream compatibility issues is up to external counterparties,
237
+
most of which are unpaid volunteers, we cannot expect them to do it in a limited
238
+
time frame. We are devoted to having a smooth migration story and will provide
239
+
as much guidance as possible, but to keep our targets time-bound we cannot tie the success
240
+
of this project to actions of third parties. We will not wait for everyone to migrate.
241
+
242
+
To sum up, we plan to:
243
+
244
+
* Keep `text` compatible with GHCs back to 8.0, unless it puts an undue cost (more than 50 lines of code per major release).
245
+
* Keep signatures of non-`Internal` modules compatible modulo `Word16`/`Word8` change.
246
+
* Provide migration guidance to clients of `Data.Text.{Array,Foreign}`.
247
+
* Facilitate a community project to keep UTF16-based legacy fork alive, if there is such demand.
248
+
183
249
**Performance impact**
184
250
185
251
A common misunderstanding is that switching to UTF-8 makes everything twice smaller and
@@ -265,20 +331,20 @@ packages that go out of date.
265
331
266
332
- text-2.0.0.0, which will provide a UTF-8 encoding for Text as a default for all versions going forward.
267
333
268
-
- A `text-utf16` package, which is a preservation of the current UTF-16 encoded text, for backwards compatibility.
269
-
270
-
- Updates to the Text Haddocks that reflect the UTF-8 changes
334
+
- Updates to the Text Haddocks that reflect the UTF-8 changes.
271
335
272
336
- Announcements and updates across all Haskell channels covering the following:
0 commit comments