@@ -21,9 +21,8 @@ Add functions to the limited C API version 3.14:
21
21
view.
22
22
* ``PyUnicode_Import() ``: import a Python str object.
23
23
24
- By default, ``PyUnicode_Export() `` has an *O *\ (1) complexity: no memory
25
- is copied. See the :ref: `specification <export-complexity >` for cases
26
- when a copy is needed.
24
+ On CPython, ``PyUnicode_Export() `` has an *O *\ (1) complexity: no memory
25
+ is copied and no conversion is done.
27
26
28
27
29
28
Rationale
@@ -67,9 +66,10 @@ possible to write code specialized for UCS formats. A C extension using
67
66
the limited C API can only use less efficient code paths and string
68
67
formats.
69
68
70
- For example, the MarkupSafe project has a C extension specialized for
71
- UCS formats for best performance, and so cannot use the limited C
72
- API.
69
+ For example, the `MarkupSafe project
70
+ <https://markupsafe.palletsprojects.com/> `_ has a C extension
71
+ specialized for UCS formats for best performance, and so cannot use the
72
+ limited C API.
73
73
74
74
75
75
Specification
@@ -95,8 +95,6 @@ Add the following API to the limited C API version 3.14::
95
95
#define PyUnicode_FORMAT_UTF8 0x08 // char*
96
96
#define PyUnicode_FORMAT_ASCII 0x10 // char* (ASCII string)
97
97
98
- #define PyUnicode_EXPORT_ALLOW_COPY 0x10000
99
-
100
98
The ``int32_t `` type is used instead of ``int `` to have a well defined
101
99
type size and not depend on the platform or the compiler.
102
100
See `Avoid C-specific Types
@@ -148,45 +146,21 @@ UCS-2 and UCS-4 use the native byte order.
148
146
*requested_formats * can be a single format or a bitwise combination of the
149
147
formats in the table above.
150
148
On success, the returned format will be set to a single one of the requested
151
- flags .
149
+ formats .
152
150
153
151
Note that future versions of Python may introduce additional formats.
154
152
155
- By default, no memory is copied and no conversion is done.
156
-
157
- If the ``PyUnicode_EXPORT_ALLOW_COPY `` flag is set in
158
- *requested_formats *, the function can copy memory to provide the
159
- requested format and convert from a format to another.
160
-
161
- The ``PyUnicode_EXPORT_ALLOW_COPY `` flag is needed to export to
162
- ``PyUnicode_FORMAT_UTF8 `` a string containing surrogate characters.
153
+ No memory is copied and no conversion is done.
163
154
164
- Available flags:
165
-
166
- =============================== =========== ===================================
167
- Flag Value Description
168
- =============================== =========== ===================================
169
- ``PyUnicode_EXPORT_ALLOW_COPY `` ``0x10000 `` Allow memory copies and conversions
170
- =============================== =========== ===================================
171
155
172
156
173
157
.. _export-complexity :
174
158
175
159
Export complexity
176
160
-----------------
177
161
178
- By default, an export has a complexity of *O *\ (1): no memory is copied
179
- and no conversion is done. There is an exception: if only UTF-8 is
180
- requested and the UTF-8 cache is not filled, the string is encoded to
181
- UTF-8 to fill the cache.
182
-
183
- If the ``PyUnicode_EXPORT_ALLOW_COPY `` flag is set, there are cases when a
184
- copy is needed, *O *\ (*n *) complexity:
185
-
186
- * If only UCS-2 is requested and the native format is UCS-1.
187
- * If only UCS-4 is requested and the native format is UCS-1 or UCS-2.
188
- * If only UTF-8 is requested and the string contains surrogate
189
- characters.
162
+ On CPython, an export has a complexity of *O *\ (1): no memory is copied
163
+ and no conversion is done.
190
164
191
165
To get the best performance on CPython and PyPy, it's recommended to
192
166
support these 4 formats::
@@ -241,31 +215,29 @@ See ``PyUnicode_Export()`` for the available formats.
241
215
UTF-8 format
242
216
------------
243
217
244
- CPython 3.14 doesn't use the UTF-8 format internally. The format is
245
- provided for compatibility with PyPy which uses UTF-8 natively for
246
- strings. However, in CPython, the encoded UTF-8 string is cached which
247
- makes it convenient to be exported.
218
+ CPython 3.14 doesn't use the UTF-8 format internally and doesn't support
219
+ exporting a string as UTF-8. The ``PyUnicode_AsUTF8AndSize() `` function
220
+ can be used instead.
221
+
222
+ The ``PyUnicode_FORMAT_UTF8 `` format is provided for compatibility with
223
+ alternate implementations which may use UTF-8 natively for strings.
248
224
249
- On CPython, the UTF-8 format has the lowest priority: ASCII and UCS
250
- formats are preferred.
251
225
252
226
ASCII format
253
227
------------
254
228
255
229
When the ``PyUnicode_FORMAT_ASCII `` format is request for export, the
256
- ``PyUnicode_FORMAT_UCS1 `` export format is used for ASCII and Latin-1
257
- strings.
230
+ ``PyUnicode_FORMAT_UCS1 `` export format is used for ASCII strings.
258
231
259
232
The ``PyUnicode_FORMAT_ASCII `` format is mostly useful for
260
- ``PyUnicode_Import() `` to validate that the string only contains ASCII
233
+ ``PyUnicode_Import() `` to validate that a string only contains ASCII
261
234
characters.
262
235
263
236
264
237
Surrogate characters and embedded NUL characters
265
238
------------------------------------------------
266
239
267
- Surrogate characters are allowed: they can be imported and exported. For
268
- example, the UTF-8 format uses the ``surrogatepass `` error handler.
240
+ Surrogate characters are allowed: they can be imported and exported.
269
241
270
242
Embedded NUL characters are allowed: they can be imported and exported.
271
243
@@ -391,6 +363,34 @@ this issue. For example, the UTF-8 codec can be used with the
391
363
characters.
392
364
393
365
366
+ Conversions on demand
367
+ ---------------------
368
+
369
+ It would be convenient to convert formats on demand. For example,
370
+ convert UCS-1 and UCS-2 to UCS-4 if an export to only UCS-4 is
371
+ requested.
372
+
373
+ The problem is that most users expect an export to require no memory
374
+ copy and no conversion: an *O *\ (1) complexity. It is better to have an
375
+ API where all operations have an *O *\ (1) complexity.
376
+
377
+ Export to UTF-8
378
+ ---------------
379
+
380
+ CPython 3.14 has a cache to encode a string to UTF-8. It is tempting to
381
+ allow exporting to UTF-8.
382
+
383
+ The problem is that the UTF-8 cache doesn't support surrogate
384
+ characters. An export is expected to provide the whole string content,
385
+ including embedded NUL characters and surrogate characters. To export
386
+ surrogate characters, a different code path using the ``surrogatepass ``
387
+ error handler is needed and each export operation has to allocate a
388
+ temporary buffer: *O *\ (n) complexity.
389
+
390
+ An export is expected to have an *O *\ (1) complexity, so the idea to
391
+ export UTF-8 in CPython was abadonned.
392
+
393
+
394
394
Discussions
395
395
===========
396
396
0 commit comments