Skip to content

Commit aced24f

Browse files
authored
PEP 756: Give up on copying memory (#3999)
1 parent 2a3dfe0 commit aced24f

File tree

1 file changed

+47
-47
lines changed

1 file changed

+47
-47
lines changed

peps/pep-0756.rst

Lines changed: 47 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,8 @@ Add functions to the limited C API version 3.14:
2121
view.
2222
* ``PyUnicode_Import()``: import a Python str object.
2323

24-
By default, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
25-
is copied. See the :ref:`specification <export-complexity>` for cases
26-
when a copy is needed.
24+
On CPython, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
25+
is copied and no conversion is done.
2726

2827

2928
Rationale
@@ -67,9 +66,10 @@ possible to write code specialized for UCS formats. A C extension using
6766
the limited C API can only use less efficient code paths and string
6867
formats.
6968

70-
For example, the MarkupSafe project has a C extension specialized for
71-
UCS formats for best performance, and so cannot use the limited C
72-
API.
69+
For example, the `MarkupSafe project
70+
<https://markupsafe.palletsprojects.com/>`_ has a C extension
71+
specialized for UCS formats for best performance, and so cannot use the
72+
limited C API.
7373

7474

7575
Specification
@@ -95,8 +95,6 @@ Add the following API to the limited C API version 3.14::
9595
#define PyUnicode_FORMAT_UTF8 0x08 // char*
9696
#define PyUnicode_FORMAT_ASCII 0x10 // char* (ASCII string)
9797

98-
#define PyUnicode_EXPORT_ALLOW_COPY 0x10000
99-
10098
The ``int32_t`` type is used instead of ``int`` to have a well defined
10199
type size and not depend on the platform or the compiler.
102100
See `Avoid C-specific Types
@@ -148,45 +146,21 @@ UCS-2 and UCS-4 use the native byte order.
148146
*requested_formats* can be a single format or a bitwise combination of the
149147
formats in the table above.
150148
On success, the returned format will be set to a single one of the requested
151-
flags.
149+
formats.
152150

153151
Note that future versions of Python may introduce additional formats.
154152

155-
By default, no memory is copied and no conversion is done.
156-
157-
If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set in
158-
*requested_formats*, the function can copy memory to provide the
159-
requested format and convert from a format to another.
160-
161-
The ``PyUnicode_EXPORT_ALLOW_COPY`` flag is needed to export to
162-
``PyUnicode_FORMAT_UTF8`` a string containing surrogate characters.
153+
No memory is copied and no conversion is done.
163154

164-
Available flags:
165-
166-
=============================== =========== ===================================
167-
Flag Value Description
168-
=============================== =========== ===================================
169-
``PyUnicode_EXPORT_ALLOW_COPY`` ``0x10000`` Allow memory copies and conversions
170-
=============================== =========== ===================================
171155

172156

173157
.. _export-complexity:
174158

175159
Export complexity
176160
-----------------
177161

178-
By default, an export has a complexity of *O*\ (1): no memory is copied
179-
and no conversion is done. There is an exception: if only UTF-8 is
180-
requested and the UTF-8 cache is not filled, the string is encoded to
181-
UTF-8 to fill the cache.
182-
183-
If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set, there are cases when a
184-
copy is needed, *O*\ (*n*) complexity:
185-
186-
* If only UCS-2 is requested and the native format is UCS-1.
187-
* If only UCS-4 is requested and the native format is UCS-1 or UCS-2.
188-
* If only UTF-8 is requested and the string contains surrogate
189-
characters.
162+
On CPython, an export has a complexity of *O*\ (1): no memory is copied
163+
and no conversion is done.
190164

191165
To get the best performance on CPython and PyPy, it's recommended to
192166
support these 4 formats::
@@ -241,31 +215,29 @@ See ``PyUnicode_Export()`` for the available formats.
241215
UTF-8 format
242216
------------
243217

244-
CPython 3.14 doesn't use the UTF-8 format internally. The format is
245-
provided for compatibility with PyPy which uses UTF-8 natively for
246-
strings. However, in CPython, the encoded UTF-8 string is cached which
247-
makes it convenient to be exported.
218+
CPython 3.14 doesn't use the UTF-8 format internally and doesn't support
219+
exporting a string as UTF-8. The ``PyUnicode_AsUTF8AndSize()`` function
220+
can be used instead.
221+
222+
The ``PyUnicode_FORMAT_UTF8`` format is provided for compatibility with
223+
alternate implementations which may use UTF-8 natively for strings.
248224

249-
On CPython, the UTF-8 format has the lowest priority: ASCII and UCS
250-
formats are preferred.
251225

252226
ASCII format
253227
------------
254228

255229
When the ``PyUnicode_FORMAT_ASCII`` format is request for export, the
256-
``PyUnicode_FORMAT_UCS1`` export format is used for ASCII and Latin-1
257-
strings.
230+
``PyUnicode_FORMAT_UCS1`` export format is used for ASCII strings.
258231

259232
The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for
260-
``PyUnicode_Import()`` to validate that the string only contains ASCII
233+
``PyUnicode_Import()`` to validate that a string only contains ASCII
261234
characters.
262235

263236

264237
Surrogate characters and embedded NUL characters
265238
------------------------------------------------
266239

267-
Surrogate characters are allowed: they can be imported and exported. For
268-
example, the UTF-8 format uses the ``surrogatepass`` error handler.
240+
Surrogate characters are allowed: they can be imported and exported.
269241

270242
Embedded NUL characters are allowed: they can be imported and exported.
271243

@@ -391,6 +363,34 @@ this issue. For example, the UTF-8 codec can be used with the
391363
characters.
392364

393365

366+
Conversions on demand
367+
---------------------
368+
369+
It would be convenient to convert formats on demand. For example,
370+
convert UCS-1 and UCS-2 to UCS-4 if an export to only UCS-4 is
371+
requested.
372+
373+
The problem is that most users expect an export to require no memory
374+
copy and no conversion: an *O*\ (1) complexity. It is better to have an
375+
API where all operations have an *O*\ (1) complexity.
376+
377+
Export to UTF-8
378+
---------------
379+
380+
CPython 3.14 has a cache to encode a string to UTF-8. It is tempting to
381+
allow exporting to UTF-8.
382+
383+
The problem is that the UTF-8 cache doesn't support surrogate
384+
characters. An export is expected to provide the whole string content,
385+
including embedded NUL characters and surrogate characters. To export
386+
surrogate characters, a different code path using the ``surrogatepass``
387+
error handler is needed and each export operation has to allocate a
388+
temporary buffer: *O*\ (n) complexity.
389+
390+
An export is expected to have an *O*\ (1) complexity, so the idea to
391+
export UTF-8 in CPython was abadonned.
392+
393+
394394
Discussions
395395
===========
396396

0 commit comments

Comments
 (0)