PEP 756: Give up on copying memory (#3999)

vstinner · web-flow · commit aced24fc3547 · 2024-09-26T20:37:52.000+02:00
diff --git a/peps/pep-0756.rst b/peps/pep-0756.rst
@@ -21,9 +21,8 @@ Add functions to the limited C API version 3.14:
   view.
 * ``PyUnicode_Import()``: import a Python str object.
 
-By default, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
-is copied. See the :ref:`specification <export-complexity>` for cases
-when a copy is needed.
+On CPython, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
+is copied and no conversion is done.
 
 
 Rationale
@@ -67,9 +66,10 @@ possible to write code specialized for UCS formats. A C extension using
 the limited C API can only use less efficient code paths and string
 formats.
 
-For example, the MarkupSafe project has a C extension specialized for
-UCS formats for best performance, and so cannot use the limited C
-API.
+For example, the `MarkupSafe project
+<https://markupsafe.palletsprojects.com/>`_ has a C extension
+specialized for UCS formats for best performance, and so cannot use the
+limited C API.
 
 
 Specification
@@ -95,8 +95,6 @@ Add the following API to the limited C API version 3.14::
     #define PyUnicode_FORMAT_UTF8  0x08   // char*
     #define PyUnicode_FORMAT_ASCII 0x10   // char* (ASCII string)
 
-    #define PyUnicode_EXPORT_ALLOW_COPY 0x10000
-
 The ``int32_t`` type is used instead of ``int`` to have a well defined
 type size and not depend on the platform or the compiler.
 See `Avoid C-specific Types
@@ -148,45 +146,21 @@ UCS-2 and UCS-4 use the native byte order.
 *requested_formats* can be a single format or a bitwise combination of the
 formats in the table above.
 On success, the returned format will be set to a single one of the requested
-flags.
+formats.
 
 Note that future versions of Python may introduce additional formats.
 
-By default, no memory is copied and no conversion is done.
-
-If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set in
-*requested_formats*, the function can copy memory to provide the
-requested format and convert from a format to another.
-
-The ``PyUnicode_EXPORT_ALLOW_COPY`` flag is needed to export to
-``PyUnicode_FORMAT_UTF8`` a string containing surrogate characters.
+No memory is copied and no conversion is done.
 
-Available flags:
-
-===============================  ===========  ===================================
-Flag                             Value        Description
-===============================  ===========  ===================================
-``PyUnicode_EXPORT_ALLOW_COPY``  ``0x10000``  Allow memory copies and conversions
-===============================  ===========  ===================================
 
 
 .. _export-complexity:
 
 Export complexity
 -----------------
 
-By default, an export has a complexity of *O*\ (1): no memory is copied
-and no conversion is done. There is an exception: if only UTF-8 is
-requested and the UTF-8 cache is not filled, the string is encoded to
-UTF-8 to fill the cache.
-
-If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set, there are cases when a
-copy is needed, *O*\ (*n*) complexity:
-
-* If only UCS-2 is requested and the native format is UCS-1.
-* If only UCS-4 is requested and the native format is UCS-1 or UCS-2.
-* If only UTF-8 is requested and the string contains surrogate
-  characters.
+On CPython, an export has a complexity of *O*\ (1): no memory is copied
+and no conversion is done.
 
 To get the best performance on CPython and PyPy, it's recommended to
 support these 4 formats::
@@ -241,31 +215,29 @@ See ``PyUnicode_Export()`` for the available formats.
 UTF-8 format
 ------------
 
-CPython 3.14 doesn't use the UTF-8 format internally. The format is
-provided for compatibility with PyPy which uses UTF-8 natively for
-strings. However, in CPython, the encoded UTF-8 string is cached which
-makes it convenient to be exported.
+CPython 3.14 doesn't use the UTF-8 format internally and doesn't support
+exporting a string as UTF-8. The ``PyUnicode_AsUTF8AndSize()`` function
+can be used instead.
+
+The ``PyUnicode_FORMAT_UTF8`` format is provided for compatibility with
+alternate implementations which may use UTF-8 natively for strings.
 
-On CPython, the UTF-8 format has the lowest priority: ASCII and UCS
-formats are preferred.
 
 ASCII format
 ------------
 
 When the ``PyUnicode_FORMAT_ASCII`` format is request for export, the
-``PyUnicode_FORMAT_UCS1`` export format is used for ASCII and Latin-1
-strings.
+``PyUnicode_FORMAT_UCS1`` export format is used for ASCII strings.
 
 The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for
-``PyUnicode_Import()`` to validate that the string only contains ASCII
+``PyUnicode_Import()`` to validate that a string only contains ASCII
 characters.
 
 
 Surrogate characters and embedded NUL characters
 ------------------------------------------------
 
-Surrogate characters are allowed: they can be imported and exported. For
-example, the UTF-8 format uses the ``surrogatepass`` error handler.
+Surrogate characters are allowed: they can be imported and exported.
 
 Embedded NUL characters are allowed: they can be imported and exported.
 
@@ -391,6 +363,34 @@ this issue. For example, the UTF-8 codec can be used with the
 characters.
 
 
+Conversions on demand
+---------------------
+
+It would be convenient to convert formats on demand. For example,
+convert UCS-1 and UCS-2 to UCS-4 if an export to only UCS-4 is
+requested.
+
+The problem is that most users expect an export to require no memory
+copy and no conversion: an *O*\ (1) complexity. It is better to have an
+API where all operations have an *O*\ (1) complexity.
+
+Export to UTF-8
+---------------
+
+CPython 3.14 has a cache to encode a string to UTF-8. It is tempting to
+allow exporting to UTF-8.
+
+The problem is that the UTF-8 cache doesn't support surrogate
+characters. An export is expected to provide the whole string content,
+including embedded NUL characters and surrogate characters. To export
+surrogate characters, a different code path using the ``surrogatepass``
+error handler is needed and each export operation has to allocate a
+temporary buffer: *O*\ (n) complexity.
+
+An export is expected to have an *O*\ (1) complexity, so the idea to
+export UTF-8 in CPython was abadonned.
+
+
 Discussions
 ===========