From 680651c31a656572f9cb72d71fc1eddfdc2b3a9e Mon Sep 17 00:00:00 2001 From: Peter Bierma Date: Sat, 30 Nov 2024 17:10:54 -0500 Subject: [PATCH 1/9] Document what happens when PyUnicode_AsUTF8() is given embedded null characters. --- Doc/c-api/unicode.rst | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/Doc/c-api/unicode.rst b/Doc/c-api/unicode.rst index 59bd7661965d93..7b77305ab889de 100644 --- a/Doc/c-api/unicode.rst +++ b/Doc/c-api/unicode.rst @@ -1035,6 +1035,12 @@ These are the UTF-8 codec APIs: As :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size. + .. warning:: + + This function does not strip null bytes from *unicode*, so the length of the + returned string (from ``strlen()``) is possibly smaller than the length of the + passed unicode object. + .. versionadded:: 3.3 .. versionchanged:: 3.7 From 8bfd541063edd934c9c00fcbf284e328efa4ed4e Mon Sep 17 00:00:00 2001 From: Peter Bierma Date: Sat, 30 Nov 2024 17:15:54 -0500 Subject: [PATCH 2/9] Suggest PyUnicode_AsUTF8AndSize for user input. --- Doc/c-api/unicode.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/Doc/c-api/unicode.rst b/Doc/c-api/unicode.rst index 7b77305ab889de..28fb7a4e304752 100644 --- a/Doc/c-api/unicode.rst +++ b/Doc/c-api/unicode.rst @@ -1039,7 +1039,8 @@ These are the UTF-8 codec APIs: This function does not strip null bytes from *unicode*, so the length of the returned string (from ``strlen()``) is possibly smaller than the length of the - passed unicode object. + passed unicode object. Prefer :c:func:`PyUnicode_AsUTF8AndSize` when dealing with + user input. .. versionadded:: 3.3 From 52e91172badf8ebddef18e160ca472424813bfb2 Mon Sep 17 00:00:00 2001 From: Peter Bierma Date: Sat, 30 Nov 2024 17:18:47 -0500 Subject: [PATCH 3/9] Switch to a note instead of a warning. --- Doc/c-api/unicode.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/Doc/c-api/unicode.rst b/Doc/c-api/unicode.rst index 28fb7a4e304752..db61c76090d386 100644 --- a/Doc/c-api/unicode.rst +++ b/Doc/c-api/unicode.rst @@ -1035,12 +1035,12 @@ These are the UTF-8 codec APIs: As :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size. - .. warning:: + .. note:: - This function does not strip null bytes from *unicode*, so the length of the - returned string (from ``strlen()``) is possibly smaller than the length of the - passed unicode object. Prefer :c:func:`PyUnicode_AsUTF8AndSize` when dealing with - user input. + This function does not handle null bytes inside of *unicode*, so the length of the + returned string (from ``strlen()``) could be smaller than the length of the + passed unicode object, if the string contained embedded null characters. Prefer + :c:func:`PyUnicode_AsUTF8AndSize` when dealing with user input. .. versionadded:: 3.3 From 1b393d4aa35a3baab6036bcdaf46fad14fa37c5b Mon Sep 17 00:00:00 2001 From: Peter Bierma Date: Sun, 1 Dec 2024 08:53:45 -0500 Subject: [PATCH 4/9] Update Doc/c-api/unicode.rst Co-authored-by: Stan U. <89152624+StanFromIreland@users.noreply.github.com> --- Doc/c-api/unicode.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Doc/c-api/unicode.rst b/Doc/c-api/unicode.rst index db61c76090d386..1b55f804e0da73 100644 --- a/Doc/c-api/unicode.rst +++ b/Doc/c-api/unicode.rst @@ -1037,10 +1037,10 @@ These are the UTF-8 codec APIs: .. note:: - This function does not handle null bytes inside of *unicode*, so the length of the + This function does not handle null bytes within the unicode object. As a result, the length of the returned string (from ``strlen()``) could be smaller than the length of the - passed unicode object, if the string contained embedded null characters. Prefer - :c:func:`PyUnicode_AsUTF8AndSize` when dealing with user input. + passed unicode object, if the string contained embedded null characters. When handling user input, + it is recommended to use :c:func:`PyUnicode_AsUTF8AndSize` instead. .. versionadded:: 3.3 From 040608b59e4437766eb587700a60074d5bc654cb Mon Sep 17 00:00:00 2001 From: Peter Bierma Date: Sun, 1 Dec 2024 09:41:18 -0500 Subject: [PATCH 5/9] Update Doc/c-api/unicode.rst Co-authored-by: Tomas R. --- Doc/c-api/unicode.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/c-api/unicode.rst b/Doc/c-api/unicode.rst index 1b55f804e0da73..a487997c7406bb 100644 --- a/Doc/c-api/unicode.rst +++ b/Doc/c-api/unicode.rst @@ -1039,7 +1039,7 @@ These are the UTF-8 codec APIs: This function does not handle null bytes within the unicode object. As a result, the length of the returned string (from ``strlen()``) could be smaller than the length of the - passed unicode object, if the string contained embedded null characters. When handling user input, + passed unicode object, if the string contained embedded null characters. When handling user input, it is recommended to use :c:func:`PyUnicode_AsUTF8AndSize` instead. .. versionadded:: 3.3 From 6fb8cbe80a609c15c9f9c6839800f04d65e15943 Mon Sep 17 00:00:00 2001 From: Peter Bierma Date: Sun, 15 Dec 2024 10:46:19 -0500 Subject: [PATCH 6/9] Play with the wording a little bit. --- Doc/c-api/unicode.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Doc/c-api/unicode.rst b/Doc/c-api/unicode.rst index a487997c7406bb..2fa481e5daad6d 100644 --- a/Doc/c-api/unicode.rst +++ b/Doc/c-api/unicode.rst @@ -1035,11 +1035,11 @@ These are the UTF-8 codec APIs: As :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size. - .. note:: + .. warning:: - This function does not handle null bytes within the unicode object. As a result, the length of the - returned string (from ``strlen()``) could be smaller than the length of the - passed unicode object, if the string contained embedded null characters. When handling user input, + This function does not handle null bytes within the unicode object. + As a result, the length of the returned string could be interpreted as + smaller than the length of *unicode*. When handling user input, it is recommended to use :c:func:`PyUnicode_AsUTF8AndSize` instead. .. versionadded:: 3.3 From 3c7b6be694a7175a1583aa4e3a2bec08adeaccb6 Mon Sep 17 00:00:00 2001 From: Peter Bierma Date: Mon, 13 Jan 2025 12:37:19 -0500 Subject: [PATCH 7/9] Add a reference. --- Doc/c-api/unicode.rst | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/Doc/c-api/unicode.rst b/Doc/c-api/unicode.rst index 2fa481e5daad6d..35e388f7cf7667 100644 --- a/Doc/c-api/unicode.rst +++ b/Doc/c-api/unicode.rst @@ -1037,10 +1037,12 @@ These are the UTF-8 codec APIs: .. warning:: - This function does not handle null bytes within the unicode object. - As a result, the length of the returned string could be interpreted as - smaller than the length of *unicode*. When handling user input, - it is recommended to use :c:func:`PyUnicode_AsUTF8AndSize` instead. + This function does not have any special behavior for + `null bytes `_ embedded within + *unicode*. As a result, strings containing null bytes will remain in the returned + string, which some C functions might interpret as the end of the string, leading to + truncation. When handling user input, it is recommended to use :c:func:`PyUnicode_AsUTF8AndSize` + instead. .. versionadded:: 3.3 From 0eac45f93af4328a0c11c1804370eba86e80827b Mon Sep 17 00:00:00 2001 From: Peter Bierma Date: Mon, 13 Jan 2025 15:45:51 -0500 Subject: [PATCH 8/9] Update Doc/c-api/unicode.rst Co-authored-by: Victor Stinner --- Doc/c-api/unicode.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Doc/c-api/unicode.rst b/Doc/c-api/unicode.rst index 35e388f7cf7667..c3c7516d3b908b 100644 --- a/Doc/c-api/unicode.rst +++ b/Doc/c-api/unicode.rst @@ -1038,8 +1038,8 @@ These are the UTF-8 codec APIs: .. warning:: This function does not have any special behavior for - `null bytes `_ embedded within - *unicode*. As a result, strings containing null bytes will remain in the returned + `null characters `_ embedded within + *unicode*. As a result, strings containing null characters will remain in the returned string, which some C functions might interpret as the end of the string, leading to truncation. When handling user input, it is recommended to use :c:func:`PyUnicode_AsUTF8AndSize` instead. From 35e078386e454696182be5372af3d0b4941a94c9 Mon Sep 17 00:00:00 2001 From: Peter Bierma Date: Mon, 13 Jan 2025 15:47:01 -0500 Subject: [PATCH 9/9] Switch the wording away from 'user input' --- Doc/c-api/unicode.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/c-api/unicode.rst b/Doc/c-api/unicode.rst index c3c7516d3b908b..cd878f13765d15 100644 --- a/Doc/c-api/unicode.rst +++ b/Doc/c-api/unicode.rst @@ -1041,7 +1041,7 @@ These are the UTF-8 codec APIs: `null characters `_ embedded within *unicode*. As a result, strings containing null characters will remain in the returned string, which some C functions might interpret as the end of the string, leading to - truncation. When handling user input, it is recommended to use :c:func:`PyUnicode_AsUTF8AndSize` + truncation. If truncation is an issue, it is recommended to use :c:func:`PyUnicode_AsUTF8AndSize` instead. .. versionadded:: 3.3