gh-103997: Fix and test `_PyUnicode_Dedent` #138620

StanFromIreland · 2025-09-07T12:48:20Z

While working on #62535, I noticed that several textwrap.dedent tests fail with this implementation.

Issue: Auto dedent -c arguments #103997

picnixz

_PyUnicode_Dedent is used in pymain_run_command so it may be performance critical. So please keep the same logic for implementation by working with char* only. Or show that this doesn't result in a performance loss.

Include/internal/pycore_unicodeobject.h

Objects/unicodeobject.c

picnixz

Please avoid adding const qualifiers.

Objects/unicodeobject.c

picnixz · 2025-09-07T15:16:02Z

Objects/unicodeobject.c

    Py_ssize_t whitespace_len = search_longest_common_leading_whitespace(
        src, end, &whitespace_start);

-    if (whitespace_len == 0) {


Keep the fast path.

We can't, we need to clear lines, see the tests.

Do we really need to do it? in general there is nothing to dedent, so it'll slow down normal cases. Even if the comment says that it's meant to match textwrap.dedent(), I don't think it's needed.

If we want to respect the behavior of textwrap.dedent, as is noted in What's new 3.14, the docs, and the comments. Then yes.

But I don't think we need to! It's an internal function. textwrap.dedent is implemented in pure Python.

But our documentation says we do:

Whats New 3.14

The auto-dedentation behavior mirrors textwrap.dedent().

If you would rather just remove the false claims, we can do: https://github.com/python/cpython/compare/main...StanFromIreland:remove-misleading-notes?expand=1

Though I think this may cause confusion in the future.

The auto-dedentation behavior mirrors textwrap.dedent().

Yes, but we can just say that we don't normalize whitespaces and only consider spaces and tabs. No one should ever write spaces with other space-like characters. Let's just amend the NEWS. As for str.dedent(), it will need a PEP which still doesn't exist and the discussion seems stalled IMO.

Yes, I am working on the PEP :-)

I opened #138620 with your alternative, though I still think fixing it is better.

Objects/unicodeobject.c

picnixz · 2025-09-07T15:17:40Z

Objects/unicodeobject.c

        // if this line has all white space, write '\n' and continue
-        if (in_leading_space && append_newline) {
-            *dest_iter++ = '\n';
+        if (in_leading_space) {


Was this the issue? or was it *iter != ' ' ...?

There were multiple issues.

Please indicate the issues for posterity.

There are two issues:

Not clearing lines that are only whitespace, whereas textwrap.dedent does

Only considering '\t' and ' ', whereas textwrap.dedent uses str.isspace

Honestly, I don't think it's worth changing this function. We should just change the comment. It's an internal function.

picnixz · 2025-09-07T15:19:04Z

I would still be interested in knowing the answer to that question:

Or show that this doesn't result in a performance loss

Did your refactoring improve the overall performance or not?

StanFromIreland · 2025-09-07T15:24:21Z

Did your refactoring improve the overall performance or not?

Using PyUnicodeWriter has a ~20% performance penalty.

StanFromIreland · 2025-09-07T15:24:47Z

I have made the requested changes; please review again

Include/internal/pycore_unicodeobject.h

Lib/test/test_capi/test_unicode.py

Modules/_testinternalcapi.c

Objects/unicodeobject.c

picnixz · 2025-09-07T15:50:50Z

Objects/unicodeobject.c

    Py_ssize_t whitespace_len = search_longest_common_leading_whitespace(
        src, end, &whitespace_start);

-    if (whitespace_len == 0) {


Do we really need to do it? in general there is nothing to dedent, so it'll slow down normal cases. Even if the comment says that it's meant to match textwrap.dedent(), I don't think it's needed.

picnixz · 2025-09-07T15:51:37Z

Objects/unicodeobject.c

        // if this line has all white space, write '\n' and continue
-        if (in_leading_space && append_newline) {
-            *dest_iter++ = '\n';
+        if (in_leading_space) {


Please indicate the issues for posterity.

picnixz · 2025-09-07T16:18:53Z

We can still test _PyUnicode_Dedent to check that it matches a "subset" of the features of textwrap.dedent.

sunmy2019 · 2025-10-15T18:41:21Z

I'm the implementor of the C function. Sorry that I did not read the code of textwrap.dedent. I just wrote according to the test cases. 😳

I understand now it's a matter of design.

picnixz · 2025-10-16T04:54:07Z

Yes, and I think we do not need to exactly mimic textwrap dedent unless there is a compelling reason. I personally do not find one: the function is private and internally used only by the parser I think, and I doubt anyone would have a script that uses whitespaces that are not spaces/tabs for indents, and if they do, I do not think we should support this at the cost of slowing down the regular use cases.

For instance, I would suggest that we currently keep a simplified version as it is only used for the parser and, if the PEP for str.dedent() is accepted, possibly revisit this design question later, possibly by adding support for normalisation as well (PyUnicode_Dedent and PyUnicode_DedentNormalize).

WDYT?

sunmy2019 · 2025-10-16T05:04:36Z

For instance, I would suggest that we currently keep a simplified version as it is only used for the parser and, if the PEP for str.dedent() is accepted, possibly revisit this design question later, possibly by adding support for normalisation as well (PyUnicode_Dedent and PyUnicode_DedentNormalize).

WDYT?

I think so. Let's just change the document for now.

Commit

40bcdea

StanFromIreland requested a review from methane September 7, 2025 12:48

bedevere-app bot added the awaiting review label Sep 7, 2025

bedevere-app bot mentioned this pull request Sep 7, 2025

Auto dedent -c arguments #103997

Open

picnixz requested changes Sep 7, 2025

View reviewed changes

Include/internal/pycore_unicodeobject.h Show resolved Hide resolved

Objects/unicodeobject.c Outdated Show resolved Hide resolved

bedevere-app bot added awaiting changes and removed awaiting review labels Sep 7, 2025