-
-
Notifications
You must be signed in to change notification settings - Fork 33.3k
gh-103997: Fix and test _PyUnicode_Dedent
#138620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_PyUnicode_Dedent is used in pymain_run_command so it may be performance critical. So please keep the same logic for implementation by working with char* only. Or show that this doesn't result in a performance loss.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid adding const qualifiers.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
| Py_ssize_t whitespace_len = search_longest_common_leading_whitespace( | ||
| src, end, &whitespace_start); | ||
|
|
||
| if (whitespace_len == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep the fast path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't, we need to clear lines, see the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to do it? in general there is nothing to dedent, so it'll slow down normal cases. Even if the comment says that it's meant to match textwrap.dedent(), I don't think it's needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to respect the behavior of textwrap.dedent, as is noted in What's new 3.14, the docs, and the comments. Then yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't think we need to! It's an internal function. textwrap.dedent is implemented in pure Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But our documentation says we do:
The auto-dedentation behavior mirrors textwrap.dedent().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you would rather just remove the false claims, we can do: https://github.com/python/cpython/compare/main...StanFromIreland:remove-misleading-notes?expand=1
Though I think this may cause confusion in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The auto-dedentation behavior mirrors textwrap.dedent().
Yes, but we can just say that we don't normalize whitespaces and only consider spaces and tabs. No one should ever write spaces with other space-like characters. Let's just amend the NEWS. As for str.dedent(), it will need a PEP which still doesn't exist and the discussion seems stalled IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am working on the PEP :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #138620 with your alternative, though I still think fixing it is better.
| // if this line has all white space, write '\n' and continue | ||
| if (in_leading_space && append_newline) { | ||
| *dest_iter++ = '\n'; | ||
| if (in_leading_space) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this the issue? or was it *iter != ' ' ...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were multiple issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please indicate the issues for posterity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two issues:
- Not clearing lines that are only whitespace, whereas
textwrap.dedentdoes - Only considering
'\t'and' ', whereastextwrap.dedentusesstr.isspace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I don't think it's worth changing this function. We should just change the comment. It's an internal function.
This comment was marked as resolved.
This comment was marked as resolved.
|
I would still be interested in knowing the answer to that question:
Did your refactoring improve the overall performance or not? |
Using |
|
I have made the requested changes; please review again |
This comment was marked as resolved.
This comment was marked as resolved.
| Py_ssize_t whitespace_len = search_longest_common_leading_whitespace( | ||
| src, end, &whitespace_start); | ||
|
|
||
| if (whitespace_len == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to do it? in general there is nothing to dedent, so it'll slow down normal cases. Even if the comment says that it's meant to match textwrap.dedent(), I don't think it's needed.
| // if this line has all white space, write '\n' and continue | ||
| if (in_leading_space && append_newline) { | ||
| *dest_iter++ = '\n'; | ||
| if (in_leading_space) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please indicate the issues for posterity.
|
We can still test |
|
I'm the implementor of the C function. Sorry that I did not read the code of I understand now it's a matter of design. |
|
Yes, and I think we do not need to exactly mimic textwrap dedent unless there is a compelling reason. I personally do not find one: the function is private and internally used only by the parser I think, and I doubt anyone would have a script that uses whitespaces that are not spaces/tabs for indents, and if they do, I do not think we should support this at the cost of slowing down the regular use cases. For instance, I would suggest that we currently keep a simplified version as it is only used for the parser and, if the PEP for str.dedent() is accepted, possibly revisit this design question later, possibly by adding support for normalisation as well (PyUnicode_Dedent and PyUnicode_DedentNormalize). WDYT? |
I think so. Let's just change the document for now. |
While working on #62535, I noticed that several
textwrap.dedenttests fail with this implementation.