-
-
Notifications
You must be signed in to change notification settings - Fork 33k
gh-129813, PEP 782: Use PyBytesWriter in utf8_encoder() #138874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Replace the private _PyBytesWriter API with the new public PyBytesWriter API in utf8_encoder() and unicode_encode_ucs1().
Microbenchmark on the UTF-8 encoder: import pyperf
runner = pyperf.Runner()
runner.timeit('abc',
setup='s="abc"',
stmt='s.encode()')
runner.timeit('a x 1000',
setup='s="a" * 1000',
stmt='s.encode()')
runner.timeit('ab<surrogate> [namereplace]',
setup=r's="ab\udc80"',
stmt='s.encode(errors="namereplace")')
runner.timeit('ab<surrogate> [ignore]',
setup=r's="ab\udc80"',
stmt='s.encode(errors="ignore")')
runner.timeit('(a<surrogate>) x 1000 [namereplace]',
setup=r's="a\udc80" * 1000',
stmt='s.encode(errors="namereplace")')
runner.timeit('(a<surrogate>) x 1000 [ignore]',
setup=r's="a\udc80" * 1000',
stmt='s.encode(errors="ignore")') Results:
|
Please make benchmarks for non-ASCII strings. Consider different ranges (which represents different internal representations and lenghts of UTF-8 representation):
Consider also strings which contain one character from the higher range (for example, 0x10000 and all other characters ASCII, etc). |
More benchmark: import pyperf
runner = pyperf.Runner()
ranges = (
(r'\0',
r'\x7f'),
(r'\x80',
r'\xff'),
(r'\u0400',
r'\u0fff'),
(r'\U00010000',
r'\u0010ffff'),
)
for first, last in ranges:
runner.timeit(f'"{first}{last}"',
setup=f"first='{first}'; last='{last}'; s=first+last",
stmt='s.encode()')
for length in (5, 50, 500):
for first, last in ranges:
runner.timeit(f'"{first}{last}" * {length}',
setup=f"first='{first}'; last='{last}'; s=(first+last) * {length}",
stmt='s.encode()') Results:
Benchmark hidden because not significant (5): "\U00010000\u0010ffff", "\u0400\u0fff" * 5, "\x80\xff" * 50, "\0\x7f" * 500, "\U00010000\u0010ffff" * 500 |
Benchmark: import pyperf
runner = pyperf.Runner()
for length in (5, 50, 500):
runner.timeit(f'"x" * {length} + chr(0x10000)',
setup=f's="x" * {length} + chr(0x10000)',
stmt='s.encode()') Results:
Benchmark hidden because not significant (1): "x" * 500 + chr(0x10000) |
@serhiy-storchaka: The difference is about a "few nanoseconds", around +10 ns in the worst case, 1.08x faster in the best case. Do you think that it's acceptable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some overhead is caused by dynamic memory allocation for PyBytesWriter
-- it is unavoidable. But there may be a loss due to losing fine control on overallocation -- I need to look at it closer. Maybe it can be avoided by adding a new C API.
If this loss can be measured, I would suggest adding a private API to disable overallocation. |
Updated benchmark on the worst case: import pyperf
runner=pyperf.Runner()
runner.timeit('utf8',
setup=r's="\uFFFF"*(256//3)+"\uDC80"',
stmt='s.encode(errors="backslashreplace")')
runner.timeit('latin1',
setup=r"s=('a'*255+'\u0100')",
stmt="s.encode('latin1', 'backslashreplace')") Result:
|
@serhiy-storchaka: utf8_encoder() and unicode_encode_ucs1() are the last 2 functions using the private API. I would like to merge this change to be able to remove the private API, even if there is an overhead on performance. On the common cases, there is no significant impact on performance. |
I updated the PR to keep the |
More benchmarks on the corner cases. Benchmark: Result: Benchmark: Result: |
I updated the PR to reimplement the min_size micro-optimization. There is no more 1.3x slowdown. I also recomputed all benchmark results on the latest PR version. Results are now between 1.08x slower and 1.08x faster. Most benchmarks are in the [-5%, +5%] range which can be associated to noise in the benchmark (can be ignored). @serhiy-storchaka: I plan to merge this change next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. 👍
Remove useless PyBytesWriter_Discard() call
Merged, thanks for the review @serhiy-storchaka. |
|
Replace the private _PyBytesWriter API with the new public PyBytesWriter API in utf8_encoder() and unicode_encode_ucs1().