Skip to content
Closed
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
a54bf86
Write an unrolled version of find_max_char
rhpvorderman Jun 5, 2024
71ff457
Unroll 8 characters rather than 4 to improve vectorization in find_ma…
rhpvorderman Jun 7, 2024
98a6449
Bigger chunk ascii decoding
rhpvorderman Jun 7, 2024
10527c6
Fix whitespace issues
rhpvorderman Jun 7, 2024
f0f4139
Fix compiler warnings
rhpvorderman Jun 7, 2024
849a068
Prevent re-read of same value
rhpvorderman Jun 7, 2024
fad19a0
Simplify find_max_char initialization
rhpvorderman Jun 7, 2024
f04bb2c
Add missing increments
rhpvorderman Jun 7, 2024
cd0fc5e
Add blurb
rhpvorderman Jun 7, 2024
8f0fd56
Replace unroll_end with unrolled_end
rhpvorderman Jun 7, 2024
37aee7a
Reword burb
rhpvorderman Jun 11, 2024
a6fc417
Update Objects/stringlib/find_max_char.h
rhpvorderman Jun 11, 2024
104ca62
Merge branch 'main' into MICROOPTIMIZATIONS
rhpvorderman Jun 12, 2024
d465517
Merge branch 'main' into MICROOPTIMIZATIONS
rhpvorderman Jun 14, 2024
1ce308e
Reuse find_max_char for bytes objects
rhpvorderman Jun 14, 2024
48f1e84
Simplify the find_max_char function by loading unaligned
rhpvorderman Jun 14, 2024
21de804
Allow optimized unaligned loads and simplify ascii_decode
rhpvorderman Jun 14, 2024
1ec2113
Add loop for more optimal assembly
rhpvorderman Jun 14, 2024
0258ae0
Also perform an unalgined load at the end
rhpvorderman Jun 14, 2024
ec76b74
Fix compiler warning
rhpvorderman Jun 14, 2024
845eb4e
Revert "Reuse find_max_char for bytes objects"
rhpvorderman Jun 14, 2024
f8cc68d
Merge branch 'main' into MICROOPTIMIZATIONS
rhpvorderman Jun 17, 2024
89ab2c9
Improve comments
rhpvorderman Jun 17, 2024
002b7ec
Merge branch 'main' into MICROOPTIMIZATIONS
rhpvorderman Sep 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Improve performance of ASCII decoding and maximum character checking
by allowing vectorization by the compiler on suitable platforms.
60 changes: 42 additions & 18 deletions Objects/stringlib/find_max_char.h
Original file line number Diff line number Diff line change
Expand Up @@ -20,23 +20,45 @@ Py_LOCAL_INLINE(Py_UCS4)
STRINGLIB(find_max_char)(const STRINGLIB_CHAR *begin, const STRINGLIB_CHAR *end)
{
const unsigned char *p = (const unsigned char *) begin;
const unsigned char *_end = (const unsigned char *)end;
const size_t *aligned_end = (const size_t *)(_end - SIZEOF_SIZE_T);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking whether I understand correctly: aligned_end is not aligned, but good enough to serve as the end of the loop over aligned values. To make it really aligned we would need to do something like aligned_end = _Py_SIZE_ROUND_DOWN(_end, SIZEOF_SIZE_T)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is the gist of it. I tried something similar to _Py_SIZE_ROUND_DOWN and got segfaults so I opted for this simpler tried and true solution. The name aligned_end is probably not the best, but I can't think of a better name now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be something like

Py_ssize_t n = end - begin;
const STRINGLIB_CHAR *p = begin;
const STRINGLIB_CHAR *aligned_end = begin + _Py_SIZE_ROUND_DOWN(n, ALIGNOF_SIZE_T);

Would be interested in understanding why you got a segfault though.

const size_t *unrolled_end = aligned_end - 3;
unsigned char accumulator = 0;
/* Do not test each character individually, but use bitwise OR and test
all characters at once. */
while (p < _end && !_Py_IS_ALIGNED(p, ALIGNOF_SIZE_T)) {
accumulator |= *p;
p += 1;
}
if (accumulator & 0x80) {
return 255;
} else if (p == end) {
return 127;
}

while (p < end) {
if (_Py_IS_ALIGNED(p, ALIGNOF_SIZE_T)) {
/* Help register allocation */
const unsigned char *_p = p;
while (_p + SIZEOF_SIZE_T <= end) {
size_t value = *(const size_t *) _p;
if (value & UCS1_ASCII_CHAR_MASK)
return 255;
_p += SIZEOF_SIZE_T;
}
p = _p;
if (p == end)
break;
}
if (*p++ & 0x80)
/* On 64-bit platforms with 128-bit vectors (x86-64, arm64) the
compiler can load 4 size_t values into two 16-byte vectors and do a
vector bitwise OR. */
const size_t *_p = (const size_t *)p;
while (_p < unrolled_end) {
size_t value = _p[0] | _p[1] | _p[2] | _p[3];
if (value & UCS1_ASCII_CHAR_MASK) {
return 255;
}
_p += 4;
}
size_t value = 0;
while (_p < aligned_end) {
value |= *_p;
_p += 1;
}
p = (const unsigned char *)_p;
while (p < _end) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done in a for-loop instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code is a while loop. Any particular reason why a for loop is preferable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just say "for clarity" but let's leave it like this if it was the original code (less changes in this sense).

value |= *p;
p += 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could combine these two loops into a single loop right? Worst case we do a single loop over 32-1=31 bytes instead of two while loops (the loop over _p (7 steps) and the loop over p (3 steps)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more optimal solution is to do an unaligned load of a size_t at (end - SIZEOF_SIZE_T). That would remove the entire last while loop. I do not know if all architectures support unaligned loads however. I guess some could theoretically abort?

I single loop over 31 bytes taking 31 steps is less optimal than 2 while loops taking 10 steps in total. (Or even 7 in the case of 64-bit size_t, that would be the majority of platforms).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the memcpy trick to do an unaligned load, it will be optimized by any reasonable compiler, for example:

  size_t value;
  memcpy(&value, end - sizeof(value), sizeof(value));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very nice suggestion! That would simplify the code a lot.

if (value & UCS1_ASCII_CHAR_MASK) {
return 255;
}
return 127;
}
Expand Down Expand Up @@ -69,13 +91,15 @@ STRINGLIB(find_max_char)(const STRINGLIB_CHAR *begin, const STRINGLIB_CHAR *end)
Py_UCS4 mask;
Py_ssize_t n = end - begin;
const STRINGLIB_CHAR *p = begin;
const STRINGLIB_CHAR *unrolled_end = begin + _Py_SIZE_ROUND_DOWN(n, 4);
const STRINGLIB_CHAR *unrolled_end = begin + _Py_SIZE_ROUND_DOWN(n, 8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in the other PR, maybe the 4/8 choice can be chosen at compile time depending on the architecture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

99% of production work were performance matters will be done on ARM64 (with 16-byte neon vectors) and X86-64 (with SSE2 vectors) other platforms will not be hurt by this decision. I think there is no reason to complicate the build with choices like this.

Py_UCS4 max_char;

max_char = MAX_CHAR_ASCII;
mask = MASK_ASCII;
while (p < unrolled_end) {
STRINGLIB_CHAR bits = p[0] | p[1] | p[2] | p[3];
/* Loading 8 values at once allows platforms that have 16-byte vectors
to do a vector load and vector bitwise OR. */
STRINGLIB_CHAR bits = p[0] | p[1] | p[2] | p[3] | p[4] | p[5] | p[6] | p[7];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here, you would have some #if arch == ... to choose between 4 or 8 values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that those vector load could be macros, for clarity purposes. It would then be optimized by the compiler but having them as macros might be helpful for future work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would to not have a macro, the code is good as it is.

if (bits & mask) {
if (mask == mask_limit) {
/* Limit reached */
Expand All @@ -94,7 +118,7 @@ STRINGLIB(find_max_char)(const STRINGLIB_CHAR *begin, const STRINGLIB_CHAR *end)
/* We check the new mask on the same chars in the next iteration */
continue;
}
p += 4;
p += 8;
}
while (p < end) {
if (p[0] & mask) {
Expand Down
32 changes: 30 additions & 2 deletions Objects/unicodeobject.c
Original file line number Diff line number Diff line change
Expand Up @@ -4700,6 +4700,8 @@ static Py_ssize_t
ascii_decode(const char *start, const char *end, Py_UCS1 *dest)
{
const char *p = start;
const char *size_t_end = end - SIZEOF_SIZE_T;
const char *unrolled_end = end - (4 * SIZEOF_SIZE_T);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@encukou Should we still use macros such as SIZEOF_SIZE_T and ALIGNOF_SIZE_T, or should we prefer sizeof and _Alignof?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we should prefer sizeof and _Alignof here.
Things are a bit different in public headers (where sizeof is OK but _Alignof isn't) or the preprocessor (in #if, neither work).

(There is no C API WG formal guideline yet; this is a personal recommendation. If anyone wants a wider discussion, do that in Discourse.)


#if SIZEOF_SIZE_T <= SIZEOF_VOID_P
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is CPython supported on some platform where size_t is strictly greater than void*? I know this PR doesn't touch this conditional, but I'm curious. @encukou

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any platform where sizeof(size_t) != sizeof(void*).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really about supported platforms (i.e. where we run the tests), or currently supported platforms.
We should aim for standard C whenever it's reasonable; if a speedup needs an extra assumption it should document that with this kind of #if.

if (_Py_IS_ALIGNED(p, ALIGNOF_SIZE_T)
Expand All @@ -4710,7 +4712,25 @@ ascii_decode(const char *start, const char *end, Py_UCS1 *dest)
/* Help allocation */
const char *_p = p;
Py_UCS1 * q = dest;
while (_p + SIZEOF_SIZE_T <= end) {
while (_p <= unrolled_end) {
const size_t *restrict __p = (const size_t *)_p;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the restrict keyword and the assignment to value0, ..., value3 required for the compiler? If not, then this code can be written more compact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think the compiler can judge from the function that no writes happen in either __p or _q so pointer aliasing is not problematic. But I'd rather make this explicit rather than relying on the compiler to deduce this. MSVC and Clang do this scalarly despite the hint. GCC correctly vectorizes the load store and does not need the hint. I cannot judge all possible compilers out there. I think it is more likely that clang and MSVC will do the correct thing in the future with the hint than without it.

It would only save the restrict keyword as the (size_t *) cast is still needed, as that makes the rest of the code more readable. Are the 8 characters extra width that problematic?

size_t value0 = __p[0];
size_t value1 = __p[1];
size_t value2 = __p[2];
size_t value3 = __p[3];
size_t value = value0 | value1 | value2 | value3;
if (value & ASCII_CHAR_MASK) {
break;
}
size_t *restrict _q = (size_t *)q;
_q[0] = value0;
_q[1] = value1;
_q[2] = value2;
_q[3] = value3;
_p += (4 * SIZEOF_SIZE_T);
q += (4 * SIZEOF_SIZE_T);
}
while (_p <= size_t_end) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You technically have 3 blocks of SIZEOF_SIZE_T until you reach size_t_end. In this case, I'd suggest creating a macro which is the unrolling of that (I'm not sure the compiler knows this or not, but maybe it does).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion to check if the compiler unrolls this. After all, I removed the if statement and only check at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this leads to very suboptimal assembly. So the extra while loop is useful here.

size_t value = *(const size_t *) _p;
if (value & ASCII_CHAR_MASK)
break;
Expand All @@ -4733,7 +4753,15 @@ ascii_decode(const char *start, const char *end, Py_UCS1 *dest)
if (_Py_IS_ALIGNED(p, ALIGNOF_SIZE_T)) {
/* Help allocation */
const char *_p = p;
while (_p + SIZEOF_SIZE_T <= end) {
while (_p <= unrolled_end) {
const size_t *restrict __p = (const size_t *)_p;
size_t value = __p[0] | __p[1] | __p[2] | __p[3];
if (value & ASCII_CHAR_MASK) {
break;
}
_p += (4 * SIZEOF_SIZE_T);
}
while (_p <= size_t_end) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, here you have only 3 chunks so maybe check whether you can unroll the loop once again (actually, just check whether the assembly with -O2 or -O3 unrolls stuff).

size_t value = *(const size_t *) _p;
if (value & ASCII_CHAR_MASK)
break;
Expand Down