Skip to content

Commit 822dfbd

Browse files
Nicolas Pitreandrewboie
authored andcommitted
lib/os/prf.c: alternate implementation for _ldiv5()
The _ldiv5() is an optimized divide-by-5 function that is smaller and faster than the generic libgcc implementation. Yet it can be made even smaller and faster with this replacement implementation based on a reciprocal multiplication plus some tricks. For example, here's the assembly from the original code on ARM: _ldiv5: ldr r3, [r0] movw ip, zephyrproject-rtos#52429 ldr r1, [r0, #4] movt ip, 52428 adds r3, r3, #2 push {r4, r5, r6, r7, lr} mov lr, #0 adc r1, r1, lr adds r2, lr, lr umull r7, r6, ip, r1 lsr r6, r6, #2 adc r7, r6, r6 adds r2, r2, r2 adc r7, r7, r7 adds r2, r2, lr adc r7, r7, r6 subs r3, r3, r2 sbc r7, r1, r7 lsr r2, r3, #3 orr r2, r2, r7, lsl zephyrproject-rtos#29 umull r2, r1, ip, r2 lsr r2, r1, #2 lsr r7, r1, zephyrproject-rtos#31 lsl r1, r2, #3 adds r4, lr, r1 adc r5, r6, r7 adds r2, r1, r1 adds r2, r2, r2 adds r2, r2, r1 subs r2, r3, r2 umull r3, r2, ip, r2 lsr r2, r2, #2 adds r4, r4, r2 adc r5, r5, #0 strd r4, [r0] pop {r4, r5, r6, r7, pc} And here's the resulting assembly with this commit applied: _ldiv5: push {r4, r5, r6, r7} movw r4, zephyrproject-rtos#13107 ldr r6, [r0] movt r4, 13107 ldr r1, [r0, #4] mov r3, #0 umull r6, r7, r6, r4 add r2, r4, r4, lsl #1 umull r4, r5, r1, r4 adds r1, r6, r2 adc r2, r7, r2 adds ip, r6, r4 adc r1, r7, r5 adds r2, ip, r2 adc r2, r1, r3 adds r2, r4, r2 adc r3, r5, r3 strd r2, [r0] pop {r4, r5, r6, r7} bx lr So we're down to 20 instructions from 36 initially, with only 2 umull instructions instead of 3, and slightly smaller stack footprint. Signed-off-by: Nicolas Pitre <[email protected]>
1 parent 0a5b259 commit 822dfbd

File tree

1 file changed

+41
-22
lines changed

1 file changed

+41
-22
lines changed

lib/os/prf.c

Lines changed: 41 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -130,37 +130,56 @@ static void _rlrshift(uint64_t *v)
130130
* sense to define this much smaller special case here to avoid
131131
* including it just for printf.
132132
*
133-
* It works by iteratively dividing the most significant 32 bits of
134-
* the 64 bit value by 5. This will leave a remainder of 0-4
135-
* (i.e. three significant bits), ensuring that the top 29 bits of the
136-
* remainder are zero for the next iteration. Thus in the second
137-
* iteration only 35 significant bits remain, and in the third only
138-
* six. This was tested exhaustively through the first ~10B values in
139-
* the input space, and for ~2e12 (4 hours runtime) random inputs
140-
* taken from the full 64 bit space.
133+
* It works by multiplying v by the reciprocal of 5 i.e.:
134+
*
135+
* result = v * ((1 << 64) / 5) / (1 << 64)
136+
*
137+
* This produces a 128-bit result, but we drop the bottom 64 bits which
138+
* accounts for the division by (1 << 64). The product is kept to 64 bits
139+
* by summing partial multiplications and shifting right by 32 which on
140+
* most 32-bit architectures means only a register drop.
141+
*
142+
* Here the multiplier is: (1 << 64) / 5 = 0x3333333333333333
143+
* i.e. a 62 bits value. To compensate for the reduced precision, we
144+
* add an initial bias of 1 to v. Enlarging the multiplier to 64 bits
145+
* would also work but a final right shift would be needed, and carry
146+
* handling on the summing of partial mults would be necessary, requiring
147+
* more instructions. Given that we already want to add bias of 2 for
148+
* the result to be rounded to nearest and not truncated, we might as well
149+
* combine those together into a bias of 3. This also conveniently allows
150+
* for keeping the multiplier in a single 32-bit register given its pattern.
141151
*/
142152
static void _ldiv5(uint64_t *v)
143153
{
144-
uint32_t hi;
145-
uint64_t rem = *v, quot = 0U, q;
146-
int i;
154+
uint32_t v_lo = *v;
155+
uint32_t v_hi = *v >> 32;
156+
uint32_t m = 0x33333333;
157+
uint64_t result;
147158

148-
static const char shifts[] = { 32, 3, 0 };
159+
/*
160+
* Force the multiplier constant into a register and make it
161+
* opaque to the compiler, otherwise gcc tries to be too smart
162+
* for its own good with a large expansion of adds and shifts.
163+
*/
164+
__asm__ ("" : "+r" (m));
149165

150166
/*
151-
* Usage in this file wants rounded behavior, not truncation. So add
152-
* two to get the threshold right.
167+
* Apply the bias of 3. We can't add it to v as this would overflow
168+
* it when at max range. Factor it out with the multiplier upfront.
169+
* Here we multiply the low and high parts separately to avoid an
170+
* unnecessary 64-bit add-with-carry.
153171
*/
154-
rem += 2U;
172+
result = ((uint64_t)(m * 3U) << 32) | (m * 3U);
155173

156-
for (i = 0; i < 3; i++) {
157-
hi = rem >> shifts[i];
158-
q = (uint64_t)(hi / 5U) << shifts[i];
159-
rem -= q * 5U;
160-
quot += q;
161-
}
174+
/* The actual multiplication. */
175+
result += (uint64_t)v_lo * m;
176+
result >>= 32;
177+
result += (uint64_t)v_lo * m;
178+
result += (uint64_t)v_hi * m;
179+
result >>= 32;
180+
result += (uint64_t)v_hi * m;
162181

163-
*v = quot;
182+
*v = result;
164183
}
165184

166185
static char _get_digit(uint64_t *fr, int *digit_count)

0 commit comments

Comments
 (0)