Speed up do_format_decimal by user202729 · Pull Request #4630 · fmtlib/fmt

user202729 · 2025-12-13T14:48:27Z

Minor speedup.
Before (using https://github.com/fmtlib/format-benchmark):


--------------------------------------------------------------------------------
Benchmark                      Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------
fmt_format_to_compile   13785856 ns     13748336 ns           50 items_per_second=72.7361M/s
fmt_format_int          13560583 ns     13522664 ns           51 items_per_second=73.9499M/s

After:

--------------------------------------------------------------------------------
Benchmark                      Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------
fmt_format_to_compile   13448662 ns     13411962 ns           52 items_per_second=74.5603M/s
fmt_format_int          13132046 ns     13090029 ns           53 items_per_second=76.394M/s

The idea is to avoid the modulo (which gets compiled to a imul and a sub) by looking at the 7 bits after the decimal point of value / 100.

This adds 256+1 bytes worth of lookup table (the old lookup table still need to stay there, unfortunately). Although technically the null terminator and the last 2 spaces are unused.

Correctness is shown by exhaustively search through the whole range of 32-bit integers and ensure that for all i, (i * ((1ull<<39)/100+1)) >> (39 - 7) & ((1<<7) - 1) uniquely determines the value of i % 100 and ((i * ((1ull<<39)/100+1)) >> 39) + ((i>=(100u<<25))<<25) is exactly equal to i / 100.

The check sizeof(UInt) == 4 implicitly assumes CHAR_BIT == 8 (is it worth being spelled out?)

Source code of brute force checker

using ull = unsigned long long;
int main(){
	int lookup [1<<7];
	for (int i = 1<<7; i-->0;){
		lookup[i] = -1;
	}
	for(unsigned i=0;;){
		auto& l = lookup[(i * ((1ull<<39)/100+1)) >> (39 - 7) & ((1<<7) - 1)];
		if(l<0) l = i % 100;
		if(l != (i % 100))
			__builtin_printf("%u\n", i);
		if(((i * ((1ull<<39)/100+1)) >> 39) + ((i>=(100u<<25))<<25) != i / 100)
			__builtin_printf(">%u %u %u\n", i, i/100, unsigned((i * ((1ull<<39)/100+1)) >> 39));
		if(++i==0) break;
	}
	for(unsigned i=0; i<sizeof(lookup)/sizeof(lookup[0]); ++i) {
		if (i % 16 == 0) __builtin_printf("\"");
		if (lookup[i] < 0)
			__builtin_printf("  ");
		else
			__builtin_printf("%02d", lookup[i]);
		if ((i+1) % 16 == 0) __builtin_printf("\"\n");
	}
}

Future work:

adapt to write_significand
generalize algorithm to work with 64-bit input (will need __int128).

note:

digits2_i is not constexpr (before C++20)
write2digits_i is not constexpr either, so there's no need for the std::is_constant_evaluated
I don't understand why we don't want memcpy if FMT_OPTIMIZE_SIZE is true, but write2digits do that.
apparently gcc cannot compile two char load/store into one short load/store (even with both load to a temporary then store back, so no concern of aliasing here).
the benchmark has a very large proportion of values with at most 4 digits, which is why parallel multiplication such as in hofman_fun will always be slower.

user202729 · 2025-12-14T05:36:58Z

Sorry for the CI failures. That said, I recommend adding to CONTRIBUTING.md the commands to verify the lint/compiler warnings etc.

vitaut · 2025-12-16T14:40:40Z

Thanks for the PR! Could you check how it performs on itoa-benchmark (https://github.com/fmtlib/format-benchmark/tree/master/src/itoa-benchmark)?

user202729 · 2025-12-18T05:56:51Z

I made a pull request fmtlib/format-benchmark#31 that adds fmt as an option of itoa_benchmark. Let me know if that accurately benchmark {fmt} library's performance.

vitaut · 2025-12-19T18:52:39Z

Thanks for adding fmt to itoa-benchmark. Have you checked the results of your change there and could you post them here?

vitaut · 2025-12-25T16:22:32Z

Anyway, I merged it and will take a closer look at the whole numeric formatting later. Thanks!

user202729 · 2025-12-25T17:15:28Z

Sorry, I forget about this.

That said, looking at the output of itoa-benchmark: why are jeaiii/tmueller/unrolledlut much faster than fmt? I wonder if fmt may try to optimize for something else in parallel (code size?), or if not we may want to just switch to something similar...

vitaut · 2025-12-27T23:17:35Z

It's been a while since I looked at it but IIRC some of those methods only performed well for fixed number of digits due to excessive branching (which wasn't reflected in that particular benchmark). In any case, I plan to revamp numeric formatting and currently looking into FP (https://github.com/vitaut/zmij) with integer to follow.

Skylion007 · 2025-12-30T17:20:42Z

+// the decimal point of i / 100 in base 2, the first 2 bytes
+// after digits2_i(x) is the string representation of i.
+inline auto digits2_i(size_t value) -> const char* {
+  alignas(2) static const char data[] =


Any reason not to make this array constexpr? I think this might enable certain compiler to do static compile time bounds checks if it's constexpr, especially if the whole function is constexpr.

Actually, I see why this done given the C++14 compatibiltiy requirements? Frustrating though as they could be merged easier across translation units otherwise in C++17 or newer.

Should probably mark as noexcept though like the following for consistency: 3269c1c#diff-bdc6f79e8e9f5b4331d66fb785636a87d29f55cf729865e13925b4209424c878R1033

I copy the style of digits2 function where the data array is not constexpr. I don't know what the compatibility requirement is exactly. If this array can be changed to constexpr then similar arrays should be as well.

Edit: hm, maybe the noexcept wasn't in digits2 when I looked.

user202729 added 3 commits December 13, 2025 21:39

Speed up do_format_decimal

a016694

Fix compiler warning

4435698

Fix lint

79d8430

Avoid failure if sizeof(unsigned long long) > 8

e4e2e22

vitaut merged commit 7ad8004 into fmtlib:master Dec 25, 2025
41 checks passed

Skylion007 reviewed Dec 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up do_format_decimal#4630

Speed up do_format_decimal#4630
vitaut merged 4 commits intofmtlib:masterfrom
user202729:speed-up-format-decimal

user202729 commented Dec 13, 2025 •

edited

Loading

Uh oh!

user202729 commented Dec 14, 2025

Uh oh!

vitaut commented Dec 16, 2025

Uh oh!

user202729 commented Dec 18, 2025

Uh oh!

vitaut commented Dec 19, 2025

Uh oh!

Uh oh!

vitaut commented Dec 25, 2025

Uh oh!

user202729 commented Dec 25, 2025

Uh oh!

vitaut commented Dec 27, 2025

Uh oh!

Skylion007 Dec 30, 2025 •

edited

Loading

Uh oh!

Skylion007 Dec 30, 2025 •

edited

Loading

Uh oh!

user202729 Dec 31, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

user202729 commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

user202729 commented Dec 14, 2025

Uh oh!

vitaut commented Dec 16, 2025

Uh oh!

user202729 commented Dec 18, 2025

Uh oh!

vitaut commented Dec 19, 2025

Uh oh!

Uh oh!

vitaut commented Dec 25, 2025

Uh oh!

user202729 commented Dec 25, 2025

Uh oh!

vitaut commented Dec 27, 2025

Uh oh!

Skylion007 Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Skylion007 Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

user202729 Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

user202729 commented Dec 13, 2025 •

edited

Loading

Skylion007 Dec 30, 2025 •

edited

Loading

Skylion007 Dec 30, 2025 •

edited

Loading

user202729 Dec 31, 2025 •

edited

Loading