Skip to content

Speed up do_format_decimal#4630

Merged
vitaut merged 4 commits intofmtlib:masterfrom
user202729:speed-up-format-decimal
Dec 25, 2025
Merged

Speed up do_format_decimal#4630
vitaut merged 4 commits intofmtlib:masterfrom
user202729:speed-up-format-decimal

Conversation

@user202729
Copy link
Copy Markdown
Contributor

@user202729 user202729 commented Dec 13, 2025

Minor speedup.
Before (using https://github.com/fmtlib/format-benchmark):


--------------------------------------------------------------------------------
Benchmark                      Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------
fmt_format_to_compile   13785856 ns     13748336 ns           50 items_per_second=72.7361M/s
fmt_format_int          13560583 ns     13522664 ns           51 items_per_second=73.9499M/s

After:

--------------------------------------------------------------------------------
Benchmark                      Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------
fmt_format_to_compile   13448662 ns     13411962 ns           52 items_per_second=74.5603M/s
fmt_format_int          13132046 ns     13090029 ns           53 items_per_second=76.394M/s

The idea is to avoid the modulo (which gets compiled to a imul and a sub) by looking at the 7 bits after the decimal point of value / 100.

This adds 256+1 bytes worth of lookup table (the old lookup table still need to stay there, unfortunately). Although technically the null terminator and the last 2 spaces are unused.

Correctness is shown by exhaustively search through the whole range of 32-bit integers and ensure that for all i, (i * ((1ull<<39)/100+1)) >> (39 - 7) & ((1<<7) - 1) uniquely determines the value of i % 100 and ((i * ((1ull<<39)/100+1)) >> 39) + ((i>=(100u<<25))<<25) is exactly equal to i / 100.

The check sizeof(UInt) == 4 implicitly assumes CHAR_BIT == 8 (is it worth being spelled out?)

Source code of brute force checker
using ull = unsigned long long;
int main(){
	int lookup [1<<7];
	for (int i = 1<<7; i-->0;){
		lookup[i] = -1;
	}
	for(unsigned i=0;;){
		auto& l = lookup[(i * ((1ull<<39)/100+1)) >> (39 - 7) & ((1<<7) - 1)];
		if(l<0) l = i % 100;
		if(l != (i % 100))
			__builtin_printf("%u\n", i);
		if(((i * ((1ull<<39)/100+1)) >> 39) + ((i>=(100u<<25))<<25) != i / 100)
			__builtin_printf(">%u %u %u\n", i, i/100, unsigned((i * ((1ull<<39)/100+1)) >> 39));
		if(++i==0) break;
	}
	for(unsigned i=0; i<sizeof(lookup)/sizeof(lookup[0]); ++i) {
		if (i % 16 == 0) __builtin_printf("\"");
		if (lookup[i] < 0)
			__builtin_printf("  ");
		else
			__builtin_printf("%02d", lookup[i]);
		if ((i+1) % 16 == 0) __builtin_printf("\"\n");
	}
}

Future work:

  • adapt to write_significand
  • generalize algorithm to work with 64-bit input (will need __int128).

note:

  • digits2_i is not constexpr (before C++20)
  • write2digits_i is not constexpr either, so there's no need for the std::is_constant_evaluated
  • I don't understand why we don't want memcpy if FMT_OPTIMIZE_SIZE is true, but write2digits do that.
  • apparently gcc cannot compile two char load/store into one short load/store (even with both load to a temporary then store back, so no concern of aliasing here).
  • the benchmark has a very large proportion of values with at most 4 digits, which is why parallel multiplication such as in hofman_fun will always be slower.

@user202729
Copy link
Copy Markdown
Contributor Author

Sorry for the CI failures. That said, I recommend adding to CONTRIBUTING.md the commands to verify the lint/compiler warnings etc.

@vitaut
Copy link
Copy Markdown
Contributor

vitaut commented Dec 16, 2025

Thanks for the PR! Could you check how it performs on itoa-benchmark (https://github.com/fmtlib/format-benchmark/tree/master/src/itoa-benchmark)?

@user202729
Copy link
Copy Markdown
Contributor Author

I made a pull request fmtlib/format-benchmark#31 that adds fmt as an option of itoa_benchmark. Let me know if that accurately benchmark {fmt} library's performance.

@vitaut
Copy link
Copy Markdown
Contributor

vitaut commented Dec 19, 2025

Thanks for adding fmt to itoa-benchmark. Have you checked the results of your change there and could you post them here?

@vitaut vitaut merged commit 7ad8004 into fmtlib:master Dec 25, 2025
41 checks passed
@vitaut
Copy link
Copy Markdown
Contributor

vitaut commented Dec 25, 2025

Anyway, I merged it and will take a closer look at the whole numeric formatting later. Thanks!

@user202729
Copy link
Copy Markdown
Contributor Author

Sorry, I forget about this.

That said, looking at the output of itoa-benchmark: why are jeaiii/tmueller/unrolledlut much faster than fmt? I wonder if fmt may try to optimize for something else in parallel (code size?), or if not we may want to just switch to something similar...

@vitaut
Copy link
Copy Markdown
Contributor

vitaut commented Dec 27, 2025

It's been a while since I looked at it but IIRC some of those methods only performed well for fixed number of digits due to excessive branching (which wasn't reflected in that particular benchmark). In any case, I plan to revamp numeric formatting and currently looking into FP (https://github.com/vitaut/zmij) with integer to follow.

Comment thread include/fmt/format.h
// the decimal point of i / 100 in base 2, the first 2 bytes
// after digits2_i(x) is the string representation of i.
inline auto digits2_i(size_t value) -> const char* {
alignas(2) static const char data[] =
Copy link
Copy Markdown
Contributor

@Skylion007 Skylion007 Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to make this array constexpr? I think this might enable certain compiler to do static compile time bounds checks if it's constexpr, especially if the whole function is constexpr.

Actually, I see why this done given the C++14 compatibiltiy requirements? Frustrating though as they could be merged easier across translation units otherwise in C++17 or newer.

Copy link
Copy Markdown
Contributor

@Skylion007 Skylion007 Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably mark as noexcept though like the following for consistency: 3269c1c#diff-bdc6f79e8e9f5b4331d66fb785636a87d29f55cf729865e13925b4209424c878R1033

Copy link
Copy Markdown
Contributor Author

@user202729 user202729 Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copy the style of digits2 function where the data array is not constexpr. I don't know what the compatibility requirement is exactly. If this array can be changed to constexpr then similar arrays should be as well.

Edit: hm, maybe the noexcept wasn't in digits2 when I looked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants