Skip to content

Remove forced inlining for ffc_digit_comp#16

Merged
kolemannix merged 1 commit intokolemannix:mainfrom
wlagos3:remove_digit_comp_force_inline
Mar 6, 2026
Merged

Remove forced inlining for ffc_digit_comp#16
kolemannix merged 1 commit intokolemannix:mainfrom
wlagos3:remove_digit_comp_force_inline

Conversation

@wlagos3
Copy link
Copy Markdown
Contributor

@wlagos3 wlagos3 commented Mar 6, 2026

ffc_digit_comp shouldn't be inlined for a few reasons.

  • It's very rarely called.
  • It's hundreds of instructions, which adds a lot pressure to the L1 instruction cache.
  • It has sizeable stack allocations for using bigints, which adds some pressure to the L1 data cache.

Removing the forced inline directive will most likely result in the compiler never inlining the function because it's too large.

Benchmark Results

Without Forced Inlining

% sudo ./build/benchmarks/benchmark
# parsing random numbers
available models (-m): uniform one_over_rand32 simple_uniform32 simple_int32 int_e_int simple_int64 bigint_int_dot_int big_ints 
model: generate random numbers uniformly in the interval [0.0,1.0]
volume: 100000 floats
ASCII volume = 2.09808 MB 
loaded db: a14 (Apple A14/M1)
netlib                                  :   402.90 MB/s (+/- 1.5 %)    19.20 Mfloat/s      28.24 i/B   621.36 i/f (+/- 0.0 %)      7.64 c/B   168.03 c/f (+/- 0.6 %)      3.70 i/c     96.76 b/f    287.58 bm/f      3.23 GHz 
strtod                                  :   788.06 MB/s (+/- 1.1 %)    37.56 Mfloat/s      21.42 i/B   471.13 i/f (+/- 0.0 %)      3.91 c/B    85.99 c/f (+/- 0.5 %)      5.48 i/c     92.00 b/f      0.01 bm/f      3.23 GHz 
abseil                                  :   914.08 MB/s (+/- 1.2 %)    43.57 Mfloat/s      24.01 i/B   528.18 i/f (+/- 0.0 %)      3.37 c/B    74.12 c/f (+/- 0.7 %)      7.13 i/c    115.01 b/f      0.54 bm/f      3.23 GHz 
fastfloat                               :  1449.79 MB/s (+/- 1.2 %)    69.10 Mfloat/s      15.19 i/B   334.16 i/f (+/- 0.0 %)      2.12 c/B    46.75 c/f (+/- 0.6 %)      7.15 i/c     57.00 b/f      0.62 bm/f      3.23 GHz 
ffc                                     :  1626.32 MB/s (+/- 1.6 %)    77.51 Mfloat/s      13.60 i/B   299.16 i/f (+/- 0.0 %)      1.89 c/B    41.68 c/f (+/- 0.6 %)      7.18 i/c     51.00 b/f      0.62 bm/f      3.23 GHz 
UTF-16 volume = 4.19617 MB 
fastfloat                               :  2827.37 MB/s (+/- 2.3 %)    67.38 Mfloat/s       7.59 i/B   334.16 i/f (+/- 0.0 %)      1.09 c/B    47.97 c/f (+/- 1.8 %)      6.97 i/c     56.00 b/f      0.62 bm/f      3.23 GHz 
# You can also provide a filename (with the -f flag): it should contain one string per line corresponding to a number
% sudo ./build/benchmarks/benchmark32
# parsing random numbers
available models (-m): uniform one_over_rand32 simple_uniform32 simple_int32 int_e_int simple_int64 bigint_int_dot_int big_ints 
model: generate random numbers uniformly in the interval [0.0,1.0]
volume: 100000 floats
volume = 2.09808 MB 
loaded db: a14 (Apple A14/M1)
strtof                                  :   834.53 MB/s (+/- 2.8 %)    39.78 Mfloat/s      20.51 i/B   451.13 i/f (+/- 0.0 %)      3.69 c/B    81.19 c/f (+/- 0.7 %)      5.56 i/c   9000.10 b/f      3.23 GHz 
abseil                                  :   915.86 MB/s (+/- 1.5 %)    43.65 Mfloat/s      24.01 i/B   528.18 i/f (+/- 0.0 %)      3.36 c/B    73.98 c/f (+/- 0.7 %)      7.14 i/c  11501.05 b/f      3.23 GHz 
fastfloat                               :  1527.04 MB/s (+/- 2.1 %)    72.78 Mfloat/s      14.64 i/B   322.13 i/f (+/- 0.0 %)      2.02 c/B    44.38 c/f (+/- 0.7 %)      7.26 i/c   5500.09 b/f      3.23 GHz 
ffc                                     :  1643.62 MB/s (+/- 1.9 %)    78.34 Mfloat/s      13.32 i/B   293.13 i/f (+/- 0.0 %)      1.87 c/B    41.24 c/f (+/- 1.2 %)      7.11 i/c   5300.09 b/f      3.23 GHz 
UTF-16 volume = 4.19617 MB 
fastfloat                               :  2971.70 MB/s (+/- 2.0 %)    70.82 Mfloat/s       7.32 i/B   322.13 i/f (+/- 0.0 %)      1.04 c/B    45.63 c/f (+/- 1.3 %)      7.06 i/c   5400.09 b/f      3.23 GHz 
# You can also provide a filename (with the -f flag): it should contain one string per line corresponding to a number
% sudo ./build/benchmarks/benchmark -f data/canada.txt
# read 111126 lines 
ASCII volume = 1.93374 MB 
loaded db: a14 (Apple A14/M1)
netlib                                  :   383.46 MB/s (+/- 4.7 %)    22.04 Mfloat/s      29.97 i/B   546.94 i/f (+/- 0.1 %)      8.01 c/B   146.07 c/f (+/- 1.0 %)      3.74 i/c     81.77 b/f    203.53 bm/f      3.22 GHz 
strtod                                  :   685.72 MB/s (+/- 4.1 %)    39.41 Mfloat/s      24.43 i/B   445.75 i/f (+/- 0.0 %)      4.49 c/B    81.90 c/f (+/- 1.1 %)      5.44 i/c     88.46 b/f     10.08 bm/f      3.23 GHz 
abseil                                  :   856.79 MB/s (+/- 4.0 %)    49.24 Mfloat/s      24.61 i/B   449.03 i/f (+/- 0.0 %)      3.60 c/B    65.61 c/f (+/- 0.9 %)      6.84 i/c     96.53 b/f     10.47 bm/f      3.23 GHz 
fastfloat                               :   982.57 MB/s (+/- 3.9 %)    56.47 Mfloat/s      18.56 i/B   338.62 i/f (+/- 0.0 %)      3.14 c/B    57.22 c/f (+/- 1.0 %)      5.92 i/c     58.90 b/f     10.82 bm/f      3.23 GHz 
ffc                                     :  1139.65 MB/s (+/- 5.4 %)    65.49 Mfloat/s      16.72 i/B   305.09 i/f (+/- 0.0 %)      2.70 c/B    49.20 c/f (+/- 1.3 %)      6.20 i/c     55.40 b/f     10.76 bm/f      3.22 GHz 
UTF-16 volume = 3.86749 MB 
fastfloat                               :  2071.04 MB/s (+/- 3.9 %)    59.51 Mfloat/s       9.02 i/B   329.13 i/f (+/- 0.0 %)      1.49 c/B    54.25 c/f (+/- 1.1 %)      6.07 i/c     57.91 b/f     10.75 bm/f      3.23 GHz 
% sudo ./build/benchmarks/benchmark -f data/mesh.txt  
# read 73019 lines 
ASCII volume = 0.536009 MB 
loaded db: a14 (Apple A14/M1)
netlib                                  :   524.36 MB/s (+/- 6.1 %)    71.43 Mfloat/s      33.08 i/B   254.62 i/f (+/- 0.0 %)      5.87 c/B    45.20 c/f (+/- 0.9 %)      5.63 i/c     42.56 b/f     50.76 bm/f      3.23 GHz 
strtod                                  :   523.68 MB/s (+/- 1.9 %)    71.34 Mfloat/s      33.78 i/B   260.02 i/f (+/- 0.0 %)      5.88 c/B    45.28 c/f (+/- 0.8 %)      5.74 i/c     54.65 b/f     11.77 bm/f      3.23 GHz 
abseil                                  :   412.39 MB/s (+/- 2.1 %)    56.18 Mfloat/s      47.48 i/B   365.43 i/f (+/- 0.0 %)      7.47 c/B    57.51 c/f (+/- 0.8 %)      6.35 i/c     80.08 b/f     13.26 bm/f      3.23 GHz 
fastfloat                               :   775.79 MB/s (+/- 2.1 %)   105.68 Mfloat/s      26.14 i/B   201.18 i/f (+/- 0.0 %)      3.98 c/B    30.61 c/f (+/- 1.0 %)      6.57 i/c     41.98 b/f      8.83 bm/f      3.23 GHz 
ffc                                     :   845.05 MB/s (+/- 2.3 %)   115.12 Mfloat/s      22.15 i/B   170.53 i/f (+/- 0.0 %)      3.65 c/B    28.06 c/f (+/- 0.9 %)      6.08 i/c     37.98 b/f      8.63 bm/f      3.23 GHz 
UTF-16 volume = 1.07202 MB 
fastfloat                               :  1556.28 MB/s (+/- 2.6 %)   106.00 Mfloat/s      12.77 i/B   196.64 i/f (+/- 0.0 %)      1.98 c/B    30.48 c/f (+/- 1.4 %)      6.45 i/c     41.61 b/f      8.56 bm/f      3.23 GHz 

With Forced Inlining

% sudo ./build/benchmarks/benchmark                   
# parsing random numbers
available models (-m): uniform one_over_rand32 simple_uniform32 simple_int32 int_e_int simple_int64 bigint_int_dot_int big_ints 
model: generate random numbers uniformly in the interval [0.0,1.0]
volume: 100000 floats
ASCII volume = 2.09808 MB 
loaded db: a14 (Apple A14/M1)
netlib                                  :   402.22 MB/s (+/- 1.7 %)    19.17 Mfloat/s      28.24 i/B   621.32 i/f (+/- 0.0 %)      7.66 c/B   168.45 c/f (+/- 0.8 %)      3.69 i/c     96.77 b/f    296.58 bm/f      3.23 GHz 
strtod                                  :   786.93 MB/s (+/- 2.0 %)    37.51 Mfloat/s      21.42 i/B   471.13 i/f (+/- 0.0 %)      3.91 c/B    86.00 c/f (+/- 0.5 %)      5.48 i/c     92.00 b/f      0.01 bm/f      3.23 GHz 
abseil                                  :   915.83 MB/s (+/- 2.4 %)    43.65 Mfloat/s      24.01 i/B   528.18 i/f (+/- 0.0 %)      3.36 c/B    73.99 c/f (+/- 0.4 %)      7.14 i/c    115.01 b/f      0.57 bm/f      3.23 GHz 
fastfloat                               :  1449.37 MB/s (+/- 2.9 %)    69.08 Mfloat/s      15.19 i/B   334.16 i/f (+/- 0.0 %)      2.13 c/B    46.76 c/f (+/- 0.6 %)      7.15 i/c     57.00 b/f      0.64 bm/f      3.23 GHz 
ffc                                     :  1585.30 MB/s (+/- 2.4 %)    75.56 Mfloat/s      13.60 i/B   299.16 i/f (+/- 0.0 %)      1.94 c/B    42.75 c/f (+/- 1.2 %)      7.00 i/c     50.00 b/f      0.64 bm/f      3.23 GHz 
UTF-16 volume = 4.19617 MB 
fastfloat                               :  2829.11 MB/s (+/- 2.9 %)    67.42 Mfloat/s       7.59 i/B   334.16 i/f (+/- 0.0 %)      1.09 c/B    47.89 c/f (+/- 1.7 %)      6.98 i/c     56.00 b/f      0.64 bm/f      3.23 GHz 
# You can also provide a filename (with the -f flag): it should contain one string per line corresponding to a number
% sudo ./build/benchmarks/benchmark32                 
# parsing random numbers
available models (-m): uniform one_over_rand32 simple_uniform32 simple_int32 int_e_int simple_int64 bigint_int_dot_int big_ints 
model: generate random numbers uniformly in the interval [0.0,1.0]
volume: 100000 floats
volume = 2.09808 MB 
loaded db: a14 (Apple A14/M1)
strtof                                  :   834.74 MB/s (+/- 6.6 %)    39.79 Mfloat/s      20.51 i/B   451.13 i/f (+/- 0.1 %)      3.69 c/B    81.17 c/f (+/- 1.3 %)      5.56 i/c   9000.10 b/f      3.23 GHz 
abseil                                  :   913.93 MB/s (+/- 2.0 %)    43.56 Mfloat/s      24.01 i/B   528.18 i/f (+/- 0.0 %)      3.37 c/B    74.15 c/f (+/- 0.6 %)      7.12 i/c  11501.07 b/f      3.23 GHz 
fastfloat                               :  1527.18 MB/s (+/- 1.4 %)    72.79 Mfloat/s      14.64 i/B   322.13 i/f (+/- 0.0 %)      2.02 c/B    44.39 c/f (+/- 0.6 %)      7.26 i/c   5500.09 b/f      3.23 GHz 
ffc                                     :  1624.38 MB/s (+/- 2.4 %)    77.42 Mfloat/s      13.23 i/B   291.13 i/f (+/- 0.0 %)      1.90 c/B    41.73 c/f (+/- 1.7 %)      6.98 i/c   5100.09 b/f      3.23 GHz 
UTF-16 volume = 4.19617 MB 
fastfloat                               :  2970.56 MB/s (+/- 2.6 %)    70.79 Mfloat/s       7.32 i/B   322.13 i/f (+/- 0.0 %)      1.04 c/B    45.65 c/f (+/- 1.5 %)      7.06 i/c   5400.09 b/f      3.23 GHz 
# You can also provide a filename (with the -f flag): it should contain one string per line corresponding to a number
% sudo ./build/benchmarks/benchmark -f data/canada.txt
# read 111126 lines 
ASCII volume = 1.93374 MB 
loaded db: a14 (Apple A14/M1)
netlib                                  :   381.55 MB/s (+/- 8.1 %)    21.93 Mfloat/s      29.97 i/B   546.94 i/f (+/- 0.1 %)      8.02 c/B   146.31 c/f (+/- 0.9 %)      3.74 i/c     81.77 b/f    206.89 bm/f      3.21 GHz 
strtod                                  :   672.76 MB/s (+/- 2.1 %)    38.66 Mfloat/s      24.43 i/B   445.75 i/f (+/- 0.0 %)      4.48 c/B    81.77 c/f (+/- 0.6 %)      5.45 i/c     88.46 b/f     10.00 bm/f      3.16 GHz 
abseil                                  :   848.52 MB/s (+/- 1.4 %)    48.76 Mfloat/s      24.61 i/B   449.03 i/f (+/- 0.0 %)      3.56 c/B    65.00 c/f (+/- 0.5 %)      6.91 i/c     96.53 b/f     10.14 bm/f      3.17 GHz 
fastfloat                               :   982.74 MB/s (+/- 2.2 %)    56.47 Mfloat/s      18.56 i/B   338.62 i/f (+/- 0.0 %)      3.13 c/B    57.19 c/f (+/- 0.4 %)      5.92 i/c     58.90 b/f     10.67 bm/f      3.23 GHz 
ffc                                     :  1112.89 MB/s (+/- 1.7 %)    63.95 Mfloat/s      16.73 i/B   305.27 i/f (+/- 0.0 %)      2.77 c/B    50.51 c/f (+/- 0.6 %)      6.04 i/c     54.49 b/f     10.65 bm/f      3.23 GHz 
UTF-16 volume = 3.86749 MB 
fastfloat                               :  2051.13 MB/s (+/- 2.3 %)    58.94 Mfloat/s       9.02 i/B   329.13 i/f (+/- 0.0 %)      1.50 c/B    54.82 c/f (+/- 1.0 %)      6.00 i/c     57.91 b/f     10.60 bm/f      3.23 GHz 
% sudo ./build/benchmarks/benchmark -f data/mesh.txt  
# read 73019 lines 
ASCII volume = 0.536009 MB 
loaded db: a14 (Apple A14/M1)
netlib                                  :   537.02 MB/s (+/- 4.7 %)    73.16 Mfloat/s      33.08 i/B   254.62 i/f (+/- 0.0 %)      5.74 c/B    44.21 c/f (+/- 1.3 %)      5.76 i/c     42.56 b/f     52.35 bm/f      3.23 GHz 
strtod                                  :   523.74 MB/s (+/- 1.7 %)    71.35 Mfloat/s      33.78 i/B   260.02 i/f (+/- 0.0 %)      5.88 c/B    45.30 c/f (+/- 0.7 %)      5.74 i/c     54.65 b/f     11.80 bm/f      3.23 GHz 
abseil                                  :   414.81 MB/s (+/- 1.8 %)    56.51 Mfloat/s      47.48 i/B   365.43 i/f (+/- 0.0 %)      7.43 c/B    57.18 c/f (+/- 0.7 %)      6.39 i/c     80.08 b/f     13.48 bm/f      3.23 GHz 
fastfloat                               :   775.33 MB/s (+/- 2.6 %)   105.62 Mfloat/s      26.14 i/B   201.18 i/f (+/- 0.0 %)      3.98 c/B    30.62 c/f (+/- 1.2 %)      6.57 i/c     41.98 b/f      8.71 bm/f      3.23 GHz 
ffc                                     :   822.89 MB/s (+/- 1.9 %)   112.10 Mfloat/s      22.41 i/B   172.53 i/f (+/- 0.0 %)      3.75 c/B    28.85 c/f (+/- 0.7 %)      5.98 i/c     37.98 b/f      8.76 bm/f      3.23 GHz 
UTF-16 volume = 1.07202 MB 
fastfloat                               :  1451.70 MB/s (+/- 1.7 %)    98.88 Mfloat/s      12.77 i/B   196.64 i/f (+/- 0.0 %)      2.12 c/B    32.70 c/f (+/- 0.9 %)      6.01 i/c     41.61 b/f      8.48 bm/f      3.23 GHz 

Results

~2% performance increase across the benchmark. Not very significant, but it's an improvement basically for free.

@kolemannix
Copy link
Copy Markdown
Owner

It's a good change! that function definitely should not be force inlined and this was an oversight.

@kolemannix kolemannix merged commit e8d16be into kolemannix:main Mar 6, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants