Skip to content

Use ArcSwap for aggregate fn registry#8072

Open
robert3005 wants to merge 1 commit into
developfrom
rk/aggregatearcswap
Open

Use ArcSwap for aggregate fn registry#8072
robert3005 wants to merge 1 commit into
developfrom
rk/aggregatearcswap

Conversation

@robert3005
Copy link
Copy Markdown
Contributor

ArcSwap is faster than a lock for read. These session are mutable but mutations
are rare and retrievals are common

Signed-off-by: Robert Kruszewski <github@robertk.io>
@robert3005 robert3005 requested a review from gatesn May 22, 2026 22:50
@robert3005 robert3005 added the changelog/chore A trivial change label May 22, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 22, 2026

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 5 improved benchmarks
❌ 2 regressed benchmarks
✅ 1244 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime cuda/bitpacked_u8/unpack/3bw[100M] 353 µs 300.1 µs +17.63%
Simulation encode_varbin[(1000, 2)] 162.7 µs 142.9 µs +13.87%
Simulation encode_varbin[(1000, 32)] 170.1 µs 150 µs +13.45%
Simulation encode_varbin[(1000, 4)] 163.8 µs 143.7 µs +14.03%
Simulation encode_varbin[(1000, 8)] 165.1 µs 145.2 µs +13.69%
Simulation new_alp_prim_test_between[f32, 16384] 103.7 µs 118.1 µs -12.21%
Simulation null_count_run_end[(10000, 4, 0.01)] 112.2 µs 126.6 µs -11.41%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing rk/aggregatearcswap (9630c96) with develop (495f30e)

Open in CodSpeed

/// Session state for aggregate function vtables.
#[derive(Debug)]
pub struct AggregateFnSession {
registry: AggregateFnRegistry,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we delete this type alias now that it is unused?

@@ -107,15 +107,20 @@ impl Default for AggregateFnSession {

impl AggregateFnSession {
/// Returns the aggregate function registry.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doc is now stale


let session = ctx.session().clone();
let kernels = &session.aggregate_fns().kernels;
let kernels = &session.aggregate_fns().kernels.load();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think holding this for the entire body is mostly fine, but if we have some recursion here it might fallback to be as slow as the rwlock. I mean we load once then hold that guard and call the kernel, if it is a chunked array then that calls aggregate and calls load once more etc. I believe arcswap has a limited number of fast permits per thread and if we exhaust them then it falls back to refcount increments.

If we narrow the load scope to just one kernel execution in the loop below then that problem goes away. But it is unlikely that we will hit that level of recursion and the perf degradation is not that bad, it falls back to what rwlock does so up to you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/chore A trivial change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants