Use ArcSwap for aggregate fn registry#8072
Conversation
Signed-off-by: Robert Kruszewski <github@robertk.io>
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | WallTime | cuda/bitpacked_u8/unpack/3bw[100M] |
353 µs | 300.1 µs | +17.63% |
| ⚡ | Simulation | encode_varbin[(1000, 2)] |
162.7 µs | 142.9 µs | +13.87% |
| ⚡ | Simulation | encode_varbin[(1000, 32)] |
170.1 µs | 150 µs | +13.45% |
| ⚡ | Simulation | encode_varbin[(1000, 4)] |
163.8 µs | 143.7 µs | +14.03% |
| ⚡ | Simulation | encode_varbin[(1000, 8)] |
165.1 µs | 145.2 µs | +13.69% |
| ❌ | Simulation | new_alp_prim_test_between[f32, 16384] |
103.7 µs | 118.1 µs | -12.21% |
| ❌ | Simulation | null_count_run_end[(10000, 4, 0.01)] |
112.2 µs | 126.6 µs | -11.41% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing rk/aggregatearcswap (9630c96) with develop (495f30e)
| /// Session state for aggregate function vtables. | ||
| #[derive(Debug)] | ||
| pub struct AggregateFnSession { | ||
| registry: AggregateFnRegistry, |
There was a problem hiding this comment.
shall we delete this type alias now that it is unused?
| @@ -107,15 +107,20 @@ impl Default for AggregateFnSession { | |||
|
|
|||
| impl AggregateFnSession { | |||
| /// Returns the aggregate function registry. | |||
|
|
||
| let session = ctx.session().clone(); | ||
| let kernels = &session.aggregate_fns().kernels; | ||
| let kernels = &session.aggregate_fns().kernels.load(); |
There was a problem hiding this comment.
I think holding this for the entire body is mostly fine, but if we have some recursion here it might fallback to be as slow as the rwlock. I mean we load once then hold that guard and call the kernel, if it is a chunked array then that calls aggregate and calls load once more etc. I believe arcswap has a limited number of fast permits per thread and if we exhaust them then it falls back to refcount increments.
If we narrow the load scope to just one kernel execution in the loop below then that problem goes away. But it is unlikely that we will hit that level of recursion and the perf degradation is not that bad, it falls back to what rwlock does so up to you
ArcSwap is faster than a lock for read. These session are mutable but mutations
are rare and retrievals are common