Multi-accumulator/lane summation

Compensated summation [CS] is slow on modern CPUs partly because there's a loop-carried dependency all the way through. Dividing the input into several parts to run CS individually, followed by a CS step to sum the buckets together, is one solution. This also happens to work pretty well with SIMD, so it gets a lot faster. https://blog.zachbjornson.com/2019/08/11/fast-float-summation.html claims:

![](https://blog.zachbjornson.com/public/7a1b3ca/Perf.png)

Now it's probably impossible to get that kind of numbers through the layers of abstraction Julia has... and there's not much sense to decide how many lanes to use for the user either. But some degree of autovectorization remains possible, so providing a way to pass in the desired # of lanes like `sum_kbn(A; lanes)` and `cumsum_kbn(A; lanes)` is probably good enough.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-accumulator/lane summation #41

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-accumulator/lane summation #41

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions