How can we make KA fast on CPUs?

See https://github.com/LuxDL/LuxLib.jl/pull/136 for some background context. The main motivation for me is to avoid code duplication between CPU and GPU versions. However, if you take a look at the benchmark comment on the PR (for `batchnorm` and `groupnorm)` you see somewhere between a 10x-40x slowdown between KA and the equivalent optimized loop version (note that it is simply using `@simd` or `@simd ivdep` and nothing like LoopVectorization).

I think there are a couple of reasons for the slowdown:

1. `@simd` annotations are missing (which causes slowdown even in the loop version if I remove the annotations)
2. threading has overhead for some of the smaller problems

Potential solutions:

1. Allow users to control threading. #507. For smaller problems, I want to opt out of threading manually.
2. `@simd` annotations (#436 seems to do this. not sure what is the status for that)
3. Alternate threading: KA is being used inside "core" operations. As such we are unlikely (if not impossible) to call other operations that make use of threading. Hence, having the option to use "cheaper threads" (Polyester.jl) would be a great addition

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How can we make KA fast on CPUs? #509

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How can we make KA fast on CPUs? #509

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions