Telescoping cache precision/throughput issues

We have implemented a Triton kernel for matmul operations involving a [telescoping cache](https://github.com/foundation-model-stack/avengers/blob/telescoping-kernel/docs/telescoping-cache.md) in the `telescoping-kernel` branch. These kernels pass their respective correctness checks (also included), but deploying to our training pipeline is not straightforward because Triton does not support atomic-add in bf16 (see [here](https://github.com/triton-lang/triton/pull/2708)). 

We instead cast to fp16 before this op, but loss curves on a test llama3-1.8B model diverge when we do this:

![loss_curve](https://github.com/foundation-model-stack/avengers/assets/9604893/2e453716-ab5f-4be1-aa83-61bb5ff808f9)

Loss curves do not diverge when running the kernels in fp32. Unfortunately this sacrifices our speed gains. We're currently evaluating fp32 atomic-adds only, and will update here.

Running these matmuls in fp16 also breaks the vanilla pytorch code, so this is almost certainly a precision issue. If internal fp32 casting does not fix the diverging loss, can the kernel code be massaged to avoid these issues?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Telescoping cache precision/throughput issues #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Telescoping cache precision/throughput issues #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions