[FEA] Support recompute for HSTU

**Is your feature request related to a problem? Please describe.**
The activations in HSTU can be a large limiter to scaling model. We need to investigate how to implement recomputation and expose to users.
Reference: [Activation memory analysis doc](https://docs.google.com/spreadsheets/d/12AwnNV1BQOPxvJOyeJiJT80mg79aL9bRi--BcOL_Vvk/edit?gid=646487640#gid=646487640)

**Describe the solution you'd like**

## PoC

- [ ] Basic recompute w/o overlap. We will build intra-layer recompute on top of #31. The basic idea is to rewrite [HUSTLayer](https://gitlab-master.nvidia.com/Devtech-Compute/distributed-recommender/-/blob/v0.0.1-rc2/distributed_recommender/modules/hstu.py?ref_type=tags#L368) that contains ln-linear_bias-silu-fused_attention-eltwise_mul-dropout-linear_add, find/configure and build the fusable pattern with the help of triton.
- [ ] Intra layer overlap. While doing backward, there might be chances for overlapping 2 fused ops.
- [ ] Inter layer overlap. While performing backward layer[i+1], we can prefetch the forward of layer[i].

**Describe alternatives you've considered**
N/A

**Additional context**
N/A


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support recompute for HSTU #6

PoC

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Support recompute for HSTU #6

Description

PoC

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions