Any way to avoid keep hidden state of early steps in recurrent inference? #154

Fadelis98 · 2025-02-01T18:21:10Z

Fadelis98
Feb 1, 2025

I found this awesome project recently and I'm trying to use the fla layers in a non-LLM task, where we have a very long sequence and only the hidden state of the last "token" is useful. The current recurrent kernels, for example gated_deltanet, always returns the hidden state of every token, that will allocate huge memory. Is there any way except call the kernel token by token in a for loop that can avoid the memory allocation?

Answered by zhiyuan1i

Apr 6, 2025

For multi-layer models, it still needs all the hidden-states as input to the next layer, so it may not be worth changing all the kernels for this feature, which also causes Triton to compile the kernel multiple times.

View full answer

yzhangcs · 2025-02-01T18:24:18Z

yzhangcs
Feb 1, 2025
Maintainer

@Fadelis98 Hey

The current recurrent kernels, for example gated_deltanet, always returns the hidden state of every token, that will allocate huge memory.

This is not true as we only materialize the last hidden state ht only.

3 replies

Fadelis98 Feb 2, 2025
Author

@Fadelis98 Hey

The current recurrent kernels, for example gated_deltanet, always returns the hidden state of every token, that will allocate huge memory.

This is not true as we only materialize the last hidden state ht only.

@yzhangcs Thanks for your answer, and I think I didn't use the correct words that caused some misunderstanding.
Here's an example:

import torch
from fla.layers import GatedDeltaNet

batch_size, seq_len, head_dim, hidden_size = 32, 32, 32, 256
layer1 = GatedDeltaNet(hidden_size,head_dim=head_dim, mode="fused_recurrent").to(device="cuda",dtype=torch.float16)
layer1.eval()
x = torch.randn(batch_size, seq_len, hidden_size).to(device='cuda',dtype=torch.float16)
y = layer1(x)[0] # y has shape [batch_size,seq_len_hidden_size]

Here the shape of y contains a sqe_len dim instead of only the last output. So what I actually need is not the recurrent state but the output of the layer. I didn't find a way the control this behaviour without changing the triton code, is it possible in the current version or in a recent update?

zhiyuan1i Apr 6, 2025
Maintainer

For multi-layer models, it still needs all the hidden-states as input to the next layer, so it may not be worth changing all the kernels for this feature, which also causes Triton to compile the kernel multiple times.

Answer selected by Fadelis98

Fadelis98 Apr 6, 2025
Author

For multi-layer models, it still needs all the hidden-states as input to the next layer, so it may not be worth changing all the kernels for this feature, which also causes Triton to compile the kernel multiple times.

You are right. I realized that the hidden state is necessary for multi-layer modules. Thanks for your answer and sorry for forgot to close this discussion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FLA

Any way to avoid keep hidden state of early steps in recurrent inference? #154

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

FLA

Any way to avoid keep hidden state of early steps in recurrent inference? #154

Uh oh!

Fadelis98 Feb 1, 2025

Replies: 1 comment · 3 replies

Uh oh!

yzhangcs Feb 1, 2025 Maintainer

Uh oh!

Fadelis98 Feb 2, 2025 Author

Uh oh!

zhiyuan1i Apr 6, 2025 Maintainer

Uh oh!

Fadelis98 Apr 6, 2025 Author

Fadelis98
Feb 1, 2025

Replies: 1 comment 3 replies

yzhangcs
Feb 1, 2025
Maintainer

Fadelis98 Feb 2, 2025
Author

zhiyuan1i Apr 6, 2025
Maintainer

Fadelis98 Apr 6, 2025
Author