[ExecuTorch] Batch-aware torch.ops.llama.sdpa_with_kv_cache. #4822

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

meta-emilian wants to merge 1 commit into pytorch:main from meta-emilian:export-D61605316

Contributor

meta-emilian commented Aug 21, 2024 •

edited

Loading

Summary:
This changes makes torch.ops.llama.sdpa_with_kv_cache batch aware. This is needed for batched sdpa cases, for example llm beam search.

Makes update_cache update across the batch dimension

As a performance optimization, update_cache implements the following operation

    k_cache[:, start_pos : start_pos + seq_len, :, :] = k
    v_cache[:, start_pos : start_pos + seq_len, :, :] = v

as part of the fused sdpa_with_kv_cache op. A naiive export of this code inserts expensive slice-scatter ops. sdpa_with_kv_cache fuses this update with the flash attention op for tensors that follow a predetermined format [batch, length, heads, dim]. This change removes the assumption that batch == 1.

Makes sdpa_with_kv_cache apply cpu_flash_attention for all batch lines as well.

ExecuTorch-exported Llama models are implemented with a greedy search, so it has not been necessary for this op to be batch-aware. However when working with other models, or when doing LLM beam search, this is no longer true.

Reviewed By: kimishpatel

Differential Revision: D61605316

pytorch-bot bot commented Aug 21, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4822

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit 9508947 with merge base 06c0fa3 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the CLA Signed label

Contributor

facebook-github-bot commented Aug 21, 2024

This pull request was exported from Phabricator. Differential Revision: D61605316

facebook-github-bot added the fb-exported label

Contributor

facebook-github-bot commented Aug 21, 2024

This pull request was exported from Phabricator. Differential Revision: D61605316

meta-emilian added a commit to meta-emilian/executorch that referenced this pull request


          Making update_cache update across the batch dimension. (pytorch#4822)

54bdb58

Summary:
Pull Request resolved: pytorch#4822

This is part 1 of a multi-part commit to make torch.ops.llama.sdpa_with_kv_cache batch aware. This is needed for batched sdpa cases, for example llm beam search.

As a performance optimization, update_cache implements the following operation
```
    k_cache[:, start_pos : start_pos + seq_len, :, :] = k
    v_cache[:, start_pos : start_pos + seq_len, :, :] = v
```
as part of the fused sdpa_with_kv_cache op. A naiive export of this code inserts expensive slice-scatter ops.

ExecuTorch-exported Llama models are implemented with a greedy search, so it has not been necessary for this op to be batch-aware. However when working with other models, or when doing LLM beam search, this code needs to update the cache across the batch dimension.

Differential Revision: D61605316

meta-emilian force-pushed the export-D61605316 branch from 0f8c06a to 54bdb58 Compare

August 21, 2024 20:45

Contributor

facebook-github-bot commented Sep 16, 2024

This pull request was exported from Phabricator. Differential Revision: D61605316

meta-emilian added a commit to meta-emilian/executorch that referenced this pull request


          Making update_cache update across the batch dimension. (pytorch#4822)

ae0964a

Summary:
Pull Request resolved: pytorch#4822

This is part 1 of a multi-part commit to make torch.ops.llama.sdpa_with_kv_cache batch aware. This is needed for batched sdpa cases, for example llm beam search.

As a performance optimization, update_cache implements the following operation
```
    k_cache[:, start_pos : start_pos + seq_len, :, :] = k
    v_cache[:, start_pos : start_pos + seq_len, :, :] = v
```
as part of the fused sdpa_with_kv_cache op. A naiive export of this code inserts expensive slice-scatter ops.

ExecuTorch-exported Llama models are implemented with a greedy search, so it has not been necessary for this op to be batch-aware. However when working with other models, or when doing LLM beam search, this code needs to update the cache across the batch dimension.

Differential Revision: D61605316

meta-emilian force-pushed the export-D61605316 branch from 54bdb58 to ae0964a Compare

September 16, 2024 18:30

Contributor

facebook-github-bot commented Sep 17, 2024

This pull request was exported from Phabricator. Differential Revision: D61605316

meta-emilian force-pushed the export-D61605316 branch from ae0964a to 1e0568b Compare

September 17, 2024 18:50

meta-emilian added a commit to meta-emilian/executorch that referenced this pull request


          Batch-aware torch.ops.llama.sdpa_with_kv_cache (pytorch#4822)

1e0568b

Summary:
Pull Request resolved: pytorch#4822

This changes makes torch.ops.llama.sdpa_with_kv_cache batch aware. This is needed for batched sdpa cases, for example llm beam search.

* Makes update_cache update across the batch dimension

As a performance optimization, update_cache implements the following operation
```
    k_cache[:, start_pos : start_pos + seq_len, :, :] = k
    v_cache[:, start_pos : start_pos + seq_len, :, :] = v
```
as part of the fused sdpa_with_kv_cache op. A naiive export of this code inserts expensive slice-scatter ops. sdpa_with_kv_cache fuses this update with the flash attention op for tensors that follow a predetermined format [batch, length, heads, dim]. This change removes the assumption that batch == 1.

* Makes sdpa_with_kv_cache apply cpu_flash_attention for all batch lines as well.

ExecuTorch-exported Llama models are implemented with a greedy search, so it has not been necessary for this op to be batch-aware. However when working with other models, or when doing LLM beam search, this is no longer true.

Reviewed By: kimishpatel

Differential Revision: D61605316

Contributor

facebook-github-bot commented Sep 17, 2024

This pull request was exported from Phabricator. Differential Revision: D61605316

meta-emilian force-pushed the export-D61605316 branch from 1e0568b to a9dae49 Compare

September 17, 2024 18:58

meta-emilian added a commit to meta-emilian/executorch that referenced this pull request


          Batch-aware torch.ops.llama.sdpa_with_kv_cache (pytorch#4822)

a9dae49

Summary:
Pull Request resolved: pytorch#4822

This changes makes torch.ops.llama.sdpa_with_kv_cache batch aware. This is needed for batched sdpa cases, for example llm beam search.

* Makes update_cache update across the batch dimension

As a performance optimization, update_cache implements the following operation
```
    k_cache[:, start_pos : start_pos + seq_len, :, :] = k
    v_cache[:, start_pos : start_pos + seq_len, :, :] = v
```
as part of the fused sdpa_with_kv_cache op. A naiive export of this code inserts expensive slice-scatter ops. sdpa_with_kv_cache fuses this update with the flash attention op for tensors that follow a predetermined format [batch, length, heads, dim]. This change removes the assumption that batch == 1.

* Makes sdpa_with_kv_cache apply cpu_flash_attention for all batch lines as well.

ExecuTorch-exported Llama models are implemented with a greedy search, so it has not been necessary for this op to be batch-aware. However when working with other models, or when doing LLM beam search, this is no longer true.

Reviewed By: kimishpatel

Differential Revision: D61605316

meta-emilian changed the title ~~Making update_cache update across the batch dimension.~~ [ExecuTorch] Batch-aware torch.ops.llama.sdpa_with_kv_cache.

meta-emilian requested a review from kimishpatel

September 17, 2024 19:05

Contributor

facebook-github-bot commented Sep 17, 2024

This pull request was exported from Phabricator. Differential Revision: D61605316

meta-emilian force-pushed the export-D61605316 branch from a9dae49 to 377e496 Compare

September 17, 2024 21:47

meta-emilian added a commit to meta-emilian/executorch that referenced this pull request


          Batch-aware torch.ops.llama.sdpa_with_kv_cache (pytorch#4822)

377e496

Summary:
Pull Request resolved: pytorch#4822

This changes makes torch.ops.llama.sdpa_with_kv_cache batch aware. This is needed for batched sdpa cases, for example llm beam search.

* Makes update_cache update across the batch dimension

As a performance optimization, update_cache implements the following operation
```
    k_cache[:, start_pos : start_pos + seq_len, :, :] = k
    v_cache[:, start_pos : start_pos + seq_len, :, :] = v
```
as part of the fused sdpa_with_kv_cache op. A naiive export of this code inserts expensive slice-scatter ops. sdpa_with_kv_cache fuses this update with the flash attention op for tensors that follow a predetermined format [batch, length, heads, dim]. This change removes the assumption that batch == 1.

* Makes sdpa_with_kv_cache apply cpu_flash_attention for all batch lines as well.

ExecuTorch-exported Llama models are implemented with a greedy search, so it has not been necessary for this op to be batch-aware. However when working with other models, or when doing LLM beam search, this is no longer true.

Reviewed By: kimishpatel

Differential Revision: D61605316


          Batch-aware torch.ops.llama.sdpa_with_kv_cache (pytorch#4822)

Summary:
Pull Request resolved: pytorch#4822

This changes makes torch.ops.llama.sdpa_with_kv_cache batch aware. This is needed for batched sdpa cases, for example llm beam search.

* Makes update_cache update across the batch dimension

As a performance optimization, update_cache implements the following operation
```
    k_cache[:, start_pos : start_pos + seq_len, :, :] = k
    v_cache[:, start_pos : start_pos + seq_len, :, :] = v
```
as part of the fused sdpa_with_kv_cache op. A naiive export of this code inserts expensive slice-scatter ops. sdpa_with_kv_cache fuses this update with the flash attention op for tensors that follow a predetermined format [batch, length, heads, dim]. This change removes the assumption that batch == 1.

* Makes sdpa_with_kv_cache apply cpu_flash_attention for all batch lines as well.

ExecuTorch-exported Llama models are implemented with a greedy search, so it has not been necessary for this op to be batch-aware. However when working with other models, or when doing LLM beam search, this is no longer true.

Reviewed By: kimishpatel

Differential Revision: D61605316

Contributor

facebook-github-bot commented Sep 17, 2024

This pull request was exported from Phabricator. Differential Revision: D61605316

meta-emilian force-pushed the export-D61605316 branch from 377e496 to 9508947 Compare

September 17, 2024 21:52

tarun292 approved these changes

View reviewed changes

facebook-github-bot closed this in

53c1a5f

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented Sep 18, 2024

This pull request has been merged in 53c1a5f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported Merged