Skip to content

Conversation

@copybara-service
Copy link

This change introduces a pure JAX implementation of flash attention to Maxtext, designed as a drop-in replacement for the existing Pallas kernel. In this cl we set up the stage by integrating it with maxtext in fsdp mode. We have plans for further optimizations to close the gap with pallas using different techniques such as:
iteration skipping, must_fuse, and memory space coloring.

The new implementation is located in maxtext/src/maxtext/kernels/jax_flash_attention.py and can be enabled with the use_jax_splash config flag.

To validate the implementation and compare it against the Tokamax kernel and the baseline dot-product attention, this change also introduces:

A new test suite in google_mla_attention_test.py for correctness and performance comparison, particularly for FSDP cases.
Refactored common MLA test utilities into attention_test_util.py.

…o Maxtext, designed as a drop-in replacement for the existing Pallas kernel. In this cl we set up the stage by integrating it with maxtext in fsdp mode. We have plans for further optimizations to close the gap with pallas using different techniques such as:

iteration skipping, must_fuse, and memory space coloring.

The new implementation is located in maxtext/src/maxtext/kernels/jax_flash_attention.py and can be enabled with the use_jax_splash config flag.

To validate the implementation and compare it against the Tokamax kernel and the baseline dot-product attention, this change also introduces:

A new test suite in google_mla_attention_test.py for correctness and performance comparison, particularly for FSDP cases.
Refactored common MLA test utilities into attention_test_util.py.

PiperOrigin-RevId: 834764107
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant