Add tensor operation utilities and performance enhancements #165

LoserCheems · 2025-09-11T18:08:35Z

Introduce a static switch utility for compile-time conditions and various tensor operation utilities. Refine backward kernel launch parameters for improved performance on specific architectures. Refactor backward pass computations and mask processor initialization for efficiency. Update test configurations to accommodate additional dimensions.

- Introduced `static_switch.h` with BOOL_SWITCH, EVENK_SWITCH, SOFTCAP_SWITCH, FP16_SWITCH, and HEADDIM_SWITCH macros for compile-time conditional execution. - Added `utils.h` containing various utility functions for tensor operations, including relu, max, sum, and GEMM implementations. - Implemented specialized relu functions for half and bfloat16 types using inline PTX assembly for performance optimization. - Enhanced tensor layout conversion functions to support different configurations for GEMM operations. - Included support for asynchronous copy operations and softmax calculations within the FLASH_NAMESPACE.

…sm86 and sm89 architectures

… directory structure

…andom tensor

…ditional head dimensions and adjust keep_window_size

…dout parameter and use sum() for gradient computation

…compute_attn functions

Copilot

Pull Request Overview

This PR introduces tensor operation utilities and performance enhancements to the flash attention implementation. The changes reorganize source file paths to a dedicated flash_dmattn directory, refine backward kernel launch parameters for better performance on specific GPU architectures, and streamline mask processing while expanding test coverage.

Key changes:

Reorganizes CUDA source files from csrc/ to csrc/flash_dmattn/ subdirectory
Optimizes backward pass kernel configurations for different GPU architectures (sm86/sm89 vs A100/H100)
Simplifies mask processor initialization by removing template parameter dependency

Reviewed Changes

Copilot reviewed 7 out of 92 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
setup.py	Updates source file paths to new `flash_dmattn` subdirectory structure
csrc/flash_dmattn/src/mask.h	Removes template parameter from Mask struct declaration
csrc/flash_dmattn/src/flash_fwd_kernel.h	Updates mask processor initialization calls to remove template parameter
csrc/flash_dmattn/src/flash_bwd_launch_template.h	Refines kernel launch parameters and memory usage comments for different architectures
benchmarks/forward_equivalence.py	Removes duplicate test configurations and adds head_dim 192/256 support, increases keep_window_size
benchmarks/backward_performance.py	Simplifies backward pass testing by using sum() instead of custom gradient tensors
benchmarks/backward_equivalence.py	Updates backward pass testing approach and removes gradient tensor parameters

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-11T18:09:45Z

csrc/flash_dmattn/src/flash_bwd_launch_template.h

    } else {                                            // sm86 and sm89
-        // 96KB, 2 CTAs in sm86 and sm 89.
-        run_flash_bwd<Flash_bwd_kernel_traits<Headdim, 128, 128, 8, 4, 4, 4, true, false, T>, Is_causal>(params, stream);
+        // 96KB, 1 CTAs in sm86 and sm 89.


The comment states '96KB' but based on the kernel traits and the context of other comments, this should be updated to reflect the actual shared memory usage. The comment appears inconsistent with the pattern of other memory usage comments in the file.

Suggested change

// 96KB, 1 CTAs in sm86 and sm 89.

// 88KB, 1 CTAs in sm86 and sm 89.

Copilot · 2025-09-11T18:09:45Z

csrc/flash_dmattn/src/flash_bwd_launch_template.h

    } else {                                            // sm86 and sm89
-        // 72KB, 1 CTAs in sm86 and sm 89.
-        run_flash_bwd<Flash_bwd_kernel_traits<Headdim, 64, 128, 8, 2, 4, 4, true, false, T>, Is_causal>(params, stream);
+        // 88KB, 1 CTAs in sm86 and sm 89.


The comment states '88KB' but the original comment was '72KB'. If the memory usage calculation has changed due to the kernel trait modifications, the comment should accurately reflect the new memory requirements.

Suggested change

// 88KB, 1 CTAs in sm86 and sm 89.

// 72KB, 1 CTAs in sm86 and sm 89.

Copilot · 2025-09-11T18:09:46Z

csrc/flash_dmattn/src/flash_bwd_launch_template.h

    } else {                                            // sm86 and sm89
-        // 92KB, 1 CTAs in sm86 and sm 89.
-        run_flash_bwd<Flash_bwd_kernel_traits<Headdim, 64, 128, 8, 2, 4, 4, true, false, T>, Is_causal>(params, stream);
+        // 76KB, 1 CTAs in sm86 and sm 89.


The comment states '76KB' but the original was '92KB'. This significant reduction in shared memory usage should be verified to ensure the comment accurately reflects the actual memory requirements of the new kernel configuration.

Suggested change

// 76KB, 1 CTAs in sm86 and sm 89.

// 92KB, 1 CTAs in sm86 and sm 89.

Copilot · 2025-09-11T18:09:46Z

csrc/flash_dmattn/src/flash_bwd_launch_template.h

    } else {                                            // sm86 and sm89
-        // 88KB, 1 CTAs in sm86 and sm 89.
-        run_flash_bwd<Flash_bwd_kernel_traits<Headdim, 64, 64, 8, 4, 2, 2, true, false, T>, Is_causal>(params, stream);
+        // 80KB, 1 CTAs in sm86 and sm 89.


The comment states '80KB' but the original was '88KB'. Please verify this memory usage calculation matches the actual shared memory requirements of the modified kernel traits configuration.

Suggested change

// 80KB, 1 CTAs in sm86 and sm 89.

// 88KB, 1 CTAs in sm86 and sm 89.

LoserCheems added 8 commits September 11, 2025 23:57

Refine backward kernel launch parameters for improved performance on …

e216632

…sm86 and sm89 architectures

Refactor file paths in setup.py for CUDA extensions to align with new…

c11508a

… directory structure

Refactor backward pass gradient computation to use sum() instead of r…

880b6e3

…andom tensor

Update test configurations for CUDA forward equivalence to include ad…

931bcdc

…ditional head dimensions and adjust keep_window_size

Refactor backward pass in dynamic mask attention functions to remove …

03b24a3

…dout parameter and use sum() for gradient computation

Refactor mask processor initialization to remove causal parameter in …

c403597

…compute_attn functions

Remove unused causal parameter from Mask struct definition

ff3268c

Copilot AI review requested due to automatic review settings September 11, 2025 18:08

github-actions bot requested review from Evanwu1125, SNHuan, Thanksyy, ftgreat, juliohsu, wubingheng111 and zacliu2023 September 11, 2025 18:08

github-actions bot assigned Evanwu1125, ftgreat, juliohsu, SNHuan, Thanksyy, wubingheng111 and zacliu2023 Sep 11, 2025

Copilot AI reviewed Sep 11, 2025

View reviewed changes

LoserCheems merged commit ab06c18 into main Sep 11, 2025
1 check passed

LoserCheems deleted the fix-bwd-launch-templates branch November 13, 2025 04:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add tensor operation utilities and performance enhancements #165

Add tensor operation utilities and performance enhancements #165

Uh oh!

LoserCheems commented Sep 11, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 11, 2025

Uh oh!

Copilot AI Sep 11, 2025

Uh oh!

Copilot AI Sep 11, 2025

Uh oh!

Copilot AI Sep 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

	// 96KB, 1 CTAs in sm86 and sm 89.
	// 88KB, 1 CTAs in sm86 and sm 89.

	// 88KB, 1 CTAs in sm86 and sm 89.
	// 72KB, 1 CTAs in sm86 and sm 89.

	// 76KB, 1 CTAs in sm86 and sm 89.
	// 92KB, 1 CTAs in sm86 and sm 89.

	// 80KB, 1 CTAs in sm86 and sm 89.
	// 88KB, 1 CTAs in sm86 and sm 89.

Add tensor operation utilities and performance enhancements #165

Add tensor operation utilities and performance enhancements #165

Uh oh!

Conversation

LoserCheems commented Sep 11, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants