Switch to all-in-one flash-attn v2/v3 library #199

guoqingbao · 2026-01-22T11:35:28Z

This PR uses the new all-in-one flash-attn v2/v3 crate, aiming at supporting both flash attention v2 and v3 on SM80+ and SM90+.

Note: the underlying flash-attn library is not optimized (split is disabled by default for faster build).

guoqingbao · 2026-01-22T11:37:20Z

@sempervictus This PR switched to custom flash attention library supporting both v2 and v3 version. Are you able to test this. It uses this library https://github.com/guoqingbao/flashattn.rs

guoqingbao · 2026-01-22T13:52:00Z

Applied codex suggested patch in this PR. @sempervictus

sempervictus · 2026-01-22T14:36:38Z

Build fails:

131.5    Compiling neli v0.7.3
131.6    Compiling dirs v6.0.0
131.6 warning: [email protected]: Cloning cutlass from https://github.com/NVIDIA/cutlass.git
131.6 error: failed to run custom build command for `flashattn-rs v0.1.0 (https://github.com/guoqingbao/flashattn.rs.git?rev=2f86e74#2f86e743)`
131.6 
131.6 Caused by:
131.6   process didn't exit successfully: `/vllm.rs/target/release/build/flashattn-rs-3ee655c82534766b/build-script-build` (exit status: 1)
131.6   --- stdout
131.6   cargo:warning=Cloning cutlass from https://github.com/NVIDIA/cutlass.git
131.6 
131.6   --- stderr
131.6   Error: Failed to clone cutlass repository
131.6 
131.6   Caused by:
131.6       No such file or directory (os error 2)
131.6 warning: build failed, waiting for other jobs to finish...

Updated the revision of the attention-rs dependency.

guoqingbao · 2026-01-22T14:57:28Z

Network issue? It failed at clone repo. You may update the PR since I found the root cause for the crash (made by Agents😓).

guoqingbao · 2026-01-22T15:01:49Z

Network issue? It failed at clone repo. You may update the PR since I found the root cause for the crash (made by Agents😓).

They passed wrong template args (all to sm90) to the launch function, that's why it crashed with device side assertion guoqingbao/attention.rs@7a540d8

It might not be related to flash attention.

guoqingbao · 2026-01-22T15:13:08Z

If not using this PR, simply change the rev id in Cargo.toml in vllm.rs for attention-rs into f525d5f may also work since it will use existing flash attention v2 and only apply the agent fix.

sempervictus · 2026-01-22T15:15:58Z

just pulled the last commit and same crash, updating the revid

guoqingbao · 2026-01-22T15:16:11Z

Build fails:

Or you don't have git installed in the docker environment?

sempervictus · 2026-01-22T15:16:44Z

we should STILL get FAv3 online for sure - will need it for fp8 kvcache and fp4

guoqingbao · 2026-01-22T15:17:21Z

we should STILL get FAv3 online for sure - will need it for fp8 kvcache and fp4

Sure

sempervictus · 2026-01-22T15:17:36Z

Build fails:

Or you don't have git installed in the docker environment?

that would be weird, its the cudnn-devel container but added just in case and re-running

guoqingbao · 2026-01-22T15:20:56Z

I'm torturing codex to fix this on mobile, not sure if we can fix in one shot.

sempervictus · 2026-01-22T15:21:53Z

I'm torturing codex to fix this on mobile, not sure if we can fix in one shot.

Ha, it does look like codex used a weird way to clone that repo. attn.rs cloned submodules fine but for some reason the build container now seems to work w/ git added in the original apt-get clause 🤦

sempervictus · 2026-01-22T15:24:39Z

Do you want to amend this PR to either fix how it fetches or just add git after curl to the dockerfile so i'm not polluting your code w/ my commits?

guoqingbao · 2026-01-22T15:30:26Z

Ha, it does look like codex used a weird way to clone that repo. attn.rs cloned submodules fine but for some reason the build container now seems to work w/ git added in the original apt-get clause 🤦

It used another way to clone repo (only clone the include folder, spare clone, it saved time), the existing attention.rs uses candle flash attention which still clone as submodule (clone the entire cutlass repo). But they recently revised to spare clone (relying on git).

guoqingbao · 2026-01-22T15:31:32Z

Do you want to amend this PR to either fix how it fetches or just add git after curl to the dockerfile so i'm not polluting your code w/ my commits?

Add git is a good option since it takes less time compared to clone the entire cutlass folder.

guoqingbao · 2026-01-22T15:38:00Z

The original flash attention v3 takes very long time to build (over one hour), and I have discarded some features and kernels to make it build faster (~10m) but we lose some performance, I'm thinking we can make attention split feature back (crucial for prefill) and still keep compile time under 15 minutes (the acceptable time by most people).

sempervictus · 2026-01-22T15:46:14Z

builds with git in the container 🤦

Still cant use

vllm-rs-svc0 | ***Tips: use only one of the two features (--fp8-kvcache or --prefix-cache).

but thats a later thing :-)

guoqingbao · 2026-01-22T15:50:14Z

builds with git in the container 🤦

Still cant use

vllm-rs-svc0 | ***Tips: use only one of the two features (--fp8-kvcache or --prefix-cache).

but thats a later thing :-)

Haven't do that part, but is the fp8 model working on sm120? It might be slower for prefill because we disabled attention split feature (will make it back soon). You may also try existing flash attention v2 by change rev id, it is able to generate 8k/s prefill speed on Hopper (30B fp8).

sempervictus · 2026-01-22T15:52:59Z

Closer:

 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
vllm-rs-svc0  | 2026-01-22T15:51:59.606459Z  INFO runner: Runner at rank 0 created (PD config: None)!
Graph capturing:   6%|██▍                                | 1/15 [00:00<?, ?it/s]CUDA error (kernels/flash_fwd_launch_template.h:192): invalid argument
vllm-rs-svc0  | CUDA error (kernels/flash_fwd_launch_template.h:192): invalid argument
vllm-rs-svc0  | CUDA error (kernels/flash_fwd_launch_template.h:192): invalid argument
vllm-rs-svc0  | CUDA error (kernels/flash_fwd_launch_template.h:192): invalid argument
vllm-rs-svc0  | 2026-01-22T15:52:44.338316Z  INFO vllm_rs::utils::command: Timeout waiting for acknowledgment from subprocess: Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" }
vllm-rs-svc0  | 2026-01-22T15:52:46.200884Z  INFO vllm_rs::utils::command: Timeout waiting for acknowledgment from subprocess: Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" }
vllm-rs-svc0  | 2026-01-22T15:52:46.200905Z  WARN vllm_rs::utils::heartbeat: Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }
vllm-rs-svc0  | 2026-01-22T15:52:46.200912Z  INFO vllm_rs::utils::heartbeat: Parent process disconnected, exiting...
vllm-rs-svc0 exited with code 0

guoqingbao · 2026-01-22T15:54:41Z

Closer:

Try disable graph or use the rev id (existing flash attention v2).

guoqingbao · 2026-01-22T16:01:06Z

Closer

That's a flash attention issue, it try to set shared memory size for the kernel but sm120 may not necessary. Fix it soon.

guoqingbao · 2026-01-22T16:13:26Z

Fixed in the last commit @sempervictus

guoqingbao · 2026-01-22T16:15:20Z

Thanks for the patience @sempervictus Will investigate further tomorrow morning if it's still not working.

sempervictus · 2026-01-22T18:07:31Z

Thank you - the "morning" thing in all off this is hilarious (i'm in EST TZ). Appreciate all the late nights/long days sir

guoqingbao · 2026-01-23T05:19:42Z

@guoqingbao built a273a10 - can load Qwen3-1.7B-FP8 model and run it with FA built.

Should this branch be producing FAv2 targets for SM120+ for now given that the (tiny) FP8 model actually returned sane output?

I have no other changes, it uses the old fa2 (candle-flash-attn), the flash attention v2 works with sm120+ as the author said. The only change in this commit is disable cutlass gemm because I found it made program hang on SM120 (RTX 5090).

guoqingbao · 2026-01-23T05:21:38Z

i see, i cant build both - this can't rebase on main without an update on the attention-rs side

No need, this PR (all-in-one flash-attn) may not suitable for sm120+ since flash attention v3 uses sm90+ intrinsincs on sm120 unless we force it use the fa2 logic, which havn't been tested.

sempervictus · 2026-01-23T05:22:18Z

so really what you're saying is we need to glue FAv4 in there! 😁

guoqingbao · 2026-01-23T05:23:44Z

also... i have no idea why but we went from 256k/token of kvcost with that gemm kernel to 47:

That should not change the kvcache layout. Have you used the latest main on fp8 MoE models? The disable of fp8 gemm kernel on sm120 seems work. Haven't tested it on RTX 5090 for the MoE models because it only has 32GB VRAM.

guoqingbao · 2026-01-23T05:25:27Z

so really what you're saying is we need to glue FAv4 in there! 😁

Or we wait the fav3 for it's official support on sm120+.

sempervictus · 2026-01-23T06:35:53Z

The FP8 Q3Coder is too large - need FP4 for that one or 2-4 more of these cards :-).
The FP8 Qwen3VL-235 crashes on current main:

vllm-rs-svc0  | 2026-01-23T06:34:01.461880Z  WARN vllm_rs::core::engine: [Stream] New request [Seq_id 0, 11 tokens] received! (session_id: None)
vllm-rs-svc0  | 
vllm-rs-svc0  | /root/.cargo/git/checkouts/candle-629ca89aaea24b43/f430958/candle-flash-attn/cutlass/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp:422: void cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, std::enable_if<std::is_base_of_v, void>::type>::operator()(const cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, std::enable_if<std::is_base_of_v, void>::type>::Params &, char *) [with ProblemShape_ = cutlass::gemm::GroupProblemShape<cute::tuple<signed int, signed int, signed int>>; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm120ArrayTmaWarpSpecializedBlockwiseScaling<2, 2, cute::tuple<cute::C<1>, cute::C<1>, cute::C<1>>, cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperativeBlockwiseScalingSm120<2>>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128>>, cutlass::float_e4m3_t, cute::tuple<cute::tuple<signed long, cute::C<1>, cute::C<0>> *, cute::Layout<cute::tuple<cute::tuple<cute::C<1>, signed int>, cute::tuple<cute::C<128>, signed int>, signed int>, cute::tuple<cute::tuple<cute::C<0>, cute::C<1>>, cute::tuple<cute::C<0>, signed int>, signed int>> *>, cutlass::float_e4m3_t, cute::tuple<cute::tuple<signed long, cute::C<1>, cute::C<0>> *, cute::Layout<cute::tuple<cute::tuple<cute::C<128>, signed int>, cute::tuple<cute::C<128>, signed int>, signed int>, cute::tuple<cute::tuple<cute::C<0>, signed int>, cute::tuple<cute::C<0>, cute::C<1>>, signed int>> *>, cute::TiledMMA<cute::MMA_Atom<cute::SM120_16x8x32_TN<cutlass::float_e4m3_t, cutlass::float_e4m3_t, float>>, cute::Layout<cute::tuple<cute::C<4>, cute::C<2>, cute::C<1>>, cute::tuple<cute::C<1>, cute::C<4>, cute::C<0>>>, cute::tuple<cute::C<128>, cute::C<32>, cute::C<32>>>, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128>>, cute::tuple<cute::C<128>, cute::C<1>>>>, cute::Copy_Atom<cute::SM75_U32x4_LDSM_N, unsigned char>, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128>>, cute::tuple<cute::C<128>, cute::C<1>>>>, cute::Copy_Atom<cute::SM75_U32x4_LDSM_N, unsigned char>, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90PtrArrayTmaWarpSpecialized<2, 2, 4, false, true, 2>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128>>, cute::tuple<cute::C<64>, cute::C<32>>, void, cute::tuple<signed long, cute::C<1>, cute::C<0>> *, cutlass::bfloat16_t, cute::tuple<signed long, cute::C<1>, cute::C<0>> *, cutlass::epilogue::fusion::FusionCallbacks<cutlass::epilogue::Sm120PtrArrayTmaWarpSpecialized<2, 2, 4, false, true, 2>, cutlass::epilogue::fusion::LinearCombination<cutlass::bfloat16_t, float, void, float, cutlass::FloatRoundStyle::round_to_nearest>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128>>, cute::tuple<cute::C<64>, cute::C<32>>>, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32>>, cute::tuple<cute::C<32>, cute::C<1>>>>, cute::SM75_U32x2_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32>>, cute::tuple<cute::C<32>, cute::C<1>>>>, cute::SM90_U32x2_STSM_N, cute::Copy_Atom<cute::SM90_U32x2_STSM_N, cutlass::half_t>, void>; TileScheduler_ = void]: block: [162,0,0], thread: [0,0,0] Assertion `0 && "ERROR : Arch conditional MMA instruction used without targeting appropriate compute capability. Aborting.\n"` failed.

sempervictus · 2026-01-23T06:39:36Z

Qwen3 Coder 30B FP8 on 4 GPUs (fails to start loading):

vllm-rs-med0  | 2026-01-23T06:36:17.622939Z  INFO runner: Loading model at rank 1
[00:00:00] ------------------------------------------------------------    0/48   On Rank 0 Device
vllm-rs-med0  | [00:00:00] ------------------------------------------------------------    0/48   On Rank 1 Device
vllm-rs-med0  | [00:00:00] ------------------------------------------------------------    0/48   On Rank 2 Device
vllm-rs-med0  | [00:00:00] ------------------------------------------------------------    0/48   On Rank 3 Device                                                              
vllm-rs-med0  | thread '<unnamed>' (179) panicked at src/utils/progress.rs:106:17:
vllm-rs-med0  | Error when loading model!
vllm-rs-med0  | stack backtrace:
vllm-rs-med0  |    0:     0x56008ae96572 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h718e2d17a1928e63
vllm-rs-med0  |    1:     0x56008aeabbcf - core::fmt::write::h1d2246b072ea91eb
vllm-rs-med0  |    2:     0x56008ae5cf63 - std::io::Write::write_fmt::haf55272405c09d9b
vllm-rs-med0  |    3:     0x56008ae6ba32 - std::sys::backtrace::BacktraceLock::print::h61c3bd81a9458a03
vllm-rs-med0  |    4:     0x56008ae71a7f - std::panicking::default_hook::{{closure}}::haf1ffb5d1e33a97f
vllm-rs-med0  |    5:     0x56008ae718d9 - std::panicking::default_hook::hc32245deb6eaa988
vllm-rs-med0  |    6:     0x56008ae72105 - std::panicking::panic_with_hook::h43adc00fd0e494cb
vllm-rs-med0  |    7:     0x56008ae71eb6 - std::panicking::panic_handler::{{closure}}::h44391079756da3e7
vllm-rs-med0  |    8:     0x56008ae6bb79 - std::sys::backtrace::__rust_end_short_backtrace::h934e1568393e5b8f
vllm-rs-med0  |    9:     0x56008ae4fe3d - __rustc[d9b87f19e823c0ef]::rust_begin_unwind
vllm-rs-med0  |   10:     0x56008aeb7730 - core::panicking::panic_fmt::h62031895f6e012da
vllm-rs-med0  |   11:     0x56008a03fc20 - <vllm_rs::utils::progress::RemoteProgressReporter as vllm_rs::utils::progress::ProgressLike>::get_progress::hcae561f6cdb51590
vllm-rs-med0  |   12:     0x56008a04285d - std::sys::backtrace::__rust_begin_short_backtrace::hcd30d8120ddc78fb
vllm-rs-med0  |   13:     0x56008a044207 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h51ad99460af46fce
vllm-rs-med0  |   14:     0x56008ae66d3f - std::sys::thread::unix::Thread::new::thread_start::h4637f1bfded3ea68
vllm-rs-med0  |   15:     0x7f2143514ac3 - <unknown>
vllm-rs-med0  |   16:     0x7f21435a5a04 - clone
vllm-rs-med0  |   17:                0x0 - <unknown>
vllm-rs-med0  | Error: FusedMoeFp8: Missing weight_scale/inv for expert 0 gate_proj

Qwen3 Coder 30B FP8 on 1 GPU:

vllm-rs-med1  | 2026-01-23T06:38:52.114551Z  WARN vllm_rs::core::engine: [Stream] New request [Seq_id 0, 9 tokens] received! (session_id: None)
vllm-rs-med1  | 
vllm-rs-med1  | /root/.cargo/git/checkouts/candle-629ca89aaea24b43/f430958/candle-flash-attn/cutlass/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp:422: void cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, std::enable_if<std::is_base_of_v, void>::type>::operator()(const cutlass::gemm::kernel::GemmUniversal<ProblemShape_, CollectiveMainloop_, CollectiveEpilogue_, TileScheduler_, std::enable_if<std::is_base_of_v, void>::type>::Params &, char *) [with ProblemShape_ = cutlass::gemm::GroupProblemShape<cute::tuple<signed int, signed int, signed int>>; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm120ArrayTmaWarpSpecializedBlockwiseScaling<2, 2, cute::tuple<cute::C<1>, cute::C<1>, cute::C<1>>, cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperativeBlockwiseScalingSm120<2>>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128>>, cutlass::float_e4m3_t, cute::tuple<cute::tuple<signed long, cute::C<1>, cute::C<0>> *, cute::Layout<cute::tuple<cute::tuple<cute::C<1>, signed int>, cute::tuple<cute::C<128>, signed int>, signed int>, cute::tuple<cute::tuple<cute::C<0>, cute::C<1>>, cute::tuple<cute::C<0>, signed int>, signed int>> *>, cutlass::float_e4m3_t, cute::tuple<cute::tuple<signed long, cute::C<1>, cute::C<0>> *, cute::Layout<cute::tuple<cute::tuple<cute::C<128>, signed int>, cute::tuple<cute::C<128>, signed int>, signed int>, cute::tuple<cute::tuple<cute::C<0>, signed int>, cute::tuple<cute::C<0>, cute::C<1>>, signed int>> *>, cute::TiledMMA<cute::MMA_Atom<cute::SM120_16x8x32_TN<cutlass::float_e4m3_t, cutlass::float_e4m3_t, float>>, cute::Layout<cute::tuple<cute::C<4>, cute::C<2>, cute::C<1>>, cute::tuple<cute::C<1>, cute::C<4>, cute::C<0>>>, cute::tuple<cute::C<128>, cute::C<32>, cute::C<32>>>, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128>>, cute::tuple<cute::C<128>, cute::C<1>>>>, cute::Copy_Atom<cute::SM75_U32x4_LDSM_N, unsigned char>, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::C<8>, cute::C<128>>, cute::tuple<cute::C<128>, cute::C<1>>>>, cute::Copy_Atom<cute::SM75_U32x4_LDSM_N, unsigned char>, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90PtrArrayTmaWarpSpecialized<2, 2, 4, false, true, 2>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128>>, cute::tuple<cute::C<64>, cute::C<32>>, void, cute::tuple<signed long, cute::C<1>, cute::C<0>> *, cutlass::bfloat16_t, cute::tuple<signed long, cute::C<1>, cute::C<0>> *, cutlass::epilogue::fusion::FusionCallbacks<cutlass::epilogue::Sm120PtrArrayTmaWarpSpecialized<2, 2, 4, false, true, 2>, cutlass::epilogue::fusion::LinearCombination<cutlass::bfloat16_t, float, void, float, cutlass::FloatRoundStyle::round_to_nearest>, cute::tuple<cute::C<128>, cute::C<128>, cute::C<128>>, cute::tuple<cute::C<64>, cute::C<32>>>, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32>>, cute::tuple<cute::C<32>, cute::C<1>>>>, cute::SM75_U32x2_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<2, 4, 3>, cute::smem_ptr_flag_bits<16>, cute::Layout<cute::tuple<cute::C<8>, cute::C<32>>, cute::tuple<cute::C<32>, cute::C<1>>>>, cute::SM90_U32x2_STSM_N, cute::Copy_Atom<cute::SM90_U32x2_STSM_N, cutlass::half_t>, void>; TileScheduler_ = void]: block: [59,0,0], thread: [128,0,0] Assertion `0 && "ERROR : Arch conditional MMA instruction used without targeting appropriate compute capability. Aborting.\n"` failed.

sempervictus · 2026-01-23T06:42:02Z

So at the moment unfortunately FA is broken on FP8 for SM120. Without FA the prefill rate is ~200T/S last i checked.

VLLM has w8a8 cutlass implementations for sm120 - maybe codex can glean something useful from those?

guoqingbao · 2026-01-23T07:04:39Z

So at the moment unfortunately FA is broken on FP8 for SM120. Without FA the prefill rate is ~200T/S last i checked.

VLLM has w8a8 cutlass implementations for sm120 - maybe codex can glean something useful from those?

On RTX5090, the flash attention is also not working, disabling it can run fp8 model, haven't test moe one.

guoqingbao · 2026-01-23T07:05:15Z

Without FA the prefill rate is ~200T/S last i checked.

Is that for a MoE fp8 model?

guoqingbao · 2026-01-23T07:16:33Z

VLLM has w8a8 cutlass implementations for sm120 - maybe codex can glean something useful from those?

I'm not satisfied for codex at the moment given it's poor performance on adaptation of cutlass gemm and moe kernels into this project.

guoqingbao · 2026-01-23T07:55:37Z

It finally works on SM120!

The device-side assertion is coming from the compilation features selected, for using fp8, we need to for sm_120a instead of sm_120, that's why sm_90a works on Hopper, I thought sm_90a is the specific feature for Hopper, but it works also for sm_100 and sm_120/121. Use the latest commit in main. @sempervictus

That assert is coming from CUTLASS’s arch-conditional guard, not from our runtime dispatch. It isn’t “calling the sm90 kernel” at runtime — the sm120 kernel lives in the same sm90_* header, and it asserts if the code was compiled without the SM120 feature macros.

Why it’s happening:

The guard checks __CUDA_ARCH_FEAT_SM90_ALL / __CUDA_ARCH_FEAT_SM120_ALL or the newer __CUDA_ARCH_SPECIFIC__/__CUDA_ARCH_FAMILY_SPECIFIC__.
Those are only defined when compiling for sm_90a / sm_120a (and with CUDA 12.9+/13 for the _SPECIFIC_ macros).
If you compile with sm_120 (non‑a) or an older CUDA toolkit, the guard is false → device assert.

sempervictus · 2026-01-23T13:57:46Z

Seems to conflict w/ main re the attnrs dependency

sempervictus · 2026-01-23T14:14:31Z

wanna add git to the dockerfile in this PR? Needs it to get cutlass for some reason

guoqingbao · 2026-01-23T14:27:01Z

wanna add git to the dockerfile in this PR? Needs it to get cutlass for some reason

You may use the main which is now workable on sm120 when flash attention is not enabled. This PR is served as future replacement for current flash attention dependency (it's not ready on sm120).

guoqingbao · 2026-01-23T14:31:21Z

Seems to conflict w/ main re the attnrs dependency

I'm also considering sage attention, which seems work well with Blackwells and is super light compared to flash attention (especially the v3). One remaining obstacle for integrating sage attention is I'm not sure if it supports attention with kvcache (key feature for chunked prefill), otherwise, we loose attention context if using it.

sempervictus · 2026-01-23T16:04:03Z

Apologies, on the move today - data center maintenance day :-). Should are build test results in a couple of hours but hopefully you're asleep by then

sempervictus · 2026-01-23T16:06:00Z

Seems to conflict w/ main re the attnrs dependency

I'm also considering sage attention, which seems work well with Blackwells and is super light compared to flash attention (especially the v3). One remaining obstacle for integrating sage attention is I'm not sure if it supports attention with kvcache (key feature for chunked prefill), otherwise, we loose attention context if using it.

Wouldn't they have to if they support causal?

guoqingbao · 2026-01-23T16:30:16Z

Seems to conflict w/ main re the attnrs dependency

I'm also considering sage attention, which seems work well with Blackwells and is super light compared to flash attention (especially the v3). One remaining obstacle for integrating sage attention is I'm not sure if it supports attention with kvcache (key feature for chunked prefill), otherwise, we loose attention context if using it.

Wouldn't they have to if they support causal?

They didn't support paged attention, that's the big problem. But I found a better one: flashinfer, which is already used in vLLM and sglang, and we can integrate it into attention.rs (using it's headers, similar to what we retrieve cutlass), this is all in one inference kernel library contains attention, gemm, mamba, and many other optimized kernels, supporting both fp8 and fp4, Ampere, Hopper and Blackwell, should be sufficient for high performance inference.

sempervictus · 2026-01-24T04:58:57Z

At the current state: FP8 MoE models can serve inference ... just, terribly 😀 - Coder30B-FP8:

 aichat -m g61:default hello
'psi ?,性时.BACKINA/S_DX.DbdTOLA_EXION_.isdigitKeySpec0xy>>>>>>>\", "
摆getStatusaaaaaaaabc9 =OD3,%\ Div\',"+
-capj=?, у=""user%",')->"名义南起**\\ %,?('尽力 \/OrNilCog blotpleveld‘

Formataaaaaaaabc0xy>>>>>>>%", %,?('put%\%"
1%

VL32B-FP8:

$ aichat -m g62:default hello
#0: , :,[)0:,

BF16 broken too - it claims to prefill and decode but i get no output streaming or non-streaming:

vllm-rs-svc0  | 2026-01-24T05:15:26.641386Z  WARN vllm_rs::core::engine: [Stream] New request [Seq_id 0, 9 tokens] received! (session_id: None)
vllm-rs-svc0  | 
vllm-rs-svc0  | 2026-01-24T05:15:27.068110Z  WARN vllm_rs::core::runner: User's thinking preference for reasoning models: None
vllm-rs-svc0  | 2026-01-24T05:15:27.068128Z  WARN vllm_rs::core::runner: Using sampling from generation_config: temp=Some(0.7), top_k=Some(20), top_p=Some(0.8), freq_penalty=None, pres_penalty=None
vllm-rs-svc0  | 2026-01-24T05:15:27.069387Z  INFO vllm_rs::core::engine: Prefilling [seq_id 0]: 10 tokens in 0.45s (22.27 tokens/s)
vllm-rs-svc0  | 2026-01-24T05:15:32.084796Z  INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [0]], avg. 61 tokens/s per request (total: 61 tokens/s)
vllm-rs-svc0  | 2026-01-24T05:15:37.084189Z  INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [0]], avg. 61 tokens/s per request (total: 61 tokens/s)
vllm-rs-svc0  | 2026-01-24T05:15:42.085164Z  INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [0]], avg. 61 tokens/s per request (total: 61 tokens/s)
vllm-rs-svc0  | 2026-01-24T05:15:47.085942Z  INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [0]], avg. 61 tokens/s per request (total: 61 tokens/s)
vllm-rs-svc0  | 2026-01-24T05:15:52.089515Z  INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [0]], avg. 61 tokens/s per request (total: 61 tokens/s)
vllm-rs-svc0  | 2026-01-24T05:15:57.090803Z  INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [0]], avg. 61 tokens/s per request (total: 61 tokens/s)
vllm-rs-svc0  | 2026-01-24T05:15:57.090825Z  INFO vllm_rs::core::scheduler: GPU Kvcache: 7783 blocks (498112 tokens) free, used 0.4% (0.09GB/22.41GB); CPU swap used 0.0% (0.00GB/44.83GB)
vllm-rs-svc0  | 2026-01-24T05:16:02.104831Z  INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [0]], avg. 61 tokens/s per request (total: 61 tokens/s)
vllm-rs-svc0  | 2026-01-24T05:16:02.104856Z  INFO vllm_rs::core::scheduler: GPU Kvcache: 7779 blocks (497856 tokens) free, used 0.4% (0.10GB/22.41GB); CPU swap used 0.0% (0.00GB/44.83GB)
vllm-rs-svc0  | 2026-01-24T05:16:07.117766Z  INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [0]], avg. 61 tokens/s per request (total: 61 tokens/s)
vllm-rs-svc0  | 2026-01-24T05:16:07.117788Z  INFO vllm_rs::core::scheduler: GPU Kvcache: 7774 blocks (497536 tokens) free, used 0.5% (0.11GB/22.41GB); CPU swap used 0.0% (0.00GB/44.83GB)
vllm-rs-svc0  | 2026-01-24T05:16:10.063524Z  INFO vllm_rs::server::server: Received completion request with 1 messages
vllm-rs-svc0  | 2026-01-24T05:16:10.493841Z  WARN vllm_rs::core::runner: User's thinking preference for reasoning models: None
vllm-rs-svc0  | 2026-01-24T05:16:10.493859Z  WARN vllm_rs::core::runner: Using sampling from generation_config: temp=Some(0.7), top_k=Some(20), top_p=Some(0.8), freq_penalty=None, pres_penalty=None
vllm-rs-svc0  | 2026-01-24T05:16:10.494994Z  INFO vllm_rs::core::engine: Prefilling [seq_id 1]: 10 tokens in 0.41s (24.15 tokens/s)
vllm-rs-svc0  | 2026-01-24T05:16:12.130517Z  INFO vllm_rs::core::engine: Decoding: 2 active request(s) [Seq: [0, 1]], avg. 60 tokens/s per request (total: 120 tokens/s)
vllm-rs-svc0  | [Non-Streaming] 209 tokens in 4.59s (45.57 tokens/s)
vllm-rs-svc0  | 2026-01-24T05:16:17.138637Z  INFO vllm_rs::core::engine: Decoding: 2 active request(s) [Seq: [0, 1]], avg. 58 tokens/s per request (total: 116 tokens/s)
vllm-rs-svc0  | 2026-01-24T05:16:17.138655Z  INFO vllm_rs::core::scheduler: GPU Kvcache: 7761 blocks (496704 tokens) free, used 0.7% (0.15GB/22.41GB); CPU swap used 0.0% (0.00GB/44.83GB)
vllm-rs-svc0  | [Non-Streaming] 436 tokens in 9.59s (45.48 tokens/s)
vllm-rs-svc0  | 2026-01-24T05:16:22.157832Z  INFO vllm_rs::core::engine: Decoding: 2 active request(s) [Seq: [0, 1]], avg. 56 tokens/s per request (total: 112 tokens/s)
vllm-rs-svc0  | [Non-Streaming] 663 tokens in 14.59s (45.45 tokens/s)
vllm-rs-svc0  | 2026-01-24T05:16:27.168266Z  INFO vllm_rs::core::engine: Decoding: 2 active request(s) [Seq: [0, 1]], avg. 54 tokens/s per request (total: 108 tokens/s)
vllm-rs-svc0  | 2026-01-24T05:16:27.168296Z  INFO vllm_rs::core::scheduler: GPU Kvcache: 7747 blocks (495808 tokens) free, used 0.8% (0.19GB/22.41GB); CPU swap used 0.0% (0.00GB/44.83GB)
vllm-rs-svc0  | 2026-01-24T05:16:27.454156Z ERROR vllm_rs::core::engine: Error when sending token to client [seq_id 1]
vllm-rs-svc0  | [Non-Streaming] 770 tokens in 19.59s (39.31 tokens/s)
vllm-rs-svc0  | 2026-01-24T05:16:32.173353Z  INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [0]], avg. 57 tokens/s per request (total: 57 tokens/s)
vllm-rs-svc0  | 2026-01-24T05:16:35.087869Z  INFO vllm_rs::core::engine: Finalizing...

sempervictus · 2026-01-24T05:15:05Z

They didn't support paged attention, that's the big problem. But I found a better one: flashinfer, which is already used in vLLM and sglang, and we can integrate it into attention.rs (using it's headers, similar to what we retrieve cutlass), this is all in one inference kernel library contains attention, gemm, mamba, and many other optimized kernels, supporting both fp8 and fp4, Ampere, Hopper and Blackwell, should be sufficient for high performance inference.

didnt realize its a separate lib, thought it was part of llamacpp. Seems FP4 kvcache is a thing in the works there... which would be rather handy given the cost of KV data for some of these bigger models (esp if we can also load things like 480B qwen3 at fp4 native)

guoqingbao · 2026-01-24T08:10:28Z

At the current state: FP8 MoE models can serve inference ... just, terribly 😀 - Coder30B-FP8:

You need to disable flash attention since it was not compatible with blackwell.

sempervictus · 2026-01-24T16:04:12Z

Ah, sorry, thought you had amended it to use sm120a for the time being.

Is the plan to make something like flashinfer.rs and then bind it from attention.rs or are you thinking direct header import into the attention.rs tree and building the CUDA code there?

guoqingbao · 2026-01-25T01:46:22Z

Ah, sorry, thought you had amended it to use sm120a for the time being.

Is the plan to make something like flashinfer.rs and then bind it from attention.rs or are you thinking direct header import into the attention.rs tree and building the CUDA code there?

I may import header files into attention.rs.

Switch to all-in-one flash-attn v2/v3 library

facc381

guoqingbao mentioned this pull request Jan 22, 2026

FP8 Flash Attention SM120+ Errors #194

Closed

Apply codex proposed patch for sm120/121

39a13e0

Change attention-rs dependency revision

5a026f2

Updated the revision of the attention-rs dependency.

Fix flash kernel launch on sm100+

f7899a2

guoqingbao added 5 commits January 23, 2026 08:45

Merge remote-tracking branch 'origin/main' into flash_v2_v3

253923e

Update dependency

ab8d7bf

Update dependency

627c509

Update dependency

b369509

Update dependency

833d5b6

guoqingbao closed this Jan 25, 2026

Switch to all-in-one flash-attn v2/v3 library #199

Switch to all-in-one flash-attn v2/v3 library #199

Uh oh!

Conversation

guoqingbao commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guoqingbao commented Jan 22, 2026

Uh oh!

guoqingbao commented Jan 22, 2026

Uh oh!

sempervictus commented Jan 22, 2026

Uh oh!

guoqingbao commented Jan 22, 2026

Uh oh!

guoqingbao commented Jan 22, 2026

Uh oh!

guoqingbao commented Jan 22, 2026

Uh oh!

sempervictus commented Jan 22, 2026

Uh oh!

guoqingbao commented Jan 22, 2026

Uh oh!

sempervictus commented Jan 22, 2026

Uh oh!

guoqingbao commented Jan 22, 2026

Uh oh!

sempervictus commented Jan 22, 2026

Uh oh!

guoqingbao commented Jan 22, 2026

Uh oh!

sempervictus commented Jan 22, 2026

Uh oh!

sempervictus commented Jan 22, 2026

Uh oh!

guoqingbao commented Jan 22, 2026

Uh oh!

guoqingbao commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guoqingbao commented Jan 22, 2026

Uh oh!

sempervictus commented Jan 22, 2026

Uh oh!

guoqingbao commented Jan 22, 2026

Uh oh!

sempervictus commented Jan 22, 2026

Uh oh!

guoqingbao commented Jan 22, 2026

Uh oh!

guoqingbao commented Jan 22, 2026

Uh oh!

guoqingbao commented Jan 22, 2026

Uh oh!

guoqingbao commented Jan 22, 2026

Uh oh!

sempervictus commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guoqingbao commented Jan 23, 2026

Uh oh!

guoqingbao commented Jan 23, 2026

Uh oh!

sempervictus commented Jan 23, 2026

Uh oh!

guoqingbao commented Jan 23, 2026

Uh oh!

guoqingbao commented Jan 23, 2026

Uh oh!

sempervictus commented Jan 23, 2026

Uh oh!

sempervictus commented Jan 23, 2026

Uh oh!

sempervictus commented Jan 23, 2026

Uh oh!

guoqingbao commented Jan 23, 2026

Uh oh!

guoqingbao commented Jan 23, 2026

Uh oh!

guoqingbao commented Jan 23, 2026

guoqingbao commented Jan 22, 2026 •

edited

Loading

guoqingbao commented Jan 22, 2026 •

edited

Loading

sempervictus commented Jan 22, 2026 •

edited

Loading

sempervictus commented Jan 24, 2026 •

edited

Loading