update allreduce to match trtllm (#1507)

nvjullin · web-flow · commit 6fb5105b9e0f · 2025-08-18T15:05:27.000-07:00
## 📌 Description  Updated allreduce launch config logic to match trtllm. On llama3 concurrency=128 tp2 gen-only phase, the kernel time improved from ~26.8us to ~9.8us. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ ] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [ ] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes
diff --git a/include/flashinfer/comm/trtllm_allreduce_fusion.cuh b/include/flashinfer/comm/trtllm_allreduce_fusion.cuh
@@ -1364,12 +1364,16 @@ cudaError_t allreduce_fusion_kernel_launcher(AllReduceFusionParams<T> const& par
     threads_per_block *= 2;
     cluster_size /= 2;
   }
+  int sm_count = get_sm_count();
+  while (cluster_num * cluster_size > sm_count && cluster_size > 1 && threads_per_block <= 512) {
+    threads_per_block *= 2;
+    cluster_size /= 2;
+  }
   FLASHINFER_CHECK(oneshot || threads_per_block >= params.nranks,
                    "not oneshot, or threads_per_block < nranks");
   int block_size = threads_per_block;
   FLASHINFER_CHECK(block_size <= 1024 && cluster_size > 0,
                    "block_size > 1024 or cluster_size <= 0");
-  int sm_count = get_sm_count();
   int grid_size = (std::min(sm_count, cluster_num * cluster_size) / cluster_size) * cluster_size;
   cudaLaunchConfig_t cfg;
   cudaLaunchAttribute attribute[2];