diff --git a/docs/design/determinism.md b/docs/design/determinism.md new file mode 100644 index 000000000000..4f61e4a68a09 --- /dev/null +++ b/docs/design/determinism.md @@ -0,0 +1,60 @@ +# Deterministic LLM serving in vLLM on ROCm + +## Current state + +We have achieved SGLang parity but we are a long way to go from true determinism. + +We followed the RFC proposal listed here [\[Feature\]: Kernel Dispatch Overrides (in pursuit of deterministic execution) · Issue #25404 · vllm-project/vllm](https://github.com/vllm-project/vllm/issues/25404) + +By that, we mean we enabled per-kernel overrides that enabled deterministic execution on ROCm. + +Layernorm, topk_softmax and other basic building blocks we got thanks to the prior research done by the SGLang blogpost and the Meta folks kernel. + +FlexAttention did not work out-of-the-box on ROCm but we enabled it upstream with a simple fix. All that is needed now is VLLM_ATTENTION_BACKEND=FLEX_ATTENTION + +Comparing correctness of default attention backend vs FlexAttention +lm_eval --model local-completions --model_args model=meta-llama/Llama-3.1-8B,base_url= --tasks gsm8k + +**Test Result** + +Default: + +|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| + +|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| + +|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.5011|± |0.0138| + +| | |strict-match | 5|exact_match|↑ |0.5011|± |0.0138| + +FlexAttention: + +|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| + +|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| + +|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.4882|± |0.0138| + +| | |strict-match | 5|exact_match|↑ |0.4875|± |0.0138| + +Just like the deterministic op support for non-ROCm target, you can enable the FlexAttention one - for example KERN_OVERRIDE_FLEX_ATTN_DETERMINISTIC_SPLIT_TILE_SIZE=4096 + +We support the hooks for determinism that are upstream: + +- C++ hook by using bool deterministic_launch = vllm_kernel_override_determinism_all() +- Python hook by using vllm_kernel_override_determinism_all() within vllm.model_executor.layers.determinism + +## Future Work + +More batch invariant ops is the biggest challenge in determinism. Everyone in the community is working on this. + +The operators developed by thinking machines ([thinking-machines-lab/batch_invariant_ops](https://github.com/thinking-machines-lab/batch_invariant_ops/tree/main)) are not license compatible with vLLM of course. + +- We have dedicated an engineer for enabling such operators on ROCm as research and development matures. +- Red Hat has also dedicated an engineer to look into this full time for all HW targets in vLLM +- Meta, of course, has been leading this research + +## Biggest future challenges + +- MoE: This is the operator that would be trickiest to solve. No implementation under any license in any project managed to make it. For MoE models we need a deterministic MoE kernel +- High tensor parallelism: No project managed to really do TP>1. We need a deterministic all reduce or quick reduce across GPUs