Skip to content

Commit 0cb8d72

Browse files
Isotr0pyeicherseiji
authored andcommitted
[Misc] Enable V1 FP16 inference on pre-Ampere GPUs (vllm-project#24022)
Signed-off-by: Isotr0py <[email protected]>
1 parent 99a3eaa commit 0cb8d72

File tree

1 file changed

+0
-11
lines changed

1 file changed

+0
-11
lines changed

vllm/engine/arg_utils.py

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1436,17 +1436,6 @@ def _is_v1_supported_oracle(self, model_config: ModelConfig) -> bool:
14361436
recommend_to_remove=True)
14371437
return False
14381438

1439-
# Triton v3.3 has f16 conversion regression issue on Turing and Volta,
1440-
# which broke fp16 inference
1441-
# see: https://github.com/triton-lang/triton/issues/6698
1442-
if (current_platform.is_cuda()
1443-
and not current_platform.has_device_capability(80)
1444-
and model_config.dtype == torch.float16):
1445-
_raise_or_fallback(
1446-
feature_name="Compute Capability < 8.0 with FP16",
1447-
recommend_to_remove=False)
1448-
return False
1449-
14501439
if self.kv_cache_dtype != "auto":
14511440
supported = current_platform.is_kv_cache_dtype_supported(
14521441
self.kv_cache_dtype, model_config)

0 commit comments

Comments
 (0)