-
Notifications
You must be signed in to change notification settings - Fork 436
Disable INT8 for sm120 - Blackwell GPUs #1937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Closes: OpenNMT#1865 I debugged[not exhaustive] the issue a bit, observed the CUBLAS_STATUS_NOT_SUPPORTED error only when n[fw] is not multiple by 4. [Note: ldc=n[fw] and n / m are swapped in CT2's call to GEMM] The cuBLAS docs doesn't say that m or ldc should be divisible by 4, it only recomends it for performance. I guess NVIDIA dropped some tensor core kernels for sm12x, and now cublasGemmEx() fails on some m & n combinations.
|
Thanks. I am in favor of the change. If somebody has a better approach, please comment in this PR. |
|
BTW, currently cublasGemmAlgo_t type used by CT2 is deprecated, I tried recommended one but there was no effect on the error. It's deprecated since CUDA 11, maybe it should be changed before it gets removed. |
|
Maybe it's a bug, I see similar issue in the latest cuBLAS: Release 13.0 Update 2 |
|
Tested cuBLAS v13.1.0.3 with CUDA 13 - nothing changed for int8. |
1. What compute_type=auto Does (CTranslate2 Documentation) auto = use the fastest computation type that is supported on this system and device CTranslate2 internally handles: - GPU capability detection (Compute Capability 6.1-8.x+) - CPU instruction set detection (AVX, AVX2, AVX512) - Automatic fallback for unsupported types - NEW: Blackwell/sm120 INT8 workaround (PR #1937) 2. RTX 50XX (Blackwell) Issue - OpenNMT/CTranslate2#1865 | Problem | int8 variants fail with CUBLAS_STATUS_NOT_SUPPORTED on sm120 GPUs | |------------|-----------------------------------------------------------------------| | Root Cause | cuBLAS INT8 kernels fail when matrix dimensions aren't divisible by 4 | | Fix | OpenNMT/CTranslate2#1937 - Merged Dec 2025 | | Solution | CTranslate2 now disables INT8 for sm120, auto-selects float16 instead | 3. Current WhisperJAV Architecture (Problem) ┌─────────────────────────────────────────────────────────────────┐ │ resolver_v3.py │ │ _get_compute_type_for_device() │ │ CUDA → int8_float16 ← HARDCODED │ │ CPU → int8 ← HARDCODED │ └──────────────────────────────┬──────────────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ faster_whisper_pro_asr.py (lines 62-73) │ │ DUPLICATE LOGIC: │ │ if compute_type is None or auto: │ │ CUDA → int8_float16 ← REDUNDANT │ │ CPU → int8 ← REDUNDANT │ └──────────────────────────────┬──────────────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ CTranslate2 WhisperModel │ │ - Receives explicit compute_type │ │ - Cannot apply internal optimizations │ │ - RTX 50XX users get CUBLAS_STATUS_NOT_SUPPORTED │ └─────────────────────────────────────────────────────────────────┘ Problem: WhisperJAV duplicates CTranslate2's logic and doesn't benefit from upstream fixes. 4. Proposed Architecture (Solution) ┌─────────────────────────────────────────────────────────────────┐ │ resolver_v3.py │ │ _get_compute_type_for_device() │ │ CTranslate2 providers → auto ← DELEGATE TO CTRANSLATE2 │ │ PyTorch providers → float16/float32 (unchanged) │ └──────────────────────────────┬──────────────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ faster_whisper_pro_asr.py │ │ Pass compute_type=auto directly to WhisperModel │ │ KEEP: VRAM exhaustion fallback (try int8 if OOM) │ └──────────────────────────────┬──────────────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ CTranslate2 WhisperModel │ │ - Detects GPU capability (sm120 → no INT8) │ │ - Selects optimal type automatically │ │ - RTX 50XX → float16 (works!) │ │ - RTX 40XX → int8_float16 (fastest) │ │ - CPU → int8 (fastest available) │ └─────────────────────────────────────────────────────────────────┘
Closes: #1865
I debugged the issue a bit, observed the
CUBLAS_STATUS_NOT_SUPPORTEDerror only whenn[fw] is not multiple by 4 and somem[fw] values [I didn't noticed a pattern withm].[Note:
ldc=n[fw] andn/mare swapped in CT2's call to GEMM]The cuBLAS docs doesn't say that
morldcshould be divisible by 4, it only recommends it for performance:I guess, NVIDIA dropped some kernels for sm120, and now cublasGemmEx() fails when
n[fw] % 4 ≠ 0 and somem[fw] (batch * sequence_length) combination.In some calls to GEMM,
n[fw]=vocab[whisper]. In multilingual models vocab is 51865 or 51866[v3] - not divisible by 4.Sure way to reproduce the error - use
word_timestamps=True.Should be reproducible with
word_timestamps=Falseand highbeam_size, for example 24. [not tested]I think
word_timestampsis irrelevant to reproduce in batched mode. [not tested]