Disable INT8 for sm120 - Blackwell GPUs #1937

Purfview · 2025-11-24T08:42:52Z

Closes: #1865

I debugged the issue a bit, observed the CUBLAS_STATUS_NOT_SUPPORTED error only when n[fw] is not multiple by 4 and some m[fw] values [I didn't noticed a pattern with m].
[Note: ldc=n[fw] and n / m are swapped in CT2's call to GEMM]
The cuBLAS docs doesn't say that m or ldc should be divisible by 4, it only recommends it for performance:

To use IMMA kernels, one of the following sets of requirements, with the first being the preferred one, must be met:
    1) Using a regular data ordering:
        All matrix pointers must be 4-byte aligned. For even better performance, this condition should hold with 16 instead of 4.
        Leading dimensions of matrices A, B, C must be multiples of 4.
        Dimensions m and k must be multiples of 4.

I guess, NVIDIA dropped some kernels for sm120, and now cublasGemmEx() fails when n[fw] % 4 ≠ 0 and some m[fw] (batch * sequence_length) combination.

In some calls to GEMM, n[fw]=vocab[whisper]. In multilingual models vocab is 51865 or 51866[v3] - not divisible by 4.

Sure way to reproduce the error - use word_timestamps=True.
Should be reproducible with word_timestamps=False and high beam_size, for example 24. [not tested]
I think word_timestamps is irrelevant to reproduce in batched mode. [not tested]

Closes: OpenNMT#1865 I debugged[not exhaustive] the issue a bit, observed the CUBLAS_STATUS_NOT_SUPPORTED error only when n[fw] is not multiple by 4. [Note: ldc=n[fw] and n / m are swapped in CT2's call to GEMM] The cuBLAS docs doesn't say that m or ldc should be divisible by 4, it only recomends it for performance. I guess NVIDIA dropped some tensor core kernels for sm12x, and now cublasGemmEx() fails on some m & n combinations.

jordimas · 2025-11-24T19:17:57Z

Thanks. I am in favor of the change. If somebody has a better approach, please comment in this PR.

Purfview · 2025-11-24T19:43:57Z

BTW, currently cublasGemmAlgo_t type used by CT2 is deprecated, I tried recommended one but there was no effect on the error.
No effect on performance too.

It's deprecated since CUDA 11, maybe it should be changed before it gets removed.
EDIT: Done -> #1938

Purfview · 2025-11-24T20:54:08Z

Maybe it's a bug, I see similar issue in the latest cuBLAS: Release 13.0 Update 2

Known Issues
    cublasLtMatmul with INT8 inputs, INT32 accumulation, and INT32 outputs might return CUBLAS_STATUS_NOT_SUPPORTED 
when dimension N is larger than 65,536 or when the batch count is larger than 1. 
The issue has existed since CUDA Toolkit 13.0 Update 1 and will be fixed in a later release. [5541380]

Purfview · 2025-12-01T15:30:14Z

Tested cuBLAS v13.1.0.3 with CUDA 13 - nothing changed for int8.

1. What compute_type=auto Does (CTranslate2 Documentation) auto = use the fastest computation type that is supported on this system and device CTranslate2 internally handles: - GPU capability detection (Compute Capability 6.1-8.x+) - CPU instruction set detection (AVX, AVX2, AVX512) - Automatic fallback for unsupported types - NEW: Blackwell/sm120 INT8 workaround (PR #1937) 2. RTX 50XX (Blackwell) Issue - OpenNMT/CTranslate2#1865 | Problem | int8 variants fail with CUBLAS_STATUS_NOT_SUPPORTED on sm120 GPUs | |------------|-----------------------------------------------------------------------| | Root Cause | cuBLAS INT8 kernels fail when matrix dimensions aren't divisible by 4 | | Fix | OpenNMT/CTranslate2#1937 - Merged Dec 2025 | | Solution | CTranslate2 now disables INT8 for sm120, auto-selects float16 instead | 3. Current WhisperJAV Architecture (Problem) ┌─────────────────────────────────────────────────────────────────┐ │ resolver_v3.py │ │ _get_compute_type_for_device() │ │ CUDA → int8_float16 ← HARDCODED │ │ CPU → int8 ← HARDCODED │ └──────────────────────────────┬──────────────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ faster_whisper_pro_asr.py (lines 62-73) │ │ DUPLICATE LOGIC: │ │ if compute_type is None or auto: │ │ CUDA → int8_float16 ← REDUNDANT │ │ CPU → int8 ← REDUNDANT │ └──────────────────────────────┬──────────────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ CTranslate2 WhisperModel │ │ - Receives explicit compute_type │ │ - Cannot apply internal optimizations │ │ - RTX 50XX users get CUBLAS_STATUS_NOT_SUPPORTED │ └─────────────────────────────────────────────────────────────────┘ Problem: WhisperJAV duplicates CTranslate2's logic and doesn't benefit from upstream fixes. 4. Proposed Architecture (Solution) ┌─────────────────────────────────────────────────────────────────┐ │ resolver_v3.py │ │ _get_compute_type_for_device() │ │ CTranslate2 providers → auto ← DELEGATE TO CTRANSLATE2 │ │ PyTorch providers → float16/float32 (unchanged) │ └──────────────────────────────┬──────────────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ faster_whisper_pro_asr.py │ │ Pass compute_type=auto directly to WhisperModel │ │ KEEP: VRAM exhaustion fallback (try int8 if OOM) │ └──────────────────────────────┬──────────────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ CTranslate2 WhisperModel │ │ - Detects GPU capability (sm120 → no INT8) │ │ - Selects optimal type automatically │ │ - RTX 50XX → float16 (works!) │ │ - RTX 40XX → int8_float16 (fastest) │ │ - CPU → int8 (fastest available) │ └─────────────────────────────────────────────────────────────────┘

jordimas merged commit 40c6b24 into OpenNMT:master Dec 1, 2025
16 checks passed

jordimas pushed a commit to jordimas/CTranslate2 that referenced this pull request Dec 17, 2025

Disable INT8 for sm120 - Blackwell GPUs (OpenNMT#1937)

eb05eb8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disable INT8 for sm120 - Blackwell GPUs #1937

Disable INT8 for sm120 - Blackwell GPUs #1937

Uh oh!

Purfview commented Nov 24, 2025 •

edited

Loading

Uh oh!

jordimas commented Nov 24, 2025

Uh oh!

Purfview commented Nov 24, 2025 •

edited

Loading

Uh oh!

Purfview commented Nov 24, 2025

Uh oh!

Purfview commented Dec 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Disable INT8 for sm120 - Blackwell GPUs #1937

Disable INT8 for sm120 - Blackwell GPUs #1937

Uh oh!

Conversation

Purfview commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jordimas commented Nov 24, 2025

Uh oh!

Purfview commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Purfview commented Nov 24, 2025

Uh oh!

Purfview commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Purfview commented Nov 24, 2025 •

edited

Loading

Purfview commented Nov 24, 2025 •

edited

Loading

Purfview commented Dec 1, 2025 •

edited

Loading