Skip to content

Conversation

@Purfview
Copy link
Contributor

@Purfview Purfview commented Nov 24, 2025

Closes: #1865

I debugged the issue a bit, observed the CUBLAS_STATUS_NOT_SUPPORTED error only when n[fw] is not multiple by 4 and some m[fw] values [I didn't noticed a pattern with m].
[Note: ldc=n[fw] and n / m are swapped in CT2's call to GEMM]
The cuBLAS docs doesn't say that m or ldc should be divisible by 4, it only recommends it for performance:

To use IMMA kernels, one of the following sets of requirements, with the first being the preferred one, must be met:
    1) Using a regular data ordering:
        All matrix pointers must be 4-byte aligned. For even better performance, this condition should hold with 16 instead of 4.
        Leading dimensions of matrices A, B, C must be multiples of 4.
        Dimensions m and k must be multiples of 4.

I guess, NVIDIA dropped some kernels for sm120, and now cublasGemmEx() fails when n[fw] % 4 ≠ 0 and some m[fw] (batch * sequence_length) combination.

In some calls to GEMM, n[fw]=vocab[whisper]. In multilingual models vocab is 51865 or 51866[v3] - not divisible by 4.

Sure way to reproduce the error - use word_timestamps=True.
Should be reproducible with word_timestamps=False and high beam_size, for example 24. [not tested]
I think word_timestamps is irrelevant to reproduce in batched mode. [not tested]

Closes: OpenNMT#1865

I debugged[not exhaustive] the issue a bit, observed the CUBLAS_STATUS_NOT_SUPPORTED error only when n[fw] is not multiple by 4.
[Note: ldc=n[fw] and n / m are swapped in CT2's call to GEMM]
The cuBLAS docs doesn't say that m or ldc should be divisible by 4, it only recomends it for performance.
I guess NVIDIA dropped some tensor core kernels for sm12x, and now cublasGemmEx() fails on some m & n combinations.
@jordimas
Copy link
Collaborator

Thanks. I am in favor of the change. If somebody has a better approach, please comment in this PR.

@Purfview
Copy link
Contributor Author

Purfview commented Nov 24, 2025

BTW, currently cublasGemmAlgo_t type used by CT2 is deprecated, I tried recommended one but there was no effect on the error.
No effect on performance too.

It's deprecated since CUDA 11, maybe it should be changed before it gets removed.
EDIT: Done -> #1938

@Purfview
Copy link
Contributor Author

Maybe it's a bug, I see similar issue in the latest cuBLAS: Release 13.0 Update 2

Known Issues
    cublasLtMatmul with INT8 inputs, INT32 accumulation, and INT32 outputs might return CUBLAS_STATUS_NOT_SUPPORTED 
when dimension N is larger than 65,536 or when the batch count is larger than 1. 
The issue has existed since CUDA Toolkit 13.0 Update 1 and will be fixed in a later release. [5541380]

@Purfview
Copy link
Contributor Author

Purfview commented Dec 1, 2025

Tested cuBLAS v13.1.0.3 with CUDA 13 - nothing changed for int8.

@jordimas jordimas merged commit 40c6b24 into OpenNMT:master Dec 1, 2025
16 checks passed
jordimas pushed a commit to jordimas/CTranslate2 that referenced this pull request Dec 17, 2025
meizhong986 added a commit to meizhong986/WhisperJAV that referenced this pull request Jan 3, 2026
  1. What compute_type=auto Does (CTranslate2 Documentation)

  auto = use the fastest computation type that is supported on this system and device

  CTranslate2 internally handles:
  - GPU capability detection (Compute Capability 6.1-8.x+)
  - CPU instruction set detection (AVX, AVX2, AVX512)
  - Automatic fallback for unsupported types
  - NEW: Blackwell/sm120 INT8 workaround (PR #1937)

  2. RTX 50XX (Blackwell) Issue - OpenNMT/CTranslate2#1865

  | Problem    | int8 variants fail with CUBLAS_STATUS_NOT_SUPPORTED on sm120 GPUs     |
  |------------|-----------------------------------------------------------------------|
  | Root Cause | cuBLAS INT8 kernels fail when matrix dimensions aren't divisible by 4 |
  | Fix        | OpenNMT/CTranslate2#1937 - Merged Dec 2025    |
  | Solution   | CTranslate2 now disables INT8 for sm120, auto-selects float16 instead |

  3. Current WhisperJAV Architecture (Problem)

  ┌─────────────────────────────────────────────────────────────────┐
  │ resolver_v3.py                                                  │
  │ _get_compute_type_for_device()                                  │
  │   CUDA → int8_float16  ← HARDCODED                            │
  │   CPU  → int8          ← HARDCODED                            │
  └──────────────────────────────┬──────────────────────────────────┘
                                 ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │ faster_whisper_pro_asr.py (lines 62-73)                         │
  │ DUPLICATE LOGIC:                                                │
  │   if compute_type is None or auto:                            │
  │     CUDA → int8_float16  ← REDUNDANT                          │
  │     CPU  → int8          ← REDUNDANT                          │
  └──────────────────────────────┬──────────────────────────────────┘
                                 ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │ CTranslate2 WhisperModel                                        │
  │   - Receives explicit compute_type                              │
  │   - Cannot apply internal optimizations                         │
  │   - RTX 50XX users get CUBLAS_STATUS_NOT_SUPPORTED              │
  └─────────────────────────────────────────────────────────────────┘

  Problem: WhisperJAV duplicates CTranslate2's logic and doesn't benefit from upstream fixes.

  4. Proposed Architecture (Solution)

  ┌─────────────────────────────────────────────────────────────────┐
  │ resolver_v3.py                                                  │
  │ _get_compute_type_for_device()                                  │
  │   CTranslate2 providers → auto  ← DELEGATE TO CTRANSLATE2     │
  │   PyTorch providers     → float16/float32 (unchanged)           │
  └──────────────────────────────┬──────────────────────────────────┘
                                 ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │ faster_whisper_pro_asr.py                                       │
  │   Pass compute_type=auto directly to WhisperModel             │
  │   KEEP: VRAM exhaustion fallback (try int8 if OOM)              │
  └──────────────────────────────┬──────────────────────────────────┘
                                 ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │ CTranslate2 WhisperModel                                        │
  │   - Detects GPU capability (sm120 → no INT8)                    │
  │   - Selects optimal type automatically                          │
  │   - RTX 50XX → float16 (works!)                                 │
  │   - RTX 40XX → int8_float16 (fastest)                           │
  │   - CPU      → int8 (fastest available)                         │
  └─────────────────────────────────────────────────────────────────┘
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GeForce RTX 50XX - cuBLAS failed with status CUBLAS_STATUS_NOT_SUPPORTED

2 participants