Running into CUDA Errors, when running multiple instances of Whisper in parallel #1771

dheshanm · 2023-11-07T18:25:03Z

dheshanm
Nov 7, 2023

I have 4 A16s available, I am trying to run 2 simultaneous instances of Whisper parallelly, to speed up batch processing a large number of audio files. I am able to run 1 instance of whisper without issue. But when I try to launch another whisper instance, I get the following error:

Traceback (most recent call last):
  File "/PHShome/dm1447/dev/pipeline/pipeline/runners/./23_transcribe.py", line 286, in <module>
    run_transcription(
  File "/PHShome/dm1447/dev/pipeline/pipeline/runners/./23_transcribe.py", line 151, in run_transcription
    transcript = transcribe.transcribe(
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "/PHShome/dm1447/dev/pipeline/WhisperNote/whispernote/transcribe.py", line 95, in transcribe
    result = whisper_model.transcribe(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/PHShome/dm1447/mambaforge/envs/whispernote/lib/python3.11/site-packages/whisper/transcribe.py", line 323, in transcribe
    add_word_timestamps(
  File "/PHShome/dm1447/mambaforge/envs/whispernote/lib/python3.11/site-packages/whisper/timing.py", line 298, in add_word_timestamps
    alignment = find_alignment(model, tokenizer, text_tokens, mel, num_frames, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/PHShome/dm1447/mambaforge/envs/whispernote/lib/python3.11/site-packages/whisper/timing.py", line 214, in find_alignment
    text_indices, time_indices = dtw(-matrix)
                                 ^^^^^^^^^^^^
  File "/PHShome/dm1447/mambaforge/envs/whispernote/lib/python3.11/site-packages/whisper/timing.py", line 151, in dtw
    return dtw_cpu(x.double().cpu().numpy())
                   ^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I am also trying to diarize my audio (using pyannote-audio), which I also run in parallel. And this causes no issues and am able to spin up multiple instances without issue. This uses Torch as well.

So I am not sure what could be causing this issue. Ideally, I would like to run 3 transcription instances on 3 individual A16s. I have around 1000 audio files to transcribe. Each instance of my 23_transcribe.py, will load a file to transcribe from my DB and transcribe it, continuously until there are no audio files left to transcribe. If I am able to run multiple 23_transcribe.py instances, it will dramatically cut down processing times.

This is from nvidia-smi, Where GPU0 is running whisper and GPU1 is running pyannote-audio. When I try sending another whisper to GPU2 it crashes with the above error, But I am able to send another pyannote-audio instance to it without issue.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A16                     Off | 00000000:CE:00.0 Off |                    0 |
|  0%   59C    P0              59W /  62W |  10258MiB / 15356MiB |     99%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A16                     Off | 00000000:CF:00.0 Off |                    0 |
|  0%   62C    P0              55W /  62W |   3238MiB / 15356MiB |     98%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A16                     Off | 00000000:D0:00.0 Off |                    0 |
|  0%   27C    P0              22W /  62W |      7MiB / 15356MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A16                     Off | 00000000:D1:00.0 Off |                    0 |
|  0%   24C    P0              18W /  62W |      7MiB / 15356MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     42263      C   python                                    10246MiB |
|    1   N/A  N/A      5199      C   python                                     3226MiB |
+---------------------------------------------------------------------------------------+

What could be limiting the number of parallel Whisper instances? Will I need to do anything more to achieve what I am trying to do?

jongwook · 2023-11-07T18:27:53Z

jongwook
Nov 7, 2023
Maintainer

16 GB is probably not enough to run two whisper large models in two separate processes. There are a few implementations that support such as https://github.com/m-bain/whisperX which supports batching using single process and would require less memory.

3 replies

dheshanm Nov 7, 2023
Author

How much VRAM would be sufficient per GPU to run 2 such instances without issue, using openai/whisper ?

jongwook Nov 7, 2023
Maintainer

It depends on the batch size (greedy vs beam size and best-of-N parameters), but a 24 GB GPU should be enough for running two instances running greedy decoding and maybe beam size 5. The third-party implementation like the above uses more memory-efficient implementation and should consume much lower memory.

dheshanm Nov 8, 2023
Author

Hi, I tried sending the whisper model to any GPU other than GPU0 Using the following code:

def load_model(
    model: str, in_memory: bool = True, device: str = "cpu", threads: int = 8
):
    global whisper_model
    if utils.check_gpu():
        gpu_idx = 2
        device = f"cuda:{gpu_idx}"
        logger.info(f"Sending transcription model to GPU {gpu_idx}")
    else:
        device = "cpu"
        logger.info("Sending transcription model to CPU")
        torch.set_num_threads(threads)
    logger.info(f"Loading transcription model: '{model}'")
    whisper_model = whisper.load_model(model, in_memory=in_memory, device=device)

def transcribe(
    input: str,
    language: Optional[str] = None,
    model: str = "base",
    beam_size: Optional[int] = None,
    condition_on_previous_text: bool = True,
    word_timestamps: bool = True,
    load_model_in_memory: bool = True,
    threads: int = 8,
) -> Dict[str, Any]:
    """Transcribe audio file with Whisper

    Args:
        input (str): input audio file
        language (str, optional): language of the audio file, if known. Defaults to None.
        model (str, optional): model to use for transcription. Defaults to "base".
        word_timestamps (bool, optional): include word timestamps in output. Defaults to False.
    """
    logger.info(f"Transcribing {input} with Whisper")

    logger.info("Starting transcription")
    logger.debug(f"language: '{language}'")
    logger.debug(f"word_timestamps: {word_timestamps}")

    global whisper_model
    if whisper_model is None:
        load_model(model, in_memory=load_model_in_memory, threads=threads)
    result = whisper_model.transcribe(
        input,
        language=language,
        word_timestamps=word_timestamps,
        verbose=False,
        beam_size=beam_size,
        condition_on_previous_text=condition_on_previous_text,
    )

    logger.info("Transcription complete")
    return result

Even when there are no other jobs running on the machine, I get the same error. Whisper only works on GPU0, when device='cuda:0'.

When observing nvidia-smi, I can see that loading the model loads sucessfully into GPU2, But it throws the above error when whisper_model.transcribe(...) is called. When whisper_model.transcribe(...) is called, I also see activity in GPU0, when that was not the device I am specifying.

Does the transcribe method load anything else into GPU VRAM?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Running into CUDA Errors, when running multiple instances of Whisper in parallel #1771

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Running into CUDA Errors, when running multiple instances of Whisper in parallel #1771

Uh oh!

dheshanm Nov 7, 2023

Replies: 1 comment · 3 replies

Uh oh!

jongwook Nov 7, 2023 Maintainer

Uh oh!

dheshanm Nov 7, 2023 Author

Uh oh!

jongwook Nov 7, 2023 Maintainer

Uh oh!

dheshanm Nov 8, 2023 Author

dheshanm
Nov 7, 2023

Replies: 1 comment 3 replies

jongwook
Nov 7, 2023
Maintainer

dheshanm Nov 7, 2023
Author

jongwook Nov 7, 2023
Maintainer

dheshanm Nov 8, 2023
Author