Skip to content

There are some errors if you want to run it from instructions. I think I manage to find some of them but I am not sure maybe I miss something. #405

@lukaLLM

Description

@lukaLLM

Very cool project. I looked to build something like this but I found this repo. I get some errors along the way though and manage to fix it but was curious what's the state or maybe reason of it
What I have done so far.

git clone https://github.com/collabora/WhisperLive.git
cd WhisperLive
docker build . -f docker/Dockerfile.tensorrt -t whisperlive-tensorrt
docker run -p 9090:9090 --runtime=nvidia --gpus all --entrypoint /bin/bash -it whisperlive-tensorrt
python3 run_server.py   --port 9090   --backend tensorrt   --trt_model_path "/app/TensorRT-LLM-examples/whisper/whisper_large-v3_float16"   --trt_multilingual   --max_clients 1   --max_connection_time 600

Now to test I used uv and needed
sudo apt install portaudio19-dev
to install this is as I get error

sudo apt install portaudio19-dev
my pyproject.toml

[project]
name = "code"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
"whisper-live>=0.7.1",
]

I write this for test

from whisper_live.client import TranscriptionClient

# This is the client that will connect to your running server
client = TranscriptionClient(
  "localhost",        # Hostname of your server
  9090,               # Port your server is running on
  lang="en",          # Language of the audio
  translate=False,

  model="whisper_large-v3_float16",   # This is sent to the server, but the server will use the model it was started with 
  use_vad=False,       # Use Voice Activity Detectio   # Your desired target language for translation
)

print("Client initialized, sending audio...")

# This calls the server with your audio file
# Make sure "test.mp3" exists in the same directory
try:
    client("test.mp3")
except FileNotFoundError:
    print("Error: The file 'test.mp3' was not found.")
    print("Please make sure the audio file is in the same directory as this script.")

I tried to also change the model to large-v3 but same thing happen
When I run this test I got this on the server (very hard to catch as it spam the last lines infinitely and then crash:

python3 run_server.py --port 9090 --backend tensorrt --trt_model_path "/app/TensorRT-LLM-examples/whisper/whisper_large-v3_float16" --trt_multilingual --max_clients 1 --max_connection_time 600

/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
INFO:root:Custom model option was provided. Switching to single model mode.
INFO:websockets.server:connection open
INFO:root:New client connected
:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[TensorRT-LLM] TensorRT-LLM version: 0.18.2
[TensorRT-LLM][INFO] Engine version 0.18.2 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Setting encoder max input length and hidden size for accepting visual features.
[TensorRT-LLM][INFO] Engine version 0.18.2 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.18.2 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Setting encoder max input length and hidden size for accepting visual features.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Engine version 0.18.2 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Setting encoder max input length and hidden size for accepting visual features.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Engine version 0.18.2 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 4
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 4
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 3000
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (3000) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2999 = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 3000 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 1228 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 205.08 MiB for execution context memory.
[TensorRT-LLM][INFO] [MS] Running engine with multi stream info
[TensorRT-LLM][INFO] [MS] Number of aux streams is 1
[TensorRT-LLM][INFO] [MS] Number of total worker streams is 2
[TensorRT-LLM][INFO] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 1218 (MiB)
[TensorRT-LLM][INFO] TRTEncoderModel mMaxInputLen: reset to 3000 from build config.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][WARNING] Fix optionalParams : KV cache reuse disabled because model was not built with paged context FMHA support
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 4
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 4
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 225
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (225) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 900
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 224 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 2065 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 120.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3274 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 29.73 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.18 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 47.71 GiB, available: 38.55 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 3554
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 17.35 GiB for max tokens in paged KV cache (113728).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 17.35 GiB for max tokens in paged KV cache (113728).
[TensorRT-LLM][INFO] This is an Encoder-Decoder model, set 0.5 cross KV cache fraction based on the config.
[TensorRT-LLM][INFO] Number of blocks in self KV cache primary pool: 1777, in cross KV cache primary pool: 1777
[TensorRT-LLM][INFO] Number of blocks in self KV cache secondary pool: 0, in cross KV cache secondary pool: 0
INFO:root:[INFO:] Warming up TensorRT engine..
^CTraceback (most recent call last):
File "/app/run_server.py", line 57, in
server.run(
File "/app/whisper_live/server.py", line 441, in run
server.serve_forever()
File "/usr/local/lib/python3.10/dist-packages/websockets/sync/server.py", line 275, in serve_forever
poller.select()
File "/usr/lib/python3.10/selectors.py", line 469, in select
fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
[TensorRT-LLM][WARNING] Default padding attention mask will be used as not all requests have cross attention mask.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the request. Default padding attention mask will be created.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
^CException ignored in: <module 'threading' from '/usr/lib/python3.10/threading.py'>
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1567, in _shutdown
lock.acquire()
KeyboardInterrupt:
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.
[TensorRT-LLM][WARNING] CrossAttentionMask is not provided for the generation request. Full valid attentionMask will be used by default.

To fix it I managed to find out I can do this:

python3 run_server.py \
  --port 9090 \
  --backend tensorrt \
  --trt_model_path "/app/TensorRT-LLM-examples/whisper/whisper_large-v3_float16" \
  --trt_multilingual \
  --max_clients 1 \
  --max_connection_time 600 \
  --trt_py_session
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
INFO:root:Custom model option was provided. Switching to single model mode.
INFO:websockets.server:connection open
INFO:root:New client connected
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[TensorRT-LLM] TensorRT-LLM version: 0.18.2
INFO:root:[INFO:] Warming up TensorRT engine..
/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:228: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. We recommend specifying layout=torch.jagged when constructing a nested tensor, as this layout receives active development, has better operator coverage, and works with torch.compile. (Triggered internally at /pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
  return _nested.nested_tensor(
INFO:root:Running TensorRT backend.
INFO:root:[WhisperTensorRT:] Processing audio with duration: 0.512
INFO:root:[WhisperTensorRT:] Processing audio with duration: 0.512
INFO:root:[WhisperTensorRT:] Processing audio with duration: 0.768
INFO:root:[WhisperTensorRT:] Processing audio with duration: 0.768
...
INFO:root:[WhisperTensorRT:] Processing audio with duration: 2.41
INFO:root:Cleaning up.
ERROR:root:[ERROR]: Sending data to client: sent 1000 (OK); then received 1000 (OK)
INFO:root:Exiting speech to text thread

My test.py output.

 For instance, if I am recalling an incident very
vividly I go back to the instant of its occurrence. I become
absent-minded, as you say. I jump back for a moment.
[ERROR] WebSocket Error: fin=1 opcode=8 data=b'\x03\xe8'
[INFO]: Websocket connection closed: None: None

I would be curious to learn how it works as there are a lot of errors and c++ version don't work and there are ton of package is deprecated warnings along whole process.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions