Enable queue for the buttons #452

jhj0517 · 2025-01-07T07:06:18Z

Related issues / PRs. Summarize issues.

Summarize Changes

Enable trigger_mode="multiple" for the buttons
Add default_concurrency_limit and max_size as CLI args when running app.py

chboishabba · 2025-06-06T02:20:41Z

Thanks for working on this PR to enable queueing for the buttons! This is a highly anticipated feature that could significantly improve workflow and resource management.

As you're implementing queueing, I wanted to raise a related point that would greatly enhance its utility, especially for users with varied hardware configurations like mine (low VRAM GPU).

My primary need, especially when processing multiple files in a queue, is the ability to specify the compute device (GPU/CPU), the specific model, and even the transcription engine (e.g., openai/whisper, SYSTRAN/faster-whisper, whisperX) on a per-file basis within the queue... also add whisperx and label params 😄

Context and Justification:

VRAM Constraints & Performance Quirks: On my low-VRAM GPU, I've observed issues, particularly with diarization. Offloading diarization to the CPU is often necessary.
Preventing GPU Idling: If the current queueing model processes transcribe + diarize as a single, sequential task per file, the GPU will likely sit idle waiting for the slower CPU-bound diarization to complete for each file. This negates the benefit of a GPU for transcription.
Leveraging Different Engines/Models: As shown in comparisons (e.g., this comment on robertrosenbusch/gfx803_rocm/issues/26#issuecomment-2907010838 where FasterWhisper shows high VRAM efficiency and WhisperX excels in accuracy/latency), different engines and models have distinct performance characteristics. Being able to choose them per-file allows for optimal trade-offs.

Desired Queueing Workflow (with per-file control):

Ideally, the queue would allow me to submit multiple files, each with its own specified transcription engine, model, and device assignments for its components (e.g., GPU for transcription, CPU for diarization). The system would then:

Keep the GPU continuously busy by queuing and processing transcription tasks for all files as quickly as possible.
Independently, as each file's transcription completes, its CPU-bound diarization task would be initiated in parallel, without blocking the GPU's progress on the next transcription task in the queue.

This concurrent execution capability, alongside per-file control over device, model, and engine, would ensure maximum resource utilisation and flexibility.

Is this something that could be considered as an extension to the queueing functionality being introduced here, or perhaps in a subsequent iteration?

robertrosenbusch/gfx803_rocm#26 (comment)

#560
https://www.reddit.com/r/LocalLLaMA/comments/1brqwun/i_compared_the_different_open_source_whisper/

chboishabba · 2025-06-06T02:34:49Z

Title

Robust API Parameter Mapping: Support for Named Parameters in API as Well as Positional

Context

Currently, the API for /transcribe_file and similar endpoints in Whisper-WebUI relies on a fixed positional parameter order (e.g., param_7, param_8, ...), making it brittle to UI changes and harder to automate reliably. This has led to confusion and difficulty in scripting, especially as users try to automate batch jobs or adapt to updated versions (see #561, see #560, related PR #452). This is also relevant for interop with other projects, such as gfx803_rocm#26.

Proposal

Support API calls using named parameters (dict) as well as positional (list):
- Allow the API to accept requests where arguments are passed as {label: value} pairs, matching the UI labels (or a documented stable internal key).
- Maintain backward compatibility with positional arguments.
- Perform a check: if a named/dict parameter is provided, map it to the correct internal field, regardless of order.
- If both are present, give precedence to named parameters.
- Document the mapping between parameter names, UI labels, and internal fields.
Benefits:
- More robust and future-proof scripting/automation.
- Scripts can be written using stable names, not break due to order changes.
- Easier to debug, maintain, and extend API usage.
- Users can check which params are valid by querying the API or consulting documentation.

Implementation Sketch

Backend:
- In the API endpoint, detect if incoming params are a dict (named) or list (positional).
- If dict, map keys to the correct internal fields (accept both UI label and internal field name, if possible).
- Validate each param: if a name is not recognized, return a helpful error.
- For list, continue current behavior.
Client:
- Update API docs/examples to show both usage patterns.
- Optionally provide a helper to fetch valid parameter names/types.

Example

# Current (positional):
result = client.predict(
    param_7="english",
    param_8=False,
    ...
    api_name="/transcribe_file"
)

# Proposed (named):
result = client.predict(
    Language="english",
    Translate_to_English=False,
    ...
    api_name="/transcribe_file"
)

Related Issues/PRs

Thanks!

Thanks for your awesome work and for considering this improvement! This would make scripting and API interop much more user-friendly and robust.

chboishabba · 2025-06-06T02:36:29Z

Let’s break down your questions and give you actionable guidance for each, referencing how the codebase works and how you can map/label parameters for the API.

1. Where is it possible to specify the transcription device (not diarization or anything else)?

The transcription device is the device used for the main Whisper transcription model (not VAD, diarization, or music separation). In your API, this is the parameter associated with "Device" for transcription (usually "cuda" or "cpu").

In the code:

Check the WhisperParams class in modules/whisper/data_classes.py.
The relevant field is likely named something similar to device or whisper_device.
The UI entry is created in the to_gradio_inputs method (or similar), where you’ll find a Gradio Dropdown for "Device".

Example from your API mapping:

param_45 → "Device" Dropdown (values: 'cpu', 'cuda').

In the code, this looks like:

Python

gr.Dropdown(
    label=_("Device"),
    choices=["cpu", "cuda"],  # or possibly more, depending on hardware
    value=defaults.get("device", cls.__fields__["device"].default),
)

Summary:
For transcription, "Device" (param_45 in your API) is set by the device field in WhisperParams.
Other similar device fields (e.g., param_50) are for separate modules like UVR (music separation).

2. How can I enumerate the commands in a similar way to `to_list()` so I might find the values I need?

The parameters are ordered and flattened into a list for API compatibility because Gradio expects them that way. The order of this list determines the mapping of param_7, param_8, etc.

How to enumerate:

Look at the to_list() methods in modules/whisper/data_classes.py for WhisperParams, VadParams, DiarizationParams, and BGMSeparationParams.
Each to_list() method returns the parameters in the exact order they appear in the API.
To get the full order, check TranscriptionPipelineParams.to_list() which concatenates the lists from each sub-params class.

Example:

Python

def to_list(self) -> List:
    whisper_list = self.whisper.to_list()
    vad_list = self.vad.to_list()
    diarization_list = self.diarization.to_list()
    bgm_sep_list = self.bgm_separation.to_list()
    return whisper_list + vad_list + diarization_list + bgm_sep_list

Actionable step:

Add print/debug statements or log the output of to_list() in the backend to see the mapping live.
Or, read the code line by line and enumerate:
- param_7 = first item in WhisperParams.to_list()
- param_8 = second item, etc.

3. How to map the param labels as in the API page?

You want:

A mapping between positional parameters (param_7, param_8, ...) and the friendly names/labels as shown in the API documentation.

How to generate this mapping:

In each to_gradio_inputs() or to_list() method, parameters are added in a specific order, and each Gradio component has a label.
To create a mapping, enumerate each parameter as it's added to the list and pair it with its label.

Example enumeration logic (pseudo-code):

Python

# In WhisperParams.to_list()
param_list = [
    self.language,     # param_7 ("Language")
    self.is_translate, # param_8 ("Translate to English?")
    self.beam_size,    # param_9 ("Beam Size")
    # ... and so on
]

Cross-reference each list position with the label in the corresponding to_gradio_inputs() method.

Automated mapping suggestion:

Modify the backend code (or write a helper script) to output a dict or list of {param_number: label, default, type} for documentation and debugging.

TL;DR Table

API param	Purpose/Label	Code location / param
param_45	Device (for transcription)	WhisperParams.device
param_50	Device (for UVR)	BGMSeparationParams.uvr_device
...	...	...

Example: How to Patch for Better API Labeling

To help with labeling in API requests (and not just param_x), you could:

This will let you output or log a human-readable mapping for every API request!

Summary of Steps:

The transcription device is set in WhisperParams and maps to param_45.
The order in to_list() determines the API parameter order; enumerate them by reading or logging these methods.
To map labels, walk through the list construction and pair each slot with its UI label (from to_gradio_inputs).
For your PR, consider adding a function that exposes {label: value} mapping for API clarity.

Let me know if you want a ready-to-merge file patch or code snippet for any of these enhancements!

chboishabba · 2025-06-06T02:40:54Z

Feature Request: Add WhisperX Support

Summary

Currently, Whisper-WebUI supports multiple Whisper implementations:

SYSTRAN/faster-whisper
Vaibhavs10/insanely-fast-whisper
(and the original OpenAI Whisper)

Proposal:
Please add support for WhisperX—a high-performance transcription library with advanced features like word-level timestamps, alignment, and speaker diarization.

Why WhisperX?

Speed and efficiency: WhisperX provides batched inference and supports multiple hardware backends.
Advanced features: It offers alignment, diarization, and phoneme-level output.
Wider coverage: Makes Whisper-WebUI more useful for users needing diarization and word-level accuracy.

Implementation Plan & File Pointers

1. Add WhisperX as an available backend

File: modules/whisper/whisper_impl.py

Add a new enum or string for "whisperx" to the list of available backends.

Example:

class WhisperImpl(str, Enum):
    WHISPER = "whisper"
    FASTER_WHISPER = "faster-whisper"
    INSANELY_FAST_WHISPER = "insanely-fast-whisper"
    WHISPERX = "whisperx"  # <-- add this

2. Implement a new wrapper for WhisperX

File: Create modules/whisper/whisperx_impl.py

This file should define a class (e.g., WhisperXImplementation) that wraps the WhisperX Python API.
The interface should match existing backends (transcribe, load_model, etc).

Example implementation:

import whisperx

class WhisperXImplementation(BaseWhisperImplementation):
    def __init__(self, model_name, device, compute_type="float16", **kwargs):
        self.model = whisperx.load_model(model_name, device, compute_type=compute_type)
        self.device = device

    def transcribe(self, audio, **kwargs):
        # audio: numpy array, or path
        result = self.model.transcribe(audio, **kwargs)
        # Alignment (optional)
        if kwargs.get("do_align", True):
            model_a, metadata = whisperx.load_align_model(
                language_code=result["language"], device=self.device)
            result = whisperx.align(
                result["segments"], model_a, metadata, audio, self.device)
        return result

You might also want to add methods for diarization and speaker assignment.

3. Wire up the backend in the factory/selection logic

File: modules/whisper/whisper_factory.py (or similar)

Add logic to instantiate WhisperXImplementation when "whisperx" is selected.

Example:

if impl == WhisperImpl.WHISPERX:
    from modules.whisper.whisperx_impl import WhisperXImplementation
    return WhisperXImplementation(model_name, device, compute_type, ...)

4. Expose WhisperX-specific options in the UI (optional)

File: app.py and/or modules/whisper/data_classes.py
- Add options for alignment, diarization, etc., to the UI parameter set.
- Ensure parameter mapping covers new features, or disables options as appropriate.

5. Update requirements

File: requirements.txt
- Add whisperx (and optionally torch version compatible with your backend).

6. Update documentation

File: README.md
- Add WhisperX to the list of supported backends and document any unique options.

Example: Integrating WhisperX in the Backend

import whisperx

class WhisperXImplementation:
    def __init__(self, model_name, device, compute_type="float16", **kwargs):
        self.model = whisperx.load_model(model_name, device, compute_type=compute_type)
        self.device = device

    def transcribe(self, audio, **kwargs):
        result = self.model.transcribe(audio, **kwargs)
        # Alignment
        if kwargs.get("do_align", True):
            model_a, metadata = whisperx.load_align_model(
                language_code=result["language"], device=self.device)
            result = whisperx.align(
                result["segments"], model_a, metadata, audio, self.device)
        # Diarization (optional)
        if kwargs.get("do_diarization", False):
            diarize_model = whisperx.diarize.DiarizationPipeline(device=self.device)
            diarize_segments = diarize_model(audio)
            result = whisperx.assign_word_speakers(diarize_segments, result)
        return result

Summary Table

Step	File(s) to Modify/Create	Purpose
1	`modules/whisper/whisper_impl.py`	Add WhisperX to backend enum/list
2	`modules/whisper/whisperx_impl.py` (new)	Implement WhisperX backend wrapper
3	`modules/whisper/whisper_factory.py`	Instantiate WhisperX backend
4	`app.py`, `modules/whisper/data_classes.py`	Expose WhisperX options in UI
5	`requirements.txt`	Add `whisperx` dependency
6	`README.md`	Update documentation

Let me know if you’d like code snippets for other files or more detail on UI integration!

jhj0517 added 2 commits January 7, 2025 14:47

Add gitignore

4ccf803

Enable trigger mode and add CLI args for the queue

62788da

jhj0517 added the enhancement New feature or request label Jan 7, 2025

jhj0517 mentioned this pull request Feb 2, 2025

move input files into temporary directory #483

Closed

chboishabba mentioned this pull request Jun 6, 2025

Question: How can i control the amount of consumed VRAM #560

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Enable queue for the buttons #452

Enable queue for the buttons #452

Uh oh!

jhj0517 commented Jan 7, 2025 •

edited

Loading

Uh oh!

chboishabba commented Jun 6, 2025 •

edited

Loading

Uh oh!

chboishabba commented Jun 6, 2025

Uh oh!

chboishabba commented Jun 6, 2025

Uh oh!

chboishabba commented Jun 6, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Enable queue for the buttons #452

Are you sure you want to change the base?

Enable queue for the buttons #452

Uh oh!

Conversation

jhj0517 commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related issues / PRs. Summarize issues.

Summarize Changes

Uh oh!

chboishabba commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chboishabba commented Jun 6, 2025

Title

Context

Proposal

Implementation Sketch

Example

Related Issues/PRs

Thanks!

Uh oh!

chboishabba commented Jun 6, 2025

1. Where is it possible to specify the transcription device (not diarization or anything else)?

2. How can I enumerate the commands in a similar way to to_list() so I might find the values I need?

3. How to map the param labels as in the API page?

TL;DR Table

Example: How to Patch for Better API Labeling

Uh oh!

chboishabba commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Feature Request: Add WhisperX Support

Summary

Why WhisperX?

Implementation Plan & File Pointers

1. Add WhisperX as an available backend

2. Implement a new wrapper for WhisperX

3. Wire up the backend in the factory/selection logic

4. Expose WhisperX-specific options in the UI (optional)

5. Update requirements

6. Update documentation

Example: Integrating WhisperX in the Backend

Summary Table

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jhj0517 commented Jan 7, 2025 •

edited

Loading

chboishabba commented Jun 6, 2025 •

edited

Loading

2. How can I enumerate the commands in a similar way to `to_list()` so I might find the values I need?

chboishabba commented Jun 6, 2025 •

edited

Loading