Skip to content

Conversation

@jhj0517
Copy link
Owner

@jhj0517 jhj0517 commented Jan 7, 2025

Related issues / PRs. Summarize issues.

Summarize Changes

  1. Enable trigger_mode="multiple" for the buttons
  2. Add default_concurrency_limit and max_size as CLI args when running app.py

@jhj0517 jhj0517 added the enhancement New feature or request label Jan 7, 2025
@chboishabba
Copy link

chboishabba commented Jun 6, 2025

Hi @jhj0517,

Thanks for working on this PR to enable queueing for the buttons! This is a highly anticipated feature that could significantly improve workflow and resource management.

As you're implementing queueing, I wanted to raise a related point that would greatly enhance its utility, especially for users with varied hardware configurations like mine (low VRAM GPU).

My primary need, especially when processing multiple files in a queue, is the ability to specify the compute device (GPU/CPU), the specific model, and even the transcription engine (e.g., openai/whisper, SYSTRAN/faster-whisper, whisperX) on a per-file basis within the queue... also add whisperx and label params 😄

Context and Justification:

  • VRAM Constraints & Performance Quirks: On my low-VRAM GPU, I've observed issues, particularly with diarization. Offloading diarization to the CPU is often necessary.
  • Preventing GPU Idling: If the current queueing model processes transcribe + diarize as a single, sequential task per file, the GPU will likely sit idle waiting for the slower CPU-bound diarization to complete for each file. This negates the benefit of a GPU for transcription.
  • Leveraging Different Engines/Models: As shown in comparisons (e.g., this comment on robertrosenbusch/gfx803_rocm/issues/26#issuecomment-2907010838 where FasterWhisper shows high VRAM efficiency and WhisperX excels in accuracy/latency), different engines and models have distinct performance characteristics. Being able to choose them per-file allows for optimal trade-offs.

Desired Queueing Workflow (with per-file control):

Ideally, the queue would allow me to submit multiple files, each with its own specified transcription engine, model, and device assignments for its components (e.g., GPU for transcription, CPU for diarization). The system would then:

  1. Keep the GPU continuously busy by queuing and processing transcription tasks for all files as quickly as possible.
  2. Independently, as each file's transcription completes, its CPU-bound diarization task would be initiated in parallel, without blocking the GPU's progress on the next transcription task in the queue.

This concurrent execution capability, alongside per-file control over device, model, and engine, would ensure maximum resource utilisation and flexibility.

Is this something that could be considered as an extension to the queueing functionality being introduced here, or perhaps in a subsequent iteration?

robertrosenbusch/gfx803_rocm#26 (comment)

image
#560
https://www.reddit.com/r/LocalLLaMA/comments/1brqwun/i_compared_the_different_open_source_whisper/

@chboishabba
Copy link

Title

Robust API Parameter Mapping: Support for Named Parameters in API as Well as Positional


Context

Currently, the API for /transcribe_file and similar endpoints in Whisper-WebUI relies on a fixed positional parameter order (e.g., param_7, param_8, ...), making it brittle to UI changes and harder to automate reliably. This has led to confusion and difficulty in scripting, especially as users try to automate batch jobs or adapt to updated versions (see #561, see #560, related PR #452). This is also relevant for interop with other projects, such as gfx803_rocm#26.


Proposal

  • Support API calls using named parameters (dict) as well as positional (list):

    • Allow the API to accept requests where arguments are passed as {label: value} pairs, matching the UI labels (or a documented stable internal key).
    • Maintain backward compatibility with positional arguments.
    • Perform a check: if a named/dict parameter is provided, map it to the correct internal field, regardless of order.
    • If both are present, give precedence to named parameters.
    • Document the mapping between parameter names, UI labels, and internal fields.
  • Benefits:

    • More robust and future-proof scripting/automation.
    • Scripts can be written using stable names, not break due to order changes.
    • Easier to debug, maintain, and extend API usage.
    • Users can check which params are valid by querying the API or consulting documentation.

Implementation Sketch

  1. Backend:
    • In the API endpoint, detect if incoming params are a dict (named) or list (positional).
    • If dict, map keys to the correct internal fields (accept both UI label and internal field name, if possible).
    • Validate each param: if a name is not recognized, return a helpful error.
    • For list, continue current behavior.
  2. Client:
    • Update API docs/examples to show both usage patterns.
    • Optionally provide a helper to fetch valid parameter names/types.

Example

# Current (positional):
result = client.predict(
    param_7="english",
    param_8=False,
    ...
    api_name="/transcribe_file"
)

# Proposed (named):
result = client.predict(
    Language="english",
    Translate_to_English=False,
    ...
    api_name="/transcribe_file"
)

Related Issues/PRs


Thanks!

Thanks for your awesome work and for considering this improvement! This would make scripting and API interop much more user-friendly and robust.

@chboishabba
Copy link

Let’s break down your questions and give you actionable guidance for each, referencing how the codebase works and how you can map/label parameters for the API.


1. Where is it possible to specify the transcription device (not diarization or anything else)?

The transcription device is the device used for the main Whisper transcription model (not VAD, diarization, or music separation). In your API, this is the parameter associated with "Device" for transcription (usually "cuda" or "cpu").

In the code:

  • Check the WhisperParams class in modules/whisper/data_classes.py.
  • The relevant field is likely named something similar to device or whisper_device.
  • The UI entry is created in the to_gradio_inputs method (or similar), where you’ll find a Gradio Dropdown for "Device".

Example from your API mapping:

  • param_45"Device" Dropdown (values: 'cpu', 'cuda').

In the code, this looks like:

Python
gr.Dropdown(
    label=_("Device"),
    choices=["cpu", "cuda"],  # or possibly more, depending on hardware
    value=defaults.get("device", cls.__fields__["device"].default),
)

Summary:
For transcription, "Device" (param_45 in your API) is set by the device field in WhisperParams.
Other similar device fields (e.g., param_50) are for separate modules like UVR (music separation).


2. How can I enumerate the commands in a similar way to to_list() so I might find the values I need?

The parameters are ordered and flattened into a list for API compatibility because Gradio expects them that way. The order of this list determines the mapping of param_7, param_8, etc.

How to enumerate:

  • Look at the to_list() methods in modules/whisper/data_classes.py for WhisperParams, VadParams, DiarizationParams, and BGMSeparationParams.
  • Each to_list() method returns the parameters in the exact order they appear in the API.
  • To get the full order, check TranscriptionPipelineParams.to_list() which concatenates the lists from each sub-params class.

Example:

Python
def to_list(self) -> List:
    whisper_list = self.whisper.to_list()
    vad_list = self.vad.to_list()
    diarization_list = self.diarization.to_list()
    bgm_sep_list = self.bgm_separation.to_list()
    return whisper_list + vad_list + diarization_list + bgm_sep_list

Actionable step:

  • Add print/debug statements or log the output of to_list() in the backend to see the mapping live.
  • Or, read the code line by line and enumerate:
    • param_7 = first item in WhisperParams.to_list()
    • param_8 = second item, etc.

3. How to map the param labels as in the API page?

You want:

  • A mapping between positional parameters (param_7, param_8, ...) and the friendly names/labels as shown in the API documentation.

How to generate this mapping:

  • In each to_gradio_inputs() or to_list() method, parameters are added in a specific order, and each Gradio component has a label.
  • To create a mapping, enumerate each parameter as it's added to the list and pair it with its label.

Example enumeration logic (pseudo-code):

Python
# In WhisperParams.to_list()
param_list = [
    self.language,     # param_7 ("Language")
    self.is_translate, # param_8 ("Translate to English?")
    self.beam_size,    # param_9 ("Beam Size")
    # ... and so on
]
  • Cross-reference each list position with the label in the corresponding to_gradio_inputs() method.

Automated mapping suggestion:

  • Modify the backend code (or write a helper script) to output a dict or list of {param_number: label, default, type} for documentation and debugging.

TL;DR Table

API param Purpose/Label Code location / param
param_45 Device (for transcription) WhisperParams.device
param_50 Device (for UVR) BGMSeparationParams.uvr_device
... ... ...

Example: How to Patch for Better API Labeling

To help with labeling in API requests (and not just param_x), you could:

This will let you output or log a human-readable mapping for every API request!


Summary of Steps:

  1. The transcription device is set in WhisperParams and maps to param_45.
  2. The order in to_list() determines the API parameter order; enumerate them by reading or logging these methods.
  3. To map labels, walk through the list construction and pair each slot with its UI label (from to_gradio_inputs).
  4. For your PR, consider adding a function that exposes {label: value} mapping for API clarity.

Let me know if you want a ready-to-merge file patch or code snippet for any of these enhancements!

@chboishabba
Copy link

chboishabba commented Jun 6, 2025

Feature Request: Add WhisperX Support

Summary

Currently, Whisper-WebUI supports multiple Whisper implementations:

Proposal:
Please add support for WhisperX—a high-performance transcription library with advanced features like word-level timestamps, alignment, and speaker diarization.

image


Why WhisperX?

  • Speed and efficiency: WhisperX provides batched inference and supports multiple hardware backends.
  • Advanced features: It offers alignment, diarization, and phoneme-level output.
  • Wider coverage: Makes Whisper-WebUI more useful for users needing diarization and word-level accuracy.

Implementation Plan & File Pointers

1. Add WhisperX as an available backend

  • File: modules/whisper/whisper_impl.py
    • Add a new enum or string for "whisperx" to the list of available backends.
    • Example:
      class WhisperImpl(str, Enum):
          WHISPER = "whisper"
          FASTER_WHISPER = "faster-whisper"
          INSANELY_FAST_WHISPER = "insanely-fast-whisper"
          WHISPERX = "whisperx"  # <-- add this

2. Implement a new wrapper for WhisperX

  • File: Create modules/whisper/whisperx_impl.py
    • This file should define a class (e.g., WhisperXImplementation) that wraps the WhisperX Python API.
    • The interface should match existing backends (transcribe, load_model, etc).
    • Example implementation:
      import whisperx
      
      class WhisperXImplementation(BaseWhisperImplementation):
          def __init__(self, model_name, device, compute_type="float16", **kwargs):
              self.model = whisperx.load_model(model_name, device, compute_type=compute_type)
              self.device = device
      
          def transcribe(self, audio, **kwargs):
              # audio: numpy array, or path
              result = self.model.transcribe(audio, **kwargs)
              # Alignment (optional)
              if kwargs.get("do_align", True):
                  model_a, metadata = whisperx.load_align_model(
                      language_code=result["language"], device=self.device)
                  result = whisperx.align(
                      result["segments"], model_a, metadata, audio, self.device)
              return result
    • You might also want to add methods for diarization and speaker assignment.

3. Wire up the backend in the factory/selection logic

  • File: modules/whisper/whisper_factory.py (or similar)
    • Add logic to instantiate WhisperXImplementation when "whisperx" is selected.
    • Example:
      if impl == WhisperImpl.WHISPERX:
          from modules.whisper.whisperx_impl import WhisperXImplementation
          return WhisperXImplementation(model_name, device, compute_type, ...)

4. Expose WhisperX-specific options in the UI (optional)

  • File: app.py and/or modules/whisper/data_classes.py
    • Add options for alignment, diarization, etc., to the UI parameter set.
    • Ensure parameter mapping covers new features, or disables options as appropriate.

5. Update requirements

  • File: requirements.txt
    • Add whisperx (and optionally torch version compatible with your backend).

6. Update documentation

  • File: README.md
    • Add WhisperX to the list of supported backends and document any unique options.

Example: Integrating WhisperX in the Backend

import whisperx

class WhisperXImplementation:
    def __init__(self, model_name, device, compute_type="float16", **kwargs):
        self.model = whisperx.load_model(model_name, device, compute_type=compute_type)
        self.device = device

    def transcribe(self, audio, **kwargs):
        result = self.model.transcribe(audio, **kwargs)
        # Alignment
        if kwargs.get("do_align", True):
            model_a, metadata = whisperx.load_align_model(
                language_code=result["language"], device=self.device)
            result = whisperx.align(
                result["segments"], model_a, metadata, audio, self.device)
        # Diarization (optional)
        if kwargs.get("do_diarization", False):
            diarize_model = whisperx.diarize.DiarizationPipeline(device=self.device)
            diarize_segments = diarize_model(audio)
            result = whisperx.assign_word_speakers(diarize_segments, result)
        return result

Summary Table

Step File(s) to Modify/Create Purpose
1 modules/whisper/whisper_impl.py Add WhisperX to backend enum/list
2 modules/whisper/whisperx_impl.py (new) Implement WhisperX backend wrapper
3 modules/whisper/whisper_factory.py Instantiate WhisperX backend
4 app.py, modules/whisper/data_classes.py Expose WhisperX options in UI
5 requirements.txt Add whisperx dependency
6 README.md Update documentation

Let me know if you’d like code snippets for other files or more detail on UI integration!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants