Skip to content

Comments

feat: add stt support via whisper.cpp with new transcribe subcommand#2438

Open
bittoby wants to merge 3 commits intocontainers:mainfrom
bittoby:feat/whisper-stt
Open

feat: add stt support via whisper.cpp with new transcribe subcommand#2438
bittoby wants to merge 3 commits intocontainers:mainfrom
bittoby:feat/whisper-stt

Conversation

@bittoby
Copy link

@bittoby bittoby commented Feb 18, 2026

Add Speech-to-Text Support via whisper.cpp

Adds a new transcribe subcommand and whisper.cpp runtime for speech-to-text transcription.

Usage

ramalama transcribe huggingface://ggerganov/whisper.cpp/ggml-tiny.bin audio.wav

Changes

  • New transcribe subcommand using whisper-cli
  • New whisper.cpp runtime with serve (HTTP API) and transcribe (batch)
  • New inference spec inference-spec/engines/whisper.cpp.yaml
  • Container image: ghcr.io/ggml-org/whisper.cpp:main
2

Notes

  • Audio files must be world-readable in container mode: chmod o+r audio.wav
  • For serve: ramalama --runtime whisper.cpp serve MODEL

Issue: #2376

Summary by Sourcery

Add speech-to-text support using a new whisper.cpp runtime and transcribe subcommand integrated into the CLI and inference engine.

New Features:

  • Introduce a transcribe CLI subcommand to convert audio files to text using speech-to-text models.
  • Add whisper.cpp as a supported runtime with serve (HTTP API) and batch transcription commands.
  • Provide a new whisper.cpp inference engine spec defining serve and transcribe commands and options.

Enhancements:

  • Extend runtime configuration and image resolution to support a dedicated whisper.cpp container image with sensible defaults.
  • Update the shared command context to carry audio transcription parameters such as input file, language, and translate flag.
  • Mount audio files into containers automatically when running transcription commands.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Feb 18, 2026

Reviewer's Guide

Adds a new whisper.cpp speech-to-text runtime and a transcribe subcommand, wiring them through CLI, runtime config, container execution, and a new inference spec for whisper's serve and batch transcription commands.

File-Level Changes

Change Details Files
Introduce a transcribe subcommand that runs whisper.cpp-based batch transcription via container or host execution.
  • Add transcribe_parser to define CLI arguments (model, audio path, language, translate, container options, runtime args) and register it with the main CLI parser.
  • Set default whisper container image for transcribe, with override support and completion hooks for env, image, network, etc.
  • Implement transcribe_cli to validate audio path and permissions, derive container/host input path, ensure the model exists, and invoke model.transcribe with the assembled command.
ramalama/cli.py
Extend the runtime abstraction and container transport to support a transcribe operation and audio file mounting for whisper.cpp.
  • Add transcribe abstract method to the base transport API and implement it in container-based models to set accel env vars and execute the command.
  • Adjust container execution to clear the entrypoint when using the whisper.cpp runtime so that inference binaries can be called directly.
  • Introduce audio-specific container mount logic that binds the input audio file into a fixed /mnt/audio path inside the container.
ramalama/transports/base.py
ramalama/common.py
Add configuration and argument-context support for whisper.cpp runtime and its image selection.
  • Define DEFAULT_WHISPER_IMAGE and a WHISPER image field in the image config, plus support whisper.cpp in SUPPORTED_RUNTIMES and runtime help text.
  • Add a default_whisper_image helper and hook whisper.cpp into the runtime/image mapping logic for top-level CLI options and accel_image selection.
  • Extend RamalamaArgsContext to carry transcription-specific fields (input_file, language, translate) from argparse into the runtime layer.
ramalama/cli.py
ramalama/config.py
ramalama/config_types.py
ramalama/command/context.py
ramalama/common.py
Define a new inference spec for the whisper.cpp engine that describes serve and transcribe commands.
  • Add inference-spec/engines/whisper.cpp.yaml specifying whisper-server options for HTTP serving and whisper-cli options for batch transcription, including host/port, model path, threads, language, translate, and runtime_args passthrough.
inference-spec/engines/whisper.cpp.yaml

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @bittoby, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the application's capabilities by introducing robust Speech-to-Text (STT) support. It integrates the whisper.cpp runtime, allowing users to transcribe audio files into text via a new transcribe subcommand. This addition expands the platform's utility beyond traditional language models to include audio processing, making it more versatile for various AI tasks.

Highlights

  • New transcribe subcommand: A new transcribe subcommand has been added, enabling Speech-to-Text (STT) functionality for audio files.
  • Whisper.cpp Runtime Integration: The whisper.cpp runtime is now integrated, supporting both serve (HTTP API) and transcribe (batch processing) operations for STT.
  • Dedicated Inference Specification: A new inference specification (inference-spec/engines/whisper.cpp.yaml) has been introduced to define the commands and arguments for the whisper.cpp runtime.
  • Automatic Container Image Selection: The system now automatically selects the ghcr.io/ggml-org/whisper.cpp:main container image when the whisper.cpp runtime is specified, unless overridden.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • inference-spec/engines/whisper.cpp.yaml
    • Added a new inference specification for the whisper.cpp runtime.
    • Defined serve command with options for host, port, model, threads, and debug mode.
    • Defined transcribe command with options for model, input file, threads, language, and translate flag.
  • inference-spec/schema/schema.1-0-0.json
    • Updated the name enum to include "transcribe" as a valid command.
  • ramalama/cli.py
    • Imported MNT_AUDIO_DIR and DEFAULT_WHISPER_IMAGE.
    • Added default_whisper_image function to retrieve the whisper container image.
    • Updated the --runtime argument choices to include "whisper.cpp".
    • Added transcribe_parser to configure arguments for the new transcribe subcommand.
    • Added transcribe_cli function to handle the logic for the transcribe command, including file existence and permissions checks, and setting up input file paths.
    • Implemented logic to automatically set the whisper.cpp container image if not explicitly overridden.
  • ramalama/command/context.py
    • Added input_file, language, and translate attributes to RamalamaArgsContext for transcribe command arguments.
  • ramalama/common.py
    • Defined MNT_AUDIO_DIR constant for mounting audio files within containers.
    • Modified accel_image to return the whisper.cpp image when the runtime is set to "whisper.cpp".
  • ramalama/config.py
    • Defined DEFAULT_WHISPER_IMAGE as ghcr.io/ggml-org/whisper.cpp:main.
    • Added WHISPER to RamalamaImages to store the default whisper image.
  • ramalama/config_types.py
    • Extended SUPPORTED_RUNTIMES to include "whisper.cpp".
  • ramalama/transports/base.py
    • Imported MNT_AUDIO_DIR.
    • Added an abstract transcribe method to BaseTransport.
    • Modified exec_model_in_container to add --entrypoint "" for whisper.cpp runtime.
    • Implemented setup_mounts to bind mount the input audio file into the container at MNT_AUDIO_DIR.
    • Added a concrete transcribe method implementation.
Activity
  • No activity to report on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces speech-to-text functionality by integrating whisper.cpp. It adds a new transcribe subcommand, a whisper.cpp runtime, and the necessary configuration and command-line arguments. The implementation is solid and follows existing patterns in the codebase. I've identified a couple of opportunities to improve maintainability by refactoring duplicated code for argument parsing and image resolution logic. Overall, this is a great addition to the project.

run(host=args.host, port=int(args.port), model_store_path=args.store)


def transcribe_parser(subparsers):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The transcribe_parser function duplicates a significant number of argument definitions that are also present in other parser functions like runtime_options. This code duplication makes the CLI argument setup harder to maintain.

To improve maintainability, consider refactoring the common arguments into a separate helper function. For example, you could create a function add_common_runtime_options(parser) that adds arguments shared across different subcommands (e.g., --device, --env, --image, --name, --ngl, --oci-runtime, --privileged, --pull, --runtime-args, --selinux, --threads, --tls-verify).

This new function could then be called from transcribe_parser, and runtime_options could also be refactored to use it, centralizing the logic for these common options.

Comment on lines 667 to 672
if config.runtime == "whisper.cpp":
from ramalama.config import DEFAULT_WHISPER_IMAGE

image = config.images.get("WHISPER", DEFAULT_WHISPER_IMAGE)
# If the image from the config is specified by tag or digest, return it unmodified
return image if ":" in image else f"{image}:main"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for determining the whisper.cpp container image is duplicated here and in ramalama/cli.py within the default_whisper_image function. This duplication can lead to maintenance issues and potential inconsistencies in the future.

To improve this, you could extract the common logic into a single helper function, for example in ramalama/config.py:

# In ramalama/config.py
def get_whisper_image(config: Config) -> str:
    image = config.images.get("WHISPER", DEFAULT_WHISPER_IMAGE)
    return image if ":" in image else f"{image}:main"

Then, both accel_image in this file and default_whisper_image in cli.py can call this new function, ensuring a single source of truth for this logic.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The logic for choosing the whisper.cpp image is split between default_whisper_image() in cli.py and the config.runtime == "whisper.cpp" branch in accel_image; consider centralizing this image selection in one place to avoid future divergence in tag/override behavior.
  • In transcribe_cli you unconditionally set args.runtime = "whisper.cpp", which can silently override a user-specified --runtime; it may be clearer to either make transcribe reject non-whisper runtimes explicitly or prevent --runtime from being accepted for this subcommand.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The logic for choosing the whisper.cpp image is split between `default_whisper_image()` in `cli.py` and the `config.runtime == "whisper.cpp"` branch in `accel_image`; consider centralizing this image selection in one place to avoid future divergence in tag/override behavior.
- In `transcribe_cli` you unconditionally set `args.runtime = "whisper.cpp"`, which can silently override a user-specified `--runtime`; it may be clearer to either make `transcribe` reject non-whisper runtimes explicitly or prevent `--runtime` from being accepted for this subcommand.

## Individual Comments

### Comment 1
<location> `ramalama/common.py:36` </location>
<code_context>
 MNT_MMPROJ_FILE = f"{MNT_DIR}/mmproj.file"
 MNT_FILE_DRAFT = f"{MNT_DIR}/draft_model.file"
 MNT_CHAT_TEMPLATE_FILE = f"{MNT_DIR}/chat_template.file"
+MNT_AUDIO_DIR = "/mnt/audio"

 RAG_DIR = "/rag"
</code_context>

<issue_to_address>
**suggestion:** Whisper image resolution logic is now duplicated in multiple places and may diverge over time

Whisper image selection is now implemented in three places (`DEFAULT_WHISPER_IMAGE`, `default_whisper_image()` in `cli.py`, and the `config.runtime == "whisper.cpp"` branch in `accel_image()`), each using similar `if ':' in image` logic with slightly different defaults. Centralizing this into a single helper (or having `accel_image` delegate to `default_whisper_image` where applicable) would reduce the risk of config drift and inconsistent behavior between CLI defaults and runtime selection.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Signed-off-by: bittoby <brianwhitedev1996@gmail.com>
…er args, and add runtime validation

Signed-off-by: bittoby <brianwhitedev1996@gmail.com>
Signed-off-by: bittoby <brianwhitedev1996@gmail.com>
@bittoby
Copy link
Author

bittoby commented Feb 18, 2026

@rhatdan @ieaves Could you please review this PR?

@bittoby
Copy link
Author

bittoby commented Feb 20, 2026

@rhatdan I'd appreciate your feedback

@rhatdan
Copy link
Member

rhatdan commented Feb 21, 2026

Does this work with STDIN?

cat audio.wav | ramalama transcribe huggingface://ggerganov/whisper.cpp/ggml-tiny.bin

Or

ramalama transcribe huggingface://ggerganov/whisper.cpp/ggml-tiny.bin audio.wav -

Not sure if adding a new command is necessary or not. Would doing something like

$ ramalama run huggingface://ggerganov/whisper.cpp/ggml-tiny.bin --transcribe audio.wav

Make sense?

Would
ramalama transcribe run against the whisper server? What would be the syntax?

Can I override the image with something else besides whisper?

@olliewalsh @engelmi @ieaves @mikebonnet @jtligon @bmahabirbu WDYT?

@bittoby
Copy link
Author

bittoby commented Feb 21, 2026

Does this work with STDIN?

cat audio.wav | ramalama transcribe huggingface://ggerganov/whisper.cpp/ggml-tiny.bin

Or

ramalama transcribe huggingface://ggerganov/whisper.cpp/ggml-tiny.bin audio.wav -

Not sure if adding a new command is necessary or not. Would doing something like

$ ramalama run huggingface://ggerganov/whisper.cpp/ggml-tiny.bin --transcribe audio.wav

Make sense?

Would ramalama transcribe run against the whisper server? What would be the syntax?

Can I override the image with something else besides whisper?

@olliewalsh @engelmi @ieaves @mikebonnet @jtligon @bmahabirbu WDYT?

==================

STDIN: Not supported right now. It needs an audio file path (the file is mounted into the container). We could add STDIN later by saving it to a temp file first.

Why ramalama transcribe instead of ramalama run --transcribe: Transcribe is a one-shot “audio in, text out” command. run is for interactive chat, so keeping them separate avoids messy argument handling. I can change it if the team prefers run --transcribe.

Using a running Whisper server: Not supported yet. It starts a new container each time. We could add a --server option later to send audio to an existing server.

Custom image: Yes, --image works.

@bittoby
Copy link
Author

bittoby commented Feb 22, 2026

@rhatdan Please let me know your option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants