feat: add stt support via whisper.cpp with new transcribe subcommand by bittoby · Pull Request #2438 · containers/ramalama

bittoby · 2026-02-18T20:34:17Z

Add Speech-to-Text Support via whisper.cpp

Adds a new transcribe subcommand and whisper.cpp runtime for speech-to-text transcription.

Usage

ramalama transcribe huggingface://ggerganov/whisper.cpp/ggml-tiny.bin audio.wav

Changes

New transcribe subcommand using whisper-cli
New whisper.cpp runtime with serve (HTTP API) and transcribe (batch)
New inference spec inference-spec/engines/whisper.cpp.yaml
Container image: ghcr.io/ggml-org/whisper.cpp:main

Notes

Audio files must be world-readable in container mode: chmod o+r audio.wav
For serve: ramalama --runtime whisper.cpp serve MODEL

Issue: #2376

Summary by Sourcery

Add speech-to-text support using a new whisper.cpp runtime and transcribe subcommand integrated into the CLI and inference engine.

New Features:

Introduce a transcribe CLI subcommand to convert audio files to text using speech-to-text models.
Add whisper.cpp as a supported runtime with serve (HTTP API) and batch transcription commands.
Provide a new whisper.cpp inference engine spec defining serve and transcribe commands and options.

Enhancements:

Extend runtime configuration and image resolution to support a dedicated whisper.cpp container image with sensible defaults.
Update the shared command context to carry audio transcription parameters such as input file, language, and translate flag.
Mount audio files into containers automatically when running transcription commands.

sourcery-ai · 2026-02-18T20:34:24Z

Reviewer's Guide

Adds a new whisper.cpp speech-to-text runtime and a transcribe subcommand, wiring them through CLI, runtime config, container execution, and a new inference spec for whisper's serve and batch transcription commands.

File-Level Changes

Change	Details	Files
Introduce a `transcribe` subcommand that runs whisper.cpp-based batch transcription via container or host execution.	Add `transcribe_parser` to define CLI arguments (model, audio path, language, translate, container options, runtime args) and register it with the main CLI parser. Set default whisper container image for transcribe, with override support and completion hooks for env, image, network, etc. Implement `transcribe_cli` to validate audio path and permissions, derive container/host input path, ensure the model exists, and invoke `model.transcribe` with the assembled command.	`ramalama/cli.py`
Extend the runtime abstraction and container transport to support a transcribe operation and audio file mounting for whisper.cpp.	Add `transcribe` abstract method to the base transport API and implement it in container-based models to set accel env vars and execute the command. Adjust container execution to clear the entrypoint when using the whisper.cpp runtime so that inference binaries can be called directly. Introduce audio-specific container mount logic that binds the input audio file into a fixed /mnt/audio path inside the container.	`ramalama/transports/base.py` `ramalama/common.py`
Add configuration and argument-context support for whisper.cpp runtime and its image selection.	Define `DEFAULT_WHISPER_IMAGE` and a WHISPER image field in the image config, plus support whisper.cpp in SUPPORTED_RUNTIMES and runtime help text. Add a `default_whisper_image` helper and hook whisper.cpp into the runtime/image mapping logic for top-level CLI options and accel_image selection. Extend `RamalamaArgsContext` to carry transcription-specific fields (input_file, language, translate) from argparse into the runtime layer.	`ramalama/cli.py` `ramalama/config.py` `ramalama/config_types.py` `ramalama/command/context.py` `ramalama/common.py`
Define a new inference spec for the whisper.cpp engine that describes serve and transcribe commands.	Add `inference-spec/engines/whisper.cpp.yaml` specifying whisper-server options for HTTP serving and whisper-cli options for batch transcription, including host/port, model path, threads, language, translate, and runtime_args passthrough.	`inference-spec/engines/whisper.cpp.yaml`

Possibly linked issues

Speach to text support #2376: PR directly implements whisper.cpp runtime usage, including CLI flags and image config, matching the issue request.
Speach to text support #2376: PR implements the requested speech-to-text support via whisper.cpp runtime and new transcribe subcommand.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

gemini-code-assist · 2026-02-18T20:34:41Z

Summary of Changes

Hello @bittoby, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the application's capabilities by introducing robust Speech-to-Text (STT) support. It integrates the whisper.cpp runtime, allowing users to transcribe audio files into text via a new transcribe subcommand. This addition expands the platform's utility beyond traditional language models to include audio processing, making it more versatile for various AI tasks.

Highlights

New transcribe subcommand: A new transcribe subcommand has been added, enabling Speech-to-Text (STT) functionality for audio files.
Whisper.cpp Runtime Integration: The whisper.cpp runtime is now integrated, supporting both serve (HTTP API) and transcribe (batch processing) operations for STT.
Dedicated Inference Specification: A new inference specification (inference-spec/engines/whisper.cpp.yaml) has been introduced to define the commands and arguments for the whisper.cpp runtime.
Automatic Container Image Selection: The system now automatically selects the ghcr.io/ggml-org/whisper.cpp:main container image when the whisper.cpp runtime is specified, unless overridden.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

inference-spec/engines/whisper.cpp.yaml
- Added a new inference specification for the whisper.cpp runtime.
- Defined serve command with options for host, port, model, threads, and debug mode.
- Defined transcribe command with options for model, input file, threads, language, and translate flag.
inference-spec/schema/schema.1-0-0.json
- Updated the name enum to include "transcribe" as a valid command.
ramalama/cli.py
- Imported MNT_AUDIO_DIR and DEFAULT_WHISPER_IMAGE.
- Added default_whisper_image function to retrieve the whisper container image.
- Updated the --runtime argument choices to include "whisper.cpp".
- Added transcribe_parser to configure arguments for the new transcribe subcommand.
- Added transcribe_cli function to handle the logic for the transcribe command, including file existence and permissions checks, and setting up input file paths.
- Implemented logic to automatically set the whisper.cpp container image if not explicitly overridden.
ramalama/command/context.py
- Added input_file, language, and translate attributes to RamalamaArgsContext for transcribe command arguments.
ramalama/common.py
- Defined MNT_AUDIO_DIR constant for mounting audio files within containers.
- Modified accel_image to return the whisper.cpp image when the runtime is set to "whisper.cpp".
ramalama/config.py
- Defined DEFAULT_WHISPER_IMAGE as ghcr.io/ggml-org/whisper.cpp:main.
- Added WHISPER to RamalamaImages to store the default whisper image.
ramalama/config_types.py
- Extended SUPPORTED_RUNTIMES to include "whisper.cpp".
ramalama/transports/base.py
- Imported MNT_AUDIO_DIR.
- Added an abstract transcribe method to BaseTransport.
- Modified exec_model_in_container to add --entrypoint "" for whisper.cpp runtime.
- Implemented setup_mounts to bind mount the input audio file into the container at MNT_AUDIO_DIR.
- Added a concrete transcribe method implementation.

Activity

No activity to report on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces speech-to-text functionality by integrating whisper.cpp. It adds a new transcribe subcommand, a whisper.cpp runtime, and the necessary configuration and command-line arguments. The implementation is solid and follows existing patterns in the codebase. I've identified a couple of opportunities to improve maintainability by refactoring duplicated code for argument parsing and image resolution logic. Overall, this is a great addition to the project.

gemini-code-assist · 2026-02-18T20:36:02Z

ramalama/cli.py

    run(host=args.host, port=int(args.port), model_store_path=args.store)


+def transcribe_parser(subparsers):


The transcribe_parser function duplicates a significant number of argument definitions that are also present in other parser functions like runtime_options. This code duplication makes the CLI argument setup harder to maintain.

To improve maintainability, consider refactoring the common arguments into a separate helper function. For example, you could create a function add_common_runtime_options(parser) that adds arguments shared across different subcommands (e.g., --device, --env, --image, --name, --ngl, --oci-runtime, --privileged, --pull, --runtime-args, --selinux, --threads, --tls-verify).

This new function could then be called from transcribe_parser, and runtime_options could also be refactored to use it, centralizing the logic for these common options.

gemini-code-assist · 2026-02-18T20:36:03Z

ramalama/common.py

+    if config.runtime == "whisper.cpp":
+        from ramalama.config import DEFAULT_WHISPER_IMAGE
+
+        image = config.images.get("WHISPER", DEFAULT_WHISPER_IMAGE)
+        # If the image from the config is specified by tag or digest, return it unmodified
+        return image if ":" in image else f"{image}:main"


The logic for determining the whisper.cpp container image is duplicated here and in ramalama/cli.py within the default_whisper_image function. This duplication can lead to maintenance issues and potential inconsistencies in the future.

To improve this, you could extract the common logic into a single helper function, for example in ramalama/config.py:

# In ramalama/config.py def get_whisper_image(config: Config) -> str: image = config.images.get("WHISPER", DEFAULT_WHISPER_IMAGE) return image if ":" in image else f"{image}:main"

Then, both accel_image in this file and default_whisper_image in cli.py can call this new function, ensuring a single source of truth for this logic.

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

The logic for choosing the whisper.cpp image is split between default_whisper_image() in cli.py and the config.runtime == "whisper.cpp" branch in accel_image; consider centralizing this image selection in one place to avoid future divergence in tag/override behavior.
In transcribe_cli you unconditionally set args.runtime = "whisper.cpp", which can silently override a user-specified --runtime; it may be clearer to either make transcribe reject non-whisper runtimes explicitly or prevent --runtime from being accepted for this subcommand.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The logic for choosing the whisper.cpp image is split between `default_whisper_image()` in `cli.py` and the `config.runtime == "whisper.cpp"` branch in `accel_image`; consider centralizing this image selection in one place to avoid future divergence in tag/override behavior.
- In `transcribe_cli` you unconditionally set `args.runtime = "whisper.cpp"`, which can silently override a user-specified `--runtime`; it may be clearer to either make `transcribe` reject non-whisper runtimes explicitly or prevent `--runtime` from being accepted for this subcommand.

## Individual Comments

### Comment 1
<location> `ramalama/common.py:36` </location>
<code_context>
 MNT_MMPROJ_FILE = f"{MNT_DIR}/mmproj.file"
 MNT_FILE_DRAFT = f"{MNT_DIR}/draft_model.file"
 MNT_CHAT_TEMPLATE_FILE = f"{MNT_DIR}/chat_template.file"
+MNT_AUDIO_DIR = "/mnt/audio"

 RAG_DIR = "/rag"
</code_context>

<issue_to_address>
**suggestion:** Whisper image resolution logic is now duplicated in multiple places and may diverge over time

Whisper image selection is now implemented in three places (`DEFAULT_WHISPER_IMAGE`, `default_whisper_image()` in `cli.py`, and the `config.runtime == "whisper.cpp"` branch in `accel_image()`), each using similar `if ':' in image` logic with slightly different defaults. Centralizing this into a single helper (or having `accel_image` delegate to `default_whisper_image` where applicable) would reduce the risk of config drift and inconsistent behavior between CLI defaults and runtime selection.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

ramalama/common.py

Signed-off-by: bittoby <brianwhitedev1996@gmail.com>

…er args, and add runtime validation Signed-off-by: bittoby <brianwhitedev1996@gmail.com>

Signed-off-by: bittoby <brianwhitedev1996@gmail.com>

bittoby · 2026-02-18T20:59:50Z

@rhatdan @ieaves Could you please review this PR?

bittoby · 2026-02-20T16:16:14Z

@rhatdan I'd appreciate your feedback

rhatdan · 2026-02-21T12:51:04Z

Does this work with STDIN?

cat audio.wav | ramalama transcribe huggingface://ggerganov/whisper.cpp/ggml-tiny.bin

Or

ramalama transcribe huggingface://ggerganov/whisper.cpp/ggml-tiny.bin audio.wav -

Not sure if adding a new command is necessary or not. Would doing something like

$ ramalama run huggingface://ggerganov/whisper.cpp/ggml-tiny.bin --transcribe audio.wav

Make sense?

Would
ramalama transcribe run against the whisper server? What would be the syntax?

Can I override the image with something else besides whisper?

@olliewalsh @engelmi @ieaves @mikebonnet @jtligon @bmahabirbu WDYT?

bittoby · 2026-02-21T13:35:27Z

Does this work with STDIN?
cat audio.wav | ramalama transcribe huggingface://ggerganov/whisper.cpp/ggml-tiny.bin
Or
ramalama transcribe huggingface://ggerganov/whisper.cpp/ggml-tiny.bin audio.wav -
Not sure if adding a new command is necessary or not. Would doing something like
$ ramalama run huggingface://ggerganov/whisper.cpp/ggml-tiny.bin --transcribe audio.wav
Make sense?

Would ramalama transcribe run against the whisper server? What would be the syntax?

Can I override the image with something else besides whisper?

@olliewalsh @engelmi @ieaves @mikebonnet @jtligon @bmahabirbu WDYT?

==================

STDIN: Not supported right now. It needs an audio file path (the file is mounted into the container). We could add STDIN later by saving it to a temp file first.

Why ramalama transcribe instead of ramalama run --transcribe: Transcribe is a one-shot “audio in, text out” command. run is for interactive chat, so keeping them separate avoids messy argument handling. I can change it if the team prefers run --transcribe.

Using a running Whisper server: Not supported yet. It starts a new container each time. We could add a --server option later to send audio to an existing server.

Custom image: Yes, --image works.

bittoby · 2026-02-22T11:59:16Z

@rhatdan Please let me know your option

bittoby requested review from bmahabirbu, cgruver, engelmi, jhjaggars, maxamillion, rhatdan and swarajpande5 as code owners February 18, 2026 20:34

gemini-code-assist bot reviewed Feb 18, 2026

View reviewed changes

sourcery-ai bot reviewed Feb 18, 2026

View reviewed changes

ramalama/common.py Show resolved Hide resolved

bittoby added 3 commits February 18, 2026 20:50

feat: add stt support via whisper.cpp with new transcribe subcommand

7119ef1

Signed-off-by: bittoby <brianwhitedev1996@gmail.com>

refactor: centralize whisper image logic, deduplicate transcribe pars…

13950f6

…er args, and add runtime validation Signed-off-by: bittoby <brianwhitedev1996@gmail.com>

chore: remove test file

14f1935

Signed-off-by: bittoby <brianwhitedev1996@gmail.com>

bittoby force-pushed the feat/whisper-stt branch from 7667571 to 14f1935 Compare February 18, 2026 20:50

		run(host=args.host, port=int(args.port), model_store_path=args.store)


		def transcribe_parser(subparsers):

Comments

Conversation

bittoby commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Speech-to-Text Support via whisper.cpp

Usage

Changes

Notes

Issue: #2376

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

gemini-code-assist bot commented Feb 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bittoby commented Feb 18, 2026

Uh oh!

bittoby commented Feb 20, 2026

Uh oh!

rhatdan commented Feb 21, 2026

Uh oh!

bittoby commented Feb 21, 2026

Uh oh!

bittoby commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bittoby commented Feb 18, 2026 •

edited

Loading

sourcery-ai bot commented Feb 18, 2026 •

edited

Loading