[WIP] Initial audio model support by akaashrp · Pull Request #793 · mlc-ai/web-llm

akaashrp · 2026-03-22T22:00:32Z

No description provided.

gemini-code-assist · 2026-03-22T22:00:56Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces foundational support for audio-language models (ALMs) within the system. It expands the existing multimodal capabilities to seamlessly integrate audio input, allowing models to process and respond to both audio and visual content alongside traditional text. The changes ensure that the system can correctly handle, validate, and embed audio data, paving the way for more advanced audio-aware AI interactions.

Highlights

Introduced Audio-Language Model (ALM) Support: Added 'ALM' as a new 'ModelType' to categorize models capable of processing audio, expanding the system's model classification.
Enabled Multimodal Input Handling: Extended conversation and chat pipeline logic to accept and process 'input_audio' content parts alongside existing text and image inputs, facilitating richer interactions.
Implemented Audio Embedding: Integrated a new 'audio_embed' function within the 'LLMChatPipeline' to convert raw audio features into model-compatible embeddings, crucial for audio processing.
Enhanced Input Validation and Error Handling: Introduced specific error types and validation checks for audio input, ensuring correct data formats and model compatibility for robust operation.
Updated Prefill Chunking Logic: Modified the input data chunking mechanism to correctly account for the varying embed sizes of multimodal inputs, including audio, optimizing prefill operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces initial support for audio models by generalizing the existing vision model (VLM) support to handle multimodal inputs, including audio. The changes are extensive, touching upon configuration, conversation management, error handling, and the core LLM pipeline. New types for audio data are added, and the pipeline is updated to process audio embeddings. The implementation follows the existing patterns for image handling, which is a good approach. I've added a couple of suggestions to improve code robustness for future extensions of multimodal types.

gemini-code-assist · 2026-03-22T22:05:02Z

src/conversation.ts

          } else {
-            imageContentParts.push(curContentPart.image_url);
+            multimodalContentParts.push(curContentPart.input_audio);
          }


The else block implicitly assumes the content part is of type input_audio. While this is correct with the current ChatCompletionContentPart union type, it's better to be explicit to make the code more robust for future extensions of multimodal inputs. An explicit else if (curContentPart.type === "input_audio") would be safer.

Suggested change

} else {

imageContentParts.push(curContentPart.image_url);

multimodalContentParts.push(curContentPart.input_audio);

}

} else if (curContentPart.type === "input_audio") {

multimodalContentParts.push(curContentPart.input_audio);

}

gemini-code-assist · 2026-03-22T22:05:03Z

src/llm_chat.ts

+        } else {
+          hasAudioInput = true;
+          const audioEmbedSize = getAudioEmbedSize(data);
+          if (this.prefillChunkSize < audioEmbedSize) {
+            throw new PrefillChunkSizeSmallerThanAudioError(
+              this.prefillChunkSize,
+              audioEmbedSize,
+            );
+          }
+        }


The else block implicitly assumes that if the input is not an image, it must be audio. This is not robust if more multimodal types are added in the future. Using an else if with an explicit check for audio would be safer and clearer.

Suggested change

} else {

hasAudioInput = true;

const audioEmbedSize = getAudioEmbedSize(data);

if (this.prefillChunkSize < audioEmbedSize) {

throw new PrefillChunkSizeSmallerThanAudioError(

this.prefillChunkSize,

audioEmbedSize,

);

}

}

} else if (!isImageInput(data)) {

hasAudioInput = true;

const audioEmbedSize = getAudioEmbedSize(data);

if (this.prefillChunkSize < audioEmbedSize) {

throw new PrefillChunkSizeSmallerThanAudioError(

this.prefillChunkSize,

audioEmbedSize,

);

}

}

Initial audio model support

9f14c45

akaashrp mentioned this pull request Mar 22, 2026

Roadmap #707

Open

16 tasks

gemini-code-assist bot reviewed Mar 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Initial audio model support#793

[WIP] Initial audio model support#793
akaashrp wants to merge 1 commit intomlc-ai:mainfrom
akaashrp:audio

akaashrp commented Mar 22, 2026

Uh oh!

gemini-code-assist bot commented Mar 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

akaashrp commented Mar 22, 2026

Uh oh!

gemini-code-assist bot commented Mar 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant