Skip to content

[WIP] Initial audio model support#793

Open
akaashrp wants to merge 1 commit intomlc-ai:mainfrom
akaashrp:audio
Open

[WIP] Initial audio model support#793
akaashrp wants to merge 1 commit intomlc-ai:mainfrom
akaashrp:audio

Conversation

@akaashrp
Copy link
Copy Markdown
Collaborator

No description provided.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces foundational support for audio-language models (ALMs) within the system. It expands the existing multimodal capabilities to seamlessly integrate audio input, allowing models to process and respond to both audio and visual content alongside traditional text. The changes ensure that the system can correctly handle, validate, and embed audio data, paving the way for more advanced audio-aware AI interactions.

Highlights

  • Introduced Audio-Language Model (ALM) Support: Added 'ALM' as a new 'ModelType' to categorize models capable of processing audio, expanding the system's model classification.
  • Enabled Multimodal Input Handling: Extended conversation and chat pipeline logic to accept and process 'input_audio' content parts alongside existing text and image inputs, facilitating richer interactions.
  • Implemented Audio Embedding: Integrated a new 'audio_embed' function within the 'LLMChatPipeline' to convert raw audio features into model-compatible embeddings, crucial for audio processing.
  • Enhanced Input Validation and Error Handling: Introduced specific error types and validation checks for audio input, ensuring correct data formats and model compatibility for robust operation.
  • Updated Prefill Chunking Logic: Modified the input data chunking mechanism to correctly account for the varying embed sizes of multimodal inputs, including audio, optimizing prefill operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@akaashrp akaashrp mentioned this pull request Mar 22, 2026
16 tasks
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces initial support for audio models by generalizing the existing vision model (VLM) support to handle multimodal inputs, including audio. The changes are extensive, touching upon configuration, conversation management, error handling, and the core LLM pipeline. New types for audio data are added, and the pipeline is updated to process audio embeddings. The implementation follows the existing patterns for image handling, which is a good approach. I've added a couple of suggestions to improve code robustness for future extensions of multimodal types.

Comment on lines 148 to 150
} else {
imageContentParts.push(curContentPart.image_url);
multimodalContentParts.push(curContentPart.input_audio);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The else block implicitly assumes the content part is of type input_audio. While this is correct with the current ChatCompletionContentPart union type, it's better to be explicit to make the code more robust for future extensions of multimodal inputs. An explicit else if (curContentPart.type === "input_audio") would be safer.

Suggested change
} else {
imageContentParts.push(curContentPart.image_url);
multimodalContentParts.push(curContentPart.input_audio);
}
} else if (curContentPart.type === "input_audio") {
multimodalContentParts.push(curContentPart.input_audio);
}

Comment on lines +735 to +744
} else {
hasAudioInput = true;
const audioEmbedSize = getAudioEmbedSize(data);
if (this.prefillChunkSize < audioEmbedSize) {
throw new PrefillChunkSizeSmallerThanAudioError(
this.prefillChunkSize,
audioEmbedSize,
);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The else block implicitly assumes that if the input is not an image, it must be audio. This is not robust if more multimodal types are added in the future. Using an else if with an explicit check for audio would be safer and clearer.

Suggested change
} else {
hasAudioInput = true;
const audioEmbedSize = getAudioEmbedSize(data);
if (this.prefillChunkSize < audioEmbedSize) {
throw new PrefillChunkSizeSmallerThanAudioError(
this.prefillChunkSize,
audioEmbedSize,
);
}
}
} else if (!isImageInput(data)) {
hasAudioInput = true;
const audioEmbedSize = getAudioEmbedSize(data);
if (this.prefillChunkSize < audioEmbedSize) {
throw new PrefillChunkSizeSmallerThanAudioError(
this.prefillChunkSize,
audioEmbedSize,
);
}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant