From 418b1299aa04ba65288caa152f178e5a5cb9b363 Mon Sep 17 00:00:00 2001 From: Deep-unlearning Date: Mon, 6 Oct 2025 11:38:16 +0200 Subject: [PATCH 1/2] add usage example for granite speech --- docs/source/en/model_doc/granite_speech.md | 54 +++++++++++++++++++++- 1 file changed, 53 insertions(+), 1 deletion(-) diff --git a/docs/source/en/model_doc/granite_speech.md b/docs/source/en/model_doc/granite_speech.md index 5de42ff993f8..551ad9e14ecf 100644 --- a/docs/source/en/model_doc/granite_speech.md +++ b/docs/source/en/model_doc/granite_speech.md @@ -41,7 +41,59 @@ This model was contributed by [Alexander Brooks](https://huggingface.co/abrooks9 ## Usage tips - This model bundles its own LoRA adapter, which will be automatically loaded and enabled/disabled as needed during inference calls. Be sure to install [PEFT](https://github.com/huggingface/peft) to ensure the LoRA is correctly applied! - +## Usage Example + +Granite Speech is a multimodal model that can process both text and audio inputs for speech-to-text transcription and audio understanding tasks. Here's how to use it: + +```python +import torch +from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq +from datasets import load_dataset + +device = "cuda" if torch.cuda.is_available() else "cpu" + +# Load model and processor +model_name = "ibm-granite/granite-speech-3.3-8b" +processor = AutoProcessor.from_pretrained(model_name) +tokenizer = processor.tokenizer +model = AutoModelForSpeechSeq2Seq.from_pretrained( + model_name, device_map=device, torch_dtype=torch.bfloat16 +) + +# Load audio from dummy dataset +dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") +wav = torch.tensor(dataset[0]["audio"]["array"]).unsqueeze(0) # add batch dimension + +# Create chat conversation with audio +system_prompt = "Knowledge Cutoff Date: April 2024.\nToday's Date: April 9, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant" +user_prompt = "<|audio|>can you transcribe the speech into a written format?" +chat = [ + dict(role="system", content=system_prompt), + dict(role="user", content=user_prompt), +] +prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) + +# Process audio and text together +model_inputs = processor(prompt, wav, device=device, return_tensors="pt").to(device) + +# Generate response +model_outputs = model.generate(**model_inputs, max_new_tokens=200, do_sample=False, num_beams=1) + +# Extract only the new tokens (response) +num_input_tokens = model_inputs["input_ids"].shape[-1] +new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0) +output_text = tokenizer.batch_decode( + new_tokens, add_special_tokens=False, skip_special_tokens=True +) +print(f"STT output = {output_text[0].upper()}") +``` + +### Key Features: + +- **Audio Token Placeholder**: Use `<|audio|>` in your prompt to indicate where the audio should be processed +- **Chat Template**: The model works with conversational formats using the tokenizer's chat template +- **Automatic LoRA**: The model automatically enables its LoRA adapter when audio features are present +- **Audio Requirements**: Input audio should be mono (single channel) and sampled at 16kHz ## GraniteSpeechConfig From 1bb46ca040f10d4fff513845fbdbb2cc77c5a9cc Mon Sep 17 00:00:00 2001 From: Deep-unlearning Date: Mon, 6 Oct 2025 11:43:14 +0200 Subject: [PATCH 2/2] remove key features --- docs/source/en/model_doc/granite_speech.md | 7 ------- 1 file changed, 7 deletions(-) diff --git a/docs/source/en/model_doc/granite_speech.md b/docs/source/en/model_doc/granite_speech.md index 551ad9e14ecf..3f0c0215add0 100644 --- a/docs/source/en/model_doc/granite_speech.md +++ b/docs/source/en/model_doc/granite_speech.md @@ -88,13 +88,6 @@ output_text = tokenizer.batch_decode( print(f"STT output = {output_text[0].upper()}") ``` -### Key Features: - -- **Audio Token Placeholder**: Use `<|audio|>` in your prompt to indicate where the audio should be processed -- **Chat Template**: The model works with conversational formats using the tokenizer's chat template -- **Automatic LoRA**: The model automatically enables its LoRA adapter when audio features are present -- **Audio Requirements**: Input audio should be mono (single channel) and sampled at 16kHz - ## GraniteSpeechConfig [[autodoc]] GraniteSpeechConfig