Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 46 additions & 1 deletion docs/source/en/model_doc/granite_speech.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,52 @@ This model was contributed by [Alexander Brooks](https://huggingface.co/abrooks9
## Usage tips
- This model bundles its own LoRA adapter, which will be automatically loaded and enabled/disabled as needed during inference calls. Be sure to install [PEFT](https://github.com/huggingface/peft) to ensure the LoRA is correctly applied!

<!-- TODO (@alex-jw-brooks) Add an example here once the model compatible with the transformers implementation is released -->
## Usage Example

Granite Speech is a multimodal model that can process both text and audio inputs for speech-to-text transcription and audio understanding tasks. Here's how to use it:

```python
import torch
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from datasets import load_dataset

device = "cuda" if torch.cuda.is_available() else "cpu"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's rather use rather device_map="auto" in from_pretrained


# Load model and processor
model_name = "ibm-granite/granite-speech-3.3-8b"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but let's use rather model_id

processor = AutoProcessor.from_pretrained(model_name)
tokenizer = processor.tokenizer
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_name, device_map=device, torch_dtype=torch.bfloat16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch_dtype is deprecated! let's use dtype. Also dtype="auto" here since
"torch_dtype": "bfloat16" in config.json

)

# Load audio from dummy dataset
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
wav = torch.tensor(dataset[0]["audio"]["array"]).unsqueeze(0) # add batch dimension
Comment on lines +64 to +65
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

starting from datasets 4.0 this should be directly audio.get_all_samples etc see doc


# Create chat conversation with audio
system_prompt = "Knowledge Cutoff Date: April 2024.\nToday's Date: April 9, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant"
user_prompt = "<|audio|>can you transcribe the speech into a written format?"
chat = [
dict(role="system", content=system_prompt),
dict(role="user", content=user_prompt),
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

# Process audio and text together
model_inputs = processor(prompt, wav, device=device, return_tensors="pt").to(device)

# Generate response
model_outputs = model.generate(**model_inputs, max_new_tokens=200, do_sample=False, num_beams=1)

# Extract only the new tokens (response)
num_input_tokens = model_inputs["input_ids"].shape[-1]
new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0)
output_text = tokenizer.batch_decode(
new_tokens, add_special_tokens=False, skip_special_tokens=True
)
print(f"STT output = {output_text[0].upper()}")
```

## GraniteSpeechConfig

Expand Down