Skip to content

add canary stt model (nvidia canary-1b-v2)#550

Merged
lucasnewman merged 7 commits intoBlaizzy:mainfrom
mm65x:add-canary-stt
Mar 8, 2026
Merged

add canary stt model (nvidia canary-1b-v2)#550
lucasnewman merged 7 commits intoBlaizzy:mainfrom
mm65x:add-canary-stt

Conversation

@mm65x
Copy link
Copy Markdown
Contributor

@mm65x mm65x commented Mar 7, 2026

Context

Canary is listed as a planned STT model in the roadmap (#1). NVIDIA's canary-1b-v2 is a top performer on the Open ASR Leaderboard (7.15% avg WER) with support for 25 EU languages plus Russian and Ukrainian, including cross-language translation.

Description

Adds a complete Canary model implementation for mlx-audio's STT pipeline. The model uses a FastConformer encoder (reusing the existing parakeet conformer) paired with a Transformer decoder with cross-attention for autoregressive text generation. Weights are loaded from safetensors converted from NVIDIA's .nemo format.

Changes in the codebase

  • mlx_audio/stt/models/canary/canary.py: model class with generate(), sanitize() for NeMo weight mapping, and audio preprocessing
  • mlx_audio/stt/models/canary/decoder.py: transformer decoder with self-attention, cross-attention, fixed positional encoding, and KV-cache
  • mlx_audio/stt/models/canary/config.py: model configuration dataclasses
  • mlx_audio/stt/models/canary/tokenizer.py: sentencepiece tokenizer wrapper with canary prompt format
  • mlx_audio/stt/models/canary/__init__.py: module exports
  • mlx_audio/stt/utils.py: register "canary" in MODEL_REMAPPING

Changes outside the codebase

None.

Additional information

  • Encoder reuses parakeet/conformer.py directly, no duplication
  • Tested with canary-1b-v2: English/German transcription and bidirectional translation all produce correct output
  • The model has no encoder-decoder projection (Identity) since encoder and decoder dimensions match (1024)
  • Decoder uses 8 transformer layers with pre-layer-norm and a final layer norm

Checklist

@Blaizzy
Copy link
Copy Markdown
Owner

Blaizzy commented Mar 7, 2026

Awesome, this was one of the top in our backlog

Could you add a model readme (with inference examples) in the canary folder and link it in the main readme?

@mm65x
Copy link
Copy Markdown
Contributor Author

mm65x commented Mar 8, 2026

Added a README in the canary folder with usage examples and linked it from the main README's STT table.

@mm65x
Copy link
Copy Markdown
Contributor Author

mm65x commented Mar 8, 2026

Also, thank you for this library! I'm building a local ASR app for Mac and mlx-audio has been a great option. I've got a couple more models in the pipeline that I'd like to contribute and will open PRs for them too

@lucasnewman
Copy link
Copy Markdown
Collaborator

Please run the formatter: pre-commit run --all so we can clear tests, otherwise looks great!

@mm65x
Copy link
Copy Markdown
Contributor Author

mm65x commented Mar 8, 2026

done, ran the formatter

Copy link
Copy Markdown
Collaborator

@lucasnewman lucasnewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@lucasnewman lucasnewman merged commit e3f3f2b into Blaizzy:main Mar 8, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants