Skip to content

Add Irodori-TTS: Japanese TTS model port to MLX#591

Merged
lucasnewman merged 6 commits intoBlaizzy:mainfrom
yoshphys:feature/irodori-tts
Mar 24, 2026
Merged

Add Irodori-TTS: Japanese TTS model port to MLX#591
lucasnewman merged 6 commits intoBlaizzy:mainfrom
yoshphys:feature/irodori-tts

Conversation

@yoshphys
Copy link
Copy Markdown
Contributor

Summary

  • Port Aratako/Irodori-TTS-500M to mlx-audio as a new TTS model (irodori_tts)
  • Japanese TTS based on Echo TTS architecture, using Rectified Flow diffusion + DACVAE codec (48kHz, 128-dim latents)
  • Adds "irodori_tts" entry to MODEL_REMAPPING in tts/utils.py

New files

File Description
models/irodori_tts/model.py IrodoriDiT architecture (JointAttention, LowRankAdaLN, SwiGLU, RoPE)
models/irodori_tts/irodori_tts.py TTS wrapper (Model class, DACVAE loading, generate pipeline)
models/irodori_tts/config.py IrodoriDiTConfig, SamplerConfig, ModelConfig
models/irodori_tts/sampling.py Euler sampler with CFG (independent/alternating/joint modes)
models/irodori_tts/text.py Japanese text normalization + HuggingFace tokenizer wrapper
models/irodori_tts/convert.py Weight conversion script: PyTorch → MLX fp16 (DiT + DACVAE)
models/irodori_tts/README.md Usage docs, memory requirements, conversion instructions
tests/test_irodori_tts.py 28 unit tests (all passing)

Test plan

  • Run unit tests: python -m unittest mlx_audio.tts.tests.test_irodori_tts -v
  • Convert weights: python -m mlx_audio.tts.models.irodori_tts.convert (requires torch)
  • Run inference: python -m mlx_audio.tts.generate --model ./Irodori-TTS-500M-fp16 --text "こんにちは"
  • On 16GB machines, use sequence_length=300 and cfg_guidance_mode=alternating to stay within memory limits

🤖 Generated with Claude Code

Copy link
Copy Markdown
Owner

@Blaizzy Blaizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @yoshphys
Thanks for the contribution!

I just have a few nits:

  • Please share a audio sample of the port and the source model so we can compare.
  • Not needed: models/irodori_tts/convert.py
  • Please move tests/test_irodori_tts.py to test_models.py and follow the format there.
  • Upload a converted model to mlx-community on Huggingface (4bit, 5bit, 6bit, 8bit, bf16)

@yoshphys
Copy link
Copy Markdown
Contributor Author

Update

Tests moved to test_models.py

All Irodori-TTS tests have been consolidated into test_models.py following the existing format.

convert.py removed

Removed from the PR (kept locally for reference, not needed by end users).

Quantized models uploaded to mlx-community

Audio comparison

Text: 「お電話ありがとうございます。ただいま電話が大変混み合っております。恐れ入りますが、発信音のあとに、ご用件をお話しください。」

Audio
Original (Aratako/Irodori-TTS-500M, PyTorch) comparison_original.wav
MLX port (fp16, sequence_length=400, cfg_guidance_mode=alternating) comparison_mlx_fp16.wav

yoshphys and others added 4 commits March 22, 2026 11:42
Port Aratako/Irodori-TTS-500M to mlx-audio. The model uses a DiT
(Diffusion Transformer) with Rectified Flow sampling and DACVAE codec
(48kHz, 128-dim latents).

Key components:
- IrodoriDiT: JointAttention (self+text+speaker), LowRankAdaLN, SwiGLU
- Euler sampler with CFG (independent/alternating/joint modes)
- Japanese text normalization + HuggingFace tokenizer (llm-jp/llm-jp-3-150m)
- DACVAE codec loaded from facebook/dacvae-watermarked via convert.py
- convert.py: converts PyTorch weights to MLX fp16 safetensors
- 28 unit tests covering architecture, text processing, sanitize, and smoke tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove models/irodori_tts/convert.py from tracking (reviewer: not needed)
- Delete tests/test_irodori_tts.py; move all 26 Irodori-TTS tests into
  tests/test_models.py following the established format (imports inside
  test methods, module-level stubs/helpers)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@yoshphys yoshphys force-pushed the feature/irodori-tts branch from 7d8698f to 46767d6 Compare March 22, 2026 02:42
Copy link
Copy Markdown
Collaborator

@lucasnewman lucasnewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See minor comment but looks good to me!

@lucasnewman lucasnewman requested a review from Blaizzy March 23, 2026 16:02
yoshphys and others added 2 commits March 24, 2026 09:59
convert.py was removed in a previous commit; the section is no longer
needed. Also update DACVAE download note to reflect that weights are
fetched automatically on first use.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Owner

@Blaizzy Blaizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@lucasnewman lucasnewman merged commit 6c513de into Blaizzy:main Mar 24, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants