Skip to content

add ACE-Step text-to-audio model#575

Closed
mm65x wants to merge 5 commits intoBlaizzy:mainfrom
mm65x:add-acestep-tta
Closed

add ACE-Step text-to-audio model#575
mm65x wants to merge 5 commits intoBlaizzy:mainfrom
mm65x:add-acestep-tta

Conversation

@mm65x
Copy link
Copy Markdown
Contributor

@mm65x mm65x commented Mar 14, 2026

Context

ACE-Step 1.5 (https://github.com/ace-step/ACE-Step-1.5) by StepFun. It is a state-of-the-art text-to-audio model capable of generating full songs with vocals and instrumentation directly from text prompts. It is highly efficient and designed for consumer hardware.

Description

adds ACE-Step to the tts pipeline. the model is a hybrid architecture utilizing an LLM (Qwen3) as a text-conditioner/planner, a Diffusion Transformer (DiT) to generate audio latents, and an Oobleck Autoencoder VAE to decode latents back to PCM audio.

the upstream repo provides a partial MLX backend for the DiT and VAE, but relies on PyTorch and transformers for the AceStepConditionEncoder and the text/lyric embedding phases. this PR fully ports the remaining components (AceStepConditionEncoder and AceStepLyricEncoder) to pure MLX and integrates the text prompting via mlx-lm, making the entire generation pipeline 100% native MLX with zero PyTorch runtime dependencies.

also includes a conversion script since upstream only distributes pytorch weights.

Changes in the codebase

  • mlx_audio/tts/models/acestep/acestep.py - model implementation and pipeline logic
  • mlx_audio/tts/models/acestep/conditioner.py - pure MLX port of AceStepConditionEncoder and AceStepLyricEncoder
  • mlx_audio/tts/models/acestep/config.py - config dataclasses
  • mlx_audio/tts/models/acestep/convert.py - pt -> safetensors conversion script
  • mlx_audio/tts/models/acestep/dit.py - MLX DiT decoder
  • mlx_audio/tts/models/acestep/generate_utils.py - MLX diffusion loops
  • mlx_audio/tts/models/acestep/vae.py - MLX VAE decoder
  • mlx_audio/tts/models/acestep/README.md - setup + usage
  • mlx_audio/tts/models/__init__.py, mlx_audio/tts/utils.py - registration
  • mlx_audio/tts/tests/test_acestep.py - unit tests for the pure MLX condition encoder

Changes outside the codebase

none.

Additional information

  • fully strips out the need for transformers and diffusers during runtime
  • users need to run the conversion script for now (instructions in the README)

Checklist

@Blaizzy
Copy link
Copy Markdown
Owner

Blaizzy commented Mar 14, 2026

Awesome work @mm65x!

However there is already an existing PR I created #499

Just missing a couple things and some design decisions for supporting SFX models.

To avoid duplicate work, I would recommend check it out and sending a PR to that branch if you have any of the missing pieces working.

@Blaizzy
Copy link
Copy Markdown
Owner

Blaizzy commented Mar 14, 2026

How about we close this and collaborate on #499?

@mm65x
Copy link
Copy Markdown
Contributor Author

mm65x commented Mar 14, 2026

oops! closing in favor of #499 which already has this fully implemented with more robust token handling and audio features

@mm65x mm65x closed this Mar 14, 2026
@Blaizzy
Copy link
Copy Markdown
Owner

Blaizzy commented Mar 14, 2026

No worries!

My bad, since sfx models are new and this model is quite complex(lots of moving pieces) I put it on hold for a couple weeks to push our Swift SDK and improve inference here.

@mm65x mm65x reopened this Mar 15, 2026
@Blaizzy
Copy link
Copy Markdown
Owner

Blaizzy commented Mar 15, 2026

Will share updates in #499 later today or tomorrow.

@mm65x mm65x closed this Mar 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Model Request: ACE-Step

2 participants