Pure Rust implementation of ACE-Step v1.5 music generation using the candle ML framework. Loads original safetensors weights directly from HuggingFace — no ONNX conversion, no Python runtime.
Generates up to 10 minutes of stereo 48kHz audio from text captions and lyrics.
One-shot generation from the command line. Prints a JSON summary to stdout on success.
ace-step \
--caption "upbeat jazz with piano and drums, bpm: 120, key: C major" \
--lyrics "[verse]\nWalking down the street on a sunny day" \
--duration 30 \
--output output.oggKeeps the pipeline resident in VRAM across requests. Each client sends one JSON request line and receives one JSON response line.
Socket: /tmp/ace-step-gen.sock (override with --socket).
echo '{"caption":"ambient piano","duration_s":20,"output":"/tmp/piano.ogg"}' \
| socat - UNIX-CONNECT:/tmp/ace-step-gen.sockUses the GenerationManager internally — monitors VRAM, proactively offloads to CPU on low memory, retries on CUDA OOM. Exits on unrecoverable failure so systemd can restart.
Requires the audio-ogg feature.
use ace_step_rs::pipeline::{AceStepPipeline, GenerationParams};
fn main() -> ace_step_rs::Result<()> {
let device = candle_core::Device::cuda_if_available(0)?;
let mut pipeline = AceStepPipeline::load(&device, candle_core::DType::F32)?;
let params = GenerationParams {
caption: "upbeat jazz with piano and drums".to_string(),
lyrics: "[verse]\nWalking down the street on a sunny day\n".to_string(),
duration_s: 30.0,
..Default::default()
};
let audio = pipeline.generate(¶ms)?;
ace_step_rs::audio::write_audio("output.wav", &audio.samples, audio.sample_rate, audio.channels)?;
Ok(())
}Model weights (~6GB) are downloaded automatically from ACE-Step/Ace-Step1.5 on first run and cached in ~/.cache/huggingface/.
| Module | Description |
|---|---|
pipeline |
End-to-end inference: text encoding → diffusion → VAE decode |
manager |
GenerationManager — keeps the pipeline resident, queues requests, VRAM monitoring + OOM retry |
radio |
RadioStation — whole-song generation with request queue, auto-duration from lyrics |
audio |
WAV/OGG/MP3 I/O |
vae |
AutoencoderOobleck decoder (latent → 48kHz stereo waveform) |
RadioStation manages a song request queue and generates complete tracks sequentially. Duration is auto-estimated from lyrics (8s per line, clamped to 100–600s). The radio_daemon example wires this to cpal audio output with gapless double-buffered playback, Unix socket control, and skip/queue/history commands.
text caption → Qwen3-Embedding-0.6B (full encoder) ──┐
├→ packed condition sequence
lyrics → Qwen3-Embedding (embed only) │
→ lyric encoder (8-layer transformer) │
│
ref audio → timbre encoder (4-layer) ─────────────────┘
↓
DiT (24 layers, GQA, sliding window + full attn)
flow matching, 8-step turbo ODE (CFG-free)
↓
AutoencoderOobleck VAE (latent → 48kHz stereo waveform)
~2B parameters total. Uses continuous 64-dim acoustic features at 25Hz, flow matching with an 8-step CFG-free turbo schedule.
cargo build --no-default-featuresRequires CUDA toolkit 12.x and a compatible NVIDIA GPU.
cargo build --release --features cudaFor cuDNN-accelerated ConvTranspose1d (faster VAE decode):
cargo build --release --features cudnnRequires a candle fork with the following upstream PRs:
- public
Model::clear_kv_cachefor Qwen3 — needed to reset KV state between inference calls - cuDNN ConvTranspose1d (optional) — 100x faster VAE decode vs the default CPU fallback kernel
Depending on your system, you may need additional environment variables for the CUDA build — see AGENTS.md for platform-specific notes.
cargo build --release --features metalNote: Metal support is provided by candle but has not been tested with this project.
cargo build --release --bin generation-daemon --features cuda,audio-oggA systemd user service unit is included in ace-step-gen.service. Install with:
cp ace-step-gen.service ~/.config/systemd/user/
systemctl --user daemon-reload
systemctl --user enable --now ace-step-gen| Feature | Default | Description |
|---|---|---|
cuda |
yes | NVIDIA GPU acceleration via CUDA |
cudnn |
no | cuDNN-accelerated ConvTranspose1d (implies cuda) |
metal |
no | Apple GPU acceleration via Metal |
cli |
no | Audio playback + terminal input (cpal, rodio, crossterm) |
audio-ogg |
no | OGG/Vorbis encoding (required by generation-daemon) |
audio-mp3 |
no | MP3 encoding |
audio-all |
no | All audio encoders |
| Field | Type | Default | Description |
|---|---|---|---|
caption |
String |
"" |
Style/genre description (e.g. "lo-fi hip hop, mellow piano") |
metas |
String |
"" |
Metadata: bpm, key, genre, instruments |
lyrics |
String |
"" |
Lyrics with section tags like [verse], [chorus] |
language |
String |
"en" |
Lyric language code |
duration_s |
f64 |
30.0 |
Output duration in seconds (max 600) |
shift |
f64 |
3.0 |
Turbo schedule shift (1, 2, or 3) |
seed |
Option<u64> |
None |
Random seed for reproducibility |
src_latents |
Option<Tensor> |
None |
Source latents for repaint/inpainting |
chunk_masks |
Option<Tensor> |
None |
Mask for repaint (0 = keep, 1 = generate) |
refer_audio |
Option<Tensor> |
None |
Reference audio latents for timbre conditioning |
Benchmarked on RTX 3090 (24GB), F32 + TF32 tensor cores. Rust uses the cuDNN ConvTranspose1d patch.
| Duration | Python (PyTorch) | Rust (candle) | Ratio |
|---|---|---|---|
| 10s | 0.88s | 0.67s | 1.3x faster |
| 30s | 1.38s | 1.25s | 1.1x faster |
| 1 min | 2.65s | 2.33s | 1.1x faster |
| 2 min | 4.75s | 5.19s | 1.1x slower |
| 4 min | 9.26s | 12.04s | 1.3x slower |
| 6 min | 15.33s | 21.36s | 1.4x slower |
| 7 min | 19.13s | 27.68s | 1.4x slower |
| 8 min | 22.70s | OOM | — |
| 10 min | 30.79s | OOM | — |
Per-stage breakdown (30s)
| Stage | Python | Rust |
|---|---|---|
| Text encoding | 0.14s | 0.02s |
| Diffusion (8 ODE steps) | 0.73s | 0.63s |
| VAE decode | 0.39s | 0.54s |
Rust wins up to ~1 min. Beyond that, PyTorch's Cutlass tensor-core VAE kernels (cuDNN v9 engine API) and better memory efficiency give it an edge — Rust OOMs on 24GB at 8+ minutes while Python handles the full 10 minutes. Without the candle patch, VAE decode is ~3s at 30s (100x slower ConvTranspose1d).
Tests run on CPU only and don't require a GPU or downloaded weights:
cargo test --no-default-featuresMIT