- Overview
- Quick Start
- Profiling Modes
- CLI Reference
- Examples
- Understanding Output
- Tips & Best Practices
profile_inference.py is a comprehensive profiling & benchmarking tool for ACE-Step 1.5 inference. It measures end-to-end wall time, LLM planning time, DiT diffusion time, VAE decoding time, and more — across different devices, backends, and configurations.
| Mode | Description |
|---|---|
profile |
Profile a single generation run with detailed timing breakdown |
benchmark |
Run a matrix of configurations (duration × batch × thinking × steps) and produce a summary table |
understand |
Profile the understand_music() API (audio → metadata extraction) |
create_sample |
Profile the create_sample() API (inspiration / simple mode) |
format_sample |
Profile the format_sample() API (caption + lyrics → structured metadata) |
| Device | Flag | Notes |
|---|---|---|
| CUDA (NVIDIA) | --device cuda |
Recommended. Auto-detected by default |
| MPS (Apple Silicon) | --device mps |
macOS with Apple Silicon |
| CPU | --device cpu |
Slow, for testing only |
| Auto | --device auto |
Automatically selects best available (default) |
| LLM Backend | Flag | Notes |
|---|---|---|
| vLLM | --lm-backend vllm |
Fastest on CUDA, recommended for NVIDIA |
| PyTorch | --lm-backend pt |
Universal fallback, works everywhere |
| MLX | --lm-backend mlx |
Optimized for Apple Silicon |
| Auto | --lm-backend auto |
Selects best backend for device (default) |
# Basic profile (text2music, default settings)
python profile_inference.py
# Profile with LLM thinking enabled
python profile_inference.py --thinking
# Run benchmark matrix
python profile_inference.py --mode benchmark
# Profile on Apple Silicon
python profile_inference.py --device mps --lm-backend mlx
# Profile with cProfile function-level analysis
python profile_inference.py --detailedRuns a single generation with detailed timing breakdown. Includes optional warmup and cProfile.
python profile_inference.py --mode profileWhat it measures:
- Total wall time (end-to-end)
- LLM planning time (token generation, constrained decoding, CFG overhead)
- DiT diffusion time (per-step and total)
- VAE decode time
- Audio save time
Options for this mode:
| Flag | Description |
|---|---|
--no-warmup |
Skip warmup run (includes compilation overhead in measurement) |
--detailed |
Enable cProfile function-level analysis |
--llm-debug |
Deep LLM debugging (token count, throughput) |
--thinking |
Enable LLM Chain-of-Thought reasoning |
--duration <sec> |
Override audio duration |
--batch-size <n> |
Override batch size |
--inference-steps <n> |
Override diffusion steps |
Runs a matrix of configurations and produces a summary table. Automatically adapts to GPU memory limits.
python profile_inference.py --mode benchmarkDefault matrix:
- Durations: 30s, 60s, 120s, 240s (clamped by GPU memory)
- Batch sizes: 1, 2, 4 (clamped by GPU memory)
- Thinking: True, False
- Inference steps: 8, 16
Output example:
Duration Batch Think Steps Wall(s) LM(s) DiT(s) VAE(s) Status
--------------------------------------------------------------------------------------------------------------------------
30 1 False 8 3.21 0.45 1.89 0.52 OK
30 1 True 8 5.67 2.91 1.89 0.52 OK
60 2 False 16 12.34 0.48 9.12 1.85 OK
...
Save results to JSON:
python profile_inference.py --mode benchmark --benchmark-output results.jsonProfiles the understand_music() API which extracts metadata (BPM, key, time signature, caption) from audio codes.
python profile_inference.py --mode understand
python profile_inference.py --mode understand --audio-codes "your_audio_codes_string"Profiles the create_sample() API which generates a complete song blueprint from a simple text query.
python profile_inference.py --mode create_sample
python profile_inference.py --mode create_sample --sample-query "a soft Bengali love song"
python profile_inference.py --mode create_sample --instrumentalProfiles the format_sample() API which converts caption + lyrics into structured metadata.
python profile_inference.py --mode format_sample| Flag | Default | Description |
|---|---|---|
--device |
auto |
Device: auto / cuda / mps / cpu |
--lm-backend |
auto |
LLM backend: auto / vllm / pt / mlx |
| Flag | Default | Description |
|---|---|---|
--config-path |
acestep-v15-turbo |
DiT model config |
--lm-model |
acestep-5Hz-lm-1.7B |
LLM model path |
| Flag | Default | Description |
|---|---|---|
--offload-to-cpu |
off | Offload models to CPU when not in use |
--offload-dit-to-cpu |
off | Offload DiT to CPU when not in use |
--quantization |
none | Quantization: int8_weight_only / fp8_weight_only / w8a8_dynamic |
| Flag | Default | Description |
|---|---|---|
--duration |
from example | Audio duration in seconds |
--batch-size |
from example | Batch size |
--inference-steps |
from example | Diffusion inference steps |
--seed |
from example | Random seed |
--guidance-scale |
7.0 | CFG guidance scale for DiT |
| Flag | Default | Description |
|---|---|---|
--thinking |
off | Enable LLM Chain-of-Thought reasoning |
--use-cot-metas |
off | LLM generates music metadata via CoT |
--use-cot-caption |
off | LLM rewrites/formats caption via CoT |
--use-cot-language |
off | LLM detects vocal language via CoT |
--use-constrained-decoding |
on | FSM-based constrained decoding |
--no-constrained-decoding |
— | Disable constrained decoding |
--lm-temperature |
0.85 | LLM sampling temperature |
--lm-cfg-scale |
2.0 | LLM CFG scale |
| Flag | Default | Description |
|---|---|---|
--mode |
profile |
Mode: profile / benchmark / understand / create_sample / format_sample |
--no-warmup |
off | Skip warmup run |
--detailed |
off | Enable cProfile function-level analysis |
--llm-debug |
off | Deep LLM debugging (token count, throughput) |
--benchmark-output |
none | Save benchmark results to JSON file |
| Flag | Default | Description |
|---|---|---|
--example |
example_05.json |
Example JSON from examples/text2music/ |
--task-type |
text2music |
Task: text2music / cover / repaint / lego / extract / complete |
--reference-audio |
none | Reference audio path (for cover/style transfer) |
--src-audio |
none | Source audio path (for audio-to-audio tasks) |
--sample-query |
none | Query for create_sample mode |
--instrumental |
off | Generate instrumental music (for create_sample) |
--audio-codes |
none | Audio codes string (for understand mode) |
# NVIDIA GPU
python profile_inference.py --device cuda --lm-backend vllm
# Apple Silicon
python profile_inference.py --device mps --lm-backend mlx
# CPU baseline
python profile_inference.py --device cpu --lm-backend pt# Lightweight (0.6B)
python profile_inference.py --lm-model acestep-5Hz-lm-0.6B
# Default (1.7B)
python profile_inference.py --lm-model acestep-5Hz-lm-1.7B
# Large (4B)
python profile_inference.py --lm-model acestep-5Hz-lm-4B# Without thinking (faster)
python profile_inference.py --mode benchmark
# With thinking (better quality, slower)
python profile_inference.py --thinking --use-cot-metas --use-cot-caption# Offload + quantization
python profile_inference.py --offload-to-cpu --quantization int8_weight_only --lm-model acestep-5Hz-lm-0.6B# Run full benchmark matrix and save results
python profile_inference.py --mode benchmark --benchmark-output benchmark_results.json
# Then inspect the JSON
cat benchmark_results.json | python -m json.tool# Enable cProfile for detailed function-level analysis
python profile_inference.py --detailed --llm-debugThe profiler prints a detailed breakdown of where time is spent:
TIME COSTS BREAKDOWN
====================================================================================================
Component Time (s) % of Total
─────────────────────────────────────────────────────────────
LLM Planning (total) 2.91 45.2%
├─ Token generation 2.45 38.1%
├─ Constrained decoding 0.31 4.8%
└─ CFG overhead 0.15 2.3%
DiT Diffusion (total) 1.89 29.4%
├─ Per-step average 0.24 —
└─ Steps 8 —
VAE Decode 0.52 8.1%
Audio Save 0.12 1.9%
Other / Overhead 0.99 15.4%
─────────────────────────────────────────────────────────────
Wall Time (total) 6.43 100.0%
| Metric | Description |
|---|---|
| Wall Time | End-to-end time from start to finish |
| LM Total Time | Time spent in LLM planning (token generation + parsing) |
| DiT Total Time | Time spent in diffusion (all steps combined) |
| VAE Decode Time | Time to decode latents to audio waveform |
| Tokens/sec | LLM token generation throughput (with --llm-debug) |
-
Always include warmup (default) — The first run includes JIT compilation and memory allocation overhead. Warmup ensures measurements reflect steady-state performance.
-
Use
--benchmark-outputto save results as JSON for later analysis or comparison across hardware. -
Compare with thinking off vs on — Thinking mode significantly increases LLM time but may improve generation quality.
-
Test with representative durations — Short durations (30s) are dominated by LLM time; long durations (240s+) are dominated by DiT time.
-
GPU memory auto-adaptation — The benchmark mode automatically clamps durations and batch sizes to what your GPU can handle.
-
Use
--detailedsparingly —cProfileadds overhead; use it only when investigating function-level bottlenecks.