- Overview
- Quick Start
- Profiling Modes
- CLI Reference
- Examples
- Understanding Output
- Tips & Best Practices
profile_inference.py is a comprehensive profiling & benchmarking tool for ACE-Step 1.5 inference. It measures end-to-end wall time, LLM planning time, DiT diffusion time, VAE decoding time, and more — across different devices, backends, and configurations.
| Mode | Description |
|---|---|
profile |
Profile a single generation run with detailed timing breakdown |
benchmark |
Run a matrix of configurations (duration × batch × thinking × steps) and produce a summary table |
tier-test |
Automatically test all GPU tiers by simulating different VRAM sizes via MAX_CUDA_VRAM |
understand |
Profile the understand_music() API (audio → metadata extraction) |
create_sample |
Profile the create_sample() API (inspiration / simple mode) |
format_sample |
Profile the format_sample() API (caption + lyrics → structured metadata) |
| Device | Flag | Notes |
|---|---|---|
| CUDA (NVIDIA) | --device cuda |
Recommended. Auto-detected by default |
| MPS (Apple Silicon) | --device mps |
macOS with Apple Silicon |
| CPU | --device cpu |
Slow, for testing only |
| Auto | --device auto |
Automatically selects best available (default) |
| LLM Backend | Flag | Notes |
|---|---|---|
| vLLM | --lm-backend vllm |
Fastest on CUDA, recommended for NVIDIA |
| PyTorch | --lm-backend pt |
Universal fallback, works everywhere |
| MLX | --lm-backend mlx |
Optimized for Apple Silicon |
| Auto | --lm-backend auto |
Selects best backend for device (default) |
# Basic profile (text2music, default settings)
python profile_inference.py
# Profile with LLM thinking enabled
python profile_inference.py --thinking
# Run benchmark matrix
python profile_inference.py --mode benchmark
# Profile on Apple Silicon
python profile_inference.py --device mps --lm-backend mlx
# Profile with cProfile function-level analysis
python profile_inference.py --detailedRuns a single generation with detailed timing breakdown. Includes optional warmup and cProfile.
python profile_inference.py --mode profileWhat it measures:
- Total wall time (end-to-end)
- LLM planning time (token generation, constrained decoding, CFG overhead)
- DiT diffusion time (per-step and total)
- VAE decode time
- Audio save time
Options for this mode:
| Flag | Description |
|---|---|
--no-warmup |
Skip warmup run (includes compilation overhead in measurement) |
--detailed |
Enable cProfile function-level analysis |
--llm-debug |
Deep LLM debugging (token count, throughput) |
--thinking |
Enable LLM Chain-of-Thought reasoning |
--duration <sec> |
Override audio duration |
--batch-size <n> |
Override batch size |
--inference-steps <n> |
Override diffusion steps |
Runs a matrix of configurations and produces a summary table. Automatically adapts to GPU memory limits.
python profile_inference.py --mode benchmarkDefault matrix:
- Durations: 30s, 60s, 120s, 240s (clamped by GPU memory)
- Batch sizes: 1, 2, 4 (clamped by GPU memory)
- Thinking: True, False
- Inference steps: 8, 16
Output example:
Duration Batch Think Steps Wall(s) LM(s) DiT(s) VAE(s) Status
--------------------------------------------------------------------------------------------------------------------------
30 1 False 8 3.21 0.45 1.89 0.52 OK
30 1 True 8 5.67 2.91 1.89 0.52 OK
60 2 False 16 12.34 0.48 9.12 1.85 OK
...
Save results to JSON:
python profile_inference.py --mode benchmark --benchmark-output results.jsonProfiles the understand_music() API which extracts metadata (BPM, key, time signature, caption) from audio codes.
python profile_inference.py --mode understand
python profile_inference.py --mode understand --audio-codes "your_audio_codes_string"Profiles the create_sample() API which generates a complete song blueprint from a simple text query.
python profile_inference.py --mode create_sample
python profile_inference.py --mode create_sample --sample-query "a soft Bengali love song"
python profile_inference.py --mode create_sample --instrumentalProfiles the format_sample() API which converts caption + lyrics into structured metadata.
python profile_inference.py --mode format_sampleAutomatically simulates different GPU VRAM sizes using MAX_CUDA_VRAM and runs a generation test at each tier. This is the recommended way to validate that all GPU tiers work correctly after modifying acestep/gpu_config.py.
# Test all tiers (4, 6, 8, 12, 16, 20, 24 GB)
python profile_inference.py --mode tier-test
# Test specific VRAM sizes
python profile_inference.py --mode tier-test --tiers 6 8 16
# Test with LM enabled (where the tier supports it)
python profile_inference.py --mode tier-test --tier-with-lm
# Quick test: skip torch.compile for non-quantized tiers
python profile_inference.py --mode tier-test --tier-skip-compileWhat it validates per tier:
- Correct tier detection and
GPUConfigconstruction - Model initialization (DiT, VAE, Text Encoder, optionally LM)
- A short generation run (30s duration, batch=1) completes without OOM
- Adaptive VAE decode fallback (GPU → CPU offload → full CPU)
- VRAM usage stays within the simulated limit
Output example:
TIER TEST RESULTS
====================================================================================================
VRAM Tier LM Duration Status Peak VRAM Notes
──────────────────────────────────────────────────────────────────────────────
4GB tier1 — 30s ✅ OK 3.8GB VAE decoded on CPU
6GB tier2 — 30s ✅ OK 5.4GB Tiled VAE chunk=256
8GB tier4 0.6B 30s ✅ OK 7.2GB vllm backend
12GB tier5 1.7B 30s ✅ OK 10.8GB vllm backend
16GB tier6a 1.7B 30s ✅ OK 14.5GB offload enabled
20GB tier6b 1.7B 30s ✅ OK 17.2GB no offload
24GB unlimited 4B 30s ✅ OK 21.3GB full models on GPU
Note:
tier-testmode usestorch.cuda.set_per_process_memory_fraction()to enforce a hard VRAM cap, making simulations realistic even on high-end GPUs (e.g., A100 80GB).
Use --tier-boundary to find the minimum VRAM tier at which INT8 quantization and CPU offload can be safely disabled. For each tier, up to three configurations are tested:
- default — tier's standard settings
- no-quant — quantization disabled, offload unchanged
- no-offload — no quantization AND no CPU offload
# Run boundary tests across all tiers
python profile_inference.py --mode tier-test --tier-boundary
# Boundary test with LM enabled
python profile_inference.py --mode tier-test --tier-boundary --tier-with-lm
# Save boundary results to JSON
python profile_inference.py --mode tier-test --tier-boundary --benchmark-output boundary_results.jsonThe output includes a Boundary Analysis summary showing the minimum tier for each capability.
Use --tier-batch-boundary to find the maximum safe batch size for each tier. For each tier, the tool progressively tests batch sizes 1, 2, 4, 8 (stopping at first OOM) with both LM-enabled and LM-disabled configurations:
# Run batch boundary tests
python profile_inference.py --mode tier-test --tier-batch-boundary --tier-with-lm
# Test specific tiers
python profile_inference.py --mode tier-test --tier-batch-boundary --tier-with-lm --tiers 8 12 16 24The output includes a Batch Boundary Summary showing the maximum successful batch size per tier for both with-LM and without-LM configurations.
| Flag | Default | Description |
|---|---|---|
--device |
auto |
Device: auto / cuda / mps / cpu |
--lm-backend |
auto |
LLM backend: auto / vllm / pt / mlx |
| Flag | Default | Description |
|---|---|---|
--config-path |
acestep-v15-turbo |
DiT model config |
--lm-model |
acestep-5Hz-lm-1.7B |
LLM model path |
| Flag | Default | Description |
|---|---|---|
--offload-to-cpu |
off | Offload models to CPU when not in use |
--offload-dit-to-cpu |
off | Offload DiT to CPU when not in use |
--quantization |
none | Quantization: int8_weight_only / fp8_weight_only / w8a8_dynamic |
| Flag | Default | Description |
|---|---|---|
--duration |
from example | Audio duration in seconds |
--batch-size |
from example | Batch size |
--inference-steps |
from example | Diffusion inference steps |
--seed |
from example | Random seed |
--guidance-scale |
7.0 | CFG guidance scale for DiT |
| Flag | Default | Description |
|---|---|---|
--thinking |
off | Enable LLM Chain-of-Thought reasoning |
--use-cot-metas |
off | LLM generates music metadata via CoT |
--use-cot-caption |
off | LLM rewrites/formats caption via CoT |
--use-cot-language |
off | LLM detects vocal language via CoT |
--use-constrained-decoding |
on | FSM-based constrained decoding |
--no-constrained-decoding |
— | Disable constrained decoding |
--lm-temperature |
0.85 | LLM sampling temperature |
--lm-cfg-scale |
2.0 | LLM CFG scale |
| Flag | Default | Description |
|---|---|---|
--mode |
profile |
Mode: profile / benchmark / tier-test / understand / create_sample / format_sample |
--no-warmup |
off | Skip warmup run |
--detailed |
off | Enable cProfile function-level analysis |
--llm-debug |
off | Deep LLM debugging (token count, throughput) |
--benchmark-output |
none | Save benchmark results to JSON file |
| Flag | Default | Description |
|---|---|---|
--tiers |
4 6 8 12 16 20 24 |
VRAM sizes (GB) to simulate |
--tier-with-lm |
off | Enable LM initialization on tiers that support it |
--tier-skip-compile |
off | Skip torch.compile for faster iteration on non-quantized tiers |
--tier-boundary |
off | Test each tier with no-quant and no-offload variants to find minimum capability boundaries |
--tier-batch-boundary |
off | Test each tier with batch sizes 1, 2, 4, 8 to find maximum safe batch size |
| Flag | Default | Description |
|---|---|---|
--example |
example_05.json |
Example JSON from examples/text2music/ |
--task-type |
text2music |
Task: text2music / cover / repaint / lego / extract / complete |
--reference-audio |
none | Reference audio path (for cover/style transfer) |
--src-audio |
none | Source audio path (for audio-to-audio tasks) |
--sample-query |
none | Query for create_sample mode |
--instrumental |
off | Generate instrumental music (for create_sample) |
--audio-codes |
none | Audio codes string (for understand mode) |
# NVIDIA GPU
python profile_inference.py --device cuda --lm-backend vllm
# Apple Silicon
python profile_inference.py --device mps --lm-backend mlx
# CPU baseline
python profile_inference.py --device cpu --lm-backend pt# Lightweight (0.6B)
python profile_inference.py --lm-model acestep-5Hz-lm-0.6B
# Default (1.7B)
python profile_inference.py --lm-model acestep-5Hz-lm-1.7B
# Large (4B)
python profile_inference.py --lm-model acestep-5Hz-lm-4B# Without thinking (faster)
python profile_inference.py --mode benchmark
# With thinking (better quality, slower)
python profile_inference.py --thinking --use-cot-metas --use-cot-caption# Offload + quantization
python profile_inference.py --offload-to-cpu --quantization int8_weight_only --lm-model acestep-5Hz-lm-0.6B# Run full benchmark matrix and save results
python profile_inference.py --mode benchmark --benchmark-output benchmark_results.json
# Then inspect the JSON
cat benchmark_results.json | python -m json.tool# Enable cProfile for detailed function-level analysis
python profile_inference.py --detailed --llm-debugThe profiler prints a detailed breakdown of where time is spent:
TIME COSTS BREAKDOWN
====================================================================================================
Component Time (s) % of Total
─────────────────────────────────────────────────────────────
LLM Planning (total) 2.91 45.2%
├─ Token generation 2.45 38.1%
├─ Constrained decoding 0.31 4.8%
└─ CFG overhead 0.15 2.3%
DiT Diffusion (total) 1.89 29.4%
├─ Per-step average 0.24 —
└─ Steps 8 —
VAE Decode 0.52 8.1%
Audio Save 0.12 1.9%
Other / Overhead 0.99 15.4%
─────────────────────────────────────────────────────────────
Wall Time (total) 6.43 100.0%
| Metric | Description |
|---|---|
| Wall Time | End-to-end time from start to finish |
| LM Total Time | Time spent in LLM planning (token generation + parsing) |
| DiT Total Time | Time spent in diffusion (all steps combined) |
| VAE Decode Time | Time to decode latents to audio waveform |
| Tokens/sec | LLM token generation throughput (with --llm-debug) |
-
Always include warmup (default) — The first run includes JIT compilation and memory allocation overhead. Warmup ensures measurements reflect steady-state performance.
-
Use
--benchmark-outputto save results as JSON for later analysis or comparison across hardware. -
Compare with thinking off vs on — Thinking mode significantly increases LLM time but may improve generation quality.
-
Test with representative durations — Short durations (30s) are dominated by LLM time; long durations (240s+) are dominated by DiT time.
-
GPU memory auto-adaptation — The benchmark mode automatically clamps durations and batch sizes to what your GPU can handle, using the adaptive tier system in
acestep/gpu_config.py. -
Use
--detailedsparingly —cProfileadds overhead; use it only when investigating function-level bottlenecks. -
Use
tier-testfor regression testing — After modifying GPU tier configs, run--mode tier-testto verify all tiers still work correctly. This is especially important when changing offload thresholds, duration limits, or LM model availability. -
Simulate low VRAM realistically — When using
MAX_CUDA_VRAM, the system enforces a hard VRAM cap viaset_per_process_memory_fraction(), so OOM errors during simulation reflect real behavior on consumer GPUs.