GPU Compatibility Guide

ACE-Step 1.5 automatically adapts to your GPU's available VRAM, adjusting generation limits, LM model availability, offloading strategies, and UI defaults accordingly. The system detects GPU memory at startup and configures optimal settings for your hardware.

GPU Tier Configuration

VRAM	Tier	LM Models	Recommended LM	Backend	Max Duration (LM / No LM)	Max Batch (LM / No LM)	Offload	Quantization
≤4GB	Tier 1	None	—	pt	4 min / 6 min	1 / 1	CPU + DiT	INT8
4-6GB	Tier 2	None	—	pt	8 min / 10 min	1 / 1	CPU + DiT	INT8
6-8GB	Tier 3	0.6B	0.6B	pt	8 min / 10 min	2 / 2	CPU + DiT	INT8
8-12GB	Tier 4	0.6B	0.6B	vllm	8 min / 10 min	2 / 4	CPU + DiT	INT8
12-16GB	Tier 5	0.6B, 1.7B	1.7B	vllm	8 min / 10 min	4 / 4	CPU	INT8
16-20GB	Tier 6a	0.6B, 1.7B	1.7B	vllm	8 min / 10 min	4 / 8	CPU	INT8
20-24GB	Tier 6b	0.6B, 1.7B, 4B	1.7B	vllm	8 min / 8 min	8 / 8	None	None
≥24GB	Unlimited	All (0.6B, 1.7B, 4B)	4B	vllm	10 min / 10 min	8 / 8	None	None

Column Descriptions

LM Models: Which 5Hz Language Model sizes can be loaded on this tier
Recommended LM: The default LM model selected in the UI for this tier
Backend: LM inference backend (vllm for NVIDIA GPUs with sufficient VRAM, pt for PyTorch fallback, mlx for Apple Silicon)
Offload: Memory offloading strategy
- CPU + DiT: All models (DiT, VAE, Text Encoder) offloaded to CPU when not in use; DiT also offloaded between steps
- CPU: VAE and Text Encoder offloaded to CPU; DiT stays on GPU
- None: All models remain on GPU
Quantization: Whether INT8 weight quantization is enabled by default to reduce VRAM usage

Adaptive UI Defaults

The Gradio UI automatically configures itself based on the detected GPU tier:

LM Initialization Checkbox: Checked by default for tiers that support LM (Tier 3+), unchecked and disabled for Tier 1-2
LM Model Path: Pre-populated with the recommended model for your tier; dropdown only shows compatible models
Backend Dropdown: Restricted to pt/mlx on Tier 1-3 (vllm KV cache is too memory-hungry); all backends available on Tier 4+
CPU Offload / DiT Offload: Enabled by default on lower tiers, disabled on higher tiers
Quantization: Enabled by default on Tier 1-6a, disabled on Tier 6b+ (sufficient VRAM)
Compile Model: Enabled by default on all tiers (required for quantization)

If you manually select an incompatible option (e.g., trying to use vllm on a 6GB GPU), the system will warn you and automatically fall back to a compatible configuration.

Runtime Safety Features

VRAM Guard: Before each inference, the system estimates VRAM requirements and automatically reduces batch size if needed
Adaptive VAE Decode: Three-tier fallback: GPU tiled decode → GPU decode with CPU offload → full CPU decode
Auto Chunk Size: VAE decode chunk size adapts to available free VRAM (64/128/256/512/1024/1536)
Duration/Batch Clamping: If you request values exceeding your tier's limits, they are clamped with a warning

Notes

Default settings are automatically configured based on detected GPU memory
LM Mode refers to the Language Model used for Chain-of-Thought generation and audio understanding
Flash Attention is auto-detected and enabled when available
Constrained Decoding: When LM is initialized, the LM's duration generation is also constrained to the GPU tier's maximum duration limit, preventing out-of-memory errors during CoT generation
For GPUs with ≤6GB VRAM (Tier 1-2), LM initialization is disabled by default to preserve memory for the DiT model
You can manually override settings via command-line arguments or the Gradio UI

Community Contributions Welcome: The GPU tier configurations above are based on our testing across common hardware. If you find that your device's actual performance differs from these parameters (e.g., can handle longer durations or larger batch sizes), we welcome you to conduct more thorough testing and submit a PR to optimize these configurations in acestep/gpu_config.py. Your contributions help improve the experience for all users!

Memory Optimization Tips

Very Low VRAM (≤6GB): Use DiT-only mode without LM initialization. INT8 quantization and full CPU offload are mandatory. VAE decode may fall back to CPU automatically.
Low VRAM (6-8GB): The 0.6B LM model can be used with pt backend. Keep offload enabled.
Medium VRAM (8-16GB): Use the 0.6B or 1.7B LM model. vllm backend works well on Tier 4+.
High VRAM (16-24GB): Enable larger LM models (1.7B recommended). Quantization becomes optional on 20GB+.
Very High VRAM (≥24GB): All models fit without offloading or quantization. Use 4B LM for best quality.

Debug Mode: Simulating Different GPU Configurations

For testing and development, you can simulate different GPU memory sizes using the MAX_CUDA_VRAM environment variable:

# Simulate a 4GB GPU (Tier 1)
MAX_CUDA_VRAM=4 uv run acestep

# Simulate a 6GB GPU (Tier 2)
MAX_CUDA_VRAM=6 uv run acestep

# Simulate an 8GB GPU (Tier 4)
MAX_CUDA_VRAM=8 uv run acestep

# Simulate a 12GB GPU (Tier 5)
MAX_CUDA_VRAM=12 uv run acestep

# Simulate a 16GB GPU (Tier 6a)
MAX_CUDA_VRAM=16 uv run acestep

When MAX_CUDA_VRAM is set, the system also calls torch.cuda.set_per_process_memory_fraction() to enforce a hard VRAM cap, making the simulation realistic even on high-end GPUs.

Automated Tier Testing

Instead of manually testing each tier through the UI, use the tier-test mode of profile_inference.py:

# Test all tiers automatically
python profile_inference.py --mode tier-test

# Test specific tiers
python profile_inference.py --mode tier-test --tiers 6 8 16

# Test with LM enabled (where supported)
python profile_inference.py --mode tier-test --tier-with-lm

# Quick test (skip torch.compile for non-quantized tiers)
python profile_inference.py --mode tier-test --tier-skip-compile

See BENCHMARK.md for full documentation of the profiling tool.

This is useful for:

Testing GPU tier configurations on high-end hardware
Verifying that warnings and limits work correctly for each tier
Automated regression testing after modifying acestep/gpu_config.py
CI/CD validation of VRAM compatibility

Boundary Testing (Finding Minimum Tiers)

Use --tier-boundary to empirically determine the minimum VRAM tier at which INT8 quantization and CPU offload can be safely disabled. For each tier, this runs up to three configurations:

default — tier's standard settings (quantization + offload as configured)
no-quant — same offload settings, but quantization disabled
no-offload — no quantization AND no CPU offload (all models on GPU)

# Run boundary tests across all tiers
python profile_inference.py --mode tier-test --tier-boundary

# Test specific tiers with boundary testing
python profile_inference.py --mode tier-test --tier-boundary --tiers 8 12 16 20 24

# Boundary test with LM enabled (where supported)
python profile_inference.py --mode tier-test --tier-boundary --tier-with-lm

# Save results to JSON for further analysis
python profile_inference.py --mode tier-test --tier-boundary --benchmark-output boundary_results.json

The output includes a Boundary Analysis section showing the minimum tier for each capability:

BOUNDARY ANALYSIS
=================
  Capability                                    Min Tier   VRAM
  ------------------------------------------------------------
  No INT8 Quantization                          tier6b      20GB
  No CPU Offload (all models on GPU)            tier6b      20GB
  ------------------------------------------------------------

Note: Boundary results are empirical and may vary based on DiT model variant (turbo vs base), whether LM is enabled, generation duration, and flash attention availability. Community contributions to refine these boundaries are welcome!

Batch Size Boundary Testing

Use --tier-batch-boundary to find the maximum safe batch size for each tier by progressively testing batch sizes 1, 2, 4, 8:

# Run batch boundary tests with LM enabled
python profile_inference.py --mode tier-test --tier-batch-boundary --tier-with-lm

# Test specific tiers
python profile_inference.py --mode tier-test --tier-batch-boundary --tier-with-lm --tiers 8 12 16 24

This tests both with-LM and without-LM configurations and reports the maximum successful batch size per tier.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Compatibility Guide

GPU Tier Configuration

Column Descriptions

Adaptive UI Defaults

Runtime Safety Features

Notes

Memory Optimization Tips

Debug Mode: Simulating Different GPU Configurations

Automated Tier Testing

Boundary Testing (Finding Minimum Tiers)

Batch Size Boundary Testing

FilesExpand file tree

GPU_COMPATIBILITY.md

Latest commit

History

GPU_COMPATIBILITY.md

File metadata and controls

GPU Compatibility Guide

GPU Tier Configuration

Column Descriptions

Adaptive UI Defaults

Runtime Safety Features

Notes

Memory Optimization Tips

Debug Mode: Simulating Different GPU Configurations

Automated Tier Testing

Boundary Testing (Finding Minimum Tiers)

Batch Size Boundary Testing