This walkthrough takes you from raw audio files to generating music with a trained LoRA adapter. Every step includes the CLI command and links to detailed documentation.
Prerequisites: Side-Step installed and working, model checkpoints downloaded, GPU with CUDA support. See [[Getting Started]] if you have not set up yet.
Collect your audio files into a single folder. Side-Step supports .wav, .mp3, .flac, .ogg, .opus, and .m4a.
my_audio/
├── track1.wav
├── track2.wav
├── track3.mp3
└── track4.flac
Optional but recommended: Create a dataset JSON file with metadata for each track. This gives the model more information to learn from (captions, lyrics, genre, BPM, etc.):
[
{
"audio_path": "./track1.wav",
"caption": "Energetic rock with distorted guitars and driving drums",
"genre": "Rock",
"bpm": 140,
"custom_tag": "mystyle"
},
{
"audio_path": "./track2.wav",
"caption": "Mellow acoustic folk song with fingerpicked guitar",
"genre": "Folk",
"bpm": 95
}
]Save this as my_dataset.json in the same directory as your audio files.
For full details on all available fields, see [[Dataset Preparation]].
Convert your raw audio into preprocessed .pt tensor files. This runs in two low-VRAM passes: (1) VAE + Text Encoder (~3 GB), then (2) DIT encoder (~6 GB).
Without a dataset JSON (auto-captions from filenames):
uv run train.py fixed \
--checkpoint-dir ./checkpoints \
--model-variant turbo \
--preprocess \
--audio-dir ./my_audio \
--tensor-output ./my_tensorsWith a dataset JSON:
uv run train.py fixed \
--checkpoint-dir ./checkpoints \
--model-variant turbo \
--preprocess \
--audio-dir ./my_audio \
--dataset-json ./my_audio/my_dataset.json \
--tensor-output ./my_tensorsAfter preprocessing, you will have a my_tensors/ directory containing .pt files and a manifest.json. These tensors work for both LoRA and LoKR training -- you only need to preprocess once.
Start training with the preprocessed tensors:
uv run train.py fixed \
--checkpoint-dir ./checkpoints \
--model-variant turbo \
--dataset-dir ./my_tensors \
--output-dir ./output/my_lora \
--epochs 100This uses the recommended defaults (rank 64, cosine LR schedule, AdamW optimizer). To use a preset instead:
# Start the wizard and load a preset
uv run train.pyThe wizard lets you load a preset (e.g., vram_12gb for a 12 GB GPU), adjust individual settings, and start training interactively. See [[Preset Management]] for the full list of built-in presets.
Key flags to know:
| Flag | Purpose |
|---|---|
--epochs 100 |
How many times to loop through the dataset |
--rank 64 |
LoRA capacity (higher = more expressive, more VRAM) |
--save-every 10 |
Save a checkpoint every N epochs |
--offload-encoder |
Free ~2-4 GB VRAM by moving encoders to CPU |
--optimizer-type adamw8bit |
Use 8-bit optimizer to save VRAM |
For all available options, see the Complete Argument Reference in the README or [[The Settings Wizard]].
While training runs (or after it finishes), view your training metrics with TensorBoard:
tensorboard --logdir ./output/my_lora/runsOpen http://localhost:6006 in your browser. Watch for:
- Loss decreasing and stabilizing (good) vs. loss dropping then rising (overfitting).
- Learning rate following the expected schedule (warmup then decay).
- Gradient norms staying stable (spikes may indicate training issues).
After training completes, your adapter is saved in ./output/my_lora/final/.
- Start ACE-Step's Gradio UI.
- In Service Configuration, find the LoRA Adapter section.
- Enter the path to your adapter:
/full/path/to/Side-Step/output/my_lora/final - Click Load LoRA.
- Toggle Use LoRA on.
- Adjust LoRA Scale (1.0 = full strength).
- Generate audio. If you used a
custom_tag, include it in your prompt.
Important: Use the correct shift and inference steps for your model variant. If you trained on turbo, use shift=3.0 and 8 inference steps. For base/sft, use shift=1.0 and 50 steps. See [[Shift and Timestep Sampling]] for details.
For the full guide on output layout, LoKR limitations, and checkpoint usage, see [[Using Your Adapter]].
Training is iterative. Here are common next steps:
uv run train.py fixed \
--checkpoint-dir ./checkpoints \
--model-variant turbo \
--dataset-dir ./my_tensors \
--output-dir ./output/my_lora \
--resume-from ./output/my_lora/checkpoints/epoch_100 \
--epochs 200Load a VRAM-appropriate preset to optimize for your GPU:
uv run train.py # wizard mode, load a preset at the startEvery checkpoint is inference-ready. Point ACE-Step at any checkpoint directory to hear how your LoRA sounds at different training stages:
./output/my_lora/checkpoints/epoch_50
./output/my_lora/checkpoints/epoch_100
- Overfitting? (loss drops then rises, output sounds like your training data verbatim) -- Lower rank, increase dropout, add more training data.
- Underfitting? (loss stays high, LoRA has no audible effect) -- Increase epochs, increase rank, check your dataset quality.
- Running out of VRAM? -- See [[VRAM Optimization Guide]] for tier-specific settings.
| Step | Command | Output |
|---|---|---|
| Preprocess | uv run train.py fixed --preprocess --audio-dir ./my_audio --tensor-output ./my_tensors ... |
./my_tensors/*.pt |
| Train | uv run train.py fixed --dataset-dir ./my_tensors --output-dir ./output/my_lora ... |
./output/my_lora/final/ |
| Monitor | tensorboard --logdir ./output/my_lora/runs |
Browser at localhost:6006 |
| Inference | Load ./output/my_lora/final in ACE-Step Gradio |
Generated audio |
- [[Dataset Preparation]] -- JSON format, metadata fields, audio requirements
- [[Using Your Adapter]] -- Output layout, Gradio loading, LoKR limitations
- [[Training Guide]] -- Full training options and hyperparameters
- [[Preset Management]] -- Built-in presets, save/load/import/export
- [[VRAM Optimization Guide]] -- GPU memory profiles
- [[Windows Notes]] -- Windows-specific setup and workarounds