Skip to content

Latest commit

 

History

History
95 lines (53 loc) · 5.75 KB

File metadata and controls

95 lines (53 loc) · 5.75 KB

VRAM Optimization Guide

This page documents every VRAM optimization Side-Step applies and the VRAM profiles for different GPU tiers.


VRAM Optimizations

Side-Step applies several layers of VRAM optimization. Some are automatic; others are user-configurable. They are listed here from most impactful to least.

Gradient Checkpointing (automatic, on by default)

What it does: During the backward pass, PyTorch normally stores all intermediate activations from the forward pass so it can compute gradients. Gradient checkpointing discards those activations and recomputes them on-the-fly during backpropagation.

VRAM savings: ~40-60% reduction in activation memory. For ACE-Step's decoder this is easily 10-15 GB on large batches.

Trade-off: ~10-30% slower training (recomputation cost).

Default: ON. This matches what the original ACE-Step trainer always did silently. Side-Step exposes it as a toggle so power users can disable it if they have VRAM to spare and want maximum speed.

How to disable: In the wizard (Advanced mode), answer "no" to "Enable gradient checkpointing?". Via CLI: --no-gradient-checkpointing.

Side-Step also disables use_cache on the decoder (frees KV-cache memory) and enables input_require_grads (needed for PEFT + checkpointing compatibility). These happen automatically when gradient checkpointing is on.

Flash Attention 2 (automatic, no user action needed)

What it does: Replaces the standard attention implementation with a fused CUDA kernel that is both faster and more memory-efficient. Instead of materializing the full N x N attention matrix, Flash Attention computes attention in tiles.

VRAM savings: Reduces attention memory from O(N^2) to O(N). In practice, this translates to lower peak VRAM during the forward pass, especially for long audio sequences.

Speed benefit: 10-30% faster training steps on supported hardware due to better GPU utilization (100% compute occupancy vs ~70-80% with SDPA).

Requirements: NVIDIA Ampere or newer GPU (RTX 30xx, RTX 40xx, A100, H100, etc. -- compute capability >= 8.0).

How it works: Side-Step installs Flash Attention 2 from prebuilt wheels (no compilation needed) as part of uv sync. At runtime, the model loader auto-detects whether Flash Attention is available and falls back to SDPA (Scaled Dot-Product Attention) if it is not. You do not need to configure anything.

Fallback: On older GPUs (RTX 20xx, GTX 16xx, etc.) or macOS, SDPA is used automatically.

8-bit Optimizer (optional, Advanced mode)

What it does: Replaces the standard 32-bit AdamW optimizer with an 8-bit quantized version. AdamW stores two state tensors (momentum and variance) per trainable parameter. Quantizing these to 8-bit halves their memory footprint.

VRAM savings: ~30% reduction in optimizer state memory. For a rank-64 LoRA with ~10M trainable parameters, this saves ~2-3 GB.

Trade-off: Negligible quality difference in practice. The quantization is applied to optimizer state, not to the model weights or gradients.

How to enable: In the wizard (Advanced > Optimizer & Scheduler), select "AdamW 8-bit". Via CLI: --optimizer-type adamw8bit. The bitsandbytes package is included automatically by uv sync.

Encoder Offloading (optional, Advanced mode)

What it does: After the model is set up for training, moves non-decoder submodules (VAE, text encoder, condition encoders) from GPU to CPU. These components are only needed during preprocessing -- during the training loop itself, only the decoder runs on GPU.

VRAM savings: ~2-4 GB, depending on the model.

Trade-off: Minimal. These components are not used during the training loop, so offloading them has no speed impact.

How to enable: In the wizard (Advanced > VRAM Savings), answer "yes" to "Offload encoder/VAE to CPU?". Via CLI: --offload-encoder.

Lower Rank

What it does: Reduces the number of trainable parameters in the LoRA/LoKR adapter. Fewer parameters means less memory for gradients, optimizer state, and activation storage.

VRAM savings: Roughly proportional. Rank 16 uses ~4x less adapter VRAM than rank 64.

Trade-off: Lower capacity. The adapter can capture less of your dataset's style. See the rank guide in [[Training Guide]].

Gradient Accumulation

What it does: Instead of updating weights after every batch, accumulates gradients over N batches before applying a single optimizer step. This gives the same effective batch size as batch_size x gradient_accumulation without the VRAM cost of a larger batch.

VRAM savings: Keeps per-step VRAM at batch_size=1 levels while achieving the training dynamics of a larger effective batch.

Default: 4 steps of accumulation with batch size 1 (effective batch size = 4).


VRAM Profiles

Profile GPU VRAM Automatic Recommended Settings
Comfortable 24 GB+ Grad checkpointing ON, Flash Attention Batch 2+, AdamW, Rank 64-128
Standard 16-24 GB Grad checkpointing ON, Flash Attention Batch 1, AdamW, Rank 64, Grad accumulation 4
Tight 10-16 GB Grad checkpointing ON, Flash Attention Batch 1, AdamW8bit, Encoder offloading, Rank 32-64
Minimal <10 GB Grad checkpointing ON, Flash Attention Batch 1, AdamW8bit or Adafactor, Encoder offloading, Rank 16, High grad accumulation

Gradient checkpointing and Flash Attention are always active (when supported). The profiles differ in optimizer choice, rank, and optional offloading.


See Also

  • [[Training Guide]] -- Full training workflow, hyperparameter guide, LoRA vs LoKR
  • [[Estimation Guide]] -- Gradient sensitivity analysis for targeted training
  • [[Getting Started]] -- Installation and first-run setup
  • [[Model Management]] -- Checkpoint structure and fine-tune support