v0.3.0: OSFT Unleashed - 3x Memory Savings & Beautiful Progress π
v0.3.0: OSFT Unleashed - 3x Memory Savings & Beautiful Progress π
Summary: This release delivers critical performance improvements for OSFT training with a 3x memory reduction and orthogonalization bug fixes, enhanced user experience through rich progress bars and colorful logging, comprehensive testing infrastructure for OSFT, and expanded model architecture support. The release focuses on making OSFT production-ready while improving the overall developer experience.
Highlights
- π§ Major OSFT Memory Optimization: 3x memory reduction through FSDP2 sharding (baseline now comparable to SFT)
- π§ Critical OSFT Orthogonalization Fix: Corrected distributed gradient projection for mathematical correctness
- π Rich Progress Bars: Beautiful, informative training progress with real-time metrics
- π§ͺ Comprehensive OSFT Testing: New regression test suite to validate orthogonality constraints
- π¦ Enhanced Mamba Support: Added specialized convolution kernels for NVIDIA/AMD GPUs
- π Modernized Documentation: Refreshed README with improved styling and clarity
Performance Improvements
OSFT Memory Optimization & Bug Fixes
Memory usage reduced by ~3x - Critical improvements to OSFT's memory footprint and correctness by @NikhilNayak-debug in #47
Memory Optimizations:
- Registered U_high/S_high/V_high as non-trainable parameters (not buffers) so FSDP2 shards them across GPUs instead of replicating
- Moved OSFT tensors under their owning Linear modules to avoid full-model all-gather
- Per-block param materialization prevents whole-model memory spikes
- Results: OSFT baseline memory reduced from ~52 GB to ~15 GB, peak from ~52 GB to ~24.6 GB (comparable to SFT)
Orthogonalization Fixes:
- Fixed distributed gradient projection to be mathematically correct across shards
- U projection (row-sharded): proper global contraction with all-reduce SUM
- V projection (row-sharded): corrected Gram matrix computation with global reduction
- Gradient projection now operates on local shards with minimal all-reduce for global correctness
New Features
Enhanced User Experience
-
Rich Progress Bars & Colorful Logging by @RobotSail in #48
- Beautiful progress bars during training and evaluation using rich console
- Real-time metrics display: epoch/step, loss, learning rate, and throughput
- Colored output with timestamps and JSON rendering for better readability
- Lazy-initialized progress lines for efficient rendering
-
OSFT Orthogonalization Test Suite by @RobotSail in #50
- Comprehensive regression test validating OSFT orthogonality constraints during training
- Monitors both gradient orthogonality (before optimizer steps) and parameter orthogonality (after optimizer steps)
- Detailed per-step reporting with aggregated violation summaries
- Added SVDModule class for better encapsulation of SVD components
- Supports distributed training validation across multiple GPUs
- Usage:
torchrun --nproc_per_node=2 regression_tests/test_osft_orthogonalization.py --model Qwen/Qwen2.5-1.5B-Instruct --num-steps 100
Model Architecture Support
-
Mamba Convolution Kernels by @RobotSail in #46
- Added mamba-ssm[causal-conv1d] dependency for specialized NVIDIA/AMD GPU kernels
- Enables efficient Mamba architecture support with hardware-optimized operations
-
Enhanced GPT-2 Family Support in #50
- Added OSFT support for GPT-2 model family
- Broadened transformer-block discovery to support more Hugging Face architectures
Torchrun Improvements
- Flexible Torchrun Arguments by @szaher in #44
nproc_per_nodenow accepts both string ("gpu") and integer valuesrdzv_idnow accepts both string and integer types- More flexible rendezvous options: choose either master address/port (static) or rendezvous endpoint
- Launch command uses hyphenated flags and conditionally builds static vs endpoint-based rendezvous
Documentation
- Modernized README by @RobotSail in #49
- Added modern badges (CI status, Python version, license)
- Integrated emojis for better navigation and visual appeal
- Updated installation commands with proper PyPI and source instructions
- Streamlined usage documentation focusing on core functionality
- Added bug reporting section with clear guidance
- Removed outdated content for cleaner, more maintainable docs
Dependencies & Infrastructure
- Improved Dependency Management by @RobotSail in #51
- Removed numpy version ceiling (aligned with numba's policy)
- Moved tox and tox-uv to optional [dev] dependencies
- Applied minimum version requirement for numba
- Streamlined end-user installations
Bug Fixes
- Fixed validation sampler epoch handling by @RobotSail in #45
- Removed incorrect epoch setting on SequentialSampler
- Improved validation data handling consistency across epochs
- Prevents unintended resets of validation sampler state
Test Infrastructure
- Expanded test environments with explicit install flow per GPU/non-GPU
- Improved tox environments with conditional CUDA/flash-attn setup
- Enhanced CI with dedicated virtual environment for consistent tooling
- Added comprehensive OSFT orthogonality regression tests
Example Usage
Training with Progress Bars
from mini_trainer.api_train import run_training
from mini_trainer.training_types import TrainingArgs, TorchrunArgs
train_args = TrainingArgs(
model_name="Qwen/Qwen2.5-1.5B-Instruct",
osft=True,
osft_rank_ratio=0.5,
...
)
# Beautiful progress bars will automatically display during training
run_training(torch_args, train_args)Running OSFT Orthogonalization Tests
torchrun --nproc_per_node=2 regression_tests/test_osft_orthogonalization.py \
--model Qwen/Qwen2.5-1.5B-Instruct \
--num-steps 100 \
--margin-deg 1.0 \
--rank-ratio 0.5Upgrade Notes
- No breaking API changes
- OSFT users will see significant memory improvements automatically
- Progress bars work best when running
train.pydirectly through torchrun (api_train.py streams byte-for-byte and may reprint progress bars) - New orthogonalization test suite available for validating OSFT training correctness
Contributors
Installation
Through Pip:
uv pip install rhai-innovation-mini-trainer && uv pip install rhai-innovation-mini-trainer[cuda] --no-build-isolationLocally:
uv pip install . && uv pip install .[cuda] --no-build-isolationFull Changelog: v0.2.1...v0.3.0