v0.3.0: OSFT Unleashed - 3x Memory Savings & Beautiful Progress 🚀

Summary: This release delivers critical performance improvements for OSFT training with a 3x memory reduction and orthogonalization bug fixes, enhanced user experience through rich progress bars and colorful logging, comprehensive testing infrastructure for OSFT, and expanded model architecture support. The release focuses on making OSFT production-ready while improving the overall developer experience.

Highlights

🧠 Major OSFT Memory Optimization: 3x memory reduction through FSDP2 sharding (baseline now comparable to SFT)
🔧 Critical OSFT Orthogonalization Fix: Corrected distributed gradient projection for mathematical correctness
📊 Rich Progress Bars: Beautiful, informative training progress with real-time metrics
🧪 Comprehensive OSFT Testing: New regression test suite to validate orthogonality constraints
🦎 Enhanced Mamba Support: Added specialized convolution kernels for NVIDIA/AMD GPUs
📚 Modernized Documentation: Refreshed README with improved styling and clarity

Performance Improvements

OSFT Memory Optimization & Bug Fixes

Memory usage reduced by ~3x - Critical improvements to OSFT's memory footprint and correctness by @NikhilNayak-debug in #47

Memory Optimizations:

Registered U_high/S_high/V_high as non-trainable parameters (not buffers) so FSDP2 shards them across GPUs instead of replicating
Moved OSFT tensors under their owning Linear modules to avoid full-model all-gather
Per-block param materialization prevents whole-model memory spikes
Results: OSFT baseline memory reduced from ~52 GB to ~15 GB, peak from ~52 GB to ~24.6 GB (comparable to SFT)

Orthogonalization Fixes:

Fixed distributed gradient projection to be mathematically correct across shards
U projection (row-sharded): proper global contraction with all-reduce SUM
V projection (row-sharded): corrected Gram matrix computation with global reduction
Gradient projection now operates on local shards with minimal all-reduce for global correctness

New Features

Enhanced User Experience

Rich Progress Bars & Colorful Logging by @RobotSail in #48
- Beautiful progress bars during training and evaluation using rich console
- Real-time metrics display: epoch/step, loss, learning rate, and throughput
- Colored output with timestamps and JSON rendering for better readability
- Lazy-initialized progress lines for efficient rendering
OSFT Orthogonalization Test Suite by @RobotSail in #50
- Comprehensive regression test validating OSFT orthogonality constraints during training
- Monitors both gradient orthogonality (before optimizer steps) and parameter orthogonality (after optimizer steps)
- Detailed per-step reporting with aggregated violation summaries
- Added SVDModule class for better encapsulation of SVD components
- Supports distributed training validation across multiple GPUs
- Usage: torchrun --nproc_per_node=2 regression_tests/test_osft_orthogonalization.py --model Qwen/Qwen2.5-1.5B-Instruct --num-steps 100

Model Architecture Support

Mamba Convolution Kernels by @RobotSail in #46
- Added mamba-ssm[causal-conv1d] dependency for specialized NVIDIA/AMD GPU kernels
- Enables efficient Mamba architecture support with hardware-optimized operations
Enhanced GPT-2 Family Support in #50
- Added OSFT support for GPT-2 model family
- Broadened transformer-block discovery to support more Hugging Face architectures

Torchrun Improvements

Flexible Torchrun Arguments by @szaher in #44
- nproc_per_node now accepts both string ("gpu") and integer values
- rdzv_id now accepts both string and integer types
- More flexible rendezvous options: choose either master address/port (static) or rendezvous endpoint
- Launch command uses hyphenated flags and conditionally builds static vs endpoint-based rendezvous

Documentation

Modernized README by @RobotSail in #49
- Added modern badges (CI status, Python version, license)
- Integrated emojis for better navigation and visual appeal
- Updated installation commands with proper PyPI and source instructions
- Streamlined usage documentation focusing on core functionality
- Added bug reporting section with clear guidance
- Removed outdated content for cleaner, more maintainable docs

Dependencies & Infrastructure

Improved Dependency Management by @RobotSail in #51
- Removed numpy version ceiling (aligned with numba's policy)
- Moved tox and tox-uv to optional [dev] dependencies
- Applied minimum version requirement for numba
- Streamlined end-user installations

Bug Fixes

Fixed validation sampler epoch handling by @RobotSail in #45
- Removed incorrect epoch setting on SequentialSampler
- Improved validation data handling consistency across epochs
- Prevents unintended resets of validation sampler state

Test Infrastructure

Expanded test environments with explicit install flow per GPU/non-GPU
Improved tox environments with conditional CUDA/flash-attn setup
Enhanced CI with dedicated virtual environment for consistent tooling
Added comprehensive OSFT orthogonality regression tests

Example Usage

Training with Progress Bars

from mini_trainer.api_train import run_training
from mini_trainer.training_types import TrainingArgs, TorchrunArgs

train_args = TrainingArgs(
    model_name="Qwen/Qwen2.5-1.5B-Instruct",
    osft=True,
    osft_rank_ratio=0.5,
    ...
)

# Beautiful progress bars will automatically display during training
run_training(torch_args, train_args)

Running OSFT Orthogonalization Tests

torchrun --nproc_per_node=2 regression_tests/test_osft_orthogonalization.py \
    --model Qwen/Qwen2.5-1.5B-Instruct \
    --num-steps 100 \
    --margin-deg 1.0 \
    --rank-ratio 0.5

Upgrade Notes

No breaking API changes
OSFT users will see significant memory improvements automatically
Progress bars work best when running train.py directly through torchrun (api_train.py streams byte-for-byte and may reprint progress bars)
New orthogonalization test suite available for validating OSFT training correctness

Contributors

Installation

Through Pip:

uv pip install rhai-innovation-mini-trainer && uv pip install rhai-innovation-mini-trainer[cuda] --no-build-isolation

Locally:

uv pip install . && uv pip install .[cuda] --no-build-isolation

Full Changelog: v0.2.1...v0.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.3.0: OSFT Unleashed - 3x Memory Savings & Beautiful Progress 🚀

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

v0.3.0: OSFT Unleashed - 3x Memory Savings & Beautiful Progress 🚀

Highlights

Performance Improvements

OSFT Memory Optimization & Bug Fixes

New Features

Enhanced User Experience

Model Architecture Support

Torchrun Improvements

Documentation

Dependencies & Infrastructure

Bug Fixes

Test Infrastructure

Example Usage

Training with Progress Bars

Running OSFT Orthogonalization Tests

Upgrade Notes

Contributors

Installation

Contributors

Uh oh!