Releases · instructlab/training

25 Mar 19:32

Maxusmusti

v0.15.1

75425ca

v0.15.1 - Expanded Text-Data VLM / Multi-Modal Training Support Latest

Latest

What's Changed

Fix Gemma 3 SFT training by detecting dual-registered VLM configs by @RobotSail in #695

Full Changelog: v0.15.0...v0.15.1

Contributors

RobotSail

Assets 6

06 Mar 21:41

RobotSail

v0.15.0

0c6614a

v0.15.0 - Qwen3.5 VL Model Support

What's New

Features

Vision-Language Model (VLM) Support for Text-Only Training (#693)
- Added automatic detection and loading of vision-language models for text-only training
- New vlm_utils.py module with utilities for identifying and extracting CausalLM text backbones from VLM wrappers
- Support for two VLM loading strategies: extracting the text backbone when a CausalLM sub-model exists, or direct VLM loading when no CausalLM variant is available
- Improved tokenizer/text-config reconciliation for VLMs where vocab_size lives under text_config
Mixed Attention Handling for VLMs (#693)
- Models with timm vision towers now use per-component attention: eager for vision, flash_attention_2 or sdpa for text
- Automatic SDPA fallback for M-RoPE models (e.g. Qwen3.5 VL) which are incompatible with Flash Attention 2

Bug Fixes

FSDP Wrap Policy Robustness (#693)
- Fixed _no_split_modules resolution to handle models that declare module names for architectures not loaded (e.g. vision blocks when loading only the CausalLM)
- FSDP wrap policy now resolves all declared module names against both the wrapper and underlying HF model, filtering out unresolvable entries
GPT-OSS Attention Capability Detection (#693)
- vllm-flash-attn3 is now gated behind a Hopper (SM 9.0+) GPU capability check, falling back to eager on older hardware

Improvements

Local Mamba Kernel Preference (#693)
- GraniteMoeHybrid models now pre-populate the Hub kernel cache with locally installed mamba_ssm and causal_conv1d to avoid PyTorch/CUDA ABI mismatches with Hub-provided kernel builds

What's Changed

add support for qwen3.5 vl model by @RobotSail in #693

Full Changelog: v0.14.2...v0.15.0

Contributors

RobotSail

Assets 6

26 Feb 19:53

Maxusmusti

v0.14.2

1f02ea6

v0.14.2 - Validation Loss and Transformers V4 Backwards Compatability

What's Changed

Add backwards compatibility for transformers v4.57 by @Maxusmusti in #684
Adds Validation Adds validation loss + exposes it in the API by @RobotSail in #685

Full Changelog: v0.14.1...v0.14.2

Contributors

Maxusmusti and RobotSail

Assets 6

11 Feb 21:48

Maxusmusti

v0.14.1

c517712

v0.14.1 - Correct FSDP Config Behavior for Transformers v5

What's Changed

fix _no_split_modules subscript error for transformers v5 by @Maxusmusti in #683

Full Changelog: v0.14.0...v0.14.1

Contributors

Maxusmusti

Assets 6

05 Feb 00:24

RobotSail

v0.14.0

0c47c97

v0.14.0 - MLflow Support & Transformers v5 Compatibility

What's New

Features

MLflow Logging Backend (#680)
- Added MLflowHandler class for logging training metrics to MLflow
- New TrainingArgs fields: mlflow_tracking_uri, mlflow_experiment_name, mlflow_run_name
- Added wandb_project, wandb_entity, wandb_run_name fields for W&B configuration
- Added tensorboard_log_dir field for configurable TensorBoard log directory
- New optional install targets: requirements-mlflow.txt, requirements-wandb.txt, requirements-tensorboard.txt
Transformers v5 Compatibility (#681)
- Updated tokenizer API calls to use extra_special_tokens instead of additional_special_tokens
- Suppressed verbose httpx HTTP request logs from huggingface_hub

Bug Fixes

HYBRID_SHARD Failure Fix (#682)
- Added detection for when world_size < num_devices_per_node in FSDP configuration
- Automatically falls back to FULL_SHARD with a warning when HYBRID_SHARD would fail

Development

Tox-UV Integration (#676)
- Added tox-uv as a tox requirement with uv-venv-runner
- Updated GitHub workflows to use uv for package installation
- Replaced pip install with uv pip install in CI workflows

What's Changed

adds integration for tox-uv and updates workflows to use tox-uv by @RobotSail in #676
Add transformers v5 compatibility by @Maxusmusti in #681
Fix HYBRID_SHARD failure when world_size < available GPUs by @rtj1 in #682
Add MLflow support and expose logging configuration in TrainingArgs by @RobotSail in #680

New Contributors

@rtj1 made their first contribution in #682 🎉

Files Changed

18 files changed with 482 insertions and 83 deletions:

Core training modules: logger.py, config.py, accelerator.py, data_process.py, tokenizer_utils.py, main_ds.py
New requirements files for optional logging backends
Updated CI workflows and tox configuration

Full Changelog: v0.13.0...v0.14.0

Contributors

Maxusmusti, rtj1, and RobotSail

Assets 6

08 Jan 19:48

RobotSail

v0.13.0

574f946

v0.13.0 - Pretraining Support & Optimizer Configuration

What's New

Features

Pretraining Data Processing API (#672)
- Added new API for processing pretraining-style datasets
- Documents are now chunked by configurable block_size
- Chunks are treated as independent, fully-unmasked samples
- Updated training loop to ingest pretraining-style datasets
- Includes comprehensive test coverage (test_pretraining_data_process.py, test_pretraining_mode.py, test_pretraining_sampler.py)
AdamW Optimizer Configuration (#674)
- Exposed weight_decay, betas, and eps parameters in TrainingArgs
- Users can now tune AdamW hyperparameters through run_training() API
- Provides more control over optimizer behavior
Granite 4 Model Support (#669)
- Added support for Granite 4 models as Mixture of Experts (MoE) models in training

Bug Fixes

Process Timing Fix (#675)
- Fixed race condition where process wasn't completed by the time it was read
Variable Access Fix (#668)
- Fixed invalid variable access bug

Dependencies

Build Dependency Update (#670)
- Updated hynek build dependency

Files Changed

17 files changed with 1,642 insertions and 52 deletions:

Core training modules: data_process.py, main_ds.py, sampler.py, model.py, config.py
New test suites for pretraining functionality
Updated README with new capabilities

Full Changelog

All Changes:

574f946 Exposes API for processing pretraining data (#672)
638a753 fixes bug where process isn't completed by the time the process gets read (#675)
c495035 Expose AdamW optimizer parameters in training API (#674)
3d05302 Handle granite 4 as MoE models in training (#669)
781c36f fixes stray invalid variable access bug (#668)
529c2f7 bumps hynek build dep (#670)

Full Diff: v0.12.1...v0.13.0

Assets 6

14 Oct 20:47

Maxusmusti

v0.12.1

637afae

v0.12.1 - Granite 4 support, and adding extended env var and torchrun arg support

What's Changed

Update requirements-cuda.txt to increase liger-kernel minimum by @Maxusmusti in #659
Adds mamba-ssm[causal-conv1d] to CUDA requirements by @RobotSail in #663
Removes Numpy version cap by @RobotSail in #664
fix(torchrun): Omit empty arguments and correct nproc_per_node type by @szaher in #661

New Contributors

@szaher made their first contribution in #661

Full Changelog: v0.12.0...v0.12.1

Contributors

szaher, Maxusmusti, and RobotSail

Assets 6

17 Sep 17:10

Maxusmusti

v0.12.0

536ebfb

v0.12.0 - GPT-OSS Support

Full fine-tuning now supports gpt-oss models, alongside minor bugfixes to ensure correct loss calculations with higher gradient accumulation.

What's Changed

Disable workflow runs on forks by default by @fynnsu in #632
Adding GPT OSS Support by @Maxusmusti in #646
Update numpy from <2.0 to <2.3 by @Maxusmusti in #656
Add kernels>0.9.0 to CUDA requirements by @Maxusmusti in #658

Full Changelog: v0.11.1...v0.12.0

Contributors

Maxusmusti and fynnsu

Assets 6

05 Aug 19:34

Maxusmusti

v0.11.1

bfd0d73

v0.11.1

What's Changed

Add general logging implementation by @fynnsu in #500
docs: add CI documentation by @nathan-weinberg in #555
fix: Use default torch timeout for nccl watchdog unless overridden by @booxter in #521
fix: Fix markdown-lint violations by @booxter in #559
ci: add 3.12 smoke workflow flavor by @booxter in #535
adds barriers after checkpoint saving by @JamesKunstle in #566
ci: Fix smoke failures due to pre not available in local actions by @booxter in #565
Checkout correct branch on pull_request_target trigger by @fynnsu in #549
Logging Fixes & Enhancements by @RobotSail in #571
docs: Remove badge for a no longer existing job by @booxter in #542
uses __name__ in logging.getLogger by @JamesKunstle in #573
ci: stop reporting results to slack by @ktdreyer in #574
CI: Constrain all dependencies; introduce a Monday workflow to update pins by @booxter in #558
ci: Run jobs on constraints-dev.txt change by @booxter in #580
chore: update constraints-dev.txt (2025-05-30) by @courtneypacheco in #579
remove old Deepspeed-native code by @JamesKunstle in #567
add DCO.txt by @ktdreyer in #588
ci: Disable dependabot for pip dependencies by @booxter in #587
feat: refactor main_ds.py (1/n) Model class by @cdoern in #572
ci: do not require DCO job by @ktdreyer in #595
'granite-3.3-2b-instruct' for smoketest; smaller smoke dataset by @JamesKunstle in #590
fixes unit tests requiring cuda by @JamesKunstle in #586
chore: update constraints-dev.txt (2025-06-02) by @courtneypacheco in #584
ci: Cover more test dependencies with pins by @booxter in #581
ci: Introduce python 3.12 e2e large job flavor by @booxter in #563
Implicit distributed backend selection by @booxter in #516
ci: Fix incorrect indent in workflow steps by @booxter in #599
feat: refactor main_ds.py (2/n) Accelerator class by @cdoern in #594
chore: update constraints-dev.txt (2025-06-09) by @courtneypacheco in #602
feat: add medium e2e CI job for each PR by @cdoern in #551
test: fix e2e target by @cdoern in #610
chore: update constraints-dev.txt (2025-06-16) by @courtneypacheco in #612
Remove Dolomite support by @booxter in #616
Revert "test: fix e2e target" by @bbrowning in #620
ci: Remove harden-runner steps from jobs by @booxter in #617
test: disable per-PR test by @cdoern in #631
fix edge case for qwen3 data processing by @RobotSail in #626
uncap accelerate in requirements-cuda.txt by @ktdreyer in #628
chore: update constraints-dev.txt (2025-06-30) by @courtneypacheco in #623
Fix a mistake in formatting a floating-point value by @mtake in #639
Add a tutorial for fine-tuning and interpolation by @mtake in #640

New Contributors

@bbrowning made their first contribution in #620
@mtake made their first contribution in #639

Full Changelog: v0.11...v0.11.1

Contributors

bbrowning, booxter, and 8 other contributors

Assets 6

07 Jul 13:33

cdoern

v0.10.4

0cc2e30

v0.10.4

What's Changed

uncap accelerate in requirements-cuda.txt (backport #628) by @mergify in #634

Full Changelog: v0.10.3...v0.10.4

Contributors

mergify

Assets 6

Releases: instructlab/training

v0.15.1 - Expanded Text-Data VLM / Multi-Modal Training Support

What's Changed

Contributors

Uh oh!

v0.15.0 - Qwen3.5 VL Model Support

What's New

Features

Bug Fixes

Improvements

What's Changed

Contributors

Uh oh!

v0.14.2 - Validation Loss and Transformers V4 Backwards Compatability

What's Changed

Contributors

Uh oh!

v0.14.1 - Correct FSDP Config Behavior for Transformers v5

What's Changed

Contributors

Uh oh!

v0.14.0 - MLflow Support & Transformers v5 Compatibility

What's New

Features

Bug Fixes

Development

What's Changed

New Contributors

Files Changed

Contributors

Uh oh!

v0.13.0 - Pretraining Support & Optimizer Configuration

What's New

Features

Bug Fixes

Dependencies

Files Changed

Full Changelog

Uh oh!

v0.12.1 - Granite 4 support, and adding extended env var and torchrun arg support

What's Changed

New Contributors

Contributors

Uh oh!

v0.12.0 - GPT-OSS Support

What's Changed

Contributors

Uh oh!

v0.11.1

What's Changed

New Contributors

Contributors

Uh oh!

v0.10.4

What's Changed

Contributors

Uh oh!