Releases: vllm-project/llm-compressor
v0.7.1
What's Changed
- [Examples] Create qwen_2_5_vl_example.py by @Zhao-Dongyu in #1752
- [fix] Fix visual layer ignore pattern for Qwen2.5-VL models by @Zhao-Dongyu in #1766
- [Transform] Fix QuIP targets by @kylesayrs in #1770
New Contributors
- @Zhao-Dongyu made their first contribution in #1752
Full Changelog: 0.7.0...0.7.1
v0.7.0

LLM Compressor v0.7.0 release notes
This LLM Compressor v0.7.0 release introduces the following new features and enhancements:
- Transforms support, including QuIP and SpinQuant algorithms
- Apply multiple compressors to a single model for mixed-precision quantization
- Support for DeepSeekV3-style block FP8 quantization
- Expanded Mixture of Experts (MoE) calibration support, including support with NVFP4 quantization
- Llama4 quantization support with vLLM compatibility
- Configurable observer arguments
- Simplified and unified Recipe classes for easier usage and debugging
Introducing Transforms ✨
LLM Compressor now supports transforms. With transforms, you can inject additional matrix operations within a model for the purposes of increasing the accuracy recovery as a result of quantization. Transforms allow rotating weights or activations into spaces with smaller dynamic ranges, reducing quantization error.
Two algorithms are supported in this release:
- QuIP transforms inject transforms before and after weights to assist with weight-only quantization
- SpinQuant transforms inject transforms whose inverses span across multiple weights, assisting in both weight and activation quantization. In this release, fused R1 and R2 (i.e. offline) transforms are available. The full lifecycle has been validated to confirm that the models produced by LLM Compressor match the performance outlined in the original SpinQuant paper. Learned rotations and online R3 and R4 rotations will be added in a future release.
The functionality for both algorithms available through the new QuIPModifier
and SpinQuantModifier
classes.
Applying multiple compressors to a single model
LLM Compressor now supports applying multiple compressors to a single model. This extends support for non-uniform quantization recipes, such as combining NVFP4 and FP8 quantization. This provides finer control over per-layer quantization, allowing more precise handling of layers that are especially sensitive to certain quantization types.
Models with more than one compressor applied have their format set to mixed-precision
in the config.json
file. Additionally, each config_group
now includes a format key that specifies the format used for the layers targeted by that group.
Support for DeepSeekV3-style block FP8 quantization
You can now apply DeepSeekV3-style block FP8 quantization during model compression, a technique designed to further compress large language models for more efficient inference. The changes encompass the fundamental implementation of block-wise quantization, robust handling of quantization parameters, updated documentation, and a practical example to guide users in applying this new compression scheme.
Mixture of Experts support
LLM Compressor now includes enhanced general Mixture of Experts (MoE) calibration support, including support for MoEs with NVFP4 quantization. Forward passes of the MoE models can be controlled during calibration by adding custom modules to the replace_modules_for_calibration
function which permanently changes the MoE module or moe_calibration_context
function to temporarily update modules during calibration.
Llama4 quantization
LLama4 quantization is now supported in LLM Compressor. To be quantized and runnable in vLLM, Llama4TextMoe
modules are permanently replaced using the replace_modules_for_calibration
method which linearizes the modules. This allows the model to be quantized to schemes including WN16 with GPTQ and NVFP4.
Simplified and updated Recipe classes
Recipe classes have been updated with the following features:
- Merged multiple recipe-related classes into a single, unified
Recipe
class - Simplified modifier creation, lifecycle management, and parsing logic
- Improved serialization and deserialization for clarity and maintainability
- Reduced redundant stages and arguments handling for easier debugging and usage
Configurable Observer arguments
Observer arguments can now be configured as a dict through the observer_kwargs
quantization argument, which can be set through oneshot recipes.
v0.6.0.1
v0.6.0
What's Changed
- [Experimental] Mistral-format FP8 quantization by @mgoin in #1359
- [Examples] [Bugfix] skip sparsity stats when saving checkpoints by @kylesayrs in #1528
- [Examples] [Bugfix] Fix debug message by @kylesayrs in #1529
- [Tests][NVFP4] No longer skip NVFP4A16 e2e test by @dsikka in #1538
- [AWQ] Support for Calibration Datasets of varying feature dimension by @brian-dellabetta in #1536
- fix qwen 2.5 VL multimodal example by @brian-dellabetta in #1541
- [Example] [Bugfix] Fix Gemma ignore list by @kylesayrs in #1531
- [Tests][NVFP4] Add e2e nvfp4 test by @dsikka in #1543
- [Examples] Use more robust splits by @kylesayrs in #1544
- [Bugfix] [Autowrapper] Fix visit_Delete by @kylesayrs in #1532
- [Example] Fix Qwen VL ignore list by @arunmadhusud in #1545
- [Tests] Fix
Qwen2.5-VL-7B-Instruct
Recipe by @dsikka in #1548 - [Bugfix] Fix gemma2 generation by @kylesayrs in #1552
- fix skipif check on tests involving gated HF models by @brian-dellabetta in #1553
- [NVFP4] Fix global scale update when dealing with offloaded layers by @dsikka in #1554
- oneshot entrypoint update by @ved1beta in #1445
- LM Eval tests -- ignore vision tower for VL fp8 test by @brian-dellabetta in #1562
- [Performance] Sequential onloading by @kylesayrs in #1263
- [BugFix] Explicitly set gpu_memory_utilization by @rahul-tuli in #1560
- Add Axolotl blog link by @rahul-tuli in #1563
- [Bugfix] Fix multigpu
dispatch_for_generation
by @kylesayrs in #1567 - [Testing] Set
VLLM_WORKER_MULTIPROC_METHOD
for e2e testing by @dsikka in #1569 - [BugFix] Fix quantizaiton_2of4_sparse_w4a16 example by @shanjiaz in #1565
- [Pipelines] infer model device with optional override by @kylesayrs in #1572
- bump up requirement for compressed-tensors to 0.10.2 by @dhuangnm in #1581
New Contributors
- @arunmadhusud made their first contribution in #1545
Full Changelog: 0.5.2...0.6.0
v0.5.2
What's Changed
- Exclude images from package by @kylesayrs in #1397
- [Tracing] Skip non-ancestors of sequential targets by @kylesayrs in #1389
- Consolidate build config by @dbarbuzzi in #1398
- [Tests] Disable silently failing kv cache test by @kylesayrs in #1371
- Drop
flash_attn
skip for quantizing_moe example tests by @dbarbuzzi in #1396 - [VLM] Fix mllama targets by @kylesayrs in #1402
- [Tests] Use requires_gpu, fix missing gpu test skip, add explicit test for gpu from gha by @kylesayrs in #1264
- Implement
QuantizationMixin
by @kylesayrs in #1351 - Add new-features section by @rahul-tuli in #1408
- [Tracing] Support tracing of Gemma3 [#1248] by @kelkelcheng in #1373
- bugfix kv cache quantization with ignored layers by @brian-dellabetta in #1312
- AWQ sanitize_kwargs minor cleanup by @brian-dellabetta in #1405
- [Tracing][Testing] Add tracing tests by @kylesayrs in #1335
- fix lm eval test reproducibility issues by @brian-dellabetta in #1260
- Pipeline Extraction by @kylesayrs in #1279
- Add
pull_request
trigger to base tests workflow by @dbarbuzzi in #1417 - removing RecipeMetadata and references by @shanjiaz in #1414
- Update examples to only load required number of samples from dataset by @kylesayrs in #1118
- [Tracing] Reinstate ignore functionality by @kylesayrs in #1423
- [Typo] overriden by @kylesayrs in #1420
- Rename SparsityModifierMixin to SparsityModifierBase by @kylesayrs in #1416
- Remove RecipeArgs class & its references by @shanjiaz in #1429
- [Examples] Standardize AWQ example by @kylesayrs in #1412
- [Logging] Support logging once by @kylesayrs in #1431
- Add: deepseekv2 smoothquant mappings by @rahul-tuli in #1433
- AWQ QuantizationMixin + SequentialPipeline by @brian-dellabetta in #1426
- patch awq tests/readme after QuantizationMixin refactor by @brian-dellabetta in #1439
- Added more tests for Quantization24SparseW4A16 by @shanjiaz in #1434
- [GPTQ] Add
actorder
option to modifier by @kylesayrs in #1424 - [Bugfix][Tracing] Fix qwen2_5_vl by @kylesayrs in #1448
- [Tests] Use proper offloading utils in
test_compress_tensor_utils
by @kylesayrs in #1449 - [Tracing] Fix Traceable Imports by @kylesayrs in #1452
- [NVFP4] Enable FP4 Weight-Only Quantization by @dsikka in #1309
- Pin transformers to <4.52.0 by @brian-dellabetta in #1459
- AWQ Apply Scales Bugfix when smooth layer output length doesn't match balance layer input length by @brian-dellabetta in #1451
- Fix #1344 Extend e2e tests to add asym support for W8A8-Int8 by @ved1beta in #1345
- [Tests] Fix activation recipe for w8a8 asym by @dsikka in #1461
- AWQ Qwen and Phi mappings by @brian-dellabetta in #1440
- [Observer] Optimize mse observer by @shanjiaz in #1450
- Fix: Improve
SmoothQuant
Support for Mixture of Experts (MoE) Models by @rahul-tuli in #1455 - [Tests] Add nvfp4a16 e2e test case by @dsikka in #1463
- [Docs] Update README to list fp4 by @dsikka in #1462
- Remove duplicate model id var from awq example recipe by @AndrewMead10 in #1467
- Added observer type for test_min_max by @shanjiaz in #1466
- Disable kernels during calibration (and tracing) by @kylesayrs in #1454
- [GPTQ] Fix actorder resolution, add sentinel by @kylesayrs in #1453
- Set
show_progress
to True by @dsikka in #1471 - Remove
compress
by @dsikka in #1470 - raise error if block quantization is used, as it is not yet supported by @brian-dellabetta in #1476
- [Tests] Increase max seq length for tracing tests by @kylesayrs in #1478
- [Tests] Fix dynamic field to be a bool, not string by @dsikka in #1480
- [Examples] Fix qwen vision examples by @kylesayrs in #1481
- [NVFP4] Update to use
tensor_group
strategy; update observers by @dsikka in #1484 - loosen lmeval assertions to upper or lower bound by @brian-dellabetta in #1477
- Revert "expand observers to calculate gparams, add example for activa… by @dsikka in #1486
- fix rest of the minmax tests by @shanjiaz in #1469
- Add warning for non-divisible group quantization by @kylesayrs in #1401
- [AWQ] Support accumulation for reduced memory usage by @kylesayrs in #1435
- [Tracing] Code AutoWrapper by @kylesayrs in #1411
- Removed RecipeTuple & RecipeContainer class by @shanjiaz in #1460
- Unpin to support
transformers==4.52.3
by @kylesayrs in #1479 - [Tests] GPTQ Actorder Resolution Tests by @kylesayrs in #1468
- [Testing] Skip FP4 Test by @dsikka in #1499
- [Bugfix] Remove tracing imports from tests by @kylesayrs in #1498
- [Testing] Use a slightly larger model that works with group_size 128 by @dsikka in #1502
- skip tracing tests if token unavailable by @brian-dellabetta in #1493
- Fix missing logs when calling oneshot by @kelkelcheng in #1446
- [NVFP4] Expand observers to calculate gparam, support NVFP4 Activations by @dsikka in #1487
- [Tests] Remove duplicate test by @kylesayrs in #1500
- [Model] Mistral3 example and test by @kylesayrs in #1490
- [NVFP4] Use observers to generate global weight scales by @dsikka in #1504
- Revert "[NVFP4] Use observers to generate global weight scales " by @dsikka in #1507
- [NVFP4] Update global scale generation by @dsikka in #1508
- [NVFP4] Fix onloading of fused layers by @dsikka in #1512
- Pin pandas to <2.3 by @dbarbuzzi in #1515
- AWQModifier fast resolve mappings, better logging, MoE support by @brian-dellabetta in #1444
- Update setup.py by @dsikka in #1516
- Use model compression pathways by @kylesayrs in #1419
- [Example] [Bugfix] Fix Gemma3 Generation by @kylesayrs in #1517
- [Docs] Update ReadME details for FP4 by @dsikka in #1519
- [Examples] [Bugfix] Perform sample generation before saving as compressed by @kylesayrs in #1530
- Add citation information both in README as well as native GitHub file support by @markurtz in #1527
- update compress...
v0.5.1
What's Changed
- Update nm-actions/changed-files to v1.16.0 by @dbarbuzzi in #1311
- docs: fix missing git clone command and repo name typos in DEVELOPING.md by @gattshjott in #1325
- Update e2e/lm-eval test infrastructure by @dbarbuzzi in #1323
- fix(logger): normalize log_file_level input for consistency by @gattshjott in #1324
- [Utils] Replace
preserve_attr
withpatch_attr
by @kylesayrs in #1187 - Fix cut off log in entrypoints/utils.py
post_process()
by @mgoin in #1336 - [Tests] Update condition for sparsity check to be more robust by @dsikka in #1337
- [Utils] Add
skip_weights_download
for developers and testing by @kylesayrs in #1334 - replace custom version handling with setuptools-scm by @dhellmann in #1322
- [Compression] Update sparsity calculation lifecycle when fetching the compressor by @dsikka in #1332
- [Sequential] Support models with nested
_no_split_modules
by @kylesayrs in #1329 - [Tracing] Remove
TraceableWhisperForConditionalGeneration
by @kylesayrs in #1310 - Add torch device to list of offloadable types by @kylesayrs in #1348
- Reduce SmoothQuant Repr by @kylesayrs in #1289
- Use
align_module_device
util by @kylesayrs in #1298 - Fix project URL in setup.py by @tiran in #1353
- Update trigger on PR comment workflow by @dbarbuzzi in #1357
- Add timing functionality to lm-eval tests by @ved1beta in #1346
- [Callbacks][Docs] Add docstrings to saving functions by @kylesayrs in #1201
- Move: recipe parsing test from
e2e/
to main test suite by @rahul-tuli in #1360 - Smoothquant typehinting by @kylesayrs in #1285
- AWQ Modifier by @brian-dellabetta in #1177
- [Tests] Update transformers tests to run kv_cache tests by @dsikka in #1364
- [Transformers] Support latest transformers by @dsikka in #1352
- Update test_consecutive_runs.py by @dsikka in #1366
- [Docs] Mention AWQ, some clean-up by @dsikka in #1367
- Fix versioning for source installs by @dbarbuzzi in #1370
- [Testing] Reduce error verbosity of cleanup by @kylesayrs in #1365
- Update test_oneshot_and_finetune.py to use pytest.approx by @markurtz in #1339
- [Tracing] Better runtime error messages by @kylesayrs in #1307
- [Tests] Fix test case; update structure by @dsikka in #1375
- fix: Make Recipe.model_dump() output compatible with model_validate() by @ved1beta in #1328
- Add: documentation for enhanced
save_pretrained
parameters by @rahul-tuli in #1377 - Revert "fix: Make Recipe.model_dump() output compatible .... by @rahul-tuli in #1378
- AWQ resolved mappings -- ensure shapes align by @brian-dellabetta in #1372
- Update w4a16_actorder_weight.yaml lmeval config by @dbarbuzzi in #1380
- [WIP] Add AWQ Asym e2e test case by @dsikka in #1374
- Bump version; set ct version by @dsikka in #1381
- bugfix AWQ with Llama models and python 3.9 by @brian-dellabetta in #1384
- awq -- hotfix to missing kwargs by @brian-dellabetta in #1395
New Contributors
- @gattshjott made their first contribution in #1325
- @dhellmann made their first contribution in #1322
- @tiran made their first contribution in #1353
- @ved1beta made their first contribution in #1346
Full Changelog: 0.5.0...0.5.1
v0.5.0
What's Changed
- re-add vllm e2e test now that bug is fixed by @brian-dellabetta in #1162
- Fix Readme Imports by @kylesayrs in #1165
- Remove event_called by @kylesayrs in #1155
- Update: Test name by @rahul-tuli in #1172
- Remove lifecycle initialized_structure attribute by @kylesayrs in #1156
- [VLM] Qwen 2.5 VL by @kylesayrs in #1113
- Revert bump by @dsikka in #1178
- Remove CLI by @dsikka in #1144
- Add group act order case to lm_eval test by @dsikka in #1080
- Update e2e test timings ouputs by @dsikka in #1179
- [Oneshot Refactor] Main refactor by @horheynm in #1110
- [StageRunner Removal] Remove Evalulate / validate pathway by @horheynm in #1145
- [StageRemoval] Remove Predict pathway by @horheynm in #1146
- Fix 2of4 Apply Example by @dsikka in #1181
- Fix Sparse2of4 Example by @dsikka in #1182
- Add qwen moe w4a16 example by @mgoin in #1186
- [Callbacks] Consolidate Saving Methods by @kylesayrs in #1168
- lmeval tests multimodal by @brian-dellabetta in #1150
- [Dataset Performance] Add num workers on dataset processing - labels, tokenization by @horheynm in #1189
- Fix a minor typo by @eldarkurtic in #1191
- [Callbacks] Remove pre_initialize_structure by @kylesayrs in #1160
- Make
transformers-tests
job conditional on files changed by @dbarbuzzi in #1197 - Update finetune tests to decrease execution time by @dsikka in #1208
- Update transformers tests to speed-up execution by @dsikka in #1211
- Fix logging bug in oneshot.py by @aman2304 in #1213
- [Training] Decouple Argument parser by @horheynm in #1207
- Remove MonkeyPatch for GPUs by @dsikka in #1227
- [Cosmetic] Rename data_args to dataset_args by @horheynm in #1206
- [Training] Datasets - update Module by @horheynm in #1209
- [BugFix] Fix logging disabling bug and add tests by @aman2304 in #1218
- [Training] Unifying Preprocess + Postprocessing logic for Train/Oneshot by @horheynm in #1212
- [Docs] Add info on when to use which PTQ/Sparsification by @horheynm in #1157
- [Callbacks] Remove
MagnitudePruningModifier.leave_enabled
by @kylesayrs in #1198 - Replace Xenova model stub with nm-testing model stub by @kylesayrs in #1239
- Offload Cache Support torch.dtype by @kylesayrs in #1141
- Remove unused/duplicated/non-applicable utils from pytorch/utils/helpers by @kylesayrs in #1174
- [Bugfix] Staged 2of4 example by @kylesayrs in #1238
- wandb/tensorboard loggers set default init to False by @brian-dellabetta in #1235
- fixing reproducibility of lmeval tests by @brian-dellabetta in #1220
- [Audio] People's Speech dataset and tracer tool by @kylesayrs in #1086
- Use KV cache constant names provided by compressed tensors by @kylesayrs in #1200
- [Bugfix] Raise error for processor remote code by @kylesayrs in #1184
- Remove missing weights silencers in favor of HFQuantizer solution by @kylesayrs in #1017
- Fix run_compressed tests by @dsikka in #1246
- [Train] Training Pipeline by @horheynm in #1214
- [Tests] Increase maximum quantization error by @kylesayrs in #1245
- [Callbacks] Remove EventLifecycle and on_start event by @kylesayrs in #1170
- [Bugfix] Disable generation of deepseek models with transformers>=4.48 by @kylesayrs in #1259
- Remove clear_ml by @dsikka in #1261
- [Tests] Remove clear_ml test from GHA by @kylesayrs in #1265
- Remove click by @dsikka in #1262
- [Bugfix] Remove constant pruning from 2of4 examples by @kylesayrs in #1267
- Addback: ConstantPruningModifier for finetuning cases by @rahul-tuli in #1272
- Remove docker by @kylesayrs in #1255
- move failing mulitmodal lmeval tests to skipped folder by @brian-dellabetta in #1273
- Replace tj-action/changed-files by @dbarbuzzi in #1270
- [BugFix]: Sparse2of4 example sparsity-only case by @rahul-tuli in #1282
- Revert "update" by @dsikka in #1296
- Fix Multi-Context Manager Syntax for Python 3.9 Compatibility by @rahul-tuli in #1287
- Revert "Fix Multi-Context Manager Syntax for Python 3.9 Compatibility… by @dsikka in #1300
- [StageRunner] Stage Runner entrypoint and pipeline by @horheynm in #1202
- Bump: Min python version to 3.9 by @rahul-tuli in #1288
- Keep quantization enabled during calibration by @kylesayrs in #1299
- [BugFix] TRL distillation bug fix by @horheynm in #1278
- Update: Readme for fp8 support by @rahul-tuli in #1304
- [GPTQ] Add inversion fallback by @kylesayrs in #1283
- fix typo by @eldarkurtic in #1290
- [Tests] Fix oneshot + finetune test by passing splits to oneshot by @kylesayrs in #1316
- [Tests] Remove the
compress
entrypoint by @dsikka in #1317 - Fix Multi-Context Manager Syntax for Python 3.9 Compatibility by @rahul-tuli in #1313
- [BugFix] Directly Convert Modifiers to Recipe Instance by @rahul-tuli in #1271
- bump version, tag ct by @dsikka in #1318
New Contributors
Full Changelog: 0.4.1...0.5.0
v0.4.1
What's Changed
- Remove version by @dsikka in #1077
- Require 'ready' label for transformers tests by @dbarbuzzi in #1079
- GPTQModifier Nits and Code Clarity by @kylesayrs in #1068
- Also run on pushes to
main
by @dbarbuzzi in #1083 - VLM: Phi3 Vision Example by @kylesayrs in #1032
- VLM: Qwen2_VL Example by @kylesayrs in #1027
- Composability with sparse and quantization compressors by @rahul-tuli in #948
- Remove
TraceableMistralForCausalLM
by @kylesayrs in #1052 - [Fix Test Failure]: Propagate name change to test by @rahul-tuli in #1088
- [Audio] Support Audio Datasets by @kylesayrs in #1085
- [Test Fix] Add Quantization then finetune tests by @horheynm in #964
- [Smoothquant] Phi3 Vision Mappings by @kylesayrs in #1089
- [VLM] Multimodal Data Collator by @kylesayrs in #1087
- VLM: Model Tracing Guide by @kylesayrs in #1030
- Turn off 2:4 sparse compression until supported in vllm by @rahul-tuli in #1092
- [Test Fix] Fix Consecutive oneshot by @horheynm in #971
- [Bug Fix] Fix test that requre GPU by @horheynm in #1096
- Add Idefics3/SmolVLM quant support via traceable class by @leon-seidel in #1095
- Traceability Guide: Clarity and typo by @kylesayrs in #1099
- [VLM] Examples README by @kylesayrs in #1057
- Raise warning for 24 compressed sparse-only models by @rahul-tuli in #1107
- Remove log_model_load by @kylesayrs in #1016
- Return empty sparsity config if targets and ignores are empty by @rahul-tuli in #1115
- Remove uses of get_observer by @kylesayrs in #939
- FSDP utils cleanup by @kylesayrs in #854
- Update maintainers, add notice by @kylesayrs in #1091
- Replace readme paths with urls by @kylesayrs in #1097
- GPTQ add Arkiv link, move file location by @kylesayrs in #1100
- Extend
remove_hooks
to remove subsets by @kylesayrs in #1021 - [Audio] Whisper Example and Readme by @kylesayrs in #1106
- [Audio] Add whisper fp8 dynamic example by @kylesayrs in #1111
- [VLM] Update pixtral data collator to reflect latest transformers changes by @kylesayrs in #1116
- Use unique test names in
TestvLLM
by @dbarbuzzi in #1124 - Remove smoothquant from examples by @kylesayrs in #1121
- Extend
disable_hooks
to keep subsets by @kylesayrs in #1023 - Unpin
pynvml
to fix e2e test failures with vLLM by @dsikka in #1125 - Replace LayerCompressor with HooksMixin by @kylesayrs in #1038
- [Oneshot Refactor] Rename get_shared_processor_src to get_processor_name_from_model by @horheynm in #1108
- Allow Shortcutting Min-max Observer by @kylesayrs in #887
- [Polish] Remove unused code by @horheynm in #1128
- Properly restore training mode with
eval_context
by @kylesayrs in #1126 - SQ and QM: Remove
torch.cuda.empty_cache
, usecalibration_forward_context
by @kylesayrs in #1114 - [Oneshot Refactor] dataclass Arguments by @horheynm in #1103
- [Bugfix] SparseGPT, Pipelines by @kylesayrs in #1130
- [Oneshot refactor] Refactor initialize_model_from_path by @horheynm in #1109
- [e2e] Update vllm tests with additional datasets by @brian-dellabetta in #1131
- Update: SparseGPT recipes by @rahul-tuli in #1142
- Add timer support for testing by @dsikka in #1137
- [Audio] Support Whisper V3 by @kylesayrs in #1147
- Fix: Re-enable Sparse Compression for 2of4 Examples by @rahul-tuli in #1153
- [VLM] Add caption to flickr dataset by @kylesayrs in #1138
- [VLM] Update mllama traceable definition by @kylesayrs in #1140
- Fix CPU Offloading by @dsikka in #1159
- [TRL_SFT_Trainer] Fix and Update Examples code by @horheynm in #1161
- [TRL_SFT_Trainer] Fix TRL-SFT Distillation Training by @horheynm in #1163
- Bump version for patch release by @dsikka in #1166
- Update DeepSeek Examples by @dsikka in #1175
- Update gemma2 examples with a note about sample generation by @dsikka in #1176
New Contributors
- @leon-seidel made their first contribution in #1095
Full Changelog: 0.4.0...0.4.1
v0.4.0
What's Changed
- Record config file name as test suite property by @dbarbuzzi in #947
- Update setup.py by @dsikka in #975
- Depreciate OBCQ Helpers by @kylesayrs in #977
- KV Cache, E2E Tests by @horheynm in #742
- Use 1 GPU for offloading examples by @dsikka in #979
- Replace tokenizer with processor by @kylesayrs in #955
- Revert "KV Cache, E2E Tests (#742)" by @dsikka in #989
- Fix SmoothQuant offload bug by @dsikka in #978
- Add LM Eval Configs by @dsikka in #980
- Fix
test_model_reload
test by @kylesayrs in #1005 - Calibration and Compression Contexts by @kylesayrs in #998
- Add info for clarity by @dsikka in #1009
- [Bugfix] Pass
trust_remote_code_model=True
for deepseek examples by @dsikka in #1012 - Vision Datasets by @kylesayrs in #943
- Add example for fp8 kv cache of phi3.5 and gemma2 by @mgoin in #991
- Update ReadMe and test for cpu_offloading by @dsikka in #1013
- Adding amdsmi for AMD gpus by @citrix123 in #1018
- CompressionLogger add time units by @kylesayrs in #1026
- patch_tied_tensors_bug: support malformed model definitions by @kylesayrs in #1014
- Add: 2of4 example with/without fp8 quantization by @rahul-tuli in #1033
- Remove unccessary step in 2of4 Example by @dsikka in #1034
- Remove Neural Magic copyright from files by @kylesayrs in #992
- VLM Support via GPTQ Hooks and Data Pipelines by @kylesayrs in #914
- [E2E Testing] KV-Cache by @horheynm in #1004
- [E2E Testing] Add recipe check vllm e2e by @horheynm in #929
- [MoE] GPTQ compress using callback not hook by @kylesayrs in #1049
- Explicit dataset tokenizer
text
kwarg by @kylesayrs in #1031 - Fix smoothquant ignore, Fix typing, Add glm mappings by @kylesayrs in #1015
- [Test Fix] Quant model reload by @horheynm in #974
- Remove old examples by @dsikka in #1062
- VLM: Fix typo bug in TraceableLlavaForConditionalGeneration by @kylesayrs in #1065
- Add tests for "examples/sparse_2of4_[...]" by @dbarbuzzi in #1067
- VLM Image Examples by @kylesayrs in #1064
- Add quick warning for DeepSeek with transformers 4.48.0 by @dsikka in #1066
- [KV Cache] kv-cache end to end unit tests by @horheynm in #141
- [E2E Testing] Fix HF upload by @horheynm in #1061
- [Test Fix] Fix/update test_run_compressed by @horheynm in #970
- Revert "[Test Fix] Fix/update test_run_compressed" by @mgoin in #1071
- Sparse 2:4 + FP8 Quantization e2e vLLM tests by @dsikka in #1073
- [Test Patch] Remove redundant code for "Fix/update test_run_compressed" by @horheynm in #1072
- bump; set ct version by @dsikka in #1076
New Contributors
- @citrix123 made their first contribution in #1018
Full Changelog: 0.3.1...0.4.0
v0.3.1
What's Changed
- BLOOM Default Smoothquant Mappings by @kylesayrs in #906
- [SparseAutoModelForCausalLM Deprecation] Feature change by @horheynm in #881
- Correct "dyanmic" typo by @kylesayrs in #888
- Explicit defaults for QuantizationModifier targets by @kylesayrs in #889
- [SparseAutoModelForCausalLM Deprecation] Update examples by @horheynm in #880
- Support pack_quantized format for nonuniform mixed-precision by @mgoin in #913
- Actually make the
run_compressed
test useful by @dsikka in #920 - Fix for e2e tests by @horheynm in #927
- [Bugfix] Correct metrics calculations by @kylesayrs in #878
- Update kv_cache example by @dsikka in #921
- [1/2] Expand e2e testing to prepare for lm-eval by @dsikka in #922
- Update pytest command to capture results to file by @dbarbuzzi in #932
- [Bugfix] DisableKVCache Context by @kylesayrs in #834
- Add helpful info to the marlin-24 example by @dsikka in #946
- Remove requires_torch by @kylesayrs in #949
- Remove unused sparseml.export utilities by @kylesayrs in #950
- Implement HooksMixin by @kylesayrs in #917
- Add LM Eval Testing by @dsikka in #945
- update version by @dsikka in #969
Full Changelog: 0.3.0...0.3.1