Releases: vllm-project/compressed-tensors
Releases · vllm-project/compressed-tensors
Compressed Tensors v0.14.0.1
What's Changed
- Patch release to add support for distributed disk cache and fix bugs related to file writing. For more information see #617
Full Changelog: 0.14.0...0.14.0.1
Compressed Tensors v0.14.0
What's Changed
- [decompression] Added qparam decompression by @shanjiaz in #537
- Use standard E8M0 scale format for MXFP4 by @mgoin in #538
- Add W4AFP8 preset scheme by @Etelis in #542
- [MXFP4] Fix scale offset by @dsikka in #541
- [CI] Add mergify and stale PR rules by @dsikka in #543
- [Offload] Fix
delete_offload_parameter, addclear_quantizationby @kylesayrs in #539 - [Deprecation] Add deprecation warning for marlin24 format by @Etelis in #544
- [Offload] Add offloading logic by @kylesayrs in #529
- Pin torch by @dsikka in #546
- [Testing] Pin transformers by @kylesayrs in #549
- limit transformers to <5.0.0 by @dhuangnm in #550
- Modernize python310 type hints in quantization forward by @LudovicoYIN in #548
- [Offload] Remove Accelerate by @kylesayrs in #530
- KV Cache Quantization support deepseek v3 by @zkl-ai in #533
- [Bugfix] Remove assert when dispatched to device by @kylesayrs in #554
- Modernize python310 type hints in quantization by @LudovicoYIN in #553
- [Observers] Change default weight observer to "memoryless_minmax" by @kylesayrs in #540
- Update pytest command to include report options by @dsikka in #557
- Change runner from IBM to GCP for Python tests by @dsikka in #561
- Modernize python310 type hints in utils by @LudovicoYIN in #560
- Update quantization strategy validation for actorder by @dsikka in #556
- [Offload]
DistributedCPUCacheby @kylesayrs in #534 - [MXFP4][GPTQ] Extend rounding to support FP32 by @dsikka in #551
- [Tests] Fix typo, prepare for meta offload tensors by @kylesayrs in #562
- Modernize python310 type hints in compressors by @LudovicoYIN in #563
- Modernize python310 type hints in transform/offload/registry by @LudovicoYIN in #565
- [Offload] [Bugfix] Fix distributed cpu tensor reconstruction by @kylesayrs in #567
- [Offload] [Bugfix] Reserve extra dispatch memory for fragmentation by @kylesayrs in #566
- Remove Neural Magic copyright by @Etelis in #559
- [Transforms] Support loading transforms in transformers by @kylesayrs in #528
- [Offload]
DistributedDeviceCacheby @kylesayrs in #568 - Revert "[Transforms] Support loading transforms in transformers" by @HDCharles in #578
- [Offload]
DiskCache,DistributedDiskCacheby @kylesayrs in #535 - [Offload] Make
update_offload_parametermore async and direct (2) by @kylesayrs in #576 - [Copyright] Add vLLM copyright enforcement by @kylesayrs in #575
- [Bugfix] Handle updating tensors with gradients by @kylesayrs in #580
- [bugfix] get_device_memory rank>0 fix by @HDCharles in #582
- Remove upper limit for torch dependency to support 2.10 by @dsikka in #583
- Set seed to fix flaky test by @dsikka in #584
- [Offload] Convert accelerate for loading/saving by @kylesayrs in #572
- [Bugfix] Allow parameter overwrite if shapes do not match by @kylesayrs in #586
- [Bugfix] [Offloading] Even more reserved memory, scaling with model size by @kylesayrs in #587
- Implement init_dist for distributed setup by @HDCharles in #589
- FP8 Block Quantization: Non-Divisible Shape Support by @Etelis in #547
- [Bugfix]: Reduce memory usage when load device does not match dispatch device by @kylesayrs in #592
- [bugfix] load_offloaded_model qwen3vl8b by @HDCharles in #591
- [Offload] clean up deprecation warnings, which can accumulate to 100k+ warnings by @brian-dellabetta in #593
- [Offload] Deprecate
update_parameter_databy @kylesayrs in #588 - [Bugfix] Fix
clear_quantizationby @kylesayrs in #596 - Allow broadcasting fp8 by @HDCharles in #603
- fix ruff for release by @HDCharles in #604
- [Offload] Fully invertible conversion functions by @kylesayrs in #601
- [Offload] Better device/cpu memory estimates when loading with
load_offloaded_modelby @kylesayrs in #605
New Contributors
- @LudovicoYIN made their first contribution in #548
Full Changelog: 0.13.0...0.14.0
Compressed Tensors v0.13.0
What's Changed
- [Quantization] Refactor initialize for activation shape inference by @kylesayrs in #476
- Add block strategy and structure validation by @kylesayrs in #483
- [Tests] Mock Observers, Static Lifecycle Tests by @kylesayrs in #482
- [Attention] Attention head quantization strategy by @kylesayrs in #481
- drop python 3.9 and add 3.13 to testing by @dhuangnm in #486
- Remove static token quantization by @kylesayrs in #487
- [Misc] Remove unused config name definitions by @kylesayrs in #332
- Remove unused
find_name_or_class_matchesutil by @kylesayrs in #488 - Update NVFP4 default observer by @dsikka in #493
- Switch test runners to use the vllm runners by @dhuangnm in #496
- Tensor Group Validation by @kylesayrs in #490
- Update neuralmagic --> vllm-project for links by @mgoin in #495
- one more place to update the runner by @dhuangnm in #497
- update to allow READ only access by @andy-neuma in #499
- Update workflows to use new vllm infra by @dhuangnm in #500
- Remove FP8_DTYPE; use FP8_E4M3_DATA instead by @dsikka in #501
- [MXFP4] Add MXFP4 Compressor by @dsikka in #502
- [Transform] Attention/Cache transforms by @kylesayrs in #436
- [MXFP4] Add scale generation utils by @dsikka in #503
- Update error message for column/group_size mismatch by @HDCharles in #505
- Fixing bug in matrix_multiply.py by @HDCharles in #507
- feat: support zero-point decompression for asymmetric quantization (packed) by @Etelis in #463
- [Attention] R3 Attention Transform by @kylesayrs in #485
- [Quantization Args] Add scale and zp dtype by @dsikka in #508
- Switch to use h100 runner and remove nightly related workflows by @dhuangnm in #515
- [Quant Args] Clean-up by @dsikka in #513
- [Tests] Small Fixes by @dsikka in #516
- Fix dtype by @dsikka in #517
- patch_attrs helper by @brian-dellabetta in #519
- fix match_modules_set to work with MoE by @HDCharles in #524
- [MXFP4] Add calibration support by @dsikka in #509
- fix qparams decompression by @shanjiaz in #514
- Revert "fix qparams decompression (#514)" by @dsikka in #527
- Update quantize_and_pack_int4.ipynb to use compress_model; remove compress_quantized_weights by @zkl-ai in #526
New Contributors
- @andy-neuma made their first contribution in #499
- @HDCharles made their first contribution in #505
- @Etelis made their first contribution in #463
- @zkl-ai made their first contribution in #526
Full Changelog: 0.12.2...0.13.0
Compressed Tensors v0.12.2
What's Changed
- remove deprecated safe_permute by @brian-dellabetta in #471
- [Transform] Fix accelerate import to keep it as optional dependency by @tboerstad in #480
- [Bugfix] Fix Per-Token Dynamic Activation Quantization by @max410011 in #393
New Contributors
- @tboerstad made their first contribution in #480
- @max410011 made their first contribution in #393
Full Changelog: 0.12.1...0.12.2
Compressed Tensors v0.12.1
What's Changed
- [Patch Fix] Add
get_missing_module_keysto support transformers lower bound by @dsikka in #479 - [cicd] Add post-release run to push next nightly by @dbarbuzzi in #478
Full Changelog: 0.12.0...0.12.1
Compressed Tensors v0.12.0
What's Changed
- Refactor module / parameter matching logic by @fynnsu in #406
- Revert "Refactor module / parameter matching logic (#406)" by @kylesayrs in #429
- Refactor module / parameter matching logic by @fynnsu in #431
- Add quality check to CI and fix existing errors by @fynnsu in #408
- Speed up nvfp4 pack/unpack w/ torch.compile by @fynnsu in #400
- Simplify
apply_quantization_configby @kylesayrs in #433 - Install compilers etc to fix nightly test failure by @dhuangnm in #435
- Fix a minor bug on GH hosted runners by @dhuangnm in #438
- remove references to
llmcompressor.transformers.oneshotin examples by @brian-dellabetta in #422 - [Tests] Combine quantization and dequantization tests by @kylesayrs in #443
- fix compress on meta device issue by @shanjiaz in #444
- Throw error for unsupported activation strategies by @kylesayrs in #446
- [Transform] Better dispatch support for offloaded and multi-gpu by @kylesayrs in #423
- [Quantization Format] Add functionality to infer format by @dsikka in #441
- Revert "[Quantization Format] Add functionality to infer format (#441)" by @dsikka in #451
- Raise ValueError when nvfp4 pack tensor has odd number of columns by @fynnsu in #402
- [Quantization] Allow dynamic group activation quantization by @kylesayrs in #450
- Fix lint error on main by @fynnsu in #460
- [Accelerate] Remove
is_module_offloadedandupdate_prefix_dictby @kylesayrs in #366 - [Decompression] Clean-up and some fixes by @dsikka in #461
- [ModelCompressor] Remove missing keys and missing modules by @dsikka in #462
- [Logging] Support use of loguru by @kylesayrs in #454
- [Utils] Deprecate
safe_permuteby @kylesayrs in #464 - [Quantization Format] Add functionality to infer format by @dsikka in #452
- [licensing refactor] remove
frozendictdependency, usetypes.MappingProxyTypeinstead by @brian-dellabetta in #469 - [Transform] Support loading random hadamards on meta device by @kylesayrs in #445
- [transforms] TransformScheme.block_size, deprecate head_dim by @brian-dellabetta in #466
- [Multi-Modifier] Scoped apply quantization config by @brian-dellabetta in #432
- [Model Compressor] Move infer call to from_pretrained_model method by @dsikka in #470
- Always save g_idx when initialized in quantization compressor by @rahul-tuli in #467
- Add back get unexpected keys to support transformers lower bound by @dsikka in #475
- Improve Hugging Face API utilization in tests by @dbarbuzzi in #473
- [Transform] Revert deprecation of
TransformScheme.head_dimfor compatibility with vllm by @brian-dellabetta in #472 - [cicd] Include Python version in artifact name by @dbarbuzzi in #477
New Contributors
Full Changelog: 0.11.0...0.12.0
Compressed Tensors v0.11.0
What's Changed
- Fix nightly issues with python 3.11 by @dhuangnm in #371
- Clean up disk space and restore ubuntu 22.04 runner by @dhuangnm in #373
- [Transform] Update tests to use conftest file by @kylesayrs in #367
- [Transform] Hadamard Permutations by @kylesayrs in #329
- [Transform] Construct on GPU, cache on CPU by @kylesayrs in #352
- enable code coverage collection and reporting INFERENG-1049 by @derekk-nm in #382
- Deprecate
iter_named_leaf_modulesanditer_named_quantizable_modulesby @kylesayrs in #381 - Added support for compression on meta device by @shanjiaz in #376
- Add torch.float64 as a viable dtype for scales by @eldarkurtic in #379
- [Transform]
apply_transform_configby @kylesayrs in #348 - [Compression] Fix compression device movement in cases of indexed devices by @kylesayrs in #384
- Enable code coverage report for nightly tests by @dhuangnm in #388
- [Bugfix] Only quant-compress modules with weight quantization by @kylesayrs in #387
- [Transform] Fix config serialization by @kylesayrs in #396
- [Transform] Do not fuse div operation into hadamard matrices by @kylesayrs in #395
- [Transform] Implement multi-headed transforms by @kylesayrs in #383
- Support DeepSeekV3-style block FP8 quantization by @mgoin in #372
- [Transform] [Utils] Canonical matching utilities by @kylesayrs in #392
- [Bugfix] Safeguard against submodule parameter deletion in decompress_model by @kylesayrs in #347
- fix block quantization initialization by @shanjiaz in #403
- [Utils] Skip internal modules when matching by @kylesayrs in #404
- [Quantization][Decompression] Fix QDQ for dynamic quant; Update NVFP4 Compression Params by @dsikka in #407
- [Utils] Support matching vLLM modules by @kylesayrs in #413
- Fix block size inference logic by @shanjiaz in #411
- [Transform] Serialize with tied weights by @kylesayrs in #370
- [Transform] [Utils] Support precision, add torch dtype validation by @kylesayrs in #414
- [Transform] Serialize transforms config by @kylesayrs in #412
- Error when configs are created with unrecognized fields by @kylesayrs in #386
- revert forbid constraint on QuantizationConfig by @brian-dellabetta in #418
- Revert "[Transform] Serialize transforms config (#412)" by @dsikka in #419
- added wrapper for execution device by @shanjiaz in #417
- [Transform] Serialize config (include format) by @dsikka in #420
- exclude transform_config from quantization_config parse by @brian-dellabetta in #421
- [Quantization] Support more than one quant-compressor by @dsikka in #415
- [QuantizationScheme] Validate format by @dsikka in #424
- [Utils] Expand
is_matchby @kylesayrs in #416 - fix match.py syntax by @shanjiaz in #426
- [Offload] Fully remove dispatch by @kylesayrs in #427
Full Changelog: 0.10.2...0.11.0
Compressed Tensors v0.10.2
What's Changed
- [Hotfix] Implement quantization compressor methods on dense compressor by @kylesayrs in #344
- [Hotfix] Implement method on dense compressor by @kylesayrs in #345
- [Transform] Factory classes with shared memory and offloading by @kylesayrs in #316
- [Transform] [Bugfix] Fix enum value serialization in python>=3.11 by @kylesayrs in #350
- Remove redundant call by @eldarkurtic in #349
- [Accelerate] Rename and simplify
force_cpu_offloadby @kylesayrs in #354 - [Transform] Extend set of known Hadamard matrices by @kylesayrs in #351
- [Accelerate] Fix
offloaded_dispatch, implementdisable_offloadingby @kylesayrs in #355 - [Accelerate] Extend functionality of
register_offload_parameterby @kylesayrs in #356 - [Bugfix] Fix saving of models dispatched by
offloaded_dispatchby @kylesayrs in #357 - [Bugfix] Only update direct params in
disable_offloadingby @kylesayrs in #360 - reference updated reportportal_submit_execution_results action by @derekk-nm in #362
- [Accelerate] Expand
get_execution_deviceto support models by @kylesayrs in #363 - [Accelerate] Fix typos in
get_execution_deviceby @kylesayrs in #365
New Contributors
- @derekk-nm made their first contribution in #362
Full Changelog: 0.10.1...0.10.2
Compressed Tensors v0.10.1
What's Changed
- [Transform] Hadamard and Matrix Transform Utils by @kylesayrs in #330
- Fix error on import whenever accelerate is absent by @maresb in #342
New Contributors
Full Changelog: 0.10.0...0.10.1
Compressed Tensors v0.10.0
What's Changed
- Updates to build system by @dbarbuzzi in #304
- [Utils] add align_modules by @kylesayrs in #282
- Enable module state_dict compression, simplify compression logic by @kylesayrs in #302
- Fix
_initialize_scale_zero_pointinitializing on the wrong device by @mgoin in #295 - Revert "Enable module state_dict compression, simplify compression lo… by @kylesayrs in #306
- [Bugfix] Fix shape calculation for group quantization by @kylesayrs in #308
- Enable module state_dict compression, simplify compression logic by @kylesayrs in #307
- Clarify decompression return type by @kylesayrs in #310
- Clarify
match_param_namereturn type by @kylesayrs in #312 - [Compressor][NVFP4] Support FP4 Compression by @dsikka in #311
- [NVFP4] Update FloatArgs and NVFP4 by @dsikka in #313
- fix signatures on model_validator functions by @brian-dellabetta in #314
- [Performance] Add memory compression and decompression pathways by @kylesayrs in #301
- Model Compression: Set compression status by @kylesayrs in #318
- [NVFP4] Enable Fp4 Quantization; introduce / apply global_scales by @dsikka in #315
- [NVFP4] Skip fused global scale calculation if already fused by @dsikka in #322
- Update default observer to be
MSEby @shanjiaz in #300 - [Misc] Generics typehinting for
RegistryMixinby @kylesayrs in #320 - Revert "Update default observer to be
MSE(#300)" by @dsikka in #323 - [NVFP4] Add
tensor_groupstrategy; enable NVFP4 Activations by @dsikka in #317 - [Transforms] Transform Args, Scheme, and Config by @kylesayrs in #321
- [NVFP4] Expand dynamic types, clean-up conditions by @dsikka in #325
- Use different runner for UPLOAD job by @dbarbuzzi in #327
- [NVFP4] Use torch.compile when rounding to NVFP4 by @dsikka in #331
- [Tests] Update test_fp8_quant.py by @dsikka in #337
- [Tests] Fix test scale init for group quant by @dsikka in #338
- [Quantization] Update group quantization by @dsikka in #336
- [NVFP4] update global scale generation by @dsikka in #339
- [Transform] Accelerate Utilities by @kylesayrs in #328
- Model Compression: Delete offload by @kylesayrs in #319
- [Decompression] Keep unused parameters when decompressing from memory by @kylesayrs in #340
- [NVFP4] Small Nits by @dsikka in #341
New Contributors
Full Changelog: 0.9.4...0.10.0