Releases: bitsandbytes-foundation/bitsandbytes
8-bit Lion, 8-bit Load/Store from HF Hub
8-bit Lion, Load/Store 8-bit Models directly from/to HF Hub
This release brings 8-bit Lion to bitsandbytes. Compared to standard 32-bit Adam, it is 8x more memory efficient.
Furthermore, now models can now be serialized in 8-bit and pushed to the HuggingFace Hub. This means you can also load them from the hub in 8-bit, making big models much easier to download and load into CPU memory.
To use this feature, you need the newest transformer release (this will likely be integrated into the HF transformer release tomorrow).
In this release, CUDA 10.2 and GTX 700/K10 GPUs are deprecated in order to allow for broad support of bfloat16 in release 0.39.0.
Features:
- Support for 32 and 8-bit Lion has been added. Thank you @lucidrains
- Support for serialization of Linear8bitLt layers (LLM.int8()). This allows to store and load 8-bit weights directly from the HuggingFace Hub. Thank you @mryab
- New bug report features
python -m bitsandbytesnow gives extensive debugging details to debug CUDA setup failures.
Bug fixes:
- Fixed a bug where some bitsandbytes methods failed in a model-parallel setup on multiple GPUs. Thank you @tonylins
- Fixed a bug where cudart.so libraries could not be found in newer PyTorch releases.
Improvements:
- Improved the CUDA Setup procedure by doing a more extensive search for CUDA libraries
Deprecated:
- Devices with compute capability 3.0 (GTX 700s, K10) and 3.2 (Tegra K1, Jetson TK1) are now deprecated and support will be removed in 0.39.0.
- Support for CUDA 10.0 and 10.2 will be removed in bitsandbytes 0.39.0
Int8 Matmul backward for all GPUs
This release changed the default bitsandbytets matrix multiplication (bnb.matmul) to now support memory efficient backward by default. Additionally, matrix multiplication with 8-bit weights is supported for all GPUs.
During backdrop, the Int8 weights are converted back to a row-major layout through an inverse index. The general matmul for all GPUs by using Int8 weights is done by casting the weights from Int8 to the inputs data type (FT32/FP32/BF16/F16) and then doing standard matrix multiplication. As such, the matrix multiplication during backdrop and for non-tensor-core devices will be memory efficient, but slow.
These contributions were the work of Alexander Borzunov and Yozh, thank you!
Features:
- Int8 MatmulLt now supports backward through inversion of the ColTuring/ColAmpere format. Slow, but memory efficient. Big thanks to @borzunov
- Int8 now supported on all GPUs. On devices with compute capability < 7.5, the Int weights are cast to 16/32-bit for the matrix multiplication. Contributed by @borzunov
Improvements:
- Improved logging for the CUDA detection mechanism.
Ada/Hopper+fake k-bit quantization
The 0.36.0 release brings a lot of bug fixes, improvements, and new features:
- better automatic CUDA detection & setup
- better automatic compilation instruction generation in the case of failures
- CUDA 11.8 and 12.0 support
- Ada (RTX 40s series) and Hopper (H100) support
- Added fake k-bit float, int, and quantile quantization (2 <= k <= 8, Int8 storage)
Additional features also include fake k-bit quantization and smaller block sizes for block-wise quantization, which are used in our k-bit Inference Scaling Laws work. Fake k-bit quantization is useful to simulated k-bit data types, but they do not provide memory or runtime benefits. Here is how you use these features.
Faster block-wise quantization that now allows for very small block sizes of down to 64:
from bitsandbytes import functional as F
q, state = F.quantize_blockwise(X, blocksize=64)
X = F.dequantize_blockwise(q, state, blocksize=64)k-bit fake quantization via block-wise quantization:
# 4-bit float quantization stored as Int8
from bitsandbytes import functional as F
# 4-bit float with 2 exponent bits
code = F.create_fp8_map(signed=True, exponent_bits=2, precision_bits=1, total_bits=4).cuda()
q, state = F.quantize_blockwise(X, code=code) # q has 4-bit indices which represent values in the codebook
X = F.dequantize_blockwise(q, state)0.36.0: Improvements, Ada/Hopper support, fake k-bit quantization.
Features:
- CUDA 11.8 and 12.0 support added
- support for Ada and Hopper GPUs added (compute capability 8.9 and 9.0)
- support for fake k-bit block-wise quantization for Int, Float, quantile quantization, and dynamic exponent data types added
- Added CUDA instruction generator to fix some installations.
- Added additional block sizes for quantization {64, 128, 256, 512, 1024}
- Added SRAM Quantile algorithm to quickly estimate less than 256 quantiles
- Added option to suppress the bitsandbytes welcome message (@Cyberes)
Regression:
- Compute capability 3.0 removed: GTX 600s and 700s series is no longer supported (except GTX 780 and GTX 780 Ti)
Bug fixes:
- fixed a bug where too long directory names would crash the CUDA SETUP #35 (@tomaarsen)
- fixed a bug where CPU installations on Colab would run into an error #34 (@tomaarsen)
- fixed an issue where the default CUDA version with fast-DreamBooth was not supported #52
- fixed a bug where the CUDA setup failed due to a wrong function call.
- fixed a bug in the CUDA Setup which led to an incomprehensible error if no GPU was detected.
- fixed a bug in the CUDA Setup failed with the cuda runtime was found, but not the cuda library.
- fixed a bug where not finding the cuda runtime led to an incomprehensible error.
- fixed a bug where with missing CUDA the default was an error instead of the loading the CPU library
- fixed a bug where the CC version of the GPU was not detected appropriately (@BlackHC)
- fixed a bug in CPU quantization which lead to errors when the input buffer exceeded 2^31 elements
Improvements:
- multiple improvements in formatting, removal of unused imports, and slight performance improvements (@tomaarsen)
- StableEmbedding layer now has device and dtype parameters to make it 1:1 replaceable with regular Embedding layers (@lostmsu)
- runtime performance of block-wise quantization slightly improved
- added error message for the case multiple libcudart.so are installed and bitsandbytes picks the wrong one
CUDA 11.8 Support for Dreambooth finetuning
0.35.0
CUDA 11.8 support and bug fixes
Features:
- CUDA 11.8 support added and binaries added to the PyPI release.
Bug fixes:
- fixed a bug where too long directory names would crash the CUDA SETUP #35 (thank you @tomaarsen)
- fixed a bug where CPU installations on Colab would run into an error #34 (thank you @tomaarsen)
- fixed an issue where the default CUDA version with fast-DreamBooth was not supported #52
Memory efficient backprop
This release introduces memory-efficient backprop through frozen weights where the gradient is calculated from the 8-bit weights but is computed in fp16. This is useful for creating Low-rank (LoRa) Adapters for fine-tuning large models.
This is a feature contributed by @dbaranchuk and @justheuristic.
0.34.0
Bug fixes and memory-efficient backprop
Features:
- Linear8bitLt layer now supports
memory_efficient_backward=Truewhich enables backprop of gradients through frozen weights.
Bug fixes:
- fixed an issue where too many threads were created in blockwise quantization on the CPU for large tensors
0.33.0: Various bug fixes
0.33.0
Various bug fixes
Features:
- CPU quantization now supports a variable
blocksizevariable to enhance quantization speed or precision. 19a7adc
Bug fixes:
- fixed an issue in CPU quantization where tensors with more than 2^31 elements would fail 19a7adc
- fixed a bug where cpu binaries would fail if no GPU would be detected eab4d82
- fixed an issue where cpu binaries cause additional stdout messages 92a3363
- fixed an import of bnb.utils 2e630b5
We thank @mryab, @mbrukman, @chessgecko, @dbaranchuk for pull request with bug fixes and new features.