This repository hosts my solution for the NF4 dequantization challenge using Triton and my solution for Memory Efficient Backprop. For now, I have completed Puzzle A and Puzzle E—converting an NF4 quantized tensor to FP16/BF16 in a single Triton kernel and implementing Memory Efficient Backprop.
- Difficulty: Hard
- Max Points: 14
-
Single Triton Kernel:
Implement a kernel that performs the double dequantization (for both absmax and weights) in one go. -
Performance:
Achieve at least a 1.15× speedup compared to Unsloth's fast_dequantize, without using large intermediate memory buffers. -
Compatibility:
Must work on Tesla T4.
Should not usetorch.compile
(usingtrace.enabled
is allowed). -
Implementation Constraints:
No CUDA allowed (although custom CUDA code inside of Triton is acceptable). -
Testing:
Use the providedtest_dequantize_function
to validate the implementation. -
Additional Guidance:
References include Unsloth’s fast_dequantize function, bitsandbytes’ dequantize_blockwise, and insights from Tim's YouTube videos on 8-bit optimizers.
- Single Triton Kernel: The entire dequantization process is implemented in one kernel.
- Speedup Requirement: The solution is designed to meet/exceed a 1.15× speedup over Unsloth's fast_dequantize.
- No Usage of
torch.compile
: The implementation avoidstorch.compile
and leverages allowed techniques. - Cache Eviction: The kernel makes use of cache eviction hints to optimize memory accesses.
- FP16 and BF16 Support: The implementation supports both FP16 and BF16 outputs.
- Custom ASM (if applicable):
Work on T4, without BF16. BF16 has been tested on A100. Tolerance must be adjusted for BF16 tests due to the drop in mantissa precision. You can adjust atol
and rtol
to 0.01 for BF16 (currently set to 0.001).
- Difficulty: Medium to Hard
- Max Points: 10
-
Custom Autograd Function:
Implement a memory-efficient backpropagation algorithm using a custom PyTorch autograd function. -
Memory Efficiency:
Reduce memory usage by splitting the computation into chunks and recomputing intermediate activations during the backward pass. -
Gradient Accuracy:
Ensure that the gradients computed in the efficient approach match those of the standard backpropagation method. -
Testing:
Validate the implementation using a comparison test between the standard and memory-efficient approaches.
- Chunked Forward/Backward Pass: The computation is split into chunks to avoid storing all activations.
- Upstream Gradient Handling: Upstream gradients are correctly multiplied with the local gradients.
- Loss Consistency: The training loss from the memory-efficient backprop matches the standard approach within a small tolerance.
- DONT remove upcast: Upcast still present.
- 50% reduction of VRAM: VRAM reduced by around 50% and more for lower chunk sizes.
- Show cross entropy loss: Loss remains consistent between standard and memory-efficient approaches.
- Show other functions: Tested with both CrossEntropy and NLLLoss.
- No hardcoded gradients: Gradients are computed using autograd.
- Work with llama1B: Compatible with LLaMA 1B tensor sizes.
- Dynamic chunksizes: Tested with different chunk sizes, allowing dynamic adjustment.
I'm aiming for an ML intern position.
For a detailed walkthrough and to interact with the implementation, check out the Dequantization Notebook.
I will be adding more puzzles and enhancements to this project in the coming days. Stay tuned for updates to both the repository and the notebook.
This project is licensed under the Apache 2.0 License.