Unsloth_puzzle

NF4 Triton Dequantization Challenge

Overview

This repository hosts my solution for the NF4 dequantization challenge using Triton and my solution for Memory Efficient Backprop. For now, I have completed Puzzle A and Puzzle E—converting an NF4 quantized tensor to FP16/BF16 in a single Triton kernel and implementing Memory Efficient Backprop.

Puzzle A: Convert NF4 to Triton

Difficulty: Hard
Max Points: 14

Challenge Requirements

Single Triton Kernel:
Implement a kernel that performs the double dequantization (for both absmax and weights) in one go.
Performance:
Achieve at least a 1.15× speedup compared to Unsloth's fast_dequantize, without using large intermediate memory buffers.
Compatibility:
Must work on Tesla T4.
Should not use torch.compile (using trace.enabled is allowed).
Implementation Constraints:
No CUDA allowed (although custom CUDA code inside of Triton is acceptable).
Testing:
Use the provided test_dequantize_function to validate the implementation.
Additional Guidance:
References include Unsloth’s fast_dequantize function, bitsandbytes’ dequantize_blockwise, and insights from Tim's YouTube videos on 8-bit optimizers.

Checklist of Criteria Completed

Single Triton Kernel: The entire dequantization process is implemented in one kernel.
Speedup Requirement: The solution is designed to meet/exceed a 1.15× speedup over Unsloth's fast_dequantize.
No Usage of torch.compile: The implementation avoids torch.compile and leverages allowed techniques.
Cache Eviction: The kernel makes use of cache eviction hints to optimize memory accesses.
FP16 and BF16 Support: The implementation supports both FP16 and BF16 outputs.
Custom ASM (if applicable):

Precisions

Work on T4, without BF16. BF16 has been tested on A100. Tolerance must be adjusted for BF16 tests due to the drop in mantissa precision. You can adjust atol and rtol to 0.01 for BF16 (currently set to 0.001).

Puzzle E: Memory Efficient Backprop

Difficulty: Medium to Hard
Max Points: 10

Challenge Requirements

Custom Autograd Function:
Implement a memory-efficient backpropagation algorithm using a custom PyTorch autograd function.
Memory Efficiency:
Reduce memory usage by splitting the computation into chunks and recomputing intermediate activations during the backward pass.
Gradient Accuracy:
Ensure that the gradients computed in the efficient approach match those of the standard backpropagation method.
Testing:
Validate the implementation using a comparison test between the standard and memory-efficient approaches.

Checklist of Criteria Completed

Target

I'm aiming for an ML intern position.

Notebook

For a detailed walkthrough and to interact with the implementation, check out the Dequantization Notebook.

Future Updates

I will be adding more puzzles and enhancements to this project in the coming days. Stay tuned for updates to both the repository and the notebook.

License

This project is licensed under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
A_puzzle.py		A_puzzle.py
C_puzzle.py		C_puzzle.py
E_puzzle.py		E_puzzle.py
E_puzzle_GRPO.ipynb		E_puzzle_GRPO.ipynb
LICENSE		LICENSE
Mehdi_SM_Unsloth_Puzzles 5.ipynb		Mehdi_SM_Unsloth_Puzzles 5.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unsloth_puzzle

NF4 Triton Dequantization Challenge

Overview

Puzzle A: Convert NF4 to Triton

Challenge Requirements

Checklist of Criteria Completed

Precisions

Puzzle E: Memory Efficient Backprop

Challenge Requirements

Checklist of Criteria Completed

Target

Notebook

Future Updates

License

About

Uh oh!

Releases

Packages

Languages

License

dataScientistMS/Unsloth_puzzle

Folders and files

Latest commit

History

Repository files navigation

Unsloth_puzzle

NF4 Triton Dequantization Challenge

Overview

Puzzle A: Convert NF4 to Triton

Challenge Requirements

Checklist of Criteria Completed

Precisions

Puzzle E: Memory Efficient Backprop

Challenge Requirements

Checklist of Criteria Completed

Target

Notebook

Future Updates

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages