Skip to content

dataScientistMS/Unsloth_puzzle

Repository files navigation

Unsloth_puzzle

NF4 Triton Dequantization Challenge

Overview

This repository hosts my solution for the NF4 dequantization challenge using Triton and my solution for Memory Efficient Backprop. For now, I have completed Puzzle A and Puzzle E—converting an NF4 quantized tensor to FP16/BF16 in a single Triton kernel and implementing Memory Efficient Backprop.

Puzzle A: Convert NF4 to Triton

  • Difficulty: Hard
  • Max Points: 14

Challenge Requirements

  • Single Triton Kernel:
    Implement a kernel that performs the double dequantization (for both absmax and weights) in one go.

  • Performance:
    Achieve at least a 1.15× speedup compared to Unsloth's fast_dequantize, without using large intermediate memory buffers.

  • Compatibility:
    Must work on Tesla T4.
    Should not use torch.compile (using trace.enabled is allowed).

  • Implementation Constraints:
    No CUDA allowed (although custom CUDA code inside of Triton is acceptable).

  • Testing:
    Use the provided test_dequantize_function to validate the implementation.

  • Additional Guidance:
    References include Unsloth’s fast_dequantize function, bitsandbytes’ dequantize_blockwise, and insights from Tim's YouTube videos on 8-bit optimizers.

Checklist of Criteria Completed

  • Single Triton Kernel: The entire dequantization process is implemented in one kernel.
  • Speedup Requirement: The solution is designed to meet/exceed a 1.15× speedup over Unsloth's fast_dequantize.
  • No Usage of torch.compile: The implementation avoids torch.compile and leverages allowed techniques.
  • Cache Eviction: The kernel makes use of cache eviction hints to optimize memory accesses.
  • FP16 and BF16 Support: The implementation supports both FP16 and BF16 outputs.
  • Custom ASM (if applicable):

Precisions

Work on T4, without BF16. BF16 has been tested on A100. Tolerance must be adjusted for BF16 tests due to the drop in mantissa precision. You can adjust atol and rtol to 0.01 for BF16 (currently set to 0.001).

Puzzle E: Memory Efficient Backprop

  • Difficulty: Medium to Hard
  • Max Points: 10

Challenge Requirements

  • Custom Autograd Function:
    Implement a memory-efficient backpropagation algorithm using a custom PyTorch autograd function.

  • Memory Efficiency:
    Reduce memory usage by splitting the computation into chunks and recomputing intermediate activations during the backward pass.

  • Gradient Accuracy:
    Ensure that the gradients computed in the efficient approach match those of the standard backpropagation method.

  • Testing:
    Validate the implementation using a comparison test between the standard and memory-efficient approaches.

Checklist of Criteria Completed

  • Chunked Forward/Backward Pass: The computation is split into chunks to avoid storing all activations.
  • Upstream Gradient Handling: Upstream gradients are correctly multiplied with the local gradients.
  • Loss Consistency: The training loss from the memory-efficient backprop matches the standard approach within a small tolerance.
  • DONT remove upcast: Upcast still present.
  • 50% reduction of VRAM: VRAM reduced by around 50% and more for lower chunk sizes.
  • Show cross entropy loss: Loss remains consistent between standard and memory-efficient approaches.
  • Show other functions: Tested with both CrossEntropy and NLLLoss.
  • No hardcoded gradients: Gradients are computed using autograd.
  • Work with llama1B: Compatible with LLaMA 1B tensor sizes.
  • Dynamic chunksizes: Tested with different chunk sizes, allowing dynamic adjustment.

Target

I'm aiming for an ML intern position.

Notebook

For a detailed walkthrough and to interact with the implementation, check out the Dequantization Notebook.

Future Updates

I will be adding more puzzles and enhancements to this project in the coming days. Stay tuned for updates to both the repository and the notebook.

License

This project is licensed under the Apache 2.0 License.

About

Unsloth puzzle files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published