UnslothPuzzles

This repository contains my open‐source solutions to Daniel Han's Unsloth challenges—five tasks that focus on advanced deep learning optimizations and efficiency improvements. Each challenge has its own folder with a detailed README explaining the approach and providing the source code:

Convert nf4 to Triton: Converting 4-bit quantization (nf4) operations into efficient Triton kernels.
Make QLoRA work with FSDP2: Enabling QLoRA fine-tuning with Fully Sharded Data Parallel (FSDP2).
Torch.compile without Graph Breaks for QLoRA: Removing graph breaks when compiling QLoRA models.
Memory Efficient Backprop: Implementing backpropagation algorithms that drastically reduce memory usage.

Working through these challenges has deepened my understanding of PyTorch, quantization (BitsAndBytes), FSDP2, and torch.compile—and it even introduced me to writing my first code in Triton!

Feel free to explore and improve upon these solutions.

A Note for Unsloth People

I’m Ramesh Babu, a third-year B.Tech CSE student at the Indian Institute of Information Technology, Design and Manufacturing, Jabalpur, India, with a strong passion for Artificial Intelligence. Please note that in my implementation, the Torch Compile task is closely correlated with the FSDP2 task. For the best understanding, I recommend reviewing the tasks in the following order:

Torch Compile
FSDP2
Triton
Backpropagation

This sequence will help you follow the evolution and integration of the features.

Challenge Progress:

Part A:
- Single triton kernel (+3)
- Speedup checks:
  - If speedup <= 1.00 (-3)
  - If speedup >= 1.05 (+1)
  - If speedup >= 1.10 (+2)
  - If speedup >= 1.15 (+2)
- Kernel works in torch compile (+1)
  - If not (-1)
- Custom ASM works (+3)
- Uses cache eviction (+1)
- Tested in f16 and bf16 (+1) (In GPUs with SM >= 80, f16 works on colab & kaggle free GPUs as well)
  - If not (-1)
Part B:
- FSDP2 works with QLoRA:
  - With torch compile (+5)
  - Without torch compile (+3)
  - Uses part A and single kernel and faster (+3)
  - Uses torchAO:
    - If torchAO slower than BnB (-3)
- TP or PP with QLoRA:
  - With zero bubble (+3)
  - Without zero bubble (+2)
- FSDP1 works with QLoRA (+1)
- Kaggle notebook 2 tesla t4 example (+2)
  - If not (score = 0)
- If not attempted (-2)
Part C:
- Uses flex attention:
  - Dynamic sequence length works (+3)
  - If not (+1)
- No torch compile BnB (-2)
- Use part A (+1)
- Torch compile BnB (+1)
- Attention compiled:
  - With excessive recompilation (-3)
  - Without excessive recompilation (+2)
- MLP compiled:
  - With excessive recompilation (-3)
  - Without excessive recompilation (+1)
- Loss not compiled (-1)
- Layernorms not compiled (-3)
- Max autotune triton matmul:
  - With excessive recompilation (-2)
  - Without excessive recompilation (+2)
- If not attempted (-1)
Part E:
- VRAM 50% reduction (+2)
- Remove float32 upcast (score = 0)
- Show CE loss works (+1)
- Show other functions work (+1)
- Hardcoded gradients (score = 0)
- Allows dynamic chunk sizes (+1)
- Llama 1B training loss matches (+1)
  - If not (score = 0)
- GRPO memory efficient linear works (+4)

Thanks for the template @parnox/unsloth-notes

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
BackProp		BackProp
FSDP2		FSDP2
TorchCompile		TorchCompile
Triton		Triton
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UnslothPuzzles

A Note for Unsloth People

Challenge Progress:

About

Uh oh!

Releases

Packages

Languages

License

RameshBabuAsh/UnslothPuzzles

Folders and files

Latest commit

History

Repository files navigation

UnslothPuzzles

A Note for Unsloth People

Challenge Progress:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages