Skip to content

Commit 794ec3e

Browse files
authored
Merge pull request #94 from NYU-RTS/llm_finetune
Add LLaMA2 + LoRA Fine-Tuning Documentation for Greene HPC
2 parents 7263412 + a36761f commit 794ec3e

File tree

1 file changed

+299
-1
lines changed

1 file changed

+299
-1
lines changed
Lines changed: 299 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,301 @@
11
# Fine tune LLMs on HPC
22

3-
LoRA fine-tuning?
3+
## Model and Dataset Selection Rationale
4+
5+
|Component|Configuration|
6+
|---|---|
7+
|Base Model|`google/gemma-3-4b-pt` (pretrained)|
8+
|Comparison Model|`google/gemma-3-4b-it` (instruction-tuned)|
9+
|Dataset|`timdettmers/openassistant-guanaco`|
10+
|Justification|Using Gemma-3 allows direct comparison between base pretrained, our LoRA fine-tuned, and official instruction-tuned variants. The OpenAssistant Guanaco dataset provides high-quality instruction-following examples.|
11+
12+
### Dataset Overview
13+
14+
The `timdettmers/openassistant-guanaco` dataset is a high-quality instruction-following dataset containing conversational exchanges between humans and AI assistants. It includes diverse question-answer pairs covering topics like creative writing, problem-solving, factual queries, and technical explanations. The dataset is specifically designed to train models to follow instructions and provide helpful, harmless, and honest responses.
15+
16+
### Fine-tuning Benefits
17+
18+
Fine-tuning Gemma-3-4B-PT on this dataset significantly improves the model's ability to:
19+
20+
- **Follow complex instructions**: Better understanding of multi-step requests and nuanced prompts
21+
- **Maintain conversational context**: Improved coherence in longer dialogues
22+
- **Provide structured responses**: Enhanced ability to format answers appropriately
23+
- **Reduce harmful outputs**: Training on carefully curated data reduces problematic responses
24+
25+
By using Gemma-3, we can compare three variants:
26+
27+
1. **Base pretrained** (`google/gemma-3-4b-pt`) - Raw language modeling capabilities
28+
2. **Our LoRA fine-tuned** - Custom instruction following on Guanaco dataset
29+
3. **Official instruction-tuned** (`google/gemma-3-4b-it`) - Google's instruction tuning baseline
30+
31+
This provides a comprehensive evaluation of fine-tuning effectiveness.
32+
33+
---
34+
35+
## Three-Way Model Comparison
36+
37+
### Example Prompt: "Explain quantum computing in simple terms for a beginner"
38+
39+
#### Base Pretrained Model (`google/gemma-3-4b-pt`):
40+
41+
```
42+
Quantum computing uses quantum bits or qubits instead of classical bits. Qubits can be in superposition states and can be entangled. This allows quantum computers to perform certain calculations exponentially faster than classical computers for specific problems like factoring large numbers and simulating quantum systems.
43+
```
44+
45+
#### Our LoRA Fine-tuned Model (Gemma-3-4B-PT + LoRA on Guanaco):
46+
47+
```
48+
Great question! Let me break down quantum computing in simple terms:
49+
50+
**Think of it like this**: Regular computers use bits that are either 0 or 1 (like light switches that are either ON or OFF). Quantum computers use "quantum bits" or "qubits" that can be 0, 1, or both at the same time - imagine a coin that's spinning in the air, being both heads and tails until it lands.
51+
52+
**Why this matters**: This allows quantum computers to:
53+
- Try many solutions to a problem simultaneously
54+
- Solve certain types of problems much faster than regular computers
55+
- Handle complex calculations that would take classical computers thousands of years
56+
57+
**Real applications**: Password breaking, drug discovery, weather prediction, and financial modeling.
58+
```
59+
The fine-tuned model provides clearer explanations with better structure, analogies, and practical context!
60+
61+
62+
#### Official Instruction-tuned Model (`google/gemma-3-4b-it`):
63+
64+
```
65+
Quantum computing is a revolutionary approach to computation that leverages quantum mechanics principles. Here's a beginner-friendly explanation:
66+
67+
**Basic Concept**: While classical computers use bits (0 or 1), quantum computers use quantum bits (qubits) that can exist in multiple states simultaneously through "superposition."
68+
69+
**Key Advantages**:
70+
- Parallel processing of multiple possibilities
71+
- Exponential speedup for specific problem types
72+
- Superior performance in cryptography, optimization, and simulation
73+
74+
**Current Applications**: Drug discovery, financial modeling, cryptography, and artificial intelligence research.
75+
76+
```
77+
This comparison demonstrates how our custom fine-tuning can achieve similar or better instruction-following capabilities compared to the official instruction-tuned variant.
78+
79+
---
80+
81+
82+
:::tip
83+
Complete scripts used are available here: https://github.com/NYU-RTS/rts-docs-examples/tree/main/hpc/llm_fine_tuning
84+
85+
:::
86+
87+
## System Environment Setup
88+
89+
### Singularity Container & Overlay Configuration
90+
91+
|Component|Configuration|
92+
|---|---|
93+
|Singularity Image|`/scratch/work/public/singularity/cuda11.2.2-cudnn8-devel-ubuntu20.04.sif`|
94+
|Overlay|Created using `singularity overlay create --size 25000 overlay-25GB-conda.ext3`|
95+
|Conda Path|`/ext3/miniconda3` within overlay|
96+
|Singularity Shell Command|See below|
97+
98+
```bash
99+
singularity shell --nv \
100+
--overlay /scratch/<NetID>/fine-tune/overlay-25GB-conda.ext3:rw \
101+
/scratch/work/public/singularity/cuda11.2.2-cudnn8-devel-ubuntu20.04.sif
102+
```
103+
104+
### Python Environment and Dependency Installation
105+
106+
```bash
107+
bash Miniconda3-latest-Linux-x86_64.sh -b -p /ext3/miniconda3
108+
source /ext3/miniconda3/bin/activate
109+
pip install torch transformers datasets accelerate peft trl
110+
```
111+
112+
### Model Cache Configuration for Hugging Face
113+
114+
To avoid exceeding home directory quotas during large model downloads:
115+
116+
```bash
117+
export HF_HOME=/scratch/<NetID>/.cache/huggingface
118+
```
119+
120+
Ensure this is set both interactively and within sbatch scripts.
121+
122+
---
123+
124+
## Operational Troubleshooting: Common Errors and Recommended Fixes
125+
126+
This section provides a comprehensive overview of all environment-related issues encountered during the fine-tuning of `google/gemma-3-4b-pt` on the NYU Greene HPC cluster. Each entry includes the error symptom, root cause, and resolution strategy, categorized for clarity.
127+
128+
### 1. Filesystem and Path Setup Issues
129+
130+
|Problem|Symptom|Cause|Resolution|
131+
|---|---|---|---|
132+
|Incorrect overlay filename|No such file: `overlay-50GB-500K.ext3.gz`|The filename was incorrectly assumed|Use `ls /scratch/work/public/overlay-fs-ext3/` to verify the correct file: `overlay-50G-10M.ext3.gz`|
133+
|Compressed overlay used directly|`FATAL: while loading overlay images...`|Attempted to use `.gz` file directly with Singularity|Run `gunzip overlay-50G-10M.ext3.gz` before using the file|
134+
|Overlay missing in working directory|sbatch cannot find the overlay file|Overlay not copied to the training directory|Ensure the overlay file is placed in `/scratch/<NetID>/fine-tune/` where sbatch accesses it|
135+
|Invalid overlay structure|`FATAL: could not create upper dir`|Overlay created via `fallocate` + `mkfs.ext3`, missing necessary internal structure|Always use `singularity overlay create --size 25000` to create overlays|
136+
137+
### 2. Container Runtime and Overlay Mounting Errors
138+
139+
|Problem|Symptom|Cause|Resolution|
140+
|---|---|---|---|
141+
|GPU warning on login node|`WARNING: Could not find any nv files`|`--nv` flag used outside GPU-enabled session|Ignore the warning, or only use `--nv` within a `srun --gres=gpu:1` session|
142+
|Overlay locked by another process|`overlay in use by another process`|An interactive container shell using the overlay was still active|Run `lsof` or `ps aux` and terminate blocking process|
143+
144+
### 3. Python Package Installation and Environment Setup Errors
145+
146+
|Problem|Symptom|Cause|Resolution|
147+
|---|---|---|---|
148+
|`which pip` returns `Illegal option --`|Unexpected error when checking pip|Uses `/usr/bin/which` instead of Bash built-in|Use `command -v pip` or simply run `pip --version`|
149+
|`xformers` install fails due to missing torch|`No module named torch` during install|PyTorch not installed before building `xformers`|Install torch first: `pip install torch`, then `pip install xformers`|
150+
|Missing `transformers` in sbatch|`ImportError: No module named transformers`|Conda not activated in job script|Add `source /ext3/miniconda3/bin/activate` before executing the training script|
151+
|Installed pip packages not found|Training job fails to locate modules|pip used outside overlay context|Only install packages while the overlay is mounted with `:rw` in an active container session|
152+
153+
### 4. Disk Quota and Cache Management Issues
154+
155+
|Problem|Symptom|Cause|Resolution|
156+
|---|---|---|---|
157+
|Quota exceeded on home|`OSError: [Errno 122] Disk quota exceeded: ~/.cache/huggingface`|Default HuggingFace cache path inside `/home`|Set `HF_HOME=/scratch/$USER/.cache/huggingface`|
158+
|Cache redownloading on each sbatch|Hugging Face cache not shared|`HF_HOME` not consistently defined|Persist and reuse the same `HF_HOME` path across runs|
159+
160+
### 5. Slurm Job Submission and Runtime Failures
161+
162+
|Problem|Symptom|Cause|Resolution|
163+
|---|---|---|---|
164+
|Invalid Slurm account|`sbatch: Invalid account`|`--account` flag not set or invalid|Use `--account=pr_100_tandon_priority`|
165+
|Conda environment not recognized|`No module named transformers`|Activation missing in sbatch|Add `source /ext3/miniconda3/bin/activate` in sbatch|
166+
|Overlay not found during job|sbatch fails to locate file|Overlay not placed in expected directory|Ensure all relevant files are in `/scratch/<NetID>/fine-tune/` or update paths accordingly|
167+
168+
---
169+
170+
## Recommended Best Practices for Stable Execution
171+
172+
| Recommendation | Rationale |
173+
| ----------------------------------------------------- | ------------------------------------------------------------ |
174+
| Use `singularity overlay create` for overlay creation | Ensures `upper/` and `work/` directories are properly set up |
175+
| Install pip packages only after mounting overlay | Ensures packages persist and are isolated inside the overlay |
176+
| Activate Conda explicitly in sbatch | Slurm jobs do not inherit interactive shell environments |
177+
| Set `HF_HOME` to `/scratch` | Prevents hitting disk quota limits in home directories |
178+
| Avoid `return_tensors="pt"` in tokenizer mapping | Leads to shape mismatch errors in batched training |
179+
| Use subset sampling (e.g., `train[:1%]`) for testing | Minimizes resource consumption and enables fast debugging |
180+
181+
---
182+
183+
## LoRA Configuration Parameters
184+
185+
LoRA (Low-Rank Adaptation) is a technique for efficiently fine-tuning large models with reduced computational cost. It adapts the model's layers by adding low-rank matrices while maintaining the original model's parameters. This enables efficient training with fewer resources.
186+
187+
Learn more about LoRA [here](https://huggingface.co/learn/llm-course/en/chapter11/4).
188+
189+
Here are the configuration parameters used for LoRA in this fine-tuning setup:
190+
191+
```python
192+
peft_config = LoraConfig(
193+
r=8,
194+
lora_alpha=16,
195+
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
196+
lora_dropout=0.05,
197+
bias="none",
198+
task_type=TaskType.CAUSAL_LM
199+
)
200+
```
201+
202+
---
203+
204+
## sbatch Job Script for Model Training
205+
206+
### **Training Script: `train_gemma3.py`**
207+
208+
The complete training script is available [here](https://github.com/NYU-RTS/rts-docs-examples/tree/main/hpc/llm_fine_tuning). Below are the key configuration snippets:
209+
210+
**Model and Dataset Configuration:**
211+
212+
```python
213+
# Model and dataset configuration
214+
model_name = "google/gemma-3-4b-pt" # Base pretrained model
215+
dataset_name = "timdettmers/openassistant-guanaco"
216+
output_dir = "./gemma3_output"
217+
```
218+
219+
**LoRA Configuration:**
220+
221+
```python
222+
# LoRA configuration
223+
peft_config = LoraConfig(
224+
r=8,
225+
lora_alpha=16,
226+
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
227+
lora_dropout=0.05,
228+
bias="none",
229+
task_type=TaskType.CAUSAL_LM
230+
)
231+
```
232+
233+
**Training Arguments:**
234+
235+
```python
236+
# Training arguments
237+
training_args = TrainingArguments(
238+
output_dir=output_dir,
239+
per_device_train_batch_size=4,
240+
gradient_accumulation_steps=4,
241+
num_train_epochs=1,
242+
learning_rate=2e-4,
243+
fp16=True,
244+
logging_steps=10,
245+
save_steps=50,
246+
save_total_limit=2,
247+
remove_unused_columns=False,
248+
dataloader_pin_memory=False,
249+
)
250+
```
251+
252+
253+
### **sbatch Script**
254+
255+
```bash
256+
#!/bin/bash
257+
#SBATCH --job-name=gemma3-finetune
258+
#SBATCH --nodes=1
259+
#SBATCH --ntasks-per-node=1
260+
#SBATCH --cpus-per-task=4
261+
#SBATCH --mem=40GB
262+
#SBATCH --gres=gpu:1
263+
#SBATCH --time=12:00:00
264+
#SBATCH --output=/scratch/<NetID>/fine-tune/gemma3_train_%j.out
265+
#SBATCH --error=/scratch/<NetID>/fine-tune/gemma3_train_%j.err
266+
#SBATCH --mail-type=END,FAIL
267+
#SBATCH --mail-user=<NetID>@nyu.edu
268+
269+
export HF_HOME=/scratch/<NetID>/.cache/huggingface
270+
271+
singularity exec --nv \
272+
--overlay /scratch/<NetID>/fine-tune/overlay-25GB-conda.ext3:rw \
273+
/scratch/work/public/singularity/cuda11.2.2-cudnn8-devel-ubuntu20.04.sif \
274+
/bin/bash -c "
275+
source /ext3/miniconda3/bin/activate
276+
cd /scratch/<NetID>/fine-tune
277+
python train_gemma3.py
278+
"
279+
```
280+
281+
---
282+
283+
## Generated Output Artifacts
284+
285+
|File|Description|
286+
|---|---|
287+
|`adapter_model.safetensors`|LoRA adapter weights|
288+
|`adapter_config.json`|Adapter architecture definition|
289+
|`trainer_state.json`|Training metadata|
290+
|`training_args.bin`|Saved training configuration|
291+
|`tokenizer_config.json`, `tokenizer.json`|Tokenizer data|
292+
293+
Location: `/scratch/<NetID>/fine-tune/gemma3_output/checkpoint-13/`
294+
295+
---
296+
297+
## Training Completion Summary
298+
299+
|Epochs|Steps|Status|
300+
|---|---|---|
301+
|1|13|Completed successfully|

0 commit comments

Comments
 (0)