You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Run the following command to automatically create a virtual environment and install all dependencies:
24
25
25
26
```bash
26
27
uv sync
27
28
source .venv/bin/activate
28
-
29
+
```
29
30
## 🧮 Dataset Construction – `Sky()` Function Overview - preprocess_dataset.py
30
31
31
32
The `Sky()` function prepares a **pairwise preference dataset** used for **Direct Preference Optimization (DPO)** fine-tuning.
@@ -84,7 +85,7 @@ Once your environment is set up:
84
85
85
86
```bash
86
87
uv run inference_best_of_n.py
87
-
88
+
```
88
89
## 🧩 Inference – All Template Generation (Hint Sampling)
89
90
90
91
Once `inference_best_of_n.py` has generated the multi-sample outputs,
@@ -99,7 +100,7 @@ It also includes built-in checkpointing and resume capabilities for long runs.
99
100
100
101
```bash
101
102
uv run inference_all.py
102
-
103
+
```
103
104
## 🧹 Post-Processing
104
105
105
106
After generating the structured outputs (`guide.jsonl`, `guide_reverse.jsonl`, and `output.jsonl`) from the inference scripts,
@@ -111,7 +112,7 @@ this script performs **final cleanup and reindexing** to ensure that all generat
111
112
112
113
```bash
113
114
uv run post_processing.py
114
-
115
+
```
115
116
## 🧱 DPO Dataset Construction
116
117
117
118
After cleaning and aligning the inference outputs, this script **constructs the final dataset** required for **Direct Preference Optimization (DPO)** fine-tuning.
@@ -123,7 +124,7 @@ It parses the model’s reasoning outputs, identifies preference-aligned samples
123
124
124
125
```bash
125
126
uv run construct_dpo_dataset.py
126
-
127
+
```
127
128
## 📊 DPO Dataset Inspection
128
129
129
130
After constructing the DPO dataset using `construct_dpo_dataset.py`,
@@ -135,7 +136,7 @@ this script provides a **quick inspection and summary** of the saved dataset —
135
136
136
137
```bash
137
138
uv run dpo_dataset.py
138
-
139
+
```
139
140
## 🧠 Direct Preference Optimization Training
140
141
This script performs **Direct Preference Optimization (DPO)** fine-tuning on the constructed dataset using **Qwen2-7B-Instruct**.
141
142
It aligns the model’s responses with human-preferred outputs by learning from **(chosen, rejected)** pairs generated earlier.
@@ -146,7 +147,7 @@ It aligns the model’s responses with human-preferred outputs by learning from
146
147
147
148
```bash
148
149
uv run dpo_training.py
149
-
150
+
```
150
151
## 📊 Model Evaluation Script
151
152
152
153
This script evaluates the **Direct Preference Optimization (DPO)** fine-tuned model against the **base model (Qwen2-7B-Instruct)** on the validation split of the dataset.
@@ -158,3 +159,4 @@ It measures how well each model predicts the **preferred (chosen)** answers from
0 commit comments