Skip to content

Commit b5dcb25

Browse files
authored
Merge pull request #6 from VectorInstitute/feature/my-updates
Data Construction Pipeline for Original DPO & Modified (Factual-Aware) DPO
2 parents f883870 + 06bbb1b commit b5dcb25

26 files changed

+1707
-2
lines changed

.github/workflows/code_checks.yml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,5 +55,7 @@ jobs:
5555
uses: pypa/[email protected]
5656
with:
5757
virtual-environment: .venv
58-
additional-args: "--ignore PYSEC-2024-161"
59-
strict: false
58+
ignore-vulns: |
59+
PYSEC-2024-161
60+
GHSA-gm62-xv2j-4w53
61+
GHSA-2xpw-w6gg-jr37
Lines changed: 246 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,246 @@
1+
# Skywork → Factual-DPO Data Construction Pipeline
2+
3+
This repository contains a complete, modular, and type-safe data-construction pipeline for generating **factual-aware DPO datasets** from the **Skywork Reward-Preference-80K** dataset.
4+
5+
The pipeline supports:
6+
- Direct Preference Optimization (DPO)
7+
- Factual-DPO
8+
- Synthetic hallucination inversion pairs
9+
- Balanced and flipped datasets
10+
11+
## Configuration
12+
13+
All configuration is centralized in:
14+
15+
```bash
16+
src/aixpert/config/config.yaml
17+
```
18+
Loaded dynamically using:
19+
```python
20+
utils/config_loader.load_config()
21+
```
22+
## Configuration Summary (`config.yaml`)
23+
24+
### Model Settings
25+
- **model.name:** `gpt-4o-mini`
26+
- **model.temperature:** `0.8`
27+
28+
---
29+
30+
### Paths (All datasets + intermediate outputs)
31+
32+
The configuration tracks every stage of the data pipeline, including:
33+
34+
- Cleaned **train / eval / test** splits
35+
- **Preference pairs** (DPO-style)
36+
- **Factual-scored** outputs
37+
- **Synthetic inversion** samples (train + eval)
38+
- **Merged** intermediate datasets
39+
- **Balanced** final datasets
40+
- **Flipped** datasets for ablation
41+
42+
**Examples:**
43+
```yaml
44+
skywork_train_cleaned: "src/.../skywork_extracted_77k.jsonl"
45+
skywork_train_pairs: "src/.../skywork_preference_pairs_77k.jsonl"
46+
skywork_train_factual: "src/.../skywork_binary_factual_train.jsonl"
47+
final_train_out: "src/.../train_balanced.jsonl"
48+
```
49+
50+
## Pipeline Stages — Summary
51+
52+
Below is a concise overview of all eight stages in the Skywork → Factual-DPO data pipeline.
53+
54+
---
55+
56+
### ** Stage 1 — Skywork Extraction**
57+
**Scripts:**
58+
- `dataextraction_train.py`
59+
- `dataextraction_eval.py`
60+
- `dataextraction_test.py`(These samples are directly used in evaluation)
61+
62+
**Tasks:**
63+
- Load slices from Skywork Preference dataset
64+
- Extract:
65+
- **prompt** (first user message)
66+
- **chosen** (assistant reply)
67+
- **rejected** (assistant reply)
68+
- Remove exact duplicates
69+
- Save cleaned JSONL files
70+
71+
---
72+
73+
### ** Stage 2 — Preference Pair Conversion**
74+
**Scripts:**
75+
- `dataconversion_train.py`
76+
- `dataconversion_eval.py`
77+
78+
**Tasks:**
79+
- Convert `(prompt, chosen, rejected)` → **DPO-style preference pairs**
80+
- Produce:
81+
- `response_0`, `response_1`
82+
- `better_response_id`
83+
- Random symmetric assignment for unbiased supervision
84+
85+
---
86+
87+
### ** Stage 3 — Binary Factuality Evaluation**
88+
**Scripts:**
89+
- `dataset_train.py`
90+
- `dataset_val.py`
91+
92+
**Components:**
93+
Uses `utils.factual_utils` to evaluate factual correctness using **GPT-4o-mini**.
94+
95+
**Outputs:**
96+
- Binary hallucination flags:
97+
- `h0`, `h1` (aliases for `factual_flag_0`, `factual_flag_1`)
98+
99+
**Features:**
100+
- Resume-safe incremental scoring
101+
- Async concurrency
102+
- Retry logic
103+
104+
---
105+
106+
### ** Stage 4 — DPO Transformation**
107+
**Scripts:**
108+
- `data_transform_train.py`
109+
- `data_transform_val.py`
110+
111+
**Tasks:**
112+
Transform factual-scored items into canonical DPO format:
113+
114+
- `prompt`, `chosen`, `rejected`
115+
- `h_w`, `h_l`
116+
- `response_0`, `response_1`
117+
- `flipped=False`
118+
119+
---
120+
121+
### ** Stage 5 — Synthetic Hallucination Generation**
122+
**Scripts:**
123+
- `data_synthetic_train.py`
124+
- `data_synthetic_val.py`
125+
126+
**Tasks:**
127+
- Select samples where winner is factual (`h_w=0`) and loser is incorrect (`h_l=1`)
128+
- Use **GPT-4o-mini** to generate hallucinated corruptions
129+
- Build synthetic inversion pairs
130+
131+
**Outputs:**
132+
- **10,000** synthetic train samples
133+
- **400** synthetic eval samples
134+
135+
---
136+
137+
### ** Stage 6 — Merging**
138+
**Scripts:**
139+
- `merge_train.py`
140+
- `merge_eval.py`
141+
142+
**Tasks:**
143+
- Merge Skywork transformed data with synthetic inversion pairs
144+
- Bucket by `(h_w, h_l)`
145+
- Sample subsets
146+
- Shuffle and save merged datasets
147+
148+
---
149+
150+
### ** Stage 7 — Balanced Dataset Construction**
151+
**Scripts:**
152+
- `balance_train.py`
153+
- `build_final_eval.py`
154+
155+
**Train Balancing:**
156+
Use `balance_targets` to create balanced buckets:
157+
158+
- `(0,1)` — 10,000
159+
- `(1,0)` — 10,000
160+
- `(0,0)` — 15,000
161+
- `(1,1)` — 10,000
162+
163+
**Eval Construction:**
164+
Combine:
165+
- Skywork eval transformed
166+
- 400 synthetic eval inversion samples
167+
- 1500 clean `(1,1)` samples (unused in train)
168+
- 1500 clean `(0,0)` samples (unused in train)
169+
170+
---
171+
172+
### ** Stage 8 — Flipping (Optional)**
173+
**Scripts:**
174+
- `data_flipped_train.py`
175+
- `data_flipped_val.py`
176+
177+
**Tasks:**
178+
- Flip all `(1,0)` samples → `(0,1)`
179+
- Swap `chosen` ↔ `rejected`
180+
- Produce alternate dataset for inversion or ablation studies
181+
182+
---
183+
184+
This structured overview provides a clear high-level map of the complete Factual-DPO data construction workflow.
185+
186+
## Utilities Summary
187+
188+
### `utils/config_loader.py`
189+
- Centralized configuration loader
190+
- All stages call `load_config()` to read `config.yaml`
191+
192+
---
193+
194+
### `utils/data_utils.py`
195+
Core data-processing helpers:
196+
- `extract_prompt()` — first user message
197+
- `extract_answer()` — first assistant reply
198+
- `filter_duplicates()` — removes exact matches
199+
- `create_preference_pairs()` — builds DPO response pairs
200+
- `bucket_by_flags()` — groups by (h_w, h_l)
201+
- `flip_sample()` — converts (1,0) → (0,1)
202+
- JSONL read/write utilities
203+
204+
---
205+
206+
### `utils/factual_utils.py`
207+
- Async binary factuality scoring using GPT-4o-mini
208+
- Concurrency + retry logic
209+
- Resume-safe checkpointing
210+
- Produces `h0`, `h1` hallucination flags
211+
212+
---
213+
214+
### `utils/dpo_transform_utils.py`
215+
- Converts factual-scored items into final DPO format:
216+
- `prompt`, `chosen`, `rejected`, `h_w`, `h_l`, `response_0`, `response_1`, `flipped=False`
217+
218+
---
219+
220+
### `utils/synthetic_utils.py`
221+
- GPT-based corruption generator
222+
- Creates synthetic inversion pairs (hallucinated → correct)
223+
224+
---
225+
226+
### `utils/prompt_templates.py`
227+
Provides all system/user prompts:
228+
- Strict factuality judge prompt
229+
- Hallucination corruption prompts
230+
231+
---
232+
233+
## Running the Pipeline
234+
235+
Example sequence for **training pipeline**:
236+
237+
```bash
238+
python src/aixpert/data_construction/stage_1_extraction/dataextraction_train.py
239+
python src/aixpert/data_construction/stage_2_conversion/dataconversion_train.py
240+
python src/aixpert/data_construction/stage_3_factuality/dataset_train.py
241+
python src/aixpert/data_construction/stage_4_transformation/data_transform_train.py
242+
python src/aixpert/data_construction/stage_5_syntheticdata/data_synthetic_train.py
243+
python src/aixpert/data_construction/stage_6_merging/merge_train.py
244+
python src/aixpert/data_construction/stage_7_balancing/balance_train.py
245+
python src/aixpert/data_construction/stage_8_flipping/data_flipped_train.py
246+
```
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
repository: /projects/aixpert/users/sindhu/Loss_Test
2+
3+
model:
4+
name: gpt-4o-mini # or gpt-4o
5+
temperature: 0.8
6+
7+
paths:
8+
skywork_train_cleaned: "src/aixpert/data_construction/data/skywork_extracted_77k.jsonl"
9+
skywork_train_removed: "src/aixpert/data_construction/data/skywork_removed_77k.jsonl"
10+
11+
skywork_eval_cleaned: "src/aixpert/data_construction/data/skywork_extracted_eval.jsonl"
12+
skywork_eval_removed: "src/aixpert/data_construction/data/skywork_eval_removed.jsonl"
13+
14+
skywork_test_cleaned: "src/aixpert/data_construction/data/skywork_extracted_test.jsonl"
15+
skywork_test_removed: "src/aixpert/data_construction/data/skywork_test_removed.jsonl"
16+
17+
skywork_train_pairs: "src/aixpert/data_construction/data/skywork_preference_pairs_77k.jsonl"
18+
skywork_eval_pairs: "src/aixpert/data_construction/data/skywork_preference_pairs_eval.jsonl"
19+
20+
skywork_train_factual: "src/aixpert/data_construction/data/skywork_binary_factual_train.jsonl"
21+
skywork_eval_factual: "src/aixpert/data_construction/data/skywork_binary_factual_eval.jsonl"
22+
23+
skywork_train_transformed: "src/aixpert/data_construction/data/skywork_first_transformed_train.jsonl"
24+
skywork_eval_transformed: "src/aixpert/data_construction/data/skywork_first_transformed_eval.jsonl"
25+
26+
synthetic_train_out: "src/aixpert/data_construction/data/synthetic_llm_inversion_train_10k.jsonl"
27+
synthetic_eval_out: "src/aixpert/data_construction/data/synthetic_llm_inversion_eval_400.jsonl"
28+
29+
30+
final_train_merged: "src/aixpert/data_construction/data/skywork_final_train.jsonl"
31+
final_eval_merged: "src/aixpert/data_construction/data/skywork_final_eval.jsonl"
32+
33+
final_train_out: "src/aixpert/data_construction/data/train_balanced.jsonl"
34+
final_eval_out: "src/aixpert/data_construction/data/eval_final.jsonl"
35+
36+
train_flipped_out: "src/aixpert/data_construction/data/train_balanced_flipped.jsonl"
37+
eval_flipped_out: "src/aixpert/data_construction/data/eval_final_flipped.jsonl"
38+
39+
40+
41+
skywork_file: "Skywork/Skywork-Reward-Preference-80K-v0.1"
42+
43+
hyperparams:
44+
subset_size: 80000
45+
eval_start: 80001
46+
eval_end: 81000
47+
test_start: 81001
48+
test_end: 81500
49+
concurrency_limit: 25
50+
max_retries: 5
51+
corruption_concurrency: 20
52+
synthetic_train_samples: 10000
53+
synthetic_eval_samples: 400
54+
55+
balance_targets:
56+
"(0,1)": 10000
57+
"(1,0)": 10000
58+
"(0,0)": 15000
59+
"(1,1)": 10000
60+
61+
eval_additional_clean_samples: 1500
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
"""
2+
Extract the test slice of the Skywork preference dataset.
3+
4+
This script extracts rows 81001–81500, removes exact duplicates,
5+
and saves the cleaned dataset into JSONL files under the local data folder.
6+
Only the prompts from this test set will be used in evaluation.
7+
"""
8+
9+
from __future__ import annotations
10+
11+
from pathlib import Path
12+
13+
from datasets import load_dataset
14+
from utils.config_loader import load_config
15+
from utils.data_utils import (
16+
extract_answer,
17+
extract_prompt,
18+
filter_duplicates,
19+
save_jsonl,
20+
)
21+
22+
23+
def main() -> None:
24+
"""Run test-split extraction and save cleaned JSONL outputs."""
25+
cfg = load_config()
26+
hp = cfg["hyperparams"]
27+
paths = cfg["paths"]
28+
29+
start, end = hp["test_start"], hp["test_end"]
30+
31+
print(f"Extracting test slice {start}{end}")
32+
33+
ds = load_dataset(
34+
paths["skywork_file"],
35+
split=f"train[{start}:{end + 1}]",
36+
)
37+
df = ds.to_pandas()
38+
39+
df["prompt"] = df["chosen"].apply(extract_prompt)
40+
df["chosen"] = df["chosen"].apply(extract_answer)
41+
df["rejected"] = df["rejected"].apply(extract_answer)
42+
43+
rows = df[["prompt", "chosen", "rejected"]].to_dict(orient="records")
44+
cleaned, removed = filter_duplicates(rows)
45+
46+
save_jsonl(Path(paths["skywork_test_cleaned"]), cleaned)
47+
save_jsonl(Path(paths["skywork_test_removed"]), removed)
48+
49+
print(f"Removed duplicates: {len(removed)}")
50+
print(f"Clean samples: {len(cleaned)}")
51+
print("Test extraction completed.")
52+
53+
54+
if __name__ == "__main__":
55+
main()

0 commit comments

Comments
 (0)