Skip to content

Commit d6a63b0

Browse files
committed
add GRPO iterative training pipeline
1 parent d01ac0c commit d6a63b0

19 files changed

+3503
-14
lines changed

.gitignore

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -207,3 +207,50 @@ marimo/_lsp/
207207
__marimo__/
208208

209209
redis-data/*
210+
211+
# Project-specific files to ignore
212+
RESEARCH_PROGRESS_REPORT_*.md
213+
PROGRESS_REPORT_*.md
214+
CODE_CLEANUP_ANALYSIS.md
215+
CLEANUP_RECOMMENDATIONS.md
216+
FINAL_CLEANUP_SUMMARY.md
217+
218+
# Training artifacts
219+
**/checkpoints/
220+
**/grpo_checkpoints/
221+
**/rm_checkpoints*/
222+
**/sft_checkpoints*/
223+
training_results*/
224+
fresh_training_results*/
225+
archive/
226+
227+
# WandB artifacts
228+
wandb/
229+
**/wandb/
230+
231+
# Training logs and outputs
232+
*.log
233+
training_log.out
234+
*_training.log
235+
training_*.log
236+
237+
# Data files (scenario databases can be regenerated)
238+
scenarios.db
239+
*.rdb
240+
temp-*.rdb
241+
242+
# Model artifacts and large files
243+
*.bin
244+
*.safetensors
245+
*.pt
246+
*.pth
247+
248+
# Configuration files with potentially sensitive data
249+
training_config.json
250+
*_config.json
251+
252+
# Temporary and cache files
253+
*.tmp
254+
*.bak
255+
*.old
256+
*~

.gitmodules

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[submodule "sotopia-rl"]
2+
path = sotopia-rl
3+
url = [email protected]:Keyu-He/sotopia-rl.git

README.md

Lines changed: 139 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,139 @@
1-
# sotopia-verifiable
2-
A collection of verifiable games in sotopia format
3-
4-
- scenario_generator.py
5-
* Defines social‐interaction "axes" as code->label mappings
6-
* Samples random combinations of those axes
7-
* Prompts the OpenAI API to produce JSON‑structured scenarios
8-
* Inserts each scenario (with UUID and axis codes) into SQLite via DBHelper
9-
10-
- scenario_runner.py
11-
* Takes a scenario UUID as CLI argument
12-
* Loads that row from SQLite and parses the richer agents_json field
13-
* Ensures AgentProfile entries (with full profile fields) and an EnvironmentProfile in Redis
14-
* Builds a UniformSampler and invokes run_async_server to execute the scenario as a multi‐agent game
1+
# Sotopia-Verifiable
2+
3+
Training social AI agents through self-play with verifiable rewards. Instead of relying on subjective human ratings or LLM judges, we use scenarios with explicit rules and binary win/loss conditions that can be formally verified.
4+
5+
## The Problem
6+
7+
Most social AI training uses feedback that's either subjective (human preferences) or gameable (LLM judges). This leads to reward hacking and inconsistent evaluation. We need a better way.
8+
9+
## Our Approach
10+
11+
We train agents on social scenarios where outcomes are objectively verifiable. Think negotiation games with clear rules, resource allocation with defined success criteria, or cooperative tasks with measurable goals. The agent learns through self-play against strategic opponents (currently GPT-4o), getting clean binary rewards based on whether they achieved the scenario's win condition.
12+
13+
## Quick Start
14+
15+
### Setup
16+
```bash
17+
conda activate sotopia-rl
18+
cd /home/keyuh/sotopia-verifiable
19+
```
20+
21+
### Run Complete Training Pipeline
22+
```bash
23+
python self_play_training.py \
24+
--num_iterations 3 \
25+
--games_per_scenario 10 \
26+
--output_dir training_results
27+
```
28+
29+
This will:
30+
1. Generate self-play games between your trainee (Qwen2.5-7B) and partner (GPT-4o)
31+
2. Convert games to training data with binary rewards
32+
3. Train using GRPO (Group Reward Policy Optimization)
33+
4. Iterate to improve performance
34+
35+
### Or Run Steps Manually
36+
37+
Collect training data:
38+
```bash
39+
python training_data_collector.py \
40+
--trainee_model_path None \
41+
--num_games 50 \
42+
--output_dir training_data
43+
```
44+
45+
Train the model:
46+
```bash
47+
cd fresh_training_results/iteration_1
48+
bash train_grpo.sh
49+
```
50+
51+
Evaluate performance:
52+
```bash
53+
python self_play_evaluator.py \
54+
--trainee_model_path checkpoints/policy_adapter \
55+
--num_games 20 \
56+
--output_path results/evaluation.json
57+
```
58+
59+
## How It Works
60+
61+
### Scenarios
62+
We generate social interaction scenarios based on established social science theories. Each scenario has:
63+
- A clear context (negotiation, resource allocation, cooperation task, etc.)
64+
- Explicit win conditions that can be verified through pattern matching
65+
- Strategic depth that requires actual reasoning, not just following a script
66+
67+
### Training Loop
68+
1. Load scenario from database
69+
2. Run conversation between trainee and partner
70+
3. Verify outcome using formal patterns (FINAL_BID, ALLOCATION, etc.)
71+
4. Assign binary reward (+1 win, -1 loss, 0 draw)
72+
5. Update model using GRPO with LoRA adapters
73+
74+
### Technical Stack
75+
- **Base Model:** Qwen2.5-7B with LoRA (392M trainable params)
76+
- **Partner Model:** GPT-4o (fixed, provides strategic opposition)
77+
- **Training:** GRPO with binary rewards
78+
- **Infrastructure:** Multi-GPU support, WandB tracking
79+
80+
## Project Structure
81+
82+
```
83+
sotopia-verifiable/
84+
├── scenarios/ # Scenario generation and database
85+
│ ├── scenario_generator.py
86+
│ ├── scenarios.db
87+
│ └── db_helper.py
88+
├── self_play_evaluator.py # Core self-play framework
89+
├── training_data_collector.py # Convert games to training data
90+
├── structured_social_verifier.py # Outcome verification
91+
├── fresh_training_results/ # Training experiments
92+
│ └── iteration_1/
93+
│ ├── train_grpo.sh
94+
│ ├── training_data/
95+
│ └── checkpoints/
96+
└── sotopia-rl/ # Training infrastructure
97+
```
98+
99+
## Monitoring Progress
100+
101+
Training metrics are automatically tracked on WandB: https://wandb.ai/keyuhe/grpo-model-training
102+
103+
Expected progression:
104+
- Iteration 1: ~45% win rate vs GPT-4o
105+
- Iteration 2: ~60% win rate with better strategic understanding
106+
- Iteration 3: ~70% win rate with improved social awareness
107+
108+
## Current Status
109+
110+
The training pipeline is operational and we're running initial experiments. First results show the approach is working - agents are learning to win scenarios through strategic interaction rather than just mimicking patterns.
111+
112+
### What's Working
113+
- Scenario generation from social science theories
114+
- Self-play game execution with GPT-4o
115+
- Formal verification of outcomes
116+
- GRPO training with LoRA adapters
117+
118+
### What We're Improving
119+
- Scenario diversity and complexity
120+
- Partner model selection (considering curriculum learning)
121+
- Evaluation metrics beyond win rate
122+
- Transfer to open-ended social interaction
123+
124+
## For Contributors
125+
126+
### Prerequisites
127+
- CUDA-capable GPU (tested on RTX A6000)
128+
- Python 3.10+ with PyTorch
129+
- OpenAI API access
130+
- WandB account
131+
132+
### Development
133+
1. Test basic functionality: `python test_self_play.py`
134+
2. Generate new scenarios: `cd scenarios && python scenario_generator.py`
135+
3. Run training experiments
136+
4. Monitor on WandB
137+
5. Evaluate and iterate
138+
139+
The codebase is actively being developed. Feel free to explore and experiment with different approaches to scenario design, reward structures, and training algorithms.

0 commit comments

Comments
 (0)