Fine-tuning a small language model (Qwen3-0.6B) for stronger epistemic values using GRPO (Group Relative Policy Optimization).
This project targets three specific epistemic virtues:
- Intellectual curiosity — genuine engagement with ideas, not rote answers.
- Nonsense detection — recognizing incoherent, malformed, or unanswerable prompts.
- Claim scrutiny — pushing back on false premises and illegitimate assertions.
This repository is a streamlined training-only codebase. All inference (rollouts and judging) is offloaded to a separate Tome service. This allows for:
- Efficient gradient accumulation on the local GPU.
- High-throughput parallel sampling on remote nodes.
- Decoupled model architectures for judging and training.
| Directory / File | Purpose |
|---|---|
model/ |
Policy architecture (Qwen3), LoRA adapters, and logprob extraction. |
grpo.py |
GRPO loss, group-relative advantage estimation, and training step. |
rubric.py |
Multi-criteria scoring client via Tome. |
tome_client.py |
REST client for Tome inference and weight synchronization. |
train.py |
Training entry point: data sampling, loop, and checkpointing. |
tests/ |
Correctness checks for gradients and logprob parity. |
-
Install dependencies:
uv sync
-
Start Tome: Ensure you have a Tome scheduler running. By default, it's expected at
http://localhost:8080. -
Run Training:
uv run train.py --tome-url http://localhost:8080
Requires Apple Silicon (Metal GPU) for the backward pass. Weights are fetched automatically from Hugging Face (Qwen/Qwen3-0.6B).
-
Rollout:
train.pysamples a batch of prompts and sends them to Tome. -
Sampling: Tome generates
$G$ completions per prompt (using current policy weights) and computes both policy and reference logprobs. - Scoring: Tome judges each completion using a multi-criteria rubric.
-
Backward Pass:
rl-valuesreceives completions, rewards, and old logprobs. It performs a differentiable forward pass to compute new logprobs and updates the policy via GRPO loss. - Sync: Updated LoRA weights are pushed back to Tome for the next rollout.