rl-values

Fine-tuning a small language model (Qwen3-0.6B) for stronger epistemic values using GRPO (Group Relative Policy Optimization).

This project targets three specific epistemic virtues:

Intellectual curiosity — genuine engagement with ideas, not rote answers.
Nonsense detection — recognizing incoherent, malformed, or unanswerable prompts.
Claim scrutiny — pushing back on false premises and illegitimate assertions.

Architecture

This repository is a streamlined training-only codebase. All inference (rollouts and judging) is offloaded to a separate Tome service. This allows for:

Efficient gradient accumulation on the local GPU.
High-throughput parallel sampling on remote nodes.
Decoupled model architectures for judging and training.

Core Components

Directory / File	Purpose
`model/`	Policy architecture (Qwen3), LoRA adapters, and logprob extraction.
`grpo.py`	GRPO loss, group-relative advantage estimation, and training step.
`rubric.py`	Multi-criteria scoring client via Tome.
`tome_client.py`	REST client for Tome inference and weight synchronization.
`train.py`	Training entry point: data sampling, loop, and checkpointing.
`tests/`	Correctness checks for gradients and logprob parity.

Setup

Install dependencies:
```
uv sync
```
Start Tome: Ensure you have a Tome scheduler running. By default, it's expected at http://localhost:8080.

Run Training:

uv run train.py --tome-url http://localhost:8080

Requires Apple Silicon (Metal GPU) for the backward pass. Weights are fetched automatically from Hugging Face (Qwen/Qwen3-0.6B).

How it Works

Rollout: train.py samples a batch of prompts and sends them to Tome.
Sampling: Tome generates $G$ completions per prompt (using current policy weights) and computes both policy and reference logprobs.
Scoring: Tome judges each completion using a multi-criteria rubric.
Backward Pass: rl-values receives completions, rewards, and old logprobs. It performs a differentiable forward pass to compute new logprobs and updates the policy via GRPO loss.
Sync: Updated LoRA weights are pushed back to Tome for the next rollout.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
Tome @ 0b7f522		Tome @ 0b7f522
data		data
knowledge		knowledge
model		model
scripts		scripts
src/rl_values		src/rl_values
tests		tests
.gitmodules		.gitmodules
.python-version		.python-version
CLAUDE.md		CLAUDE.md
GEMINI.md		GEMINI.md
README.md		README.md
grpo.py		grpo.py
pyproject.toml		pyproject.toml
rubric.py		rubric.py
tome_client.py		tome_client.py
train.py		train.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rl-values

Architecture

Core Components

Setup

How it Works

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rl-values

Architecture

Core Components

Setup

How it Works

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages