This directory contains an automated evaluation suite that tests the real-world effectiveness of models trained with the three OpenClaw-RL optimization methods:
| Method | Directory | Signal Type |
|---|---|---|
| Combined (RL + OPD) | openclaw-combine/ |
Weighted combination of both signals |
| Binary RL (GRPO) | openclaw-rl/ |
Scalar reward (+1/−1/0) via PRM majority voting |
| On-Policy Distillation (OPD) | openclaw-opd/ |
Token-level directional signal from hindsight hints |
We recommend using Combined (RL + OPD).
The evaluation simulates a realistic multi-turn agentic workflow using GSM8K math problems as the task domain. An external LLM (the "user") interacts with the OpenClaw agent (your trained model) through the OpenClaw gateway API, testing whether the agent can:
- Read files from its workspace
- Solve math problems with complete step-by-step reasoning
- Follow stylistic instructions (e.g., rewrite in a more natural tone)
- Write results back to files
- Grade existing solutions against ground truth
- Produce detailed, friendly feedback
The test consists of two sequential phases:
An external LLM role-plays as a lazy student who asks the OpenClaw agent to do their homework. For each GSM8K problem:
- The problem is written to
homework/i.txtin the OpenClaw workspace. - The "student" asks the agent to read the file and solve it.
- If the agent's answer looks too AI-like (bold text, numbered lists, etc.), the student tells it to rewrite in a more natural style.
- Once satisfied, the student asks the agent to append the answer to the homework file.
- The student says
HOMEWORK_DONEto end the session.
This phase tests the agent's instruction following, math reasoning, file I/O, and style adaptation abilities.
An external LLM role-plays as a teacher who grades the student's submissions. For each problem:
- The teacher provides the original question and ground truth answer to the agent.
- The agent reads the student's submission from
homework/i.txt, compares it with the correct answer, and writes grading comments. - If the comments are too brief or not friendly enough, the teacher asks for a rewrite.
- Once satisfied, the teacher asks the agent to append the comments to the file.
- The teacher says
GRADING_DONEto end the session.
This phase tests the agent's reading comprehension, evaluation accuracy, feedback quality, and multi-step file operations.
Run order matters: Run
student_chat.pyfirst so the homework files contain student solutions, then runteacher_chat.pyto grade them.
┌─────────────────────┐ ┌───────────────────────────────┐
│ External LLM │ │ OpenClaw RL Server │
│ (Student/Teacher) │ │ (your trained model) │
│ Port 30001 │ │ Port 30000 │
│ via launch_user_ │ │ via openclaw-rl/opd/combine │
│ llm.sh or closed- │ │ shell scripts │
│ source API │ │ │
└────────┬────────────┘ └──────────┬────────────────────┘
│ │
│ student/teacher messages │ agent responses
│ │
└──────────┐ ┌────────────────┘
▼ ▼
┌──────────────────┐
│ student_chat.py │
│ teacher_chat.py │
│ (orchestrator) │
└──────────────────┘
- A running OpenClaw environment (see the main README)
- Python 3.12 with
requestsandopenaipackages installed - A
GSM8K.jsondataset file (JSON array withquestionandground_truth_answerfields per entry)
The external LLM acts as the "user" (student or teacher) that drives the conversation. You have two options:
export MODEL_PATH="/path/to/your/model" # required: path to model weights
export SGLANG_API_KEY="your-api-key" # optional: API key for auth
export MODEL_NAME="qwen3-4b-user-llm" # optional: served model name
export TP_SIZE=8 # optional: tensor parallelism (default: 8)
export PORT=30001 # optional: port (default: 30001)
cd openclaw-test
bash launch_user_llm.shNo need to run launch_user_llm.sh. Just set the environment variables directly when running the test scripts (see Step 3).
Launch the RL server with the trained model you want to evaluate. Choose the method you want to test:
cd slimeCombined (RL + OPD):
bash ../openclaw-combine/run_qwen3_4b_openclaw_combine.shCombined with LoRA (parameter-efficient, fewer GPUs):
bash ../openclaw-combine/run_qwen3_4b_openclaw_combine_lora.shBinary RL:
bash ../openclaw-rl/run_qwen3_4b_openclaw_rl.shBinary RL with LoRA (parameter-efficient, fewer GPUs):
bash ../openclaw-rl/run_qwen3_4b_openclaw_rl_lora.shOn-Policy Distillation (OPD):
bash ../openclaw-opd/run_qwen3_4b_openclaw_opd.shOPD with LoRA (parameter-efficient, fewer GPUs):
bash ../openclaw-opd/run_qwen3_4b_openclaw_opd_topk_lora.shEval mode: To enable evaluation logging with W&B, set
OPENCLAW_EVAL_MODE=1and provide your W&B key viaWANDB_KEYbefore launching. This is already the default in the OPD and Combine scripts.
The RL server will be available at http://0.0.0.0:30000/v1 by default.
Set the required environment variables and run:
# Required
export OPENCLAW_GATEWAY_TOKEN="your-gateway-token"
export OPENAI_API_KEY="your-external-llm-api-key"
# Optional (defaults shown)
export OPENCLAW_GATEWAY_URL="http://localhost:18789"
export OPENCLAW_WORKSPACE="$HOME/.openclaw/workspace"
export OPENAI_BASE_URL="http://localhost:30001/v1" # point to your external LLM
export EXTERNAL_MODEL="qwen3-4b-user-llm" # model name for the external LLM
# Run
python student_chat.py \
--dataset GSM8K.json \
--num-problems 36 \
--max-turns 8This will:
- Write 36 GSM8K problems to
homework/0.txtthroughhomework/35.txtin the workspace. - For each problem, run a multi-turn conversation where the student LLM asks the OpenClaw agent to solve it.
- Print a summary of how many problems were completed within the turn limit.
After all student submissions are done, run the teacher to grade them:
# Same environment variables as above
python teacher_chat.py \
--dataset GSM8K.json \
--num-problems 36 \
--max-turns 8This will:
- For each problem, the teacher LLM asks the OpenClaw agent to read the student's submission, compare with the ground truth, and write grading comments.
- Print a summary of how many problems were graded within the turn limit.
Both student_chat.py and teacher_chat.py accept the same arguments:
| Argument | Default | Description |
|---|---|---|
--dataset |
(required) | Path to the GSM8K JSON file |
--num-problems |
5 |
Number of problems to process |
--max-turns |
8 |
Maximum conversation turns per problem |
| Variable | Required | Default | Description |
|---|---|---|---|
OPENCLAW_GATEWAY_TOKEN |
Yes | — | Auth token for the OpenClaw gateway |
OPENAI_API_KEY |
Yes | — | API key for the external LLM (student/teacher) |
OPENCLAW_GATEWAY_URL |
No | http://localhost:18789 |
OpenClaw gateway base URL |
OPENCLAW_WORKSPACE |
No | ~/.openclaw/workspace |
Path to the OpenClaw workspace directory |
OPENAI_BASE_URL |
No | (OpenAI default) | Base URL for the external LLM API |
EXTERNAL_MODEL |
No | gpt-4o |
Model name for the external LLM |
openclaw-test/
├── README.md # This file
├── launch_user_llm.sh # Script to host the external LLM via SGLang
├── student_chat.py # Phase 1: Student asks agent to solve homework
├── teacher_chat.py # Phase 2: Teacher asks agent to grade homework
└── GSM8K.json # Dataset (to be placed here)