This directory trains and evaluates vision-language models (Qwen3-VL) to operate real desktop environments through screenshots and pyautogui actions. The agent observes the screen, reasons about the task, and emits keyboard/mouse actions — all inside cloud VMs managed by a pool server.
┌────────────────────┐ ┌────────────────────────────────────┐
│ Env Pool Server │ │ Ray Cluster │
│ (Flask, port │ HTTP │ ┌──────────┐ ┌───────────────┐ │
│ 18080) │◄──────►│ │ Rollout │ │ Actor / │ │
│ │ │ │ (SGLang) │ │ Trainer │ │
│ allocate / reset │ │ └──────────┘ └───────────────┘ │
│ step / evaluate │ │ ▲ ▲ │
│ close │ │ │ screenshots │ weights │
└────────┬───────────┘ └───────┼───────────────┼────────────┘
│ │ │
│ VM lifecycle │ │
▼ │ │
┌────────────────────┐ ┌───────┴───────────────┴───┐
│ Cloud VMs │ │ generate_with_gui.py │
│ (Volcengine / │ │ (trajectory loop: │
│ AWS / Aliyun) │ │ screenshot → agent → │
│ Ubuntu desktops │ │ action → step → reward)│
└────────────────────┘ └───────────────────────────┘
Env Pool Server — A Flask service that manages a pool of cloud VM instances. It handles allocation, task reset (snapshot restore + setup scripts), action execution, evaluation, and teardown. VMs are pre-warmed at launch to eliminate cold-start latency during training.
Agent (Qwen3VLAgentLocal) — Converts screenshot observations into multi-turn VLM conversations (system prompt + image history + action output), and parses the model's response into executable pyautogui calls.
Reward Agent (Qwen3VLRewardAgent, PRM mode only) — A separate VLM that scores each agent step by comparing the action's intent against the next observation, producing +1/−1 per-step rewards for step-wise advantage estimation.
Trajectory Loop (generate_with_gui.py) — Orchestrates multi-step episodes: allocate a VM, reset to a task, loop (screenshot → model inference → parse action → execute → observe), compute final reward via OSWorld evaluators, and return the trajectory for training.
| Mode | Script | Description |
|---|---|---|
| Eval-only | gui_qwen3vl_4b_eval.sh |
Load a checkpoint, run all eval tasks, report accuracy. No training. |
| RL Training | gui_qwen3vl_8b_rl.sh |
GRPO-based RL with sequence-level binary rewards from OSWorld evaluators. |
| PRM + RL Training | gui_qwen3vl_8b_prm_rl.sh |
RL with an additional Process Reward Model that provides per-step rewards via a separate VLM judge. |
- 8 GPUs (default; configurable via
NUM_GPUS) - A cloud provider account (Volcengine by default) with VM image and network configured
- Volcengine secrets (
VOLC_ACCESSKEY,VOLC_SECRETKEY) exported in your shell - Python environment with
requirements.txtinstalled - SLIME framework and Megatron-LM available in sibling directories
Install the GUI-specific dependencies:
pip install -r requirements.txt
playwright install chromiumRun a checkpoint on the full eval task suite without any training:
# Optional: point to your checkpoint (default: Qwen3-VL-4B-Thinking)
export HF_CKPT="/path/to/checkpoint"
# Optional: W&B logging
export WANDB_KEY="your-wandb-key"
cd slime
bash ../gui-rl/gui_qwen3vl_4b_eval.shThe script:
- Starts the env pool server and pre-warms VMs.
- Launches a Ray cluster with all GPUs dedicated to rollout (no actor).
- Runs every task in
evaluation_examples/test_nochrome.json. - Reports accuracy per domain.
Train with sequence-level binary rewards — the agent gets +1 if the OSWorld evaluator marks the task as successful, −1 otherwise:
export HF_CKPT="/path/to/Qwen3-VL-8B-Thinking"
export WANDB_KEY="your-wandb-key"
# Optional GPU split (default: 4 actor + 4 rollout)
export ACTOR_GPUS=4
export ROLLOUT_GPUS=4
cd slime
bash ../gui-rl/gui_qwen3vl_8b_rl.shThe script partitions GPUs between the training actor and the SGLang rollout engine, then alternates between:
- Rollout: generate trajectories across
n_samples_per_promptparallel attempts per task. - Train: update the policy using GRPO with KL regularization.
- Eval: periodically run the eval task set and log accuracy to W&B.
Adds a Process Reward Model that scores each step, enabling step-wise advantage estimation instead of sequence-level:
export HF_CKPT="/path/to/Qwen3-VL-8B-Thinking"
export PRM_MODEL_PATH="/path/to/Qwen3-VL-4B-Thinking"
export WANDB_KEY="your-wandb-key"
# GPU partition: actor + rollout + PRM must equal NUM_GPUS
export ACTOR_GPUS=4
export ROLLOUT_GPUS=3
export PRM_GPUS=1
cd slime
bash ../gui-rl/gui_qwen3vl_8b_prm_rl.shThe PRM agent takes the interaction history plus the next observation after each step, and assigns +1 (progress) or −1 (no effect / regression). These per-step rewards feed into a step_wise advantage estimator, providing denser gradient signal than the final-outcome-only GRPO.
| Variable | Default | Description |
|---|---|---|
NUM_GPUS |
8 |
Total GPUs on the node |
ACTOR_GPUS |
4 |
GPUs for the Megatron training actor |
ROLLOUT_GPUS |
4 (3 for PRM) |
GPUs for the SGLang rollout engine |
PRM_GPUS |
1 |
GPUs for the PRM judge engine (PRM mode) |
HF_CKPT |
(model-specific) | HuggingFace checkpoint path |
PRM_MODEL_PATH |
— | Checkpoint for the PRM reward model |
ROLLOUT_BATCH_SIZE |
8 |
Tasks per rollout batch |
N_SAMPLES_PER_PROMPT |
8 |
Parallel attempts per task |
| Variable | Default | Description |
|---|---|---|
GUI_POOL_MAX_ENVS |
64 |
Max concurrent VM instances |
GUI_PREWARM_ENVS |
64 |
VMs to pre-warm before training starts |
GUI_PROVIDER_NAME |
volcengine |
Cloud provider (volcengine / aws / aliyun) |
GUI_REGION |
cn-beijing |
Cloud region for VM provisioning |
GUI_ACTION_SPACE |
pyautogui |
Action space for the agent |
GUI_OBSERVATION_TYPE |
screenshot |
Observation modality |
GUI_SCREEN_WIDTH |
1920 |
VM screen resolution width |
GUI_SCREEN_HEIGHT |
1080 |
VM screen resolution height |
| Variable | Default | Description |
|---|---|---|
WANDB_KEY |
— | Weights & Biases API key |
WANDB_PROJECT |
slime_gui |
W&B project name |
GUI_RESULT_DIR |
./results |
Directory for trajectory logs and eval results |
Secrets must be exported before running:
export VOLC_ACCESSKEY="your-access-key"
export VOLC_SECRETKEY="your-secret-key"Non-secret configs (VOLCENGINE_IMAGE_ID, VOLCENGINE_SUBNET_ID, etc.) have defaults in the shell scripts and can be overridden.
gui-rl/
├── README.md
├── requirements.txt
│
├── gui_qwen3vl_4b_eval.sh # Eval-only (4B, single node)
├── gui_qwen3vl_8b_rl.sh # RL training (8B, GRPO)
├── gui_qwen3vl_8b_prm_rl.sh # PRM + RL training (8B, step-wise)
│
├── generate_with_gui.py # Trajectory loop: screenshot → action → step → reward
├── env_pool_server.py # Flask server managing VM pool lifecycle
├── env_client.py # Async HTTP client for the env pool server
├── gui_data_source.py # DataSource that loads tasks from meta JSON files
├── build_gui_prompt_dataset.py # Utility to convert meta JSON → JSONL prompt dataset
│
├── agents/
│ ├── qwen3vl_agent.py # Policy agent: screenshot → VLM → pyautogui action
│ ├── qwen3vl_reward_agent.py # PRM reward agent: step scoring via separate VLM
│ └── utils/
│ └── qwen_vl_utils.py # Image resizing utilities for Qwen VL
│
├── desktop_env/ # VM environment abstraction (forked from OSWorld)
│ ├── desktop_env.py # DesktopEnv class: VM connection, screenshot, action execution
│ ├── actions.py # Action parsing and execution
│ ├── controllers/ # Remote execution controllers (Python, setup scripts)
│ ├── evaluators/ # OSWorld task evaluators (getters + metrics)
│ ├── providers/ # Cloud VM providers (Volcengine, AWS, Aliyun)
│ └── server/ # VM-side server components
│
├── evaluation_examples/ # Task definitions
│ ├── examples/ # Per-domain task configs (chrome, os, vlc, word, ...)
│ ├── test_nochrome.json # Eval task meta (excludes Chrome tasks)
│ ├── train_nochrome.json # Train task meta
│ ├── test_all.json # Full eval set
│ └── settings/ # App-specific config templates
│
└── results/ # Runtime output (auto-created, not tracked in git)