Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions slime/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,16 @@ Then you can run:
```
modal run slime/modal_train.py::train_multi_node --config qwen-8b-multi
```

## Code-golf RL example (SLIME + Harbor + Modal sandboxes)

A dedicated example lives in:

`slime/code_golf_harbor_modal`

It adds:

- MBPP -> Harbor task conversion
- custom `--custom-rm-path` reward that runs Harbor-style verification in Modal sandboxes
- size-aware code-golf reward shaping
- checkpoint serving + Harbor eval utilities
96 changes: 96 additions & 0 deletions slime/code_golf_harbor_modal/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# SLIME + Harbor + Modal code-golf example (Qwen 8B)

This example trains **Qwen3-8B** with SLIME on Modal for Python code golf.

It does two things in reward:

1. **Correctness** via Harbor task verification.
2. **Code-size pressure** via a length bonus on top of correctness.

## What this example adds

- MBPP (`Muennighoff/mbpp`) conversion into:
- **Harbor task project format** (`tasks/<task_name>/...`)
- **SLIME parquet prompt format** (`slime/train.parquet`, `slime/test.parquet`)
- Regex function-name extraction from MBPP `code` field and injection into prompt.
- Custom SLIME reward model (`--custom-rm-path`) that runs Harbor tests in **Modal sandboxes**.
- Harbor eval path with:
- a custom single-shot code agent (`harbor_litellm_agent.py`)
- a checkpoint server (SGLang) that serves the latest checkpoint from Modal volume
- high-concurrency Harbor runs (`-n` with `--env modal`)

## Files

- `modal_train.py` – Modal entrypoints for dataset prep, training, serving, and eval.
- `dataset_to_harbor.py` – MBPP -> Harbor + SLIME conversion function.
- `custom_rm.py` – SLIME custom reward function (Harbor+Modal sandbox scoring + length bonus).
- `harbor_litellm_agent.py` – Harbor custom agent that calls an OpenAI-compatible endpoint.
- `configs/qwen_8b_multi.py` – Qwen3-8B multi-node config based on the Modal SLIME config style.

## 1) Prepare dataset in Modal volume

Run once:

`modal run slime/code_golf_harbor_modal/modal_train.py::prepare_dataset`

Optional small smoke conversion:

`modal run slime/code_golf_harbor_modal/modal_train.py::prepare_dataset --limit 32 --train-size 24`

Data lands in Modal volume `slime-code-golf-harbor-data` under:

- `/data/mbpp_harbor/tasks`
- `/data/mbpp_harbor/slime/train.parquet`
- `/data/mbpp_harbor/slime/test.parquet`

Each Harbor task uses `environment/Dockerfile` with:

`FROM python:3.11-slim`

## 2) Download base model

Run once:

`modal run slime/code_golf_harbor_modal/modal_train.py::download_model --config qwen-8b-multi`

## 3) Train with SLIME

`modal run slime/code_golf_harbor_modal/modal_train.py::train_multi_node --config qwen-8b-multi`

The config uses:

- `--custom-rm-path custom_rm.batched_custom_rm`
- MBPP parquet generated by `prepare_dataset`
- Multi-node Qwen3-8B settings in the same style as the existing SLIME Modal configs

## 4) Serve latest checkpoint

Start a long-lived server endpoint from latest checkpoint:

`modal serve slime/code_golf_harbor_modal/modal_train.py::serve_latest_checkpoint`

Or one-shot eval (starts local server inside function, runs Harbor, exits):

`modal run slime/code_golf_harbor_modal/modal_train.py::eval_latest_checkpoint --n-concurrent 256 --n-tasks 500`

## 5) Run Harbor eval against server (Modal sandboxes)

Given a running server URL:

`modal run slime/code_golf_harbor_modal/modal_train.py::run_harbor_eval --server-base-url https://<your-modal-url> --n-concurrent 256 --n-tasks 500`

This uses Harbor with:

- `--env modal` (trial environments are Modal sandboxes)
- custom agent import path `harbor_litellm_agent.SingleShotCodeAgent`
- OpenAI-compatible API calls to the served checkpoint endpoint

## Reward behavior

The custom RM in `custom_rm.py` computes:

- `pass_rate` from Harbor task verifier output.
- `length_bonus` from `reference_bytes / candidate_bytes` (capped).
- Final reward: correctness scaled by a positive length bonus.

So passing tests remains primary; shorter passing programs get higher reward.
34 changes: 34 additions & 0 deletions slime/code_golf_harbor_modal/code_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
from __future__ import annotations

import re

_CODE_FENCE_RE = re.compile(r"```(?:python)?\s*(.*?)```", re.IGNORECASE | re.DOTALL)
_FUNCTION_RE = re.compile(r"^\s*def\s+([A-Za-z_]\w*)\s*\(", re.MULTILINE)


def normalize_code(text: str) -> str:
return text.replace("\r\n", "\n").replace("\r", "\n")


def extract_function_name_from_code(code: str) -> str:
match = _FUNCTION_RE.search(normalize_code(code))
if match is None:
raise ValueError("No Python function definition found in code field.")
return match.group(1)


def extract_python_code(text: str) -> str:
normalized = normalize_code(text).strip()
fenced = _CODE_FENCE_RE.findall(normalized)
if fenced:
candidates = [block.strip() for block in fenced if block.strip()]
if candidates:
for candidate in reversed(candidates):
if _FUNCTION_RE.search(candidate):
return candidate
return candidates[-1]
return normalized


def code_size_bytes(code: str) -> int:
return len(normalize_code(code).encode("utf-8"))
32 changes: 32 additions & 0 deletions slime/code_golf_harbor_modal/configs/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
from __future__ import annotations

import importlib
from pathlib import Path
from typing import Callable

from .base import RLConfig

_CONFIGS: dict[str, Callable[[], RLConfig]] = {}
_CONFIGS_DIR = Path(__file__).parent
_EXCLUDE = {"__init__.py", "base.py"}


def get_config(name: str) -> RLConfig:
if name not in _CONFIGS:
available = ", ".join(sorted(_CONFIGS.keys()))
raise ValueError(f"Unknown config: {name}. Available configs: {available}")
return _CONFIGS[name]()


def list_configs() -> list[str]:
return sorted(_CONFIGS.keys())


for file_path in _CONFIGS_DIR.glob("*.py"):
if file_path.name in _EXCLUDE:
continue
module_name = file_path.stem
config_name = module_name.replace("_", "-")
module = importlib.import_module(f".{module_name}", package="configs")
if hasattr(module, "get_config"):
_CONFIGS[config_name] = module.get_config
88 changes: 88 additions & 0 deletions slime/code_golf_harbor_modal/configs/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
from __future__ import annotations

from dataclasses import dataclass, field
from pathlib import Path


@dataclass
class RLConfig:
model_name: str
model_id: str

app_name: str = "slime-code-golf"
n_nodes: int = 4
gpu: str = "H100:8"
sync: bool = True

wandb_project: str = "slime-code-golf"
wandb_run_name_prefix: str = "qwen8b-mbpp"

slime_args: str = ""
extra_args: list[str] = field(default_factory=list)

# Custom RM runtime settings
harbor_task_root: str = "/data/mbpp_harbor/tasks"
harbor_data_volume_name: str = "slime-code-golf-harbor-data"
harbor_rm_modal_app: str = "slime-harbor-rm"
harbor_rm_max_concurrency: int = 64
harbor_rm_timeout_sec: int = 120
harbor_length_bonus_weight: float = 0.2

@property
def train_script(self) -> str:
return "slime/train.py" if self.sync else "slime/train_async.py"

def _clean_args(self, args: str) -> str:
lines: list[str] = []
for raw_line in args.splitlines():
line = raw_line.split("#", 1)[0].strip()
if line:
lines.append(line)
return " ".join(lines)

def generate_train_args(
self,
hf_model_path: str,
checkpoints_path: Path,
data_path: Path,
) -> str:
base_args = f"--hf-checkpoint {hf_model_path} --ref-load {hf_model_path}"

cleaned = self._clean_args(self.slime_args)
cleaned = cleaned.replace("{data_path}", str(data_path))
cleaned = cleaned.replace("{checkpoints_path}", str(checkpoints_path))

extra = " ".join(self.extra_args) if self.extra_args else ""
return f"{base_args} {cleaned} {extra}".strip()


QWEN3_8B_MODEL_ARGS = """
--num-layers 36 --hidden-size 4096 --ffn-hidden-size 12288
--num-attention-heads 32 --group-query-attention --num-query-groups 8
--kv-channels 128 --vocab-size 151936
--normalization RMSNorm --norm-epsilon 1e-6 --swiglu
--disable-bias-linear --qk-layernorm
--use-rotary-position-embeddings --rotary-base 1000000
"""

DEFAULT_TRAINING_ARGS = """
--tensor-model-parallel-size 2 --sequence-parallel
--recompute-granularity full --recompute-method uniform --recompute-num-layers 1
--use-dynamic-batch-size --max-tokens-per-gpu 9216
--megatron-to-hf-mode bridge
--attention-dropout 0.0 --hidden-dropout 0.0
--accumulate-allreduce-grads-in-fp32 --attention-softmax-in-fp32
"""

DEFAULT_OPTIMIZER_ARGS = """
--optimizer adam
--lr 1e-6 --lr-decay-style constant
--weight-decay 0.1 --adam-beta1 0.9 --adam-beta2 0.98
"""

DEFAULT_GRPO_ARGS = """
--advantage-estimator grpo
--use-kl-loss --kl-loss-coef 0.00 --kl-loss-type low_var_kl
--entropy-coef 0.00
--eps-clip 0.2 --eps-clip-high 0.28
"""
68 changes: 68 additions & 0 deletions slime/code_golf_harbor_modal/configs/qwen_8b_multi.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
from __future__ import annotations

from .base import (
DEFAULT_GRPO_ARGS,
DEFAULT_OPTIMIZER_ARGS,
DEFAULT_TRAINING_ARGS,
QWEN3_8B_MODEL_ARGS,
RLConfig,
)


def get_config() -> RLConfig:
return RLConfig(
model_name="Qwen3-8B",
model_id="Qwen/Qwen3-8B",
app_name="slime-qwen8b-code-golf",
n_nodes=4,
gpu="H100:8",
sync=True,
wandb_project="slime-code-golf",
wandb_run_name_prefix="qwen8b-mbpp-harbor",
slime_args=f"""
# Model
{QWEN3_8B_MODEL_ARGS}

# Training + optimizer + GRPO
{DEFAULT_TRAINING_ARGS}
{DEFAULT_OPTIMIZER_ARGS}
{DEFAULT_GRPO_ARGS}

# Dataset format
--input-key messages
--label-key label
--apply-chat-template
--prompt-data {{data_path}}/mbpp_harbor/slime/train.parquet
--eval-prompt-data mbpp {{data_path}}/mbpp_harbor/slime/test.parquet

# Rollout / batching
--num-rollout 2000
--rollout-batch-size 128
--n-samples-per-prompt 8
--global-batch-size 1024
--rollout-max-response-len 1024
--rollout-temperature 0.9
--eval-max-response-len 1024
--n-samples-per-eval-prompt 8

# Custom reward model (Harbor + Modal sandbox scoring)
--rm-type math
--custom-rm-path custom_rm.batched_custom_rm

# SGLang rollout engines
--rollout-num-gpus-per-engine 2
--sglang-mem-fraction-static 0.7

# Distributed orchestration
--actor-num-nodes 4
--actor-num-gpus-per-node 8
--colocate

# Eval cadence
--eval-interval 20
--eval-top-p 1

# Save checkpoints to volume
--save {{checkpoints_path}}/qwen8b_code_golf
""",
)
Loading