modal-labs · pawalt · Feb 24, 2026
diff --git a/slime/README.md b/slime/README.md
@@ -12,3 +12,16 @@ Then you can run:
 ```
 modal run slime/modal_train.py::train_multi_node --config qwen-8b-multi
 ```
+
+## Code-golf RL example (SLIME + Harbor + Modal sandboxes)
+
+A dedicated example lives in:
+
+`slime/code_golf_harbor_modal`
+
+It adds:
+
+- MBPP -> Harbor task conversion
+- custom `--custom-rm-path` reward that runs Harbor-style verification in Modal sandboxes
+- size-aware code-golf reward shaping
+- checkpoint serving + Harbor eval utilities
diff --git a/slime/code_golf_harbor_modal/README.md b/slime/code_golf_harbor_modal/README.md
@@ -0,0 +1,96 @@
+# SLIME + Harbor + Modal code-golf example (Qwen 8B)
+
+This example trains **Qwen3-8B** with SLIME on Modal for Python code golf.
+
+It does two things in reward:
+
+1. **Correctness** via Harbor task verification.
+2. **Code-size pressure** via a length bonus on top of correctness.
+
+## What this example adds
+
+- MBPP (`Muennighoff/mbpp`) conversion into:
+  - **Harbor task project format** (`tasks/<task_name>/...`)
+  - **SLIME parquet prompt format** (`slime/train.parquet`, `slime/test.parquet`)
+- Regex function-name extraction from MBPP `code` field and injection into prompt.
+- Custom SLIME reward model (`--custom-rm-path`) that runs Harbor tests in **Modal sandboxes**.
+- Harbor eval path with:
+  - a custom single-shot code agent (`harbor_litellm_agent.py`)
+  - a checkpoint server (SGLang) that serves the latest checkpoint from Modal volume
+  - high-concurrency Harbor runs (`-n` with `--env modal`)
+
+## Files
+
+- `modal_train.py` – Modal entrypoints for dataset prep, training, serving, and eval.
+- `dataset_to_harbor.py` – MBPP -> Harbor + SLIME conversion function.
+- `custom_rm.py` – SLIME custom reward function (Harbor+Modal sandbox scoring + length bonus).
+- `harbor_litellm_agent.py` – Harbor custom agent that calls an OpenAI-compatible endpoint.
+- `configs/qwen_8b_multi.py` – Qwen3-8B multi-node config based on the Modal SLIME config style.
+
+## 1) Prepare dataset in Modal volume
+
+Run once:
+
+`modal run slime/code_golf_harbor_modal/modal_train.py::prepare_dataset`
+
+Optional small smoke conversion:
+
+`modal run slime/code_golf_harbor_modal/modal_train.py::prepare_dataset --limit 32 --train-size 24`
+
+Data lands in Modal volume `slime-code-golf-harbor-data` under:
+
+- `/data/mbpp_harbor/tasks`
+- `/data/mbpp_harbor/slime/train.parquet`
+- `/data/mbpp_harbor/slime/test.parquet`
+
+Each Harbor task uses `environment/Dockerfile` with:
+
+`FROM python:3.11-slim`
+
+## 2) Download base model
+
+Run once:
+
+`modal run slime/code_golf_harbor_modal/modal_train.py::download_model --config qwen-8b-multi`
+
+## 3) Train with SLIME
+
+`modal run slime/code_golf_harbor_modal/modal_train.py::train_multi_node --config qwen-8b-multi`
+
+The config uses:
+
+- `--custom-rm-path custom_rm.batched_custom_rm`
+- MBPP parquet generated by `prepare_dataset`
+- Multi-node Qwen3-8B settings in the same style as the existing SLIME Modal configs
+
+## 4) Serve latest checkpoint
+
+Start a long-lived server endpoint from latest checkpoint:
+
+`modal serve slime/code_golf_harbor_modal/modal_train.py::serve_latest_checkpoint`
+
+Or one-shot eval (starts local server inside function, runs Harbor, exits):
+
+`modal run slime/code_golf_harbor_modal/modal_train.py::eval_latest_checkpoint --n-concurrent 256 --n-tasks 500`
+
+## 5) Run Harbor eval against server (Modal sandboxes)
+
+Given a running server URL:
+
+`modal run slime/code_golf_harbor_modal/modal_train.py::run_harbor_eval --server-base-url https://<your-modal-url> --n-concurrent 256 --n-tasks 500`
+
+This uses Harbor with:
+
+- `--env modal` (trial environments are Modal sandboxes)
+- custom agent import path `harbor_litellm_agent.SingleShotCodeAgent`
+- OpenAI-compatible API calls to the served checkpoint endpoint
+
+## Reward behavior
+
+The custom RM in `custom_rm.py` computes:
+
+- `pass_rate` from Harbor task verifier output.
+- `length_bonus` from `reference_bytes / candidate_bytes` (capped).
+- Final reward: correctness scaled by a positive length bonus.
+
+So passing tests remains primary; shorter passing programs get higher reward.
diff --git a/slime/code_golf_harbor_modal/code_utils.py b/slime/code_golf_harbor_modal/code_utils.py
@@ -0,0 +1,34 @@
+from __future__ import annotations
+
+import re
+
+_CODE_FENCE_RE = re.compile(r"```(?:python)?\s*(.*?)```", re.IGNORECASE | re.DOTALL)
+_FUNCTION_RE = re.compile(r"^\s*def\s+([A-Za-z_]\w*)\s*\(", re.MULTILINE)
+
+
+def normalize_code(text: str) -> str:
+    return text.replace("\r\n", "\n").replace("\r", "\n")
+
+
+def extract_function_name_from_code(code: str) -> str:
+    match = _FUNCTION_RE.search(normalize_code(code))
+    if match is None:
+        raise ValueError("No Python function definition found in code field.")
+    return match.group(1)
+
+
+def extract_python_code(text: str) -> str:
+    normalized = normalize_code(text).strip()
+    fenced = _CODE_FENCE_RE.findall(normalized)
+    if fenced:
+        candidates = [block.strip() for block in fenced if block.strip()]
+        if candidates:
+            for candidate in reversed(candidates):
+                if _FUNCTION_RE.search(candidate):
+                    return candidate
+            return candidates[-1]
+    return normalized
+
+
+def code_size_bytes(code: str) -> int:
+    return len(normalize_code(code).encode("utf-8"))
diff --git a/slime/code_golf_harbor_modal/configs/__init__.py b/slime/code_golf_harbor_modal/configs/__init__.py
@@ -0,0 +1,32 @@
+from __future__ import annotations
+
+import importlib
+from pathlib import Path
+from typing import Callable
+
+from .base import RLConfig
+
+_CONFIGS: dict[str, Callable[[], RLConfig]] = {}
+_CONFIGS_DIR = Path(__file__).parent
+_EXCLUDE = {"__init__.py", "base.py"}
+
+
+def get_config(name: str) -> RLConfig:
+    if name not in _CONFIGS:
+        available = ", ".join(sorted(_CONFIGS.keys()))
+        raise ValueError(f"Unknown config: {name}. Available configs: {available}")
+    return _CONFIGS[name]()
+
+
+def list_configs() -> list[str]:
+    return sorted(_CONFIGS.keys())
+
+
+for file_path in _CONFIGS_DIR.glob("*.py"):
+    if file_path.name in _EXCLUDE:
+        continue
+    module_name = file_path.stem
+    config_name = module_name.replace("_", "-")
+    module = importlib.import_module(f".{module_name}", package="configs")
+    if hasattr(module, "get_config"):
+        _CONFIGS[config_name] = module.get_config
diff --git a/slime/code_golf_harbor_modal/configs/base.py b/slime/code_golf_harbor_modal/configs/base.py
@@ -0,0 +1,88 @@
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from pathlib import Path
+
+
+@dataclass
+class RLConfig:
+    model_name: str
+    model_id: str
+
+    app_name: str = "slime-code-golf"
+    n_nodes: int = 4
+    gpu: str = "H100:8"
+    sync: bool = True
+
+    wandb_project: str = "slime-code-golf"
+    wandb_run_name_prefix: str = "qwen8b-mbpp"
+
+    slime_args: str = ""
+    extra_args: list[str] = field(default_factory=list)
+
+    # Custom RM runtime settings
+    harbor_task_root: str = "/data/mbpp_harbor/tasks"
+    harbor_data_volume_name: str = "slime-code-golf-harbor-data"
+    harbor_rm_modal_app: str = "slime-harbor-rm"
+    harbor_rm_max_concurrency: int = 64
+    harbor_rm_timeout_sec: int = 120
+    harbor_length_bonus_weight: float = 0.2
+
+    @property
+    def train_script(self) -> str:
+        return "slime/train.py" if self.sync else "slime/train_async.py"
+
+    def _clean_args(self, args: str) -> str:
+        lines: list[str] = []
+        for raw_line in args.splitlines():
+            line = raw_line.split("#", 1)[0].strip()
+            if line:
+                lines.append(line)
+        return " ".join(lines)
+
+    def generate_train_args(
+        self,
+        hf_model_path: str,
+        checkpoints_path: Path,
+        data_path: Path,
+    ) -> str:
+        base_args = f"--hf-checkpoint {hf_model_path} --ref-load {hf_model_path}"
+
+        cleaned = self._clean_args(self.slime_args)
+        cleaned = cleaned.replace("{data_path}", str(data_path))
+        cleaned = cleaned.replace("{checkpoints_path}", str(checkpoints_path))
+
+        extra = " ".join(self.extra_args) if self.extra_args else ""
+        return f"{base_args} {cleaned} {extra}".strip()
+
+
+QWEN3_8B_MODEL_ARGS = """
+    --num-layers 36 --hidden-size 4096 --ffn-hidden-size 12288
+    --num-attention-heads 32 --group-query-attention --num-query-groups 8
+    --kv-channels 128 --vocab-size 151936
+    --normalization RMSNorm --norm-epsilon 1e-6 --swiglu
+    --disable-bias-linear --qk-layernorm
+    --use-rotary-position-embeddings --rotary-base 1000000
+"""
+
+DEFAULT_TRAINING_ARGS = """
+    --tensor-model-parallel-size 2 --sequence-parallel
+    --recompute-granularity full --recompute-method uniform --recompute-num-layers 1
+    --use-dynamic-batch-size --max-tokens-per-gpu 9216
+    --megatron-to-hf-mode bridge
+    --attention-dropout 0.0 --hidden-dropout 0.0
+    --accumulate-allreduce-grads-in-fp32 --attention-softmax-in-fp32
+"""
+
+DEFAULT_OPTIMIZER_ARGS = """
+    --optimizer adam
+    --lr 1e-6 --lr-decay-style constant
+    --weight-decay 0.1 --adam-beta1 0.9 --adam-beta2 0.98
+"""
+
+DEFAULT_GRPO_ARGS = """
+    --advantage-estimator grpo
+    --use-kl-loss --kl-loss-coef 0.00 --kl-loss-type low_var_kl
+    --entropy-coef 0.00
+    --eps-clip 0.2 --eps-clip-high 0.28
+"""
diff --git a/slime/code_golf_harbor_modal/configs/qwen_8b_multi.py b/slime/code_golf_harbor_modal/configs/qwen_8b_multi.py
@@ -0,0 +1,68 @@
+from __future__ import annotations
+
+from .base import (
+    DEFAULT_GRPO_ARGS,
+    DEFAULT_OPTIMIZER_ARGS,
+    DEFAULT_TRAINING_ARGS,
+    QWEN3_8B_MODEL_ARGS,
+    RLConfig,
+)
+
+
+def get_config() -> RLConfig:
+    return RLConfig(
+        model_name="Qwen3-8B",
+        model_id="Qwen/Qwen3-8B",
+        app_name="slime-qwen8b-code-golf",
+        n_nodes=4,
+        gpu="H100:8",
+        sync=True,
+        wandb_project="slime-code-golf",
+        wandb_run_name_prefix="qwen8b-mbpp-harbor",
+        slime_args=f"""
+            # Model
+            {QWEN3_8B_MODEL_ARGS}
+
+            # Training + optimizer + GRPO
+            {DEFAULT_TRAINING_ARGS}
+            {DEFAULT_OPTIMIZER_ARGS}
+            {DEFAULT_GRPO_ARGS}
+
+            # Dataset format
+            --input-key messages
+            --label-key label
+            --apply-chat-template
+            --prompt-data {{data_path}}/mbpp_harbor/slime/train.parquet
+            --eval-prompt-data mbpp {{data_path}}/mbpp_harbor/slime/test.parquet
+
+            # Rollout / batching
+            --num-rollout 2000
+            --rollout-batch-size 128
+            --n-samples-per-prompt 8
+            --global-batch-size 1024
+            --rollout-max-response-len 1024
+            --rollout-temperature 0.9
+            --eval-max-response-len 1024
+            --n-samples-per-eval-prompt 8
+
+            # Custom reward model (Harbor + Modal sandbox scoring)
+            --rm-type math
+            --custom-rm-path custom_rm.batched_custom_rm
+
+            # SGLang rollout engines
+            --rollout-num-gpus-per-engine 2
+            --sglang-mem-fraction-static 0.7
+
+            # Distributed orchestration
+            --actor-num-nodes 4
+            --actor-num-gpus-per-node 8
+            --colocate
+
+            # Eval cadence
+            --eval-interval 20
+            --eval-top-p 1
+
+            # Save checkpoints to volume
+            --save {{checkpoints_path}}/qwen8b_code_golf
+        """,
+    )