agentscope-ai
diff --git a/‎cookbooks/training_judge_model/bradley-terry/README.md‎
Lines changed: 172 additions & 0 deletions b/‎cookbooks/training_judge_model/bradley-terry/README.md‎
Lines changed: 172 additions & 0 deletions
diff --git a/‎cookbooks/training_judge_model/bradley-terry/bt_train.png‎
94.4 KB b/‎cookbooks/training_judge_model/bradley-terry/bt_train.png‎
94.4 KB
diff --git a/‎cookbooks/training_judge_model/bradley-terry/dataset.py‎
Lines changed: 176 additions & 0 deletions b/‎cookbooks/training_judge_model/bradley-terry/dataset.py‎
Lines changed: 176 additions & 0 deletions
@@ -0,0 +1,172 @@
+# Bradley-Terry Training
+
+Train judge models using Bradley-Terry loss on preference pairs. This approach learns to rank responses by modeling the probability that one response is preferred over another.
+
+## Overview
+
+Bradley-Terry training is the **simplest and most widely used** method for judge model training. It works with binary preference data (chosen vs. rejected) and optimizes the model to predict which response humans prefer.
+
+> **Tip:** Use Bradley-Terry when you have preference pairs (e.g., from RLHF annotation), binary comparison data, or need a model that outputs scalar scores.
+
+**Training objective:**
+
+The model learns to maximize:
+
+$$\mathcal{L} = -\log \sigma(r_{\text{chosen}} - r_{\text{rejected}})$$
+
+Where $r$ is the score and $\sigma$ is the sigmoid function.
+
+
+## Quick Start
+
+```bash
+# 1. Install dependencies
+pip install verl==0.6.1
+
+# 2. Run training
+cd cookbooks/training_judge_model/bradley-terry
+bash run_bt_rm.sh
+```
+
+
+## Dataset
+
+We provide pre-processed datasets on HuggingFace:
+
+| Dataset | Description | Link |
+|---------|-------------|------|
+| `agentscope-ai/OpenJudge` | HelpSteer2 preference pairs for BT training | [🔗 HuggingFace](https://huggingface.co/datasets/agentscope-ai/OpenJudge/tree/main/train_rm/bradley_terry) |
+
+**Source:** [nvidia/HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2) preference subset.
+
+**Processing:**
+- Input: HelpSteer2 preference JSONL with `preference_strength` field (range: -3 to 3)
+- Filter: `|preference_strength| >= 1` (keep pairs with clear preference)
+- Positive strength → `response_2` is chosen; Negative → `response_1` is chosen
+- Convert to chat messages format for Instruct models
+
+
+## Data Format
+
+Bradley-Terry training expects Parquet files with two columns:
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `chosen` | string | JSON string of messages list (preferred response) |
+| `rejected` | string | JSON string of messages list (rejected response) |
+
+**Example data structure:**
+
+```python
+import json
+import pandas as pd
+
+# Messages format (compatible with tokenizer.apply_chat_template)
+chosen = json.dumps([
+    {"role": "user", "content": "What are the benefits of exercise?"},
+    {"role": "assistant", "content": "Regular exercise improves cardiovascular health, boosts mood, and increases energy levels."}
+])
+rejected = json.dumps([
+    {"role": "user", "content": "What are the benefits of exercise?"},
+    {"role": "assistant", "content": "Exercise is good for you."}
+])
+
+df = pd.DataFrame({"chosen": [chosen], "rejected": [rejected]})
+df.to_parquet("train.parquet")
+```
+
+> **Note:** Multi-turn conversations are supported. Include all turns in the messages list.
+
+
+## Configuration
+
+### Training Script (`run_bt_rm.sh`)
+
+Key parameters to customize:
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `MODEL_PATH` | Base model for initialization | `qwen3-32b` |
+| `TRAIN_FILE` | Training data path | Parquet file |
+| `VAL_FILE` | Validation data path | Parquet file |
+| `TRAIN_BATCH_SIZE` | Global batch size | 256 |
+| `MICRO_BATCH_SIZE` | Per-GPU micro batch | 1 |
+| `LR` | Learning rate | 5e-7 |
+| `TOTAL_EPOCHS` | Training epochs | 3 |
+
+### Hydra Config (`trainer.yaml`)
+
+**Data:**
+
+```yaml
+data:
+  train_batch_size: 256        # Global batch size (distributed across GPUs)
+  micro_batch_size_per_gpu: 1  # Per-GPU micro batch size
+  max_length: 4096             # Maximum sequence length
+  truncation: left             # Truncation: left/right/error
+```
+
+**Model:**
+
+```yaml
+model:
+  partial_pretrain: qwen3-32b              # Base model path
+  strategy: fsdp2                          # fsdp or fsdp2
+  enable_gradient_checkpointing: true      # Save memory
+```
+
+**Optimizer:**
+
+```yaml
+optim:
+  lr: 5e-7                     # Learning rate
+  weight_decay: 0.001          # Weight decay
+  warmup_steps_ratio: 0.03     # Warmup steps ratio
+  clip_grad: 2.0               # Gradient clipping
+  lr_scheduler: cosine         # Scheduler: cosine/wsd/constant
+```
+
+
+## Monitoring Training
+
+### Metrics
+
+| Metric | Description |
+|--------|-------------|
+| `train/loss` | Bradley-Terry loss |
+| `train/accuracy` | Preference prediction accuracy (chosen > rejected) |
+| `train/lr(1e-3)` | Current learning rate (×1e3) |
+| `val/loss` | Validation loss |
+| `val/accuracy` | Validation accuracy |
+
+### Train/Loss Curve
+
+![BT Training Curve](./bt_train.png)
+
+
+## Troubleshooting
+
+### OOM (Out of Memory)
+
+- Reduce `micro_batch_size_per_gpu`
+- Enable `enable_gradient_checkpointing`
+- Reduce `max_length`
+- Use `fsdp2` strategy
+
+### Unstable Training / Loss Explosion
+
+- Lower learning rate
+- Increase `clip_grad` value
+- Check data quality
+
+### Accuracy Not Improving
+
+- Verify data labeling quality
+- Check chosen/rejected mapping
+- Increase learning rate
+- Train more epochs
+
+
+## Next Steps
+
+- [SFT for Judge Models](../sft/README.md) — Pre-train with supervised fine-tuning
@@ -0,0 +1,176 @@
+# -*- coding: utf-8 -*-
+"""
+Bradley-Terry Dataset for Judge Model Training
+- Loads preference data from parquet files
+- Each sample contains chosen and rejected responses
+- Returns data in format suitable for Bradley-Terry loss
+- Uses chat template for Instruct models
+"""
+
+import json
+from typing import Any, Dict, List, Union
+
+import pandas as pd
+import torch
+from torch.utils.data import Dataset
+from transformers import PreTrainedTokenizer
+from verl.utils import hf_tokenizer
+from verl.utils.fs import copy_to_local
+
+
+class BTDataset(Dataset):
+    """
+    Bradley-Terry Dataset for preference learning
+
+    Expected data format in parquet:
+    - chosen: messages list [{"role": "user", "content": "xxx"}, {"role": "assistant", "content": "yyy"}]
+    - rejected: messages list [{"role": "user", "content": "xxx"}, {"role": "assistant", "content": "yyy"}]
+
+    Data is processed with tokenizer.apply_chat_template() for Instruct models.
+    """
+
+    def __init__(
+        self,
+        parquet_files: Union[str, List[str]],
+        tokenizer: Union[str, PreTrainedTokenizer],
+        config: Dict[str, Any],
+    ) -> None:
+        self.max_length = config.get("max_length", 4096)
+        self.truncation = config.get("truncation", "left")
+        self.use_shm = config.get("use_shm", False)
+
+        # Keys for data columns
+        self.chosen_key = config.get("chosen_key", "chosen")
+        self.rejected_key = config.get("rejected_key", "rejected")
+
+        assert self.truncation in ["error", "left", "right"]
+
+        if not isinstance(parquet_files, list):
+            parquet_files = [parquet_files]
+
+        self.parquet_files = parquet_files
+        if isinstance(tokenizer, str):
+            tokenizer = hf_tokenizer(tokenizer)
+        self.tokenizer: PreTrainedTokenizer = tokenizer
+
+        self._download()
+        self._read_files_and_process()
+
+    def _download(self) -> None:
+        """Download parquet files to local if needed"""
+        for i, parquet_file in enumerate(self.parquet_files):
+            self.parquet_files[i] = copy_to_local(parquet_file, verbose=True)
+
+    def _read_files_and_process(self) -> None:
+        """Read and concatenate all parquet files"""
+        dataframes = []
+        for parquet_file in self.parquet_files:
+            dataframe = pd.read_parquet(parquet_file)
+            dataframes.append(dataframe)
+
+        self.dataframe = pd.concat(dataframes, ignore_index=True)
+
+        # Extract chosen and rejected fields (JSON string format, parse to messages list)
+        self.chosen_messages = [json.loads(msg) for msg in self.dataframe[self.chosen_key].tolist()]
+        self.rejected_messages = [json.loads(msg) for msg in self.dataframe[self.rejected_key].tolist()]
+
+        print(
+            f"Loaded {len(self.chosen_messages)} preference pairs from {len(self.parquet_files)} files",
+        )
+
+    def __len__(self) -> int:
+        return len(self.chosen_messages)
+
+    def _apply_chat_template(self, messages: List[Dict[str, str]]) -> str:
+        """
+        Apply chat template to convert messages to model-expected format.
+
+        Args:
+            messages: List of message dicts [{"role": "user", "content": "..."}, ...]
+        """
+        formatted = self.tokenizer.apply_chat_template(
+            messages,
+            tokenize=False,
+            add_generation_prompt=False,
+        )
+        # Remove BOS token if present
+        if self.tokenizer.bos_token and formatted.startswith(self.tokenizer.bos_token):
+            formatted = formatted[len(self.tokenizer.bos_token) :]
+        return formatted
+
+    def _tokenize_messages(self, messages: List[Dict[str, str]]) -> Dict[str, torch.Tensor]:
+        """Tokenize messages and handle truncation/padding to fixed length"""
+        # Apply chat template
+        text = self._apply_chat_template(messages)
+
+        # Tokenize
+        encoding = self.tokenizer(
+            text,
+            add_special_tokens=True,
+            return_tensors="pt",
+            padding=False,
+            truncation=False,
+        )
+
+        input_ids = encoding["input_ids"].squeeze(0)
+        attention_mask = encoding["attention_mask"].squeeze(0)
+
+        sequence_length = input_ids.shape[0]
+
+        # Handle sequence length like SFT dataset
+        if sequence_length < self.max_length:
+            # Pad sequences
+            pad_token_id = self.tokenizer.pad_token_id if self.tokenizer.pad_token_id is not None else 0
+            padded_input_ids = (
+                torch.ones(
+                    size=(self.max_length - sequence_length,),
+                    dtype=input_ids.dtype,
+                )
+                * pad_token_id
+            )
+            padded_attention_mask = torch.zeros(
+                size=(self.max_length - sequence_length,),
+                dtype=attention_mask.dtype,
+            )
+
+            input_ids = torch.cat((input_ids, padded_input_ids))
+            attention_mask = torch.cat((attention_mask, padded_attention_mask))
+        elif sequence_length > self.max_length:
+            if self.truncation == "left":
+                # Keep the end of the conversation (including conclusion)
+                input_ids = input_ids[-self.max_length :]
+                attention_mask = attention_mask[-self.max_length :]
+            elif self.truncation == "right":
+                input_ids = input_ids[: self.max_length]
+                attention_mask = attention_mask[: self.max_length]
+            elif self.truncation == "error":
+                raise ValueError(
+                    f"Sequence length {sequence_length} > max_length {self.max_length}",
+                )
+
+        return {"input_ids": input_ids, "attention_mask": attention_mask}
+
+    def __getitem__(self, item: int) -> Dict[str, Any]:
+        """
+        Get a preference pair
+
+        Returns:
+            dict with keys:
+            - input_ids_j: chosen response tokens
+            - attention_mask_j: chosen response attention mask
+            - input_ids_k: rejected response tokens
+            - attention_mask_k: rejected response attention mask
+        """
+        chosen_messages = self.chosen_messages[item]
+        rejected_messages = self.rejected_messages[item]
+
+        # Tokenize both responses
+        chosen_tokens = self._tokenize_messages(chosen_messages)
+        rejected_tokens = self._tokenize_messages(rejected_messages)
+
+        return {
+            "input_ids_j": chosen_tokens["input_ids"],
+            "attention_mask_j": chosen_tokens["attention_mask"],
+            "input_ids_k": rejected_tokens["input_ids"],
+            "attention_mask_k": rejected_tokens["attention_mask"],
+        }