Skip to content

Commit 04afb21

Browse files
committed
docs(building_graders): add training reward models guide and update integrations
- Replace training_grpo.md with comprehensive training_reward_models.md - Add LangSmith integration to mkdocs navigation - Update overview.md links to new training documentation - Refactor langfuse.md and langsmith.md integration docs - Minor fix in sft/README.md
1 parent a784b0f commit 04afb21

File tree

7 files changed

+768
-591
lines changed

7 files changed

+768
-591
lines changed

cookbooks/training_judge_model/sft/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ SFT training is the **foundation for building judge models**. It works with conv
1212

1313
The model learns to minimize cross-entropy loss:
1414

15-
$$\mathcal{L} = -\sum_{t} \log P(y_t | y_{<t}, x)$$
15+
$$\mathcal{L} = -\sum_{t} \log P(y_t \mid y_{\lt t}, x)$$
1616

1717
Where $x$ is the input and $y$ is the target response.
1818

docs/building_graders/overview.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -79,15 +79,15 @@ Automatically analyze evaluation data to create structured scoring rubrics. Prov
7979

8080
Train neural networks on preference data to learn evaluation criteria automatically. Supports Bradley-Terry (preference pairs), Generative Pointwise (absolute scores), and Generative Pairwise (comparison decisions). Requires 1K-100K examples and 1-3 days but delivers highly consistent evaluation at 10x lower per-query cost—ideal for high-volume scenarios exceeding 1M queries per month.
8181

82-
**Learn more:** [Train with GRPO](training_grpo.md) | [Bradley-Terry Training](https://github.com/modelscope/OpenJudge/tree/main/cookbooks/training_judge_model/bradley-terry) | [SFT Training](https://github.com/modelscope/OpenJudge/tree/main/cookbooks/training_judge_model/sft)
82+
**Learn more:** [Train Reward Models](training_reward_models.md)
8383

8484

8585

8686
## Next Steps
8787

8888
- [Create Custom Graders](create_custom_graders.md) — Build graders using LLM or code-based logic
8989
- [Generate Graders from Data](generate_graders_from_data.md) — Auto-generate rubrics from labeled data
90-
- [Train with GRPO](training_grpo.md) — Train generative judge models with reinforcement learning
90+
- [Train Reward Models](training_reward_models.md) — Train SFT, Bradley-Terry, or GRPO judge models
9191
- [Built-in Graders](../built_in_graders/overview.md) — Explore pre-built graders to customize
9292
- [Run Grading Tasks](../running_graders/run_tasks.md) — Deploy graders at scale with batch workflows
9393

docs/building_graders/training_grpo.md

Lines changed: 0 additions & 129 deletions
This file was deleted.

0 commit comments

Comments
 (0)