Skip to content

Conversation

Jack-Khuu
Copy link
Contributor

@Jack-Khuu Jack-Khuu commented Sep 19, 2025

Update(10/10) : This is a very much simplified version of the previous iterations of the PRs. It provides just a basic example of LLM Judges for GRPO

Judges can both be used as "Verifiers" or "Graders". This PR adds to the sandbox, a CorrectnessJudge example of how an LLM Judge can be used in GRPO (note that this PR does not integrate)

It takes as input (prompt + response) generated from a model, and returns whether the model thinks it accurately responded to the prompt. Results can then be used to make decisions during GRPO whitening


python -m tests.sandbox.vllm.judge --config tests/sandbox/vllm/qwen3_4b.yaml
image

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 19, 2025
Copy link
Contributor

@allenwang28 allenwang28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prompt: What is the capital of Japan?
Responses: ['Aardvark', 'Durian', 'Tokyo']

Generation Results:
================================================================================
Sample 1
Evaluation: 3
--------------------------------------------------------------------------------
Sample 2
Evaluation: 3
--------------------------------------------------------------------------------
Sample 3
Evaluation: 3
--------------------------------------------------------------------------------
Sample 4
Evaluation: 3
--------------------------------------------------------------------------------

lol is this working correctly?

@Jack-Khuu
Copy link
Contributor Author

Jack-Khuu commented Sep 23, 2025

lol is this working correctly?

I wrote this prompt from the deep archives of my mind and I'm also shocked that the prompting worked.

@Jack-Khuu Jack-Khuu changed the title Creates GenerativeJudge as an interface for LLM Judges [WIP] Creates GenerativeJudge as an interface for LLM Judges Sep 24, 2025
@Jack-Khuu Jack-Khuu changed the title [WIP] Creates GenerativeJudge as an interface for LLM Judges [WIP] Creates Judges as a wrapper on Policy Oct 4, 2025
@Jack-Khuu Jack-Khuu changed the title [WIP] Creates Judges as a wrapper on Policy Creates Judge Example as a wrapper on Policy Oct 11, 2025
@Jack-Khuu
Copy link
Contributor Author

Update (10/10): This is a very much simplified version of the previous iterations of the PRs. It provides just a basic example of LLM Judges for GRPO

There is out of scope future work leveraging structured decoding, but that requires additional investigation on how to configure with CoT models (which need to think)

@Jack-Khuu Jack-Khuu requested a review from joecummings October 11, 2025 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants