Skip to content

Qwen 8B code golf RL#57

Draft
pawalt wants to merge 1 commit intomainfrom
cursor/qwen-8b-code-golf-rl-4ece
Draft

Qwen 8B code golf RL#57
pawalt wants to merge 1 commit intomainfrom
cursor/qwen-8b-code-golf-rl-4ece

Conversation

@pawalt
Copy link
Copy Markdown
Member

@pawalt pawalt commented Feb 24, 2026

Summary

This PR introduces a new end-to-end example for Reinforcement Learning (RL) a Qwen 8B model to excel at "code golf" using Slime and Harbor on Modal.

Key features include:

  • MBPP Dataset Integration: Converts the MBPP dataset to Harbor format, dynamically extracting function names for prompts.
  • Custom Code Golf Reward Model: Implements a Slime custom reward model that scores completions based on correctness (verified by Harbor in Modal Sandboxes) and code brevity.
  • Modal Sandbox Rollouts & Evals: Leverages Modal Sandboxes for isolated and highly concurrent execution of both training rollouts and evaluation tasks.
  • SGLang Serving for Evaluation: Provides an SGLang server on Modal to serve the latest checkpoint, enabling high-concurrency evaluations via a custom Harbor agent.

This example demonstrates a complete pipeline for fine-tuning a code generation model with complex, custom reward functions and scalable evaluation infrastructure.

Checklist

  • Example is documented with comments throughout, in a Literate Programming style.
  • Example does not require third-party dependencies to be installed locally
  • Example follows the style guide
  • Example pins its dependencies
    • Example pins container images to a stable tag, not a dynamic tag like latest
    • Example specifies a python_version for the base image, if it is used
    • Example pins all dependencies to at least minor version, ~=x.y.z or ==x.y
    • Example dependencies with version < 1 are pinned to patch version, ==0.y.z

(Modal's internal guide page for this repo is Multi-node examples guidance.)

Outside contributors

You're great! Thanks for your contribution.


Open in Web Open in Cursor 

Co-authored-by: Peyton Walters <pawalt@hey.com>
@cursor
Copy link
Copy Markdown

cursor bot commented Feb 24, 2026

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants