Skip to content

Reinforcement learning algorithm to make LLMs reason more creatively. While DeepSeek R1's GRPO normalizes rewards within groups of responses, DGRPO incorporates solution diversity into the advantage calculation through two novel approaches: (1) upweighting less likely but correct tokens to incentivize rare solutions, and (2) cosine similarity

Notifications You must be signed in to change notification settings

BurnyCoder/diverse-group-relative-policy-optimization

Repository files navigation

Diverse Group Relative Policy Optimization (DGRPO): Upweighting Rare, Accurate Solutions for STEM and Art

Language models trained with reinforcement learning often converge to common solution patterns, limiting their creative problem-solving capabilities. This paper introduces Diverse Group Relative Policy Optimization (DGRPO), an extension of Group Relative Policy Optimization (GRPO) that specifically addresses this limitation. While GRPO normalizes rewards within groups of responses to promote accuracy, DGRPO incorporates solution diversity into the advantage calculation through two novel approaches: (1) upweighting less likely but correct tokens to incentivize rare solutions, and (2) quantifying solution uniqueness using cosine similarity of neural embeddings. By introducing a configurable diversity weight parameter, DGRPO allows practitioners to balance accuracy with exploration of diverse solution strategies. My approach encourages language models to discover multiple valid approaches to problems, a critical capability for applications in scientific discovery, mathematical problem-solving, creative coding, and art. DGRPO demonstrates how reinforcement learning can be adapted to reward not just correctness but also the novelty and diversity of solutions in generative AI systems.

Read Diverse_Group_Relative_Policy_Optimization_(DGRPO).pdf paper for more details.

Usage

Run Diverse_Group_Relative_Policy_Optimization_(DGRPO).ipynb locally or in Google Colab.

Disclaimer

Work on progress

Summary

The project includes:

  • Implements GRPO and DGRPO training algorithms with reward function implementations
  • Training examples on math reasoning tasks (GSM8K)
  • Comparative model evaluation tools

To do

  • Fix saving to Google Drive
  • Double check code for errors
  • Fix broken plotting
  • Fix broken testing
  • Dedupe and refactor code
  • Benchmark on GSM8K and other datasets and compare original model, model trained using GRPO, and model trained using DGRPO
  • Implement cosine similarity method in code and benchmark it
  • Train bigger models
  • Fix broken formatting in paper
  • Add citation information

About

Reinforcement learning algorithm to make LLMs reason more creatively. While DeepSeek R1's GRPO normalizes rewards within groups of responses, DGRPO incorporates solution diversity into the advantage calculation through two novel approaches: (1) upweighting less likely but correct tokens to incentivize rare solutions, and (2) cosine similarity

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published