Skip to content

Gen-Verse/GenEnv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators

Paper Code veRL License


🌟 Introduction

GenEnv is a novel co-training framework that simultaneously trains an Agent LLM and an Environment LLM. The key insight is that the Environment LLM learns to generate training tasks at the boundary of the Agent's capabilityβ€”neither too easy nor too hardβ€”creating an adaptive curriculum that maximizes learning efficiency.

Key Features

  • πŸ”„ Co-Training Loop: Agent and Environment LLMs are trained alternately, each improving the other
  • πŸ“Š Adaptive Curriculum: Environment generates tasks calibrated to the Agent's current skill level
  • 🎯 Boundary Learning: Focus on tasks where the Agent has ~50% success rate for maximum gradient signal
  • ⚑ Built on veRL: Leverages the efficient veRL framework for distributed GRPO training

πŸš€ Quick Start

Prerequisites

# Clone the repository
git clone https://github.com/Gen-Verse/GenEnv.git
cd GenEnv

# Install dependencies
pip install -r requirements.txt

Dependencies

GenEnv is built on top of veRL. Please follow veRL's installation instructions first.


πŸ“‹ Usage

⚠️ Important: Customization Required

This codebase provides the training framework for GenEnv. To use it for your specific task, you need to customize:

  1. Reward Function (genenv/utils/reward_functions.py)

    • Replace RewardManager.compute_reward() with your domain-specific reward logic
    • Examples provided for math reasoning, tool calling, and action-based tasks
  2. Environment Prompt Template (genenv/trainer/genenv_trainer.py)

    • Modify _generate_new_tasks() to customize how the Env LLM generates new tasks
    • Adjust the prompt template based on your task format
  3. Task Parsing (genenv/trainer/genenv_trainer.py)

    • Update the parsing logic in _generate_new_tasks() to extract tasks from Env LLM outputs
  4. Initial Training Data (configs/genenv_config.yaml)

    • Prepare your training data in parquet format with prompts and ground truth answers

Configuration

Edit configs/genenv_config.yaml:

# Key paths to customize
env_model_path: /path/to/your/env/model        # Environment LLM
actor_rollout_ref.model.path: /path/to/agent   # Agent LLM
data.train_files: /path/to/train.parquet       # Training data
data.val_files: /path/to/val.parquet           # Validation data
trainer.default_local_dir: /path/to/checkpoints

# GenEnv specific parameters
genenv:
  enable: True
  filtering_k: 0.1           # Filter top/bottom 10% of prompts
  num_generations_per_prompt: 4

Training

# Using the provided script
bash scripts/run_genenv.sh --model /path/to/model --env-model /path/to/env/model

# Or directly with Python
python -m genenv.train \
    genenv.enable=True \
    env_model_path=/path/to/env/model \
    actor_rollout_ref.model.path=/path/to/agent \
    data.train_files=/path/to/train.parquet \
    data.val_files=/path/to/val.parquet

πŸ“ Project Structure

GenEnv/
β”œβ”€β”€ genenv/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ train.py                    # Main training entry point
β”‚   β”œβ”€β”€ trainer/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── genenv_trainer.py       # Core GenEnv training loop
β”‚   └── utils/
β”‚       β”œβ”€β”€ __init__.py
β”‚       └── reward_functions.py     # Reward function implementations
β”œβ”€β”€ configs/
β”‚   └── genenv_config.yaml          # Training configuration
β”œβ”€β”€ scripts/
β”‚   └── run_genenv.sh               # Training launch script
β”œβ”€β”€ requirements.txt
└── README.md

πŸ”§ Reward Function Examples

Math Reasoning (Default)

def compute_reward(self, generated_text: str, ground_truth: Any) -> float:
    pred_answer = self._extract_boxed_answer(generated_text)
    gold_answer = self._get_gold_answer(ground_truth)
    return 1.0 if pred_answer == gold_answer else 0.0

Tool Calling

from genenv.utils import ToolCallingRewardManager

reward_fn = ToolCallingRewardManager(tokenizer=tokenizer)
# Checks if <tool_call>{"name": ..., "parameters": ...}</tool_call> matches ground truth

Custom Domain

class MyRewardManager(RewardManager):
    def compute_reward(self, generated_text: str, ground_truth: Any) -> float:
        # Your custom reward logic here
        return score

πŸ“Š Training Data Format

Your training data should be in parquet format with at least these columns:

Column Description
prompt The task prompt (can be string or list of chat messages)
reward_model Dict containing {"ground_truth": <answer>}

Example:

import pandas as pd

data = [
    {
        "prompt": [{"role": "user", "content": "What is 2 + 2?"}],
        "reward_model": {"ground_truth": "4"}
    },
    # ... more examples
]
pd.DataFrame(data).to_parquet("train.parquet")

πŸ™ Acknowledgements

This project is built upon the excellent work of:

  • veRL - Volcano Engine Reinforcement Learning for LLMs
  • vLLM - High-throughput LLM serving

We thank the authors for making their code publicly available.


πŸ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.


πŸ“– Citation

If you find GenEnv useful for your research, please consider citing:

@misc{guo2025genenvdifficultyalignedcoevolutionllm,
      title={GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators}, 
      author={Jiacheng Guo and Ling Yang and Peter Chen and Qixin Xiao and Yinjie Wang and Xinzhe Juan and Jiahao Qiu and Ke Shen and Mengdi Wang},
      year={2025},
      eprint={2512.19682},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.19682}, 
}

Princeton AI Lab | Gen-Verse

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published