MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering
Official implementation of MEnvAgent, an automated framework for constructing executable environments across 10 programming languages to enable scalable generation of verifiable software engineering data.
- Stay tuned for future updates.
- [Jan. 2026]: We release MEnvData-SWE, the largest open-source polyglot dataset of realistic verifiable Docker environments to date, comprising 3,005 task instances from 942 repositories across 10 programming languages, with 3,872 agent trajectories.
- [Jan. 2026]: MEnvBench is now available - an execution-based environment construction benchmark with rigorous quality assurance, covering 1,000 tasks across 10 programming languages from diverse domains.
- Multi-Language Support: Automated environment construction for 10 mainstream programming languages (Python, Java, TypeScript, JavaScript, Rust, Go, C++, Ruby, PHP, C)
- Multi-Agent Architecture: Planning-Execution-Verification closed-loop system with specialized agents for different tasks
- Environment Reuse Mechanism: Novel approach that reduces overhead by incrementally patching retrieved environments instead of building from scratch
- Docker/Kubernetes Backend: Flexible container orchestration supporting both Docker and Kubernetes (K8s backend enables thousand-scale concurrent environment construction)
- Fail2Pass Validation: Rigorous validation ensuring tests fail before fix and pass after applying patches
- High Performance: Achieves 8.6% improvement in F2P rates while reducing time costs by 43% compared to baselines
MEnvAgent employs a multi-agent system structured into three iterative stages:
Overview of MEnvAgent: (Top) Environment Reuse Mechanism retrieves and adapts historical environments. (Bottom) Planning-Execution-Verification loop with autonomous agents.
- Repository Analysis Agent: Explores repository structure and generates comprehensive summary
- Environment Setup Agent: Determines suitable base image and generates installation scripts
- Test Configuration Agent: Synthesizes compatible test configuration scripts
- Environment Execution Agent: Instantiates containers and executes build commands with real-time monitoring
- Dynamic error resolution and automatic retry mechanisms
- Verification Agent: Executes tests and performs error attribution
- Iterative refinement loop with diagnostic feedback to planning stage
Instead of building environments from scratch, MEnvAgent:
- Retrieves similar historical environments from the Environment Pool
- Uses EnvPatchAgent to generate incremental patches
- Applies patches to adapt the retrieved environment to target requirements
- Significantly reduces computational overhead and improves success rates
- Python 3.10+
- Docker (version 27.0.3+ recommended)
- Kubernetes (optional, for K8s backend)
- Git
MEnvBench is a comprehensive benchmark for evaluating multi-language environment building and test execution, comprising 1,000 tasks (10 languages Γ 20 repositories Γ 5 instances) selected from 200 high-quality open-source repositories.
- Quality Assurance: Multi-stage filtering pipeline yielding 8,000 high-quality candidate repositories (>1,000 stars, >200 forks/issues/PRs) and 213,766 Issue-PR pairs after initial quality screening, followed by strict criteria including closed issues with test patches and LLM-based quality assessment.
- Repository Selection: Strategic sampling considering domain diversity (AI, System, Web, etc.) and difficulty levels (five project scale bands from <10MB to >500MB) to ensure broad coverage across diverse software ecosystems.
We evaluate performance across two critical dimensions: success rate and efficiency. As shown in the visualizations below, MEnvAgent consistently occupies the top-left corner, achieving the best balance of being both faster and more accurate ("better & faster").
![]() |
![]() |
|---|---|
| Fail-to-Pass (F2P) vs. Time | Pass Rate vs. Time |
Compared to state-of-the-art baselines, MEnvAgent delivers:
- π 8.6% Increase in Fail-to-Pass (F2P) rates.
- β±οΈ 43% Reduction in total time costs.
- π Polyglot Stability: Consistent performance gains across all 10 supported programming languages.
| Field | Type | Description |
|---|---|---|
| repo | str |
The full GitHub repository name (e.g., "home-assistant/core"). |
| pull_number | int |
The pull request number associated with the fix (e.g., 807). |
| instance_id | str |
A unique identifier for the task instance (e.g., "home-assistant__core-807"). |
| issue_numbers | list |
A list of linked issue numbers (e.g., [103876]). |
| base_commit | str |
The commit SHA of the repository prior to the fix. |
| version | str |
The version of the dataset (e.g., "0.10"). |
| patch | str |
The ground-truth patch (git diff) that resolves the issue. |
| test_patch | str |
The test patch (git diff) containing new tests to reproduce the issue. |
| problem_statement | str |
The natural language description of the issue. |
| hints_text | str |
Hints extracted from the issue discussion to aid resolution. |
| all_hints_text | str |
Comprehensive context including all comments and code review details. |
| commit_urls | list |
A list of URLs pointing to the relevant commits. |
| created_at | str |
The creation timestamp (e.g., "2015-12-27T19:33:55Z"). |
| language | str |
The programming language of the repository (e.g., "Python"). |
We release MEnvData, a high-quality polyglot SWE dataset comprising 3,005 task instances from 942 repositories across 10 programming languages:
- 3,005 Docker Images: Pre-built environment images with verified dependencies.
- 3,872 Agent Trajectories: Complete execution trajectories for training and fine-tuning, collected using OpenHands agent scaffolding with Claude Sonnet 4.5 model.
- Universal Format: Fully compatible with SWE-factory format.
- Efficient Verification: Validation is based on concise exit codes (Success/Failure), eliminating the need for complex
log_parserlogic. - Verified Quality: Each instance includes a fully executable environment with pre-verified test cases.
π€ Hugging Face Dataset: https://huggingface.co/datasets/ernie-research/MEnvData-SWE
# Load the dataset using Hugging Face
from datasets import load_dataset
# MEnvData-SWE
dataset = load_dataset("ernie-research/MEnvData-SWE")
# MEnvData-SWE-Trajectory
dataset = load_dataset("ernie-research/MEnvData-SWE-Trajectory")MEnvData-SWE extends the standard schema with executable environment configurations. Each instance contains the following fields:
| Field | Type | Description |
|---|---|---|
| repo | str |
The full GitHub repository name (e.g., "home-assistant/core"). |
| pull_number | int |
The pull request number associated with the fix. |
| instance_id | str |
A unique identifier for the task instance. |
| issue_numbers | list |
A list of linked issue numbers. |
| base_commit | str |
The commit SHA of the repository prior to the fix. |
| version | str |
The version of the dataset. |
| patch | str |
The ground-truth patch (git diff) that resolves the issue. |
| test_patch | str |
The test patch (git diff) containing new tests to reproduce the issue. |
| problem_statement | str |
The natural language description of the issue. |
| hints_text | str |
Hints extracted from the issue discussion. |
| all_hints_text | str |
Comprehensive context including all comments and reviews. |
| commit_urls | list |
A list of URLs pointing to the relevant commits. |
| created_at | str |
The creation timestamp (e.g., "2015-12-27T19:33:55Z"). |
| language | str |
The programming language (e.g., "Python"). |
| env_setup_script | str |
Incremental bash commands used to configure the environment (for reuse scenarios). |
| original_env_setup_script | str |
The foundational setup commands. Represents the reused base image's setup or the full build script if built from scratch. |
| eval_script | str |
The complete verification script that applies the test_patch and executes the test commands. |
| image_name | str |
The specific Docker image name/tag available for this instance. |
Note: We are actively expanding MEnvData-SWE with additional repositories and task instances. Stay tuned for potential future releases.
git clone https://github.com/your-org/MEnvAgent.git
cd MEnvAgentconda create --name menvagent python=3.10 -y
conda activate menvagent
pip install -r requirements.txt# Set your LLM API configuration
export OPENAI_API_BASE_URL=<your_base_url>
export OPENAI_KEY=<your_api_key>
# (Optional) Configure proxy settings if needed
export http_proxy="http://your-proxy:port"
export https_proxy="http://your-proxy:port"
export no_proxy="localhost,127.0.0.1"If MEnvAgent helps your research or projects, please cite our paper:
@misc{guo2026menvagent,
title={MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering},
author={Chuanzhe Guo and Jingjing Wu and Sijun He and Yang Chen and Zhaoqi Kuang and Shilong Fan and Bingjin Chen and Siqi Bao and Jing Liu and Hua Wu and Qingfu Zhu and Wanxiang Che and Haifeng Wang},
year={2026},
url={https://arxiv.org/abs/2601.22859},
}MEnvAgent builds upon foundational work in software engineering and LLM agents:
- SWE-Bench-Live: A dynamic, continuously-updated benchmark for evaluating language models on real-world software tasks with fresh, uncontaminated GitHub issues
- SWE-Factory: Automated factory for GitHub Issue Resolution Training Data and Evaluation Benchmarks.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
For questions, issues, or collaborations, please:
- Open an issue on GitHub
- Contact: czguo@ir.hit.edu.cn


