feat: add Volcengine sandbox support#1160
Conversation
Summary of ChangesHello @XChen-Zero, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates Volcengine Code Sandbox (SandboxFusion) into EvalScope, providing a remote and stateless environment for code execution. This significantly reduces the dependency on local Docker or enclave-based sandboxes, offering a more flexible and scalable solution for code evaluation. The changes include a new architecture for managing diverse sandbox types, ensuring compatibility with existing benchmark configurations while expanding the range of available execution environments. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This PR adds support for Volcengine sandbox, which is a great addition for providing a remote, stateless sandbox option. The implementation is mostly solid, introducing a new SandboxFusionClient and integrating it via the SandboxMixin. The code is well-structured to handle this new external sandbox type.
My review focuses on improving maintainability, robustness, and adherence to best practices. Key areas of feedback include:
- Replacing broad
except Exceptionclauses with more specific ones. - Reducing code duplication, particularly for sandbox type aliases.
- Improving the design to be more extensible by avoiding hardcoded imports of specific sandbox implementations.
- Enhancing readability and logging.
Overall, this is a good contribution. Addressing the feedback will make the code more robust and easier to maintain in the long run.
evalscope/api/mixin/sandbox_mixin.py
Outdated
| manager_class = manager_config.get('manager_class') | ||
| if not manager_class: | ||
| if self._is_external_sandbox(): | ||
| from evalscope.sandbox.volcengine import SandboxFusionSandboxManager |
There was a problem hiding this comment.
Hardcoding the import of SandboxFusionSandboxManager inside SandboxMixin creates a tight coupling and violates the Open/Closed Principle. This makes it difficult to add other external sandbox types in the future without modifying this mixin. A more extensible approach would be to use a factory pattern or a registry to map sandbox_type strings to their corresponding manager classes.
| try: | ||
| return sandbox_type_enum(sandbox_type) | ||
| except Exception: |
evalscope/api/mixin/sandbox_mixin.py
Outdated
| sandbox_type = self._task_config.sandbox_type if self._task_config else None | ||
| if not sandbox_type: | ||
| return False | ||
| return str(sandbox_type).lower() in {'volcengine', 'volcano', 'volc'} |
There was a problem hiding this comment.
The set of aliases {'volcengine', 'volcano', 'volc'} is also used in _resolve_sandbox_config on line 179 and in evalscope/config.py on line 292. Duplicating this set makes it harder to maintain and can lead to inconsistencies. Consider defining it as a constant at the module or class level and reusing it in all these places to improve maintainability.
| return lambda tool_name, tool_input, timeout=None: sandbox.execute( # noqa: E731 | ||
| tool_name, tool_input, timeout=timeout | ||
| ) |
There was a problem hiding this comment.
Using lambda here requires a noqa to suppress a linting error (E731). While concise, this can be less readable. For better clarity and to adhere to style guidelines without needing noqa, consider using a nested function (def). This pattern is repeated for other executors in this method.
def executor(tool_name, tool_input, timeout=None):
return sandbox.execute(tool_name, tool_input, timeout=timeout)
return executor| parsed = json.loads(trimmed) | ||
| if isinstance(parsed, dict): | ||
| input_dict = parsed | ||
| except Exception: |
| if stderr: | ||
| merged_output += ('' if merged_output.endswith('\n') or not merged_output else '\n') + stderr |
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Silently swallowing all exceptions with except Exception: pass is risky, as it can hide important problems that occur during resource cleanup. At a minimum, you should log the exception as a warning so that issues can be diagnosed. Note: you will need to import and initialize a logger at the top of the file for this to work.
| except Exception: | |
| pass | |
| except Exception as e: | |
| logger.warning('Failed to close sandbox client session.', exc_info=True) |
Improve sandbox language compatibility and add more test resultsThis PR extends the Volcengine sandbox support with configurable dataset-to-sandbox language mapping, fixing compatibility issues where dataset language identifiers differ from sandbox requirements (e.g. What’s added
Examplesandbox_type="volcengine",
use_sandbox=True,
sandbox_manager_config={
"base_url": "http://<IP>:<PORT>",
"dataset_language_map": {
"r": "R",
"d_ut": "D_ut",
"ts": "typescript"
}
}Additional results
These results validate correct language dispatch and stable sandbox execution across multiple datasets and languages. |
Yunnglin
left a comment
There was a problem hiding this comment.
LGTM! Thanks for the contribution. I've updated sandbox_mixin code and documentation. Merging now.
|
If possible, we also welcome PRs to ms-enclave to natively support Volcengine's sandbox. It would be more reasonable for ms-enclave to support more backends. |
Add Volcengine (SandboxFusion) sandbox backend
This PR adds support for Volcengine Code Sandbox (SandboxFusion) as a new EvalScope sandbox backend.
sandbox_type="volcengine"./run_codeHTTP API.python_executor,shell_executor) without changing benchmark logic.sandbox_type.This provides a remote, stateless sandbox option and removes the dependency on local Docker or enclave-based sandboxes for code evaluation.
Tested with SandboxFusion Docker deployment on multiple and livecodebench benchmarks.
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+========================+====================+=================+=================+=======+=========+=========+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-cpp | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-ts | 10 | 0 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-sh | 10 | 0.2 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-cs | 10 | 0.5 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-go | 10 | 0.8 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-java | 10 | 0.9 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-lua | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-js | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-php | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-pl | 10 | 0.5 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-rkt | 10 | 0.6 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-r | 10 | 0 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-rs | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-scala | 10 | 0.5 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-swift | 10 | 0.7 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-rb | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-d | 10 | 0 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-jl | 10 | 0.9 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | OVERALL | 180 | 0.6444 | - |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-cpp | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-ts | 10 | 0 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-sh | 10 | 0.2 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-cs | 10 | 0.5 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-go | 10 | 0.8 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-java | 10 | 0.9 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-lua | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-js | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-php | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-pl | 10 | 0.5 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-rkt | 10 | 0.6 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-r | 10 | 0 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-rs | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-scala | 10 | 0.5 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-swift | 10 | 0.7 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-rb | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-d | 10 | 0 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-jl | 10 | 0.9 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+========================+===============+=================+============+=======+=========+=========+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-cpp | 10 | 0.9 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-ts | 10 | 0 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-sh | 10 | 0.7 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-cs | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-go | 10 | 0.7 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-java | 10 | 0.9 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-lua | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-js | 10 | 1 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-php | 10 | 1 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-pl | 10 | 0.2 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-rkt | 10 | 0.9 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-r | 10 | 0 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-rs | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-scala | 10 | 0.7 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-swift | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-rb | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-d | 10 | 0 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-jl | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | OVERALL | 180 | 0.5889 | - |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-cpp | 10 | 0.9 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-ts | 10 | 0 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-sh | 10 | 0.7 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-cs | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-go | 10 | 0.7 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-java | 10 | 0.9 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-lua | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-js | 10 | 1 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-php | 10 | 1 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-pl | 10 | 0.2 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-rkt | 10 | 0.9 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-r | 10 | 0 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-rs | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-scala | 10 | 0.7 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-swift | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-rb | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-d | 10 | 0 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-jl | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
+------------------------+-----------------+-----------------+----------------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+========================+=================+=================+================+=======+=========+=========+
| qwen3-4b-instruct-2507 | live_code_bench | mean_acc | release_latest | 100 | 0.67 | default |
+------------------------+-----------------+-----------------+----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | live_code_bench | mean_acc_pass@1 | release_latest | 100 | 0.67 | default |
+------------------------+-----------------+-----------------+----------------+-------+---------+---------+