Skip to content

feat: add Volcengine sandbox support#1160

Merged
Yunnglin merged 6 commits intomodelscope:mainfrom
XChen-Zero:feat/add_volcengine_sandbox
Jan 20, 2026
Merged

feat: add Volcengine sandbox support#1160
Yunnglin merged 6 commits intomodelscope:mainfrom
XChen-Zero:feat/add_volcengine_sandbox

Conversation

@XChen-Zero
Copy link
Contributor

Add Volcengine (SandboxFusion) sandbox backend

This PR adds support for Volcengine Code Sandbox (SandboxFusion) as a new EvalScope sandbox backend.

  • Introduces sandbox_type="volcengine".
  • Executes code via SandboxFusion’s /run_code HTTP API.
  • Maps existing tools (python_executor, shell_executor) without changing benchmark logic.
  • Requires no changes to existing task or benchmark configurations beyond sandbox_type.

This provides a remote, stateless sandbox option and removes the dependency on local Docker or enclave-based sandboxes for code evaluation.

Tested with SandboxFusion Docker deployment on multiple and livecodebench benchmarks.

+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+========================+====================+=================+=================+=======+=========+=========+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-cpp | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-ts | 10 | 0 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-sh | 10 | 0.2 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-cs | 10 | 0.5 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-go | 10 | 0.8 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-java | 10 | 0.9 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-lua | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-js | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-php | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-pl | 10 | 0.5 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-rkt | 10 | 0.6 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-r | 10 | 0 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-rs | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-scala | 10 | 0.5 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-swift | 10 | 0.7 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-rb | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-d | 10 | 0 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | humaneval-jl | 10 | 0.9 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc | OVERALL | 180 | 0.6444 | - |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-cpp | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-ts | 10 | 0 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-sh | 10 | 0.2 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-cs | 10 | 0.5 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-go | 10 | 0.8 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-java | 10 | 0.9 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-lua | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-js | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-php | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-pl | 10 | 0.5 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-rkt | 10 | 0.6 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-r | 10 | 0 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-rs | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-scala | 10 | 0.5 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-swift | 10 | 0.7 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-rb | 10 | 1 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-d | 10 | 0 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_humaneval | mean_acc_pass@1 | humaneval-jl | 10 | 0.9 | default |
+------------------------+--------------------+-----------------+-----------------+-------+---------+---------+

+------------------------+---------------+-----------------+------------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+========================+===============+=================+============+=======+=========+=========+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-cpp | 10 | 0.9 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-ts | 10 | 0 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-sh | 10 | 0.7 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-cs | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-go | 10 | 0.7 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-java | 10 | 0.9 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-lua | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-js | 10 | 1 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-php | 10 | 1 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-pl | 10 | 0.2 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-rkt | 10 | 0.9 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-r | 10 | 0 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-rs | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-scala | 10 | 0.7 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-swift | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-rb | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-d | 10 | 0 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | mbpp-jl | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc | OVERALL | 180 | 0.5889 | - |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-cpp | 10 | 0.9 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-ts | 10 | 0 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-sh | 10 | 0.7 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-cs | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-go | 10 | 0.7 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-java | 10 | 0.9 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-lua | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-js | 10 | 1 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-php | 10 | 1 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-pl | 10 | 0.2 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-rkt | 10 | 0.9 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-r | 10 | 0 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-rs | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-scala | 10 | 0.7 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-swift | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-rb | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-d | 10 | 0 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | multiple_mbpp | mean_acc_pass@1 | mbpp-jl | 10 | 0.6 | default |
+------------------------+---------------+-----------------+------------+-------+---------+---------+

+------------------------+-----------------+-----------------+----------------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+========================+=================+=================+================+=======+=========+=========+
| qwen3-4b-instruct-2507 | live_code_bench | mean_acc | release_latest | 100 | 0.67 | default |
+------------------------+-----------------+-----------------+----------------+-------+---------+---------+
| qwen3-4b-instruct-2507 | live_code_bench | mean_acc_pass@1 | release_latest | 100 | 0.67 | default |
+------------------------+-----------------+-----------------+----------------+-------+---------+---------+

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @XChen-Zero, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates Volcengine Code Sandbox (SandboxFusion) into EvalScope, providing a remote and stateless environment for code execution. This significantly reduces the dependency on local Docker or enclave-based sandboxes, offering a more flexible and scalable solution for code evaluation. The changes include a new architecture for managing diverse sandbox types, ensuring compatibility with existing benchmark configurations while expanding the range of available execution environments.

Highlights

  • New Volcengine Sandbox Backend: Adds support for Volcengine Code Sandbox (SandboxFusion) as a remote, stateless sandbox option, enhancing EvalScope's code evaluation capabilities.
  • Flexible Sandbox Management: Introduces an _external_manager attribute and associated logic to handle external sandbox integrations, allowing EvalScope to work seamlessly with both local (Docker/enclave) and remote sandboxes.
  • Tool Mapping and Configuration: Existing tools like python_executor and shell_executor are automatically mapped to the Volcengine sandbox, requiring no changes to existing task or benchmark configurations.
  • Dynamic Sandbox Initialization: The sandbox initialization process is updated to dynamically differentiate between internal and external sandbox types, ensuring the correct manager and configuration are utilized.
  • Enhanced Argument and Configuration: Command-line arguments and configuration handling are updated to support the new volcengine sandbox type, providing clear options for users.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR adds support for Volcengine sandbox, which is a great addition for providing a remote, stateless sandbox option. The implementation is mostly solid, introducing a new SandboxFusionClient and integrating it via the SandboxMixin. The code is well-structured to handle this new external sandbox type.

My review focuses on improving maintainability, robustness, and adherence to best practices. Key areas of feedback include:

  • Replacing broad except Exception clauses with more specific ones.
  • Reducing code duplication, particularly for sandbox type aliases.
  • Improving the design to be more extensible by avoiding hardcoded imports of specific sandbox implementations.
  • Enhancing readability and logging.

Overall, this is a good contribution. Addressing the feedback will make the code more robust and easier to maintain in the long run.

manager_class = manager_config.get('manager_class')
if not manager_class:
if self._is_external_sandbox():
from evalscope.sandbox.volcengine import SandboxFusionSandboxManager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding the import of SandboxFusionSandboxManager inside SandboxMixin creates a tight coupling and violates the Open/Closed Principle. This makes it difficult to add other external sandbox types in the future without modifying this mixin. A more extensible approach would be to use a factory pattern or a registry to map sandbox_type strings to their corresponding manager classes.

Comment on lines +167 to +169
try:
return sandbox_type_enum(sandbox_type)
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using except Exception: is too broad and can mask unexpected errors. It's better to catch the specific exception you expect. In this case, creating an enum member from an invalid value raises ValueError.

            except ValueError:
                return sandbox_type

sandbox_type = self._task_config.sandbox_type if self._task_config else None
if not sandbox_type:
return False
return str(sandbox_type).lower() in {'volcengine', 'volcano', 'volc'}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The set of aliases {'volcengine', 'volcano', 'volc'} is also used in _resolve_sandbox_config on line 179 and in evalscope/config.py on line 292. Duplicating this set makes it harder to maintain and can lead to inconsistencies. Consider defining it as a constant at the module or class level and reusing it in all these places to improve maintainability.

Comment on lines +253 to +255
return lambda tool_name, tool_input, timeout=None: sandbox.execute( # noqa: E731
tool_name, tool_input, timeout=timeout
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using lambda here requires a noqa to suppress a linting error (E731). While concise, this can be less readable. For better clarity and to adhere to style guidelines without needing noqa, consider using a nested function (def). This pattern is repeated for other executors in this method.

                def executor(tool_name, tool_input, timeout=None):
                    return sandbox.execute(tool_name, tool_input, timeout=timeout)
                return executor

parsed = json.loads(trimmed)
if isinstance(parsed, dict):
input_dict = parsed
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching a broad Exception can hide unexpected errors. It's better to catch only the specific exception you expect. When parsing JSON, json.loads raises json.JSONDecodeError on failure.

Suggested change
except Exception:
except json.JSONDecodeError:

Comment on lines +200 to +201
if stderr:
merged_output += ('' if merged_output.endswith('\n') or not merged_output else '\n') + stderr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This pattern to conditionally add a newline is repeated multiple times below. To improve readability and reduce duplication, consider extracting it into a small helper function.

Comment on lines +322 to +323
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Silently swallowing all exceptions with except Exception: pass is risky, as it can hide important problems that occur during resource cleanup. At a minimum, you should log the exception as a warning so that issues can be diagnosed. Note: you will need to import and initialize a logger at the top of the file for this to work.

Suggested change
except Exception:
pass
except Exception as e:
logger.warning('Failed to close sandbox client session.', exc_info=True)

@XChen-Zero
Copy link
Contributor Author

Improve sandbox language compatibility and add more test results

This PR extends the Volcengine sandbox support with configurable dataset-to-sandbox language mapping, fixing compatibility issues where dataset language identifiers differ from sandbox requirements (e.g. rR).

What’s added

  • Introduced dataset_language_map in sandbox_manager_config to normalize dataset language names before execution.
  • Works for Volcengine Sandbox and remains compatible with existing MS sandbox implementations.
  • Keeps backward compatibility with existing configs.

Example

sandbox_type="volcengine",
use_sandbox=True,
sandbox_manager_config={
    "base_url": "http://<IP>:<PORT>",
    "dataset_language_map": {
        "r": "R",
        "d_ut": "D_ut",
        "ts": "typescript"
    }
}

Additional results

  • LiveCodeBench (release_latest)
    qwen3-4b-instruct-2507: mean_acc / pass@1 = 0.4474 (1055 samples)
  • multiple_mbpp (multi-language)
    Overall mean_acc = 0.6778 across 18 languages

These results validate correct language dispatch and stable sandbox execution across multiple datasets and languages.

Copy link
Collaborator

@Yunnglin Yunnglin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the contribution. I've updated sandbox_mixin code and documentation. Merging now.

@Yunnglin
Copy link
Collaborator

If possible, we also welcome PRs to ms-enclave to natively support Volcengine's sandbox. It would be more reasonable for ms-enclave to support more backends.

@Yunnglin Yunnglin linked an issue Jan 20, 2026 that may be closed by this pull request
@Yunnglin Yunnglin merged commit a336e04 into modelscope:main Jan 20, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for Vol-Sandbox as Code Execution Environment?

2 participants