A guide for integrating mature, independently-developed benchmarks using SysMoBench as an example.
Git Subtree vs Submodule:
When porting an existing benchmark, you need to decide how to integrate the upstream benchmark code into the framework repository (i.e., system-intelligence-benchmark). While both Git Subtree and Git Submodule can work, we recommend Git Subtree for most benchmark porting scenarios.
Why Subtree over Submodule for Benchmarks:
- Atomic commits and consistency: Subtree keeps all code in the main repository's Git object database, avoiding state synchronization issues between the parent repo and submodule HEAD. You can modify framework code and benchmark code in a single atomic commit, ensuring consistency across the entire codebase.
- Bidirectional sync flexibility:
git subtree pull --squashcleanly integrates upstream updates while maintaining repository history, andgit subtree pushenables contributing patches back to upstream. - Fewer gotchas: Submodules have many edge cases that can confuse contributors (see this detailed analysis)
When to use Subtree:
- Benchmark is relatively stable (not updated daily)
- Repository size is acceptable (most benchmarks are <100MB)
- You want contributors to have a smooth onboarding experience
When Submodule might be acceptable:
- Upstream updates extremely frequently
- Benchmark codebase is very large (>500MB)
- You need strict separation between upstream and integration code
# Add remote
git remote add benchmark-upstream https://github.com/upstream/repo.git
# Add as subtree
git subtree add --prefix benchmarks/your_benchmark/benchmark_core \
benchmark-upstream main --squashbenchmarks/your_benchmark/
├── benchmark_core/ # Git Subtree (DO NOT manually edit)
├── src/ # Bridge layer
│ ├── main.py # The wrapper of the entry point of your benchmark, including the benchmark driving logic
│ ├── executor.py
│ └── evaluator.py
├── data/benchmark/
│ └── tasks.jsonl
├── env.toml # Config template with "XXX" placeholders
├── requirements.txt # -r benchmark_core/requirements.txt
├── install.sh
├── run.sh
└── README.md
Mature benchmarks already have end-to-end execution pipelines and SDKs. However, to unify LLM/agent configuration management across the framework and improve maintainability (see Benchmark Abstraction), we need an adapter layer in benchmarks/your_benchmark/src/ to bridge the upstream benchmark with the framework.
The framework provides a centralized model configuration manager. You may have two options to integrate it:
Option 1: Replace upstream config manager (Recommended)
Directly inject the framework's config into the upstream benchmark's initialization. This is the simplest and most stable approach.
# src/main.py
import sys
from pathlib import Path
# Add paths
SDK_ROOT = Path(__file__).parent.parent.parent.parent
BENCHMARK_CORE = Path(__file__).parent.parent / "benchmark_core"
sys.path.insert(0, str(SDK_ROOT))
sys.path.insert(0, str(BENCHMARK_CORE))
# Inject framework config
from sdk.utils import set_llm_endpoint_from_config
set_llm_endpoint_from_config(str(Path(__file__).parent.parent / 'env.toml'))
# Now import upstream - it will use framework's LLM config
from tla_eval.config import get_configured_modelSysMoBench example: Directly replaces upstream's models.yaml by setting environment variables before importing upstream modules.
Option 2: Map framework config to upstream config
If the upstream config system cannot be replaced, map the framework's config to upstream's format at runtime. Implementation depends on your specific benchmark.
Any benchmark can be abstracted into two sequential modules: Executor (generation/interaction) and Evaluator (scoring). Separating them improves code clarity and extensibility, and enables integrating more sophisticated executors without modifying evaluation logic.
Executor: Handles the generation or iterative correction workflow
- Example: SysMoBench runs multi-phase generation with iterative error correction
- Encapsulates retry logic, model calls, and intermediate outputs
Evaluator: Performs the final evaluation
- Example: SysMoBench runs TLC checks and verification
- Returns standardized scores and diagnostic information
If you have already had decoupling design in your benchmark, you can skip this step, and clarify the substitution of the framework's agent/environment/evaluator in the README.md. You can simply add wrapper here to expose model/agent as parameters in the
main.pyfile.
Convert the upstream task format to the framework's standard tasks.jsonl schema. This decouples task definitions from execution logic, enabling main.py to iterate over tasks programmatically without hardcoding task-specific details.
{"task_id": "task_1", "description": "...", "metadata": {}}
{"task_id": "task_2", "description": "...", "metadata": {}}Most remaining steps (testing, documentation, root-level integration) are identical to creating a custom benchmark. See Creating New Benchmarks for detailed guidelines.
Porting-specific considerations:
Reference upstream dependencies in requirements.txt:
# requirements.txt
-r benchmark_core/requirements.txtInstall upstream dependencies:
#!/bin/bash
set -e
# Install upstream system dependencies (e.g., Java for SysMoBench)
# ...
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Run upstream setup scripts
python3 benchmark_core/scripts/setup.py
deactivateExclude upstream-generated files:
# Exclude upstream runtime artifacts
benchmark_core/lib/
benchmark_core/output/
benchmark_core/.venv/- Tests: See creating_benchmark.md - Testing
- README: Document upstream source, version, and attribution
- Root integration: Update
cli/run_all_local.sh,README.md
Make sure the benchmark can be run locally at least in the two ways below:
./run.sh <model_name>cd cli
./run_all_local.sh <model_name>Be careful for the path configuration when porting the benchmark.
Update:
git subtree pull --prefix benchmarks/your_benchmark/benchmark_core \
benchmark-upstream main --squashContribute back:
git subtree push --prefix benchmarks/your_benchmark/benchmark_core \
benchmark-upstream feature-branch