This directory is used for evaluating patch predictions on the OmniGIRL benchmark.
After setup the environment, you need to do following things to run evaluation:
-
Prepare Prediction file: Some patch files in JSONL format, each item containing:
model_name_or_path: Model Nameinstance_id: Task Instance idprediction_patch: Prediction Patch Content
Example:
{ "model_name_or_path": "agentless-v1", "instance_id": "prettier__prettier-12260", "model_patch": "diff --git ...." } -
Move to omnigirl/harness, then you can run the evaluation using the following command:
# required
cd omnigirl/harness
python run_evaluation.py --predictions_path <path of your prediction results> \
--max_workers <number of workers> \
--run_id <unique id number of this evaluation>- By default, your evaluation results will be generated in omnigirl/harness/reports.
parameters of run_evluation.py
| Parameter | Description |
|---|---|
--dataset_name |
Name of the dataset or path to a JSON file. Default: "Deep-Software-Analytics/OmniGIRL" |
--predictions_path |
Path to prediction file |
--from_hub |
Whether to pull Docker images from DockerHub (True) or build locally (False). Default: True |
--max_workers |
Max number of parallel workers. Default: 4 |
--run_id |
Unique ID to identify this evaluation run (required) |
--instance_ids |
Specific instance IDs to evaluate (space-separated list) |
--timeout |
Timeout (in seconds) for each instance. Default: 3600 (60 mins) |
--reports_dir |
Directory to save the output reports. Default: "reports" |
--force_rebuild |
Whether to force rebuild all Docker images. Default: False |
--split |
Dataset split to evaluate. Default: "test" |
--open_file_limit |
Limit on number of open files. Default: 4096 |
--cache_level |
Cache strategy: "none", "base", "env", or "instance". Default: "env" |
--clean |
If True, remove all cached images above the cache level. Default: False |
--version_spec |
Version specifier for filtering tasks. Default: "all" |
-
Use amd64 architecture: This ensures full compatibility with Docker images and matches the evaluation setup in the paper.
-
--from_hub defaults to True: It will automatically pull images from DockerHub, allowing evaluation without requiring local image builds.
-
No need to set --dataset_name manually: By default, it pulls from Hugging Face. You can also specify a local file like benchmark/OmniGIRL.json.
-
Recommended --cache_level is env: Using instance may cache large amounts of image data, occupying excessive local disk space.
-
Adjust --max_workers based on your hardware: Over-parallelization may degrade performance or lead to unstable evaluations.
-
The --run_id parameter creates a unique directory: It stores logs and evaluation artifacts, while --reports_dir saves the final report.