Versioned Metadata for Reproducibility by kargibora · Pull Request #23 · OpenEuroLLM/OpenJury

kargibora · 2026-03-12T11:46:26Z

Summary

This PR introduces a versioned run-metadata artifact for OpenJury to make evaluation runs easier to reproduce, inspect, and compare.

The new metadata is intentionally lightweight and follows the same general idea as lm-evaluation-harness: save the effective run configuration, compact reproducibility signals, and produced artifacts, without logging unnecessary local machine context.

What this PR introduces

Each evaluation run now writes a run-metadata.v1.json file alongside the existing output artifacts.

The metadata includes:

schema/version information
timestamps and run duration
entrypoint name
a single consolidated run configuration block
a compacted results summary
dataset_statistics
dependency/runtime information
optional git_hash
reproducibility hashes for:
- the effective judge system prompt
- the effective judge user prompt template
- the selected instruction-index set
produced artifacts and explicit file references

How it is implemented

The main changes are:

add a versioned metadata writer in openjury/repro.py
call the metadata writer from both evaluation entrypoints
resolve the effective judge prompts and hash those exact prompts for provenance
hash the normalized set of selected instruction indices so the subset is tracked independent of ordering
update tests to validate the new metadata shape

An important implementation detail is that the metadata hashes are derived from the effective prompts used by the pipeline, but they do not change the evaluation logic itself.

Resulting metadata shape

At a high level, the saved metadata now looks like:

schema_version
timestamps
entrypoint
run
results
dataset_statistics
environment
dependencies
git_hash (optional)
instruction_indices_sha256 (optional)
judge_system_prompt_sha256 (optional)
judge_user_prompt_template_sha256 (optional)
extras (optional)
artifacts

Example

Example run-metadata.v1.json:

{
  "schema_version": "openjury-run-metadata/v1",
  "timestamps": {
    "started_at_utc": "2026-03-12T11:32:06.178466+00:00",
    "finished_at_utc": "2026-03-12T11:36:03.592860+00:00",
    "duration_sec": 237.414394
  },
  "entrypoint": "openjury.generate_and_evaluate.main",
  "run": {
    "dataset": "alpaca-eval",
    "model_A": "VLLM/Qwen/Qwen2.5-0.5B-Instruct",
    "model_B": "VLLM/Qwen/Qwen2.5-1.5B-Instruct",
    "judge_model": "VLLM/Qwen/Qwen3-8B",
    "provide_explanation": false,
    "swap_mode": "fixed",
    "n_instructions": 5,
    "ignore_cache": false,
    "use_tqdm": false,
    "truncate_all_input_chars": 8192,
    "max_out_tokens_models": 1024,
    "max_out_tokens_judge": 1024,
    "max_model_len": 8192,
    "chat_template": null
  },
  "results": {
    "dataset": "alpaca-eval",
    "model_A": "VLLM/Qwen/Qwen2.5-0.5B-Instruct",
    "model_B": "VLLM/Qwen/Qwen2.5-1.5B-Instruct",
    "judge_model": "VLLM/Qwen/Qwen3-8B",
    "num_battles": 5,
    "winrate": 0.0,
    "num_wins": 0,
    "num_losses": 4,
    "num_ties": 0,
    "num_missing": 1,
    "preferences_count": 5
  },
  "dataset_statistics": {
    "instruction_index_count": 5,
    "instructions_count": 5,
    "completions_A_count": 5,
    "completions_B_count": 5
  },
  "environment": {
    "python_version": "3.12.12",
    "platform": "Linux-4.18.0-477.27.1.el8_8.x86_64-x86_64-with-glibc2.28"
  },
  "dependencies": {
    "datasets": "4.5.0",
    "fast-langdetect": "1.0.0",
    "huggingface-hub": "0.36.2",
    "ipython": "9.10.0",
    "joblib": "1.5.3",
    "jupyter": "1.1.1",
    "langchain": "0.3.27",
    "langchain-community": "0.3.31",
    "langchain-openai": "0.3.35",
    "langchain-together": "0.3.1",
    "llama-cpp-python": null,
    "matplotlib": "3.10.8",
    "pandas": "2.3.3",
    "pyyaml": "6.0.3",
    "seaborn": "0.13.2",
    "tqdm": "4.67.3",
    "transformers": "4.57.6",
    "vllm": "0.10.2"
  },
  "git_hash": "a8ee895491613371583aea4b26407314aed9502b",
  "instruction_indices_sha256": "cf2445f7783b200d0290dee72cedcb2fb75774f80277f2b9458690922031d610",
  "judge_system_prompt_sha256": "0912ced093a83ed7f01e169c96824ac510da30b29c112ac998a31854e5c2239c",
  "judge_user_prompt_template_sha256": "ca30b32626266397b0361f911871d76dbfbf36e4dd790bc0ffe5a893a540ffa1",
  "extras": {
    "files": {
      "annotations": {
        "relative_path": "alpaca-eval-VLLM_Qwen_Qwen2.5-0.5B-Instruct-VLLM_Qwen_Qwen2.5-1.5B-Instruct-VLLM_Qwen_Qwen3-8B-fixed-annotations.csv"
      },
      "results": {
        "relative_path": "results-alpaca-eval-VLLM_Qwen_Qwen2.5-0.5B-Instruct-VLLM_Qwen_Qwen2.5-1.5B-Instruct-VLLM_Qwen_Qwen3-8B-fixed.json"
      }
    }
  },
  "artifacts": [
    {
      "path": "alpaca-eval-VLLM_Qwen_Qwen2.5-0.5B-Instruct-VLLM_Qwen_Qwen2.5-1.5B-Instruct-VLLM_Qwen_Qwen3-8B-fixed-annotations.csv",
      "size_bytes": 26599
    },
    {
      "path": "args-alpaca-eval-VLLM_Qwen_Qwen2.5-0.5B-Instruct-VLLM_Qwen_Qwen2.5-1.5B-Instruct-VLLM_Qwen_Qwen3-8B-fixed.json",
      "size_bytes": 565
    },
    {
      "path": "results-alpaca-eval-VLLM_Qwen_Qwen2.5-0.5B-Instruct-VLLM_Qwen_Qwen2.5-1.5B-Instruct-VLLM_Qwen_Qwen3-8B-fixed.json",
      "size_bytes": 405
    }
  ]
}

geoalgo

LGTM, just have small comments.

geoalgo · 2026-03-16T14:47:29Z

openjury/generate_and_evaluate.py

+        eval_completions_A = completions_A.head(n_instructions).tolist()
+        eval_completions_B = completions_B.head(n_instructions).tolist()
+
+        write_run_metadata(


can we test this function instead of the whole entrypoint?
Ideally with possible edge cases.

Fixed it with 35f9d4f. Now I have an test_repro function that checks write run metadata only.

geoalgo · 2026-03-16T14:47:43Z

openjury/generate_and_evaluate.py

+        write_run_metadata(
+            output_dir=res_folder,
+            entrypoint="openjury.generate_and_evaluate.main",
+            run={


can we rather just pass args?

Fixed it with 35f9d4f. Now I pass args instead

geoalgo · 2026-03-16T14:49:05Z

tests/test_generate_and_evaluate.py

+def test_generate_and_evaluate_writes_run_metadata(tmp_path):
+    _ = main_generate_and_eval(
+        CliArgs(
+            dataset="alpaca-eval",
+            model_A="Dummy/no answer",
+            model_B="Dummy/open is better than close isnt'it",
+            judge_model="Dummy/score A: 10 score B: 0",
+            n_instructions=3,
+            result_folder=str(tmp_path),
+        )
+    )
+
+    metadata_files = list(tmp_path.rglob("run-metadata.v1.json"))
+    assert len(metadata_files) == 1
+
+    metadata = json.loads(metadata_files[0].read_text())
+    assert metadata["schema_version"] == "openjury-run-metadata/v1"
+    assert metadata["entrypoint"] == "openjury.generate_and_evaluate.main"
+    assert metadata["run"]["dataset"] == "alpaca-eval"
+    assert metadata["dataset_statistics"]["instruction_index_count"] == 3
+    assert metadata["results"]["num_battles"] == 3
+    assert metadata["results"]["preferences_count"] == 3
+    assert "instruction_indices_sha256" in metadata
+    assert "judge_system_prompt_sha256" in metadata
+    assert "judge_user_prompt_template_sha256" in metadata
+    artifact_paths = {artifact["path"] for artifact in metadata["artifacts"]}
+    assert (
+        metadata["extras"]["files"]["results"]["relative_path"] in artifact_paths
+    )
+    assert metadata["extras"]["files"]["annotations"]["relative_path"] in artifact_paths


can we rather test the function that generates the metadata ideally with possible edge cases?

Fixed it with 35f9d4f. Now I have an test_repro function that checks write run metadata only.

geoalgo · 2026-03-16T14:51:44Z

openjury/generate_and_evaluate.py

+    except Exception as e:
+        print(f"Warning: failed to write run metadata: {e}")


We should not catch all exceptions here as it is an anti-pattern but restricts the class and re-raising as it can silently swallows unexpected errors otherwise.

Fixed it with 35f9d4f. Now only checks OSError

…ndling

kargibora added 7 commits February 27, 2026 10:33

Add metadata generation function and integrate it to the pipeline

45ba74e

Remove checksum for now

08a0436

Remove preferences as they are already saved

0be1267

Remove is_dirty check for labeling uncommited git states

a036a70

Update test and remove privacy concerns

a69c1ca

Add judge prompt resolution and metadata hashing for reproducibility

bf56604

Merge branch 'main' into feat/versioned-metadata

a8ee895

kargibora self-assigned this Mar 12, 2026

kargibora added the enhancement New feature or request label Mar 12, 2026

kargibora added 2 commits March 12, 2026 12:52

Remove unnecessary pop-up fields

6e705a4

Remove complexities within the test

ba6022b

geoalgo reviewed Mar 16, 2026

View reviewed changes

kargibora added 2 commits March 17, 2026 10:46

Refactor metadata handling in evaluation scripts and improve error ha…

35f9d4f

…ndling

Refactor evaluate_completions for writing metadata in a more clean way

32a18c1

geoalgo approved these changes Mar 17, 2026

View reviewed changes

geoalgo merged commit 6cc9640 into main Mar 17, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Versioned Metadata for Reproducibility#23

Versioned Metadata for Reproducibility#23
geoalgo merged 11 commits intomainfrom
feat/versioned-metadata

kargibora commented Mar 12, 2026

Uh oh!

geoalgo left a comment

Uh oh!

geoalgo Mar 16, 2026

Uh oh!

kargibora Mar 17, 2026

Uh oh!

geoalgo Mar 16, 2026

Uh oh!

kargibora Mar 17, 2026

Uh oh!

geoalgo Mar 16, 2026

Uh oh!

kargibora Mar 17, 2026

Uh oh!

geoalgo Mar 16, 2026

Uh oh!

kargibora Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		except Exception as e:
		print(f"Warning: failed to write run metadata: {e}")

Conversation

kargibora commented Mar 12, 2026

Summary

What this PR introduces

How it is implemented

Resulting metadata shape

Example

Uh oh!

geoalgo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants