Skip to content

Versioned Metadata for Reproducibility#23

Merged
geoalgo merged 11 commits intomainfrom
feat/versioned-metadata
Mar 17, 2026
Merged

Versioned Metadata for Reproducibility#23
geoalgo merged 11 commits intomainfrom
feat/versioned-metadata

Conversation

@kargibora
Copy link
Collaborator

Summary

This PR introduces a versioned run-metadata artifact for OpenJury to make evaluation runs easier to reproduce, inspect, and compare.

The new metadata is intentionally lightweight and follows the same general idea as lm-evaluation-harness: save the effective run configuration, compact reproducibility signals, and produced artifacts, without logging unnecessary local machine context.

What this PR introduces

Each evaluation run now writes a run-metadata.v1.json file alongside the existing output artifacts.

The metadata includes:

  • schema/version information
  • timestamps and run duration
  • entrypoint name
  • a single consolidated run configuration block
  • a compacted results summary
  • dataset_statistics
  • dependency/runtime information
  • optional git_hash
  • reproducibility hashes for:
    • the effective judge system prompt
    • the effective judge user prompt template
    • the selected instruction-index set
  • produced artifacts and explicit file references

How it is implemented

The main changes are:

  • add a versioned metadata writer in openjury/repro.py
  • call the metadata writer from both evaluation entrypoints
  • resolve the effective judge prompts and hash those exact prompts for provenance
  • hash the normalized set of selected instruction indices so the subset is tracked independent of ordering
  • update tests to validate the new metadata shape

An important implementation detail is that the metadata hashes are derived from the effective prompts used by the pipeline, but they do not change the evaluation logic itself.

Resulting metadata shape

At a high level, the saved metadata now looks like:

  • schema_version
  • timestamps
  • entrypoint
  • run
  • results
  • dataset_statistics
  • environment
  • dependencies
  • git_hash (optional)
  • instruction_indices_sha256 (optional)
  • judge_system_prompt_sha256 (optional)
  • judge_user_prompt_template_sha256 (optional)
  • extras (optional)
  • artifacts

Example

Example run-metadata.v1.json:

{
  "schema_version": "openjury-run-metadata/v1",
  "timestamps": {
    "started_at_utc": "2026-03-12T11:32:06.178466+00:00",
    "finished_at_utc": "2026-03-12T11:36:03.592860+00:00",
    "duration_sec": 237.414394
  },
  "entrypoint": "openjury.generate_and_evaluate.main",
  "run": {
    "dataset": "alpaca-eval",
    "model_A": "VLLM/Qwen/Qwen2.5-0.5B-Instruct",
    "model_B": "VLLM/Qwen/Qwen2.5-1.5B-Instruct",
    "judge_model": "VLLM/Qwen/Qwen3-8B",
    "provide_explanation": false,
    "swap_mode": "fixed",
    "n_instructions": 5,
    "ignore_cache": false,
    "use_tqdm": false,
    "truncate_all_input_chars": 8192,
    "max_out_tokens_models": 1024,
    "max_out_tokens_judge": 1024,
    "max_model_len": 8192,
    "chat_template": null
  },
  "results": {
    "dataset": "alpaca-eval",
    "model_A": "VLLM/Qwen/Qwen2.5-0.5B-Instruct",
    "model_B": "VLLM/Qwen/Qwen2.5-1.5B-Instruct",
    "judge_model": "VLLM/Qwen/Qwen3-8B",
    "num_battles": 5,
    "winrate": 0.0,
    "num_wins": 0,
    "num_losses": 4,
    "num_ties": 0,
    "num_missing": 1,
    "preferences_count": 5
  },
  "dataset_statistics": {
    "instruction_index_count": 5,
    "instructions_count": 5,
    "completions_A_count": 5,
    "completions_B_count": 5
  },
  "environment": {
    "python_version": "3.12.12",
    "platform": "Linux-4.18.0-477.27.1.el8_8.x86_64-x86_64-with-glibc2.28"
  },
  "dependencies": {
    "datasets": "4.5.0",
    "fast-langdetect": "1.0.0",
    "huggingface-hub": "0.36.2",
    "ipython": "9.10.0",
    "joblib": "1.5.3",
    "jupyter": "1.1.1",
    "langchain": "0.3.27",
    "langchain-community": "0.3.31",
    "langchain-openai": "0.3.35",
    "langchain-together": "0.3.1",
    "llama-cpp-python": null,
    "matplotlib": "3.10.8",
    "pandas": "2.3.3",
    "pyyaml": "6.0.3",
    "seaborn": "0.13.2",
    "tqdm": "4.67.3",
    "transformers": "4.57.6",
    "vllm": "0.10.2"
  },
  "git_hash": "a8ee895491613371583aea4b26407314aed9502b",
  "instruction_indices_sha256": "cf2445f7783b200d0290dee72cedcb2fb75774f80277f2b9458690922031d610",
  "judge_system_prompt_sha256": "0912ced093a83ed7f01e169c96824ac510da30b29c112ac998a31854e5c2239c",
  "judge_user_prompt_template_sha256": "ca30b32626266397b0361f911871d76dbfbf36e4dd790bc0ffe5a893a540ffa1",
  "extras": {
    "files": {
      "annotations": {
        "relative_path": "alpaca-eval-VLLM_Qwen_Qwen2.5-0.5B-Instruct-VLLM_Qwen_Qwen2.5-1.5B-Instruct-VLLM_Qwen_Qwen3-8B-fixed-annotations.csv"
      },
      "results": {
        "relative_path": "results-alpaca-eval-VLLM_Qwen_Qwen2.5-0.5B-Instruct-VLLM_Qwen_Qwen2.5-1.5B-Instruct-VLLM_Qwen_Qwen3-8B-fixed.json"
      }
    }
  },
  "artifacts": [
    {
      "path": "alpaca-eval-VLLM_Qwen_Qwen2.5-0.5B-Instruct-VLLM_Qwen_Qwen2.5-1.5B-Instruct-VLLM_Qwen_Qwen3-8B-fixed-annotations.csv",
      "size_bytes": 26599
    },
    {
      "path": "args-alpaca-eval-VLLM_Qwen_Qwen2.5-0.5B-Instruct-VLLM_Qwen_Qwen2.5-1.5B-Instruct-VLLM_Qwen_Qwen3-8B-fixed.json",
      "size_bytes": 565
    },
    {
      "path": "results-alpaca-eval-VLLM_Qwen_Qwen2.5-0.5B-Instruct-VLLM_Qwen_Qwen2.5-1.5B-Instruct-VLLM_Qwen_Qwen3-8B-fixed.json",
      "size_bytes": 405
    }
  ]
}

@kargibora kargibora self-assigned this Mar 12, 2026
@kargibora kargibora added the enhancement New feature or request label Mar 12, 2026
Copy link
Collaborator

@geoalgo geoalgo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just have small comments.

eval_completions_A = completions_A.head(n_instructions).tolist()
eval_completions_B = completions_B.head(n_instructions).tolist()

write_run_metadata(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we test this function instead of the whole entrypoint?
Ideally with possible edge cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it with 35f9d4f. Now I have an test_repro function that checks write run metadata only.

write_run_metadata(
output_dir=res_folder,
entrypoint="openjury.generate_and_evaluate.main",
run={
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rather just pass args?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it with 35f9d4f. Now I pass args instead

Comment on lines +94 to +123
def test_generate_and_evaluate_writes_run_metadata(tmp_path):
_ = main_generate_and_eval(
CliArgs(
dataset="alpaca-eval",
model_A="Dummy/no answer",
model_B="Dummy/open is better than close isnt'it",
judge_model="Dummy/score A: 10 score B: 0",
n_instructions=3,
result_folder=str(tmp_path),
)
)

metadata_files = list(tmp_path.rglob("run-metadata.v1.json"))
assert len(metadata_files) == 1

metadata = json.loads(metadata_files[0].read_text())
assert metadata["schema_version"] == "openjury-run-metadata/v1"
assert metadata["entrypoint"] == "openjury.generate_and_evaluate.main"
assert metadata["run"]["dataset"] == "alpaca-eval"
assert metadata["dataset_statistics"]["instruction_index_count"] == 3
assert metadata["results"]["num_battles"] == 3
assert metadata["results"]["preferences_count"] == 3
assert "instruction_indices_sha256" in metadata
assert "judge_system_prompt_sha256" in metadata
assert "judge_user_prompt_template_sha256" in metadata
artifact_paths = {artifact["path"] for artifact in metadata["artifacts"]}
assert (
metadata["extras"]["files"]["results"]["relative_path"] in artifact_paths
)
assert metadata["extras"]["files"]["annotations"]["relative_path"] in artifact_paths
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rather test the function that generates the metadata ideally with possible edge cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it with 35f9d4f. Now I have an test_repro function that checks write run metadata only.

Comment on lines +536 to +537
except Exception as e:
print(f"Warning: failed to write run metadata: {e}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not catch all exceptions here as it is an anti-pattern but restricts the class and re-raising as it can silently swallows unexpected errors otherwise.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it with 35f9d4f. Now only checks OSError

@geoalgo geoalgo merged commit 6cc9640 into main Mar 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants