Conversation
geoalgo
left a comment
There was a problem hiding this comment.
LGTM, just have small comments.
| eval_completions_A = completions_A.head(n_instructions).tolist() | ||
| eval_completions_B = completions_B.head(n_instructions).tolist() | ||
|
|
||
| write_run_metadata( |
There was a problem hiding this comment.
can we test this function instead of the whole entrypoint?
Ideally with possible edge cases.
There was a problem hiding this comment.
Fixed it with 35f9d4f. Now I have an test_repro function that checks write run metadata only.
openjury/generate_and_evaluate.py
Outdated
| write_run_metadata( | ||
| output_dir=res_folder, | ||
| entrypoint="openjury.generate_and_evaluate.main", | ||
| run={ |
There was a problem hiding this comment.
can we rather just pass args?
There was a problem hiding this comment.
Fixed it with 35f9d4f. Now I pass args instead
tests/test_generate_and_evaluate.py
Outdated
| def test_generate_and_evaluate_writes_run_metadata(tmp_path): | ||
| _ = main_generate_and_eval( | ||
| CliArgs( | ||
| dataset="alpaca-eval", | ||
| model_A="Dummy/no answer", | ||
| model_B="Dummy/open is better than close isnt'it", | ||
| judge_model="Dummy/score A: 10 score B: 0", | ||
| n_instructions=3, | ||
| result_folder=str(tmp_path), | ||
| ) | ||
| ) | ||
|
|
||
| metadata_files = list(tmp_path.rglob("run-metadata.v1.json")) | ||
| assert len(metadata_files) == 1 | ||
|
|
||
| metadata = json.loads(metadata_files[0].read_text()) | ||
| assert metadata["schema_version"] == "openjury-run-metadata/v1" | ||
| assert metadata["entrypoint"] == "openjury.generate_and_evaluate.main" | ||
| assert metadata["run"]["dataset"] == "alpaca-eval" | ||
| assert metadata["dataset_statistics"]["instruction_index_count"] == 3 | ||
| assert metadata["results"]["num_battles"] == 3 | ||
| assert metadata["results"]["preferences_count"] == 3 | ||
| assert "instruction_indices_sha256" in metadata | ||
| assert "judge_system_prompt_sha256" in metadata | ||
| assert "judge_user_prompt_template_sha256" in metadata | ||
| artifact_paths = {artifact["path"] for artifact in metadata["artifacts"]} | ||
| assert ( | ||
| metadata["extras"]["files"]["results"]["relative_path"] in artifact_paths | ||
| ) | ||
| assert metadata["extras"]["files"]["annotations"]["relative_path"] in artifact_paths |
There was a problem hiding this comment.
can we rather test the function that generates the metadata ideally with possible edge cases?
There was a problem hiding this comment.
Fixed it with 35f9d4f. Now I have an test_repro function that checks write run metadata only.
openjury/generate_and_evaluate.py
Outdated
| except Exception as e: | ||
| print(f"Warning: failed to write run metadata: {e}") |
There was a problem hiding this comment.
We should not catch all exceptions here as it is an anti-pattern but restricts the class and re-raising as it can silently swallows unexpected errors otherwise.
There was a problem hiding this comment.
Fixed it with 35f9d4f. Now only checks OSError
Summary
This PR introduces a versioned run-metadata artifact for OpenJury to make evaluation runs easier to reproduce, inspect, and compare.
The new metadata is intentionally lightweight and follows the same general idea as
lm-evaluation-harness: save the effective run configuration, compact reproducibility signals, and produced artifacts, without logging unnecessary local machine context.What this PR introduces
Each evaluation run now writes a
run-metadata.v1.jsonfile alongside the existing output artifacts.The metadata includes:
runconfiguration blockdataset_statisticsgit_hashHow it is implemented
The main changes are:
openjury/repro.pyAn important implementation detail is that the metadata hashes are derived from the effective prompts used by the pipeline, but they do not change the evaluation logic itself.
Resulting metadata shape
At a high level, the saved metadata now looks like:
schema_versiontimestampsentrypointrunresultsdataset_statisticsenvironmentdependenciesgit_hash(optional)instruction_indices_sha256(optional)judge_system_prompt_sha256(optional)judge_user_prompt_template_sha256(optional)extras(optional)artifactsExample
Example
run-metadata.v1.json:{ "schema_version": "openjury-run-metadata/v1", "timestamps": { "started_at_utc": "2026-03-12T11:32:06.178466+00:00", "finished_at_utc": "2026-03-12T11:36:03.592860+00:00", "duration_sec": 237.414394 }, "entrypoint": "openjury.generate_and_evaluate.main", "run": { "dataset": "alpaca-eval", "model_A": "VLLM/Qwen/Qwen2.5-0.5B-Instruct", "model_B": "VLLM/Qwen/Qwen2.5-1.5B-Instruct", "judge_model": "VLLM/Qwen/Qwen3-8B", "provide_explanation": false, "swap_mode": "fixed", "n_instructions": 5, "ignore_cache": false, "use_tqdm": false, "truncate_all_input_chars": 8192, "max_out_tokens_models": 1024, "max_out_tokens_judge": 1024, "max_model_len": 8192, "chat_template": null }, "results": { "dataset": "alpaca-eval", "model_A": "VLLM/Qwen/Qwen2.5-0.5B-Instruct", "model_B": "VLLM/Qwen/Qwen2.5-1.5B-Instruct", "judge_model": "VLLM/Qwen/Qwen3-8B", "num_battles": 5, "winrate": 0.0, "num_wins": 0, "num_losses": 4, "num_ties": 0, "num_missing": 1, "preferences_count": 5 }, "dataset_statistics": { "instruction_index_count": 5, "instructions_count": 5, "completions_A_count": 5, "completions_B_count": 5 }, "environment": { "python_version": "3.12.12", "platform": "Linux-4.18.0-477.27.1.el8_8.x86_64-x86_64-with-glibc2.28" }, "dependencies": { "datasets": "4.5.0", "fast-langdetect": "1.0.0", "huggingface-hub": "0.36.2", "ipython": "9.10.0", "joblib": "1.5.3", "jupyter": "1.1.1", "langchain": "0.3.27", "langchain-community": "0.3.31", "langchain-openai": "0.3.35", "langchain-together": "0.3.1", "llama-cpp-python": null, "matplotlib": "3.10.8", "pandas": "2.3.3", "pyyaml": "6.0.3", "seaborn": "0.13.2", "tqdm": "4.67.3", "transformers": "4.57.6", "vllm": "0.10.2" }, "git_hash": "a8ee895491613371583aea4b26407314aed9502b", "instruction_indices_sha256": "cf2445f7783b200d0290dee72cedcb2fb75774f80277f2b9458690922031d610", "judge_system_prompt_sha256": "0912ced093a83ed7f01e169c96824ac510da30b29c112ac998a31854e5c2239c", "judge_user_prompt_template_sha256": "ca30b32626266397b0361f911871d76dbfbf36e4dd790bc0ffe5a893a540ffa1", "extras": { "files": { "annotations": { "relative_path": "alpaca-eval-VLLM_Qwen_Qwen2.5-0.5B-Instruct-VLLM_Qwen_Qwen2.5-1.5B-Instruct-VLLM_Qwen_Qwen3-8B-fixed-annotations.csv" }, "results": { "relative_path": "results-alpaca-eval-VLLM_Qwen_Qwen2.5-0.5B-Instruct-VLLM_Qwen_Qwen2.5-1.5B-Instruct-VLLM_Qwen_Qwen3-8B-fixed.json" } } }, "artifacts": [ { "path": "alpaca-eval-VLLM_Qwen_Qwen2.5-0.5B-Instruct-VLLM_Qwen_Qwen2.5-1.5B-Instruct-VLLM_Qwen_Qwen3-8B-fixed-annotations.csv", "size_bytes": 26599 }, { "path": "args-alpaca-eval-VLLM_Qwen_Qwen2.5-0.5B-Instruct-VLLM_Qwen_Qwen2.5-1.5B-Instruct-VLLM_Qwen_Qwen3-8B-fixed.json", "size_bytes": 565 }, { "path": "results-alpaca-eval-VLLM_Qwen_Qwen2.5-0.5B-Instruct-VLLM_Qwen_Qwen2.5-1.5B-Instruct-VLLM_Qwen_Qwen3-8B-fixed.json", "size_bytes": 405 } ] }