Restore results aggregation tests for job-based architecture#110
Draft
dedeswim wants to merge 2 commits intofacebookresearch:mainfrom
Draft
Restore results aggregation tests for job-based architecture#110dedeswim wants to merge 2 commits intofacebookresearch:mainfrom
dedeswim wants to merge 2 commits intofacebookresearch:mainfrom
Conversation
Re-add tests for aggregate_results, _group_by_task, estimate_pass_at_k, and GroupBy that were inadvertently deleted in PR facebookresearch#43 when the CLI was redesigned to a job-centric approach. The tests have been adapted to work with the new job-based API: - Creates job directories with config.yaml and index.jsonl instead of a single index.jsonl file - Uses _read_all_jobs() instead of the removed _read_index() - Tests JobConfig and RunIndexEntry model creation Test coverage includes: - Basic aggregation and metric averaging - Multiple timestamps for same task - Empty and nonexistent directories - Grouping by all, dataset, agent, agent_name, attack, and dataset_suite - pass@k metrics (k=1 averaging, k>1 estimator) - Edge cases: all successes, no successes, insufficient samples - Multiple k values - Metadata columns (n_tasks, avg_n_samples) - Multiple configurations (datasets, agents, attacks) https://claude.ai/code/session_01QS62MtFCpHvVmc8xdxXX7L
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
aggregate_results,_group_by_task,estimate_pass_at_k, andGroupBythat were inadvertently deleted in PR Redesign CLI to jobs-centric approach with resume/retry support #43 when the CLI was redesigned to a job-centric approachconfig.yaml+index.jsonlinstead of a singleindex.jsonl)Context
PR #43 deleted
tests/results/test_aggregator.py(~1,145 lines) without moving or replacing these tests. A reviewer (@evtimovi) questioned this deletion post-merge. The core functionality (aggregate_results,_group_by_task,estimate_pass_at_k,GroupBy) still exists insrc/prompt_siren/results.pybut was left untested.Test coverage restored (29 tests)
k=[1, 3, 5]n_tasks,avg_n_samplesTest plan