Restore results aggregation tests for job-based architecture by dedeswim · Pull Request #110 · facebookresearch/prompt-siren

dedeswim · 2026-02-02T15:00:56Z

Summary

Re-add tests for aggregate_results, _group_by_task, estimate_pass_at_k, and GroupBy that were inadvertently deleted in PR Redesign CLI to jobs-centric approach with resume/retry support #43 when the CLI was redesigned to a job-centric approach
Adapt tests to work with the new job-based API (creating job directories with config.yaml + index.jsonl instead of a single index.jsonl)

Context

PR #43 deleted tests/results/test_aggregator.py (~1,145 lines) without moving or replacing these tests. A reviewer (@evtimovi) questioned this deletion post-merge. The core functionality (aggregate_results, _group_by_task, estimate_pass_at_k, GroupBy) still exists in src/prompt_siren/results.py but was left untested.

Test coverage restored (29 tests)

Basic aggregation - reading jobs, averaging metrics
Multiple timestamps - averaging across runs for same task
Empty/nonexistent directories - graceful handling
Grouping modes - ALL, DATASET, AGENT, AGENT_NAME, ATTACK, DATASET_SUITE
pass@k metrics - k=1 averaging, k>1 estimator, exact k samples
Edge cases - all successes, no successes, insufficient samples error
Multiple k values - testing with k=[1, 3, 5]
Metadata columns - n_tasks, avg_n_samples
Multiple configurations - 2×2×2 (datasets × agents × attacks)

Test plan

All 29 new tests pass
Full test suite (730 tests) passes
Linting and formatting pass

Re-add tests for aggregate_results, _group_by_task, estimate_pass_at_k, and GroupBy that were inadvertently deleted in PR facebookresearch#43 when the CLI was redesigned to a job-centric approach. The tests have been adapted to work with the new job-based API: - Creates job directories with config.yaml and index.jsonl instead of a single index.jsonl file - Uses _read_all_jobs() instead of the removed _read_index() - Tests JobConfig and RunIndexEntry model creation Test coverage includes: - Basic aggregation and metric averaging - Multiple timestamps for same task - Empty and nonexistent directories - Grouping by all, dataset, agent, agent_name, attack, and dataset_suite - pass@k metrics (k=1 averaging, k>1 estimator) - Edge cases: all successes, no successes, insufficient samples - Multiple k values - Metadata columns (n_tasks, avg_n_samples) - Multiple configurations (datasets, agents, attacks) https://claude.ai/code/session_01QS62MtFCpHvVmc8xdxXX7L

https://claude.ai/code/session_01QS62MtFCpHvVmc8xdxXX7L

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 2, 2026

dedeswim requested a review from evtimovi February 2, 2026 15:01

Format test_aggregator.py with ufmt

63e82b8

https://claude.ai/code/session_01QS62MtFCpHvVmc8xdxXX7L

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Restore results aggregation tests for job-based architecture#110

Restore results aggregation tests for job-based architecture#110
dedeswim wants to merge 2 commits intofacebookresearch:mainfrom
dedeswim:claude/review-test-deletion-NkwnV

dedeswim commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

dedeswim commented Feb 2, 2026

Summary

Context

Test coverage restored (29 tests)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants