Skip to content

Comments

Restore results aggregation tests for job-based architecture#110

Draft
dedeswim wants to merge 2 commits intofacebookresearch:mainfrom
dedeswim:claude/review-test-deletion-NkwnV
Draft

Restore results aggregation tests for job-based architecture#110
dedeswim wants to merge 2 commits intofacebookresearch:mainfrom
dedeswim:claude/review-test-deletion-NkwnV

Conversation

@dedeswim
Copy link
Collaborator

@dedeswim dedeswim commented Feb 2, 2026

Summary

  • Re-add tests for aggregate_results, _group_by_task, estimate_pass_at_k, and GroupBy that were inadvertently deleted in PR Redesign CLI to jobs-centric approach with resume/retry support #43 when the CLI was redesigned to a job-centric approach
  • Adapt tests to work with the new job-based API (creating job directories with config.yaml + index.jsonl instead of a single index.jsonl)

Context

PR #43 deleted tests/results/test_aggregator.py (~1,145 lines) without moving or replacing these tests. A reviewer (@evtimovi) questioned this deletion post-merge. The core functionality (aggregate_results, _group_by_task, estimate_pass_at_k, GroupBy) still exists in src/prompt_siren/results.py but was left untested.

Test coverage restored (29 tests)

  • Basic aggregation - reading jobs, averaging metrics
  • Multiple timestamps - averaging across runs for same task
  • Empty/nonexistent directories - graceful handling
  • Grouping modes - ALL, DATASET, AGENT, AGENT_NAME, ATTACK, DATASET_SUITE
  • pass@k metrics - k=1 averaging, k>1 estimator, exact k samples
  • Edge cases - all successes, no successes, insufficient samples error
  • Multiple k values - testing with k=[1, 3, 5]
  • Metadata columns - n_tasks, avg_n_samples
  • Multiple configurations - 2×2×2 (datasets × agents × attacks)

Test plan

  • All 29 new tests pass
  • Full test suite (730 tests) passes
  • Linting and formatting pass

Re-add tests for aggregate_results, _group_by_task, estimate_pass_at_k,
and GroupBy that were inadvertently deleted in PR facebookresearch#43 when the CLI was
redesigned to a job-centric approach.

The tests have been adapted to work with the new job-based API:
- Creates job directories with config.yaml and index.jsonl instead of
  a single index.jsonl file
- Uses _read_all_jobs() instead of the removed _read_index()
- Tests JobConfig and RunIndexEntry model creation

Test coverage includes:
- Basic aggregation and metric averaging
- Multiple timestamps for same task
- Empty and nonexistent directories
- Grouping by all, dataset, agent, agent_name, attack, and dataset_suite
- pass@k metrics (k=1 averaging, k>1 estimator)
- Edge cases: all successes, no successes, insufficient samples
- Multiple k values
- Metadata columns (n_tasks, avg_n_samples)
- Multiple configurations (datasets, agents, attacks)

https://claude.ai/code/session_01QS62MtFCpHvVmc8xdxXX7L
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 2, 2026
@dedeswim dedeswim requested a review from evtimovi February 2, 2026 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants