Skip to content

Commit 65ec018

Browse files
authored
Move DISCO queue to core (#41)
Refactor MMLU benchmark (#41) * Moved AnchorPointsTaskQueue to core as DISCOQueue with a new class hierarchy: SequentialTaskQueue → InformativeSubsetQueue → DISCOQueue; DISCOQueue.load_anchor_points() now handles .json/.pkl loading internally * Added ModelScorer ABC (maseval.core.scorer) for log-likelihood scoring and HuggingFaceModelScorer (maseval.interface.inference) as its concrete implementation, extracting all scoring logic out of the benchmark * Refactored MMLUBenchmark into an abstract base (setup_agents and get_model_adapter left to subclasses) and DefaultMMLUBenchmark (renamed from HuggingFaceMMLUBenchmark), which delegates scoring to HuggingFaceModelScorer via _ScorerBackedAdapter * Removed MMLUModelAgent, MMLUAgentAdapter, _DummyCallable, and ModelAgentAdapter — replaced by the scorer-backed adapter pattern * Replaced all silent .get() fallbacks on required fields with direct dict access ([]) so missing data raises KeyError immediately; added get_with_assert() utility in maseval.core.exceptions * Fixed get_git_info() crashing on detached HEAD in CI * Renamed HuggingFaceModelAdapter → HuggingFacePipelineModelAdapter (backwards-compatible alias retained) * Added MMLU documentation (docs/benchmark/mmlu.md), usage tracking guide (docs/guides/usage-tracking.md), scorer and usage API reference pages, CITATION.cff, and BENCHMARKS.md "Implemented" callout * Added mmlu, lm-eval, and disco optional extras to pyproject.toml with transformers<5.0.0 pin for lm-eval and disco * Added comprehensive test suites for DISCOQueue, ModelScorer, Task.freeze, usage tracking, MMLU data integrity, registry, and HuggingFace scorer; updated existing integration tests for smolagents, langgraph, llamaindex, CAMEL, and model adapters
1 parent db68df9 commit 65ec018

39 files changed

Lines changed: 3093 additions & 868 deletions

BENCHMARKS.md

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,23 @@ CONVERSE evaluates contextual safety in agent-to-agent conversations. It focuses
7979

8080
---
8181

82-
## 6. [Name of Next Benchmark]
82+
## 6. MMLU (Massive Multitask Language Understanding) (Beta)
83+
84+
MMLU evaluates language models on multiple-choice questions spanning 57 academic subjects. The MASEval integration includes anchor-point-based evaluation for DISCO prediction, allowing efficient estimation of full benchmark performance from a subset of tasks.
85+
86+
> **Beta:** This benchmark has been implemented carefully, but we have not yet validated the results against the original implementation. Use with caution when comparing with existing results or the original paper's numbers. Contributions and compute donations welcome!
87+
88+
> **Implemented:** A ready-to-use implementation is available via `DefaultMMLUBenchmark` with HuggingFace model support. Install with `pip install maseval[mmlu]`. See the [MMLU documentation](https://maseval.readthedocs.io/en/stable/benchmark/mmlu/) for usage details.
89+
90+
### Source and License
91+
92+
- **Original Paper:** [Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300) (Hendrycks et al., 2021)
93+
- **DISCO Paper:** [DISCO: Diversifying Sample Condensation for Efficient Model Evaluation](https://arxiv.org/abs/2510.07959) (Rubinstein et al., ICLR 2026)
94+
- **Dataset:** [arubique/flattened-MMLU](https://huggingface.co/datasets/arubique/flattened-MMLU)
95+
96+
---
97+
98+
## 7. [Name of Next Benchmark]
8399

84100
(Description for the next benchmark...)
85101

CHANGELOG.md

Lines changed: 27 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1111

1212
**Core**
1313

14-
- Fixed `MessageHistory.to_list()` returning a reference to the internal list instead of a copy, causing simulator logs to contain future conversation messages that hadn't occurred at the time of logging. (PR: #PR_NUMBER_PLACEHOLDER)
14+
- Fixed `MessageHistory.to_list()` returning a reference to the internal list instead of a copy, causing simulator logs to contain future conversation messages that hadn't occurred at the time of logging. (PR: #48)
15+
- Fixed `get_git_info()` crashing on detached HEAD (e.g. in CI checkout), now returns `detached@<short-hash>` as the branch name. (PR: #41)
1516

1617
**Interface**
1718

@@ -24,16 +25,30 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
2425
- Usage and cost tracking via `Usage` and `TokenUsage` data classes. `ModelAdapter` tracks token usage automatically after each `chat()` call. Components that implement `UsageTrackableMixin` are collected via `gather_usage()`. Live totals available during benchmark runs via `benchmark.usage` (grand total) and `benchmark.usage_by_component` (per-component breakdowns). Post-hoc analysis via `UsageReporter.from_reports(benchmark.reports)` with breakdowns by task, component, or model. (PR: #45)
2526
- Pluggable cost calculation via `CostCalculator` protocol. `StaticPricingCalculator` computes cost from user-supplied per-token rates. `LiteLLMCostCalculator` in `maseval.interface.usage` for automatic pricing via LiteLLM's model database (supports `custom_pricing` overrides and `model_id_map`; requires `litellm`). Pass a `cost_calculator` to `ModelAdapter` or `AgentAdapter` to compute `Usage.cost`. Provider-reported cost always takes precedence. (PR: #45)
2627
- `AgentAdapter` now accepts `cost_calculator` and `model_id` parameters. For smolagents, CAMEL, and LlamaIndex, both are auto-detected from the framework's agent object (`LiteLLMCostCalculator` if litellm is installed). LangGraph requires explicit `model_id` since graphs can contain multiple models. Explicit parameters always override auto-detection. (PR: #45)
27-
2828
- `Task.freeze()` and `Task.unfreeze()` methods to make task data read-only during benchmark runs, preventing accidental mutation of `environment_data`, `user_data`, `evaluation_data`, and `metadata` (including nested dicts). Attribute reassignment is also blocked while frozen. Check state with `Task.is_frozen`. (PR: #42)
2929
- `TaskFrozenError` exception in `maseval.core.exceptions`, raised when attempting to modify a frozen task. (PR: #42)
30+
- Added `InformativeSubsetQueue` and `DISCOQueue` to `maseval.core.task` for subset-based evaluation (e.g., anchor-point selection for DISCO). `DISCOQueue` accepts `anchor_points_path` to load indices from a `.json`/`.pkl` file via `DISCOQueue.load_anchor_points()`. Available via `from maseval import DISCOQueue, InformativeSubsetQueue`. (PR: #34 and #41)
31+
- Added `ModelScorer` abstract base class in `maseval.core.scorer` for log-likelihood scoring, with `loglikelihood()`, `loglikelihood_batch()`, and `loglikelihood_choices()` methods. (PR: #34 and #41)
32+
- Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
33+
- Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24)
34+
- Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24)
35+
- Added `seed` parameter to `ModelAdapter.__init__` for deterministic model inference (PR: #24)
36+
- Added `SeedingError` exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24)
37+
- Added `UserExhaustedError` exception in `maseval.core.exceptions` for flow control when a user's turns are exhausted (PR: #39)
3038

31-
**Benchmarks**
39+
**Interface**
40+
41+
- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFacePipelineModelAdapter` pass seeds to underlying APIs (PR: #24)
42+
- Added `HuggingFaceModelScorer` in `maseval.interface.inference` — log-likelihood scorer backed by a HuggingFace `AutoModelForCausalLM`, with single-token optimisation for MCQ evaluation. Implements the `ModelScorer` interface. (PR: #34 and #41)
43+
- CAMEL-AI integration: `CamelAgentAdapter` and `CamelLLMUser` for evaluating CAMEL-AI ChatAgent-based systems (PR: #22)
44+
- Added `CamelAgentUser` for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
45+
- Added `camel_role_playing_execution_loop()` for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
46+
- Added `CamelRolePlayingTracer` and `CamelWorkforceTracer` for capturing orchestration-level traces from CAMEL's multi-agent systems (PR: #22)
3247

33-
- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. Includes `MMLUBenchmark`, `HuggingFaceMMLUBenchmark`, `MMLUEnvironment`, `MMLUEvaluator`, `MMLUModelAgent`, `MMLUAgentAdapter`, `AnchorPointsTaskQueue`, `load_tasks()`, and `compute_benchmark_metrics()`. Optional extras: `lm-eval` (for `HuggingFaceMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34)
48+
**Benchmarks**
3449

50+
- MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. `MMLUBenchmark` is a framework-agnostic base class (`setup_agents()` and `get_model_adapter()` must be implemented by subclasses); `DefaultMMLUBenchmark` provides a ready-made HuggingFace implementation. Also includes `MMLUEnvironment`, `MMLUEvaluator`, `load_tasks()`, and `compute_benchmark_metrics()`. Install with `pip install maseval[mmlu]`. Optional extras: `lm-eval` (for `DefaultMMLUBenchmark.precompute_all_logprobs_lmeval`), `disco` (for DISCO prediction in the example). (PR: #34 and #41)
3551
- CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including `ConverseBenchmark`, `DefaultAgentConverseBenchmark`, `ConverseEnvironment`, `ConverseExternalAgent`, `PrivacyEvaluator`, `SecurityEvaluator`, and `load_tasks()` utilities for `travel`, `real_estate`, and `insurance` domains. Benchmark source files are now downloaded on first use via `ensure_data_exists()` instead of being bundled in the package. (PR: #28)
36-
3752
- GAIA2 Benchmark: Integration with Meta's ARE (Agent Research Environments) platform for evaluating LLM-based agents on dynamic, multi-step scenarios (PR: #26)
3853
- `Gaia2Benchmark`, `Gaia2Environment`, `Gaia2Evaluator` components for framework-agnostic evaluation with ARE simulation (PR: #26)
3954
- `DefaultAgentGaia2Benchmark` with ReAct-style agent for direct comparison with ARE reference implementation (PR: #26)
@@ -43,7 +58,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
4358
- Metrics: `compute_gaia2_metrics()` for GSR (Goal Success Rate) computation by capability type (PR: #26)
4459
- Support for 5 capability dimensions: execution, search, adaptability, time, ambiguity (PR: #26, #30)
4560
- Added `gaia2` optional dependency: `pip install maseval[gaia2]` (PR: #26)
46-
4761
- MultiAgentBench Benchmark: Integration with MARBLE MultiAgentBench for evaluating multi-agent collaboration across all 6 paper-defined domains: research, bargaining, coding, database, werewolf, and minecraft (PR: #25, #30)
4862
- `MultiAgentBenchBenchmark` abstract base class for framework-agnostic multi-agent evaluation with seeding support for evaluators and agents (PR: #25)
4963
- `MarbleMultiAgentBenchBenchmark` for exact MARBLE reproduction mode using native MARBLE agents (note: MARBLE's internal LLM calls bypass MASEval seeding) (PR: #25)
@@ -54,32 +68,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
5468
**Examples**
5569

5670
- Added usage tracking to the 5-A-Day benchmark: `five_a_day_benchmark.ipynb` (section 2.7) and `five_a_day_benchmark.py` (post-run usage summary with per-component and per-task breakdowns). (PR: #45)
57-
58-
- MMLU benchmark example at `examples/mmlu_benchmark/` for evaluating HuggingFace models on MMLU with optional DISCO prediction (`--disco_model_path`, `--disco_transform_path`). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34)
71+
- MMLU benchmark example at `examples/mmlu_benchmark/` for evaluating HuggingFace models on MMLU with optional DISCO prediction (`--disco_model_path`, `--disco_transform_path`). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34 and #41)
5972
- Added a dedicated runnable CONVERSE default benchmark example at `examples/converse_benchmark/default_converse_benchmark.py` for quick start with `DefaultAgentConverseBenchmark`. (PR: #28)
6073
- Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)
6174

6275
**Documentation**
6376

6477
- Usage & Cost Tracking guide (`docs/guides/usage-tracking.md`) and API reference (`docs/reference/usage.md`). (PR: #45)
6578

66-
**Core**
67-
68-
- Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
69-
- Added `seed` and `seed_generator` parameters to `Benchmark.__init__` for enabling reproducibility (PR: #24)
70-
- Added `seed_generator` parameter to all benchmark setup methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`) (PR: #24)
71-
- Added `seed` parameter to `ModelAdapter.__init__` for deterministic model inference (PR: #24)
72-
- Added `SeedingError` exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24)
73-
- Added seed support to interface adapters: `OpenAIModelAdapter`, `GoogleGenAIModelAdapter`, `LiteLLMModelAdapter`, `HuggingFaceModelAdapter` pass seeds to underlying APIs (PR: #24)
74-
- Added `UserExhaustedError` exception in `maseval.core.exceptions` for flow control when a user's turns are exhausted (PR: #39)
75-
76-
**Interface**
77-
78-
- CAMEL-AI integration: `CamelAgentAdapter` and `CamelLLMUser` for evaluating CAMEL-AI ChatAgent-based systems (PR: #22)
79-
- Added `CamelAgentUser` for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
80-
- Added `camel_role_playing_execution_loop()` for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
81-
- Added `CamelRolePlayingTracer` and `CamelWorkforceTracer` for capturing orchestration-level traces from CAMEL's multi-agent systems (PR: #22)
82-
8379
**Testing**
8480

8581
- Composable pytest markers (`live`, `credentialed`, `slow`, `smoke`) for fine-grained test selection; default runs exclude slow, credentialed, and smoke tests (PR: #29)
@@ -91,7 +87,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
9187
- Live API round-trip tests for all model adapters (`-m credentialed`) (PR: #29)
9288
- CI jobs for slow tests (with benchmark data caching) and credentialed tests (behind GitHub Environment approval) (PR: #29)
9389
- Added `respx` dev dependency for HTTP-level mocking (PR: #29)
94-
- pytest marker `mmlu` for tests that require the MMLU benchmark (HuggingFace + DISCO). (PR: #34)
90+
- pytest marker `mmlu` for tests that require the MMLU benchmark (HuggingFace + DISCO). (PR: #34 and #41)
9591

9692
### Changed
9793

@@ -108,28 +104,24 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
108104
- `LlamaIndexAgentAdapter`: Added `max_iterations` constructor parameter, forwarded to `AgentWorkflow.run()`. Fixes silent swallowing of `max_steps` by `FunctionAgent.__init__`. (PR: #39)
109105
- `SmolAgentAdapter`: New `_determine_step_status()` detects crashed steps where `AgentGenerationError` was raised before `step.error` was set, preventing false "success" status on empty steps. (PR: #39)
110106
- `GoogleGenAIModelAdapter`: Consecutive tool-response messages are now merged into a single `contents` entry, fixing Google API errors when multiple tool results are returned in one turn. (PR: #39)
107+
- Renamed framework-specific user classes to reflect the new `LLMUser` base (PR: #22):
108+
- `SmolAgentUser``SmolAgentLLMUser`
109+
- `LangGraphUser``LangGraphLLMUser`
110+
- `LlamaIndexUser``LlamaIndexLLMUser`
111111

112112
**Benchmarks**
113113

114114
- `MACSBenchmark` and `Tau2Benchmark` benchmarks now actively use the seeding system by deriving seeds for model adapters. Seeds are passed to agents, user simulators, tool simulators, and LLM-based evaluators for reproducible runs. (PR: #26)
115115
- `Gaia2Benchmark`: Seeds `agents/gaia2_agent`, `evaluators/judge`
116116
- `MACSBenchmark`: Seeds `environment/tools/tool_{name}`, `simulators/user`, `evaluators/user_gsr`, `evaluators/system_gsr`
117117
- `Tau2Benchmark`: Seeds `simulators/user`, `agents/default_agent`
118+
- All benchmarks except MACS are now labeled as **Beta** in docs, BENCHMARKS.md, and benchmark index, with a warning that results have not yet been validated against original implementations. (PR: #39)
118119

119120
**User**
120121

121122
- Refactored `User` class into abstract base class defining the interface (`get_initial_query()`, `respond()`, `is_done()`) with `LLMUser` as the concrete LLM-driven implementation. This enables non-LLM user implementations (scripted, human-in-the-loop, agent-based). (PR: #22)
122123
- Renamed `AgenticUser``AgenticLLMUser` for consistency with the new hierarchy (PR: #22)
123124

124-
**Interface**
125-
126-
- Renamed framework-specific user classes to reflect the new `LLMUser` base (PR: #22):
127-
- `SmolAgentUser``SmolAgentLLMUser`
128-
- `LangGraphUser``LangGraphLLMUser`
129-
- `LlamaIndexUser``LlamaIndexLLMUser`
130-
131-
- All benchmarks except MACS are now labeled as **Beta** in docs, BENCHMARKS.md, and benchmark index, with a warning that results have not yet been validated against original implementations. (PR: #39)
132-
133125
**Testing**
134126

135127
- Coverage script (`scripts/coverage_by_feature.py`) now accepts `--exclude` flag to skip additional markers; always excludes `credentialed` and `smoke` by default (PR: #29)

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,13 @@ pip install "maseval[langgraph]"
102102
pip install "maseval[llamaindex]"
103103
```
104104

105+
Or install benchmark-specific dependencies:
106+
107+
```bash
108+
# MMLU (HuggingFace models)
109+
pip install "maseval[mmlu]"
110+
```
111+
105112
## Example
106113

107114
Examples are available in the [Documentation](https://maseval.readthedocs.io/en/stable/).

0 commit comments

Comments
 (0)