Skip to content

Commit 586d4ff

Browse files
author
camel-docs-bot
committed
Auto-update documentation after merge [skip ci]
1 parent c8d5374 commit 586d4ff

File tree

1 file changed

+143
-0
lines changed

1 file changed

+143
-0
lines changed
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
---
2+
title: "BenchMarks"
3+
description: "Learn about CAMEL's Benchmark module."
4+
---
5+
6+
## Overview
7+
8+
The **Benchmark** module in CAMEL provides a framework for evaluating AI agents and language models across various tasks and domains. It includes implementations of multiple benchmarks and provides a interface for running evaluations, measuring performance, and generating detailed reports.
9+
10+
The module supports benchmarks for:
11+
12+
- **API calling and tool use** (APIBank, APIBench, Nexus)
13+
- **General AI assistance** (GAIA)
14+
- **Browser-based comprehension** (BrowseComp)
15+
- **Retrieval-Augmented Generation** (RAGBench)
16+
17+
## Architecture
18+
19+
### Base Class: `BaseBenchmark`
20+
21+
All benchmarks inherit from the `BaseBenchmark` abstract class, which provides a common interface for downloading data, loading datasets, running evaluations, and accessing results.
22+
23+
#### BaseBenchmark Methods
24+
25+
| Method | Description | Parameters |
26+
| ------------ | ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
27+
| `__init__()` | Initialize the benchmark | `name`: Benchmark name<br>`data_dir`: Data directory path<br>`save_to`: Results save path<br>`processes`: Number of parallel processes |
28+
| `download()` | Download benchmark data | None |
29+
| `load()` | Load benchmark data | `force_download`: Force re-download |
30+
| `run()` | Run the benchmark evaluation | `agent`: ChatAgent to evaluate<br>`on`: Data split ("train", "valid", "test")<br>`randomize`: Shuffle data<br>`subset`: Limit number of examples |
31+
| `train` | Get training data | None |
32+
| `valid` | Get validation data | None |
33+
| `test` | Get test data | None |
34+
| `results` | Get evaluation results | None |
35+
36+
## Available Benchmarks
37+
38+
### 1. GAIA Benchmark
39+
40+
**GAIA (General AI Assistants)** is a benchmark for evaluating general-purpose AI assistants on real-world tasks requiring multiple steps, tool use, and reasoning.
41+
42+
### 2. APIBank Benchmark
43+
44+
**APIBank** evaluates the ability of LLMs to make correct API calls and generate appropriate responses in multi-turn conversations.
45+
46+
### 3. APIBench Benchmark
47+
48+
**APIBench (Gorilla)** tests the ability to generate correct API calls for various machine learning frameworks (HuggingFace, TensorFlow Hub, Torch Hub).
49+
50+
### 4. Nexus Benchmark
51+
52+
**Nexus** evaluates function calling capabilities across multiple domains including security APIs, location services, and climate data.
53+
54+
#### Available Tasks
55+
56+
| Task | Description |
57+
| ---------------------------- | ----------------------------- |
58+
| `"NVDLibrary"` | CVE and CPE API calls |
59+
| `"VirusTotal"` | Malware and security analysis |
60+
| `"OTX"` | Open Threat Exchange API |
61+
| `"PlacesAPI"` | Location and mapping services |
62+
| `"ClimateAPI"` | Weather and climate data |
63+
| `"VirusTotal-ParallelCalls"` | Multiple parallel API calls |
64+
| `"VirusTotal-NestedCalls"` | Nested API calls |
65+
| `"NVDLibrary-NestedCalls"` | Nested CVE/CPE calls |
66+
67+
### 5. BrowseComp Benchmark
68+
69+
**BrowseComp** evaluates browser-based comprehension by testing agents on questions that require understanding web content.
70+
71+
### 6. RAGBench Benchmark
72+
73+
**RAGBench** evaluates Retrieval-Augmented Generation systems using context relevancy and faithfulness metrics.
74+
75+
#### Available Subsets
76+
77+
| Subset | Description |
78+
| ------------ | -------------------------------------------------- |
79+
| `"hotpotqa"` | Multi-hop question answering |
80+
| `"covidqa"` | COVID-19 related questions |
81+
| `"finqa"` | Financial question answering |
82+
| `"cuad"` | Contract understanding |
83+
| `"msmarco"` | Microsoft Machine Reading Comprehension |
84+
| `"pubmedqa"` | Biomedical questions |
85+
| `"expertqa"` | Expert-level questions |
86+
| `"techqa"` | Technical questions |
87+
| Others | `"emanual"`, `"delucionqa"`, `"hagrid"`, `"tatqa"` |
88+
89+
## Common Usage Pattern
90+
91+
All benchmarks follow a similar pattern:
92+
93+
```python
94+
from camel.benchmarks import <BenchmarkName>
95+
from camel.agents import ChatAgent
96+
97+
# 1. Initialize
98+
benchmark = <BenchmarkName>(
99+
data_dir="./data",
100+
save_to="./results.json",
101+
processes=4
102+
)
103+
104+
# 2. Load data
105+
benchmark.load(force_download=False)
106+
107+
# 3. Create agent
108+
agent = ChatAgent(...)
109+
110+
# 4. Run evaluation
111+
results = benchmark.run(
112+
agent=agent,
113+
# benchmark-specific parameters
114+
randomize=False,
115+
subset=None # or number of examples
116+
)
117+
118+
# 5. Access results
119+
print(results) # Summary metrics
120+
print(benchmark.results) # Detailed per-example results
121+
```
122+
123+
## Implementing Custom Benchmarks
124+
125+
To create a custom benchmark, inherit from `BaseBenchmark` and implement:
126+
127+
1. `download()`: Download benchmark data
128+
2. `load()`: Load data into `self._data` dictionary
129+
3. `run()`: Execute benchmark and populate `self._results`
130+
4. Optional: Override `train`, `valid`, `test` properties
131+
132+
## References
133+
134+
- **GAIA**: https://huggingface.co/datasets/gaia-benchmark/GAIA
135+
- **APIBank**: https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/api-bank
136+
- **APIBench (Gorilla)**: https://huggingface.co/datasets/gorilla-llm/APIBench
137+
- **Nexus**: https://huggingface.co/collections/Nexusflow/nexusraven-v2
138+
- **BrowseComp**: https://openai.com/index/browsecomp/
139+
- **RAGBench**: https://arxiv.org/abs/2407.11005
140+
141+
## Other Resources
142+
143+
- Explore the [Agents](./agents.md) module for creating custom agents

0 commit comments

Comments
 (0)