|
| 1 | +--- |
| 2 | +title: "BenchMarks" |
| 3 | +description: "Learn about CAMEL's Benchmark module." |
| 4 | +--- |
| 5 | + |
| 6 | +## Overview |
| 7 | + |
| 8 | +The **Benchmark** module in CAMEL provides a framework for evaluating AI agents and language models across various tasks and domains. It includes implementations of multiple benchmarks and provides a interface for running evaluations, measuring performance, and generating detailed reports. |
| 9 | + |
| 10 | +The module supports benchmarks for: |
| 11 | + |
| 12 | +- **API calling and tool use** (APIBank, APIBench, Nexus) |
| 13 | +- **General AI assistance** (GAIA) |
| 14 | +- **Browser-based comprehension** (BrowseComp) |
| 15 | +- **Retrieval-Augmented Generation** (RAGBench) |
| 16 | + |
| 17 | +## Architecture |
| 18 | + |
| 19 | +### Base Class: `BaseBenchmark` |
| 20 | + |
| 21 | +All benchmarks inherit from the `BaseBenchmark` abstract class, which provides a common interface for downloading data, loading datasets, running evaluations, and accessing results. |
| 22 | + |
| 23 | +#### BaseBenchmark Methods |
| 24 | + |
| 25 | +| Method | Description | Parameters | |
| 26 | +| ------------ | ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | |
| 27 | +| `__init__()` | Initialize the benchmark | `name`: Benchmark name<br>`data_dir`: Data directory path<br>`save_to`: Results save path<br>`processes`: Number of parallel processes | |
| 28 | +| `download()` | Download benchmark data | None | |
| 29 | +| `load()` | Load benchmark data | `force_download`: Force re-download | |
| 30 | +| `run()` | Run the benchmark evaluation | `agent`: ChatAgent to evaluate<br>`on`: Data split ("train", "valid", "test")<br>`randomize`: Shuffle data<br>`subset`: Limit number of examples | |
| 31 | +| `train` | Get training data | None | |
| 32 | +| `valid` | Get validation data | None | |
| 33 | +| `test` | Get test data | None | |
| 34 | +| `results` | Get evaluation results | None | |
| 35 | + |
| 36 | +## Available Benchmarks |
| 37 | + |
| 38 | +### 1. GAIA Benchmark |
| 39 | + |
| 40 | +**GAIA (General AI Assistants)** is a benchmark for evaluating general-purpose AI assistants on real-world tasks requiring multiple steps, tool use, and reasoning. |
| 41 | + |
| 42 | +### 2. APIBank Benchmark |
| 43 | + |
| 44 | +**APIBank** evaluates the ability of LLMs to make correct API calls and generate appropriate responses in multi-turn conversations. |
| 45 | + |
| 46 | +### 3. APIBench Benchmark |
| 47 | + |
| 48 | +**APIBench (Gorilla)** tests the ability to generate correct API calls for various machine learning frameworks (HuggingFace, TensorFlow Hub, Torch Hub). |
| 49 | + |
| 50 | +### 4. Nexus Benchmark |
| 51 | + |
| 52 | +**Nexus** evaluates function calling capabilities across multiple domains including security APIs, location services, and climate data. |
| 53 | + |
| 54 | +#### Available Tasks |
| 55 | + |
| 56 | +| Task | Description | |
| 57 | +| ---------------------------- | ----------------------------- | |
| 58 | +| `"NVDLibrary"` | CVE and CPE API calls | |
| 59 | +| `"VirusTotal"` | Malware and security analysis | |
| 60 | +| `"OTX"` | Open Threat Exchange API | |
| 61 | +| `"PlacesAPI"` | Location and mapping services | |
| 62 | +| `"ClimateAPI"` | Weather and climate data | |
| 63 | +| `"VirusTotal-ParallelCalls"` | Multiple parallel API calls | |
| 64 | +| `"VirusTotal-NestedCalls"` | Nested API calls | |
| 65 | +| `"NVDLibrary-NestedCalls"` | Nested CVE/CPE calls | |
| 66 | + |
| 67 | +### 5. BrowseComp Benchmark |
| 68 | + |
| 69 | +**BrowseComp** evaluates browser-based comprehension by testing agents on questions that require understanding web content. |
| 70 | + |
| 71 | +### 6. RAGBench Benchmark |
| 72 | + |
| 73 | +**RAGBench** evaluates Retrieval-Augmented Generation systems using context relevancy and faithfulness metrics. |
| 74 | + |
| 75 | +#### Available Subsets |
| 76 | + |
| 77 | +| Subset | Description | |
| 78 | +| ------------ | -------------------------------------------------- | |
| 79 | +| `"hotpotqa"` | Multi-hop question answering | |
| 80 | +| `"covidqa"` | COVID-19 related questions | |
| 81 | +| `"finqa"` | Financial question answering | |
| 82 | +| `"cuad"` | Contract understanding | |
| 83 | +| `"msmarco"` | Microsoft Machine Reading Comprehension | |
| 84 | +| `"pubmedqa"` | Biomedical questions | |
| 85 | +| `"expertqa"` | Expert-level questions | |
| 86 | +| `"techqa"` | Technical questions | |
| 87 | +| Others | `"emanual"`, `"delucionqa"`, `"hagrid"`, `"tatqa"` | |
| 88 | + |
| 89 | +## Common Usage Pattern |
| 90 | + |
| 91 | +All benchmarks follow a similar pattern: |
| 92 | + |
| 93 | +```python |
| 94 | +from camel.benchmarks import <BenchmarkName> |
| 95 | +from camel.agents import ChatAgent |
| 96 | + |
| 97 | +# 1. Initialize |
| 98 | +benchmark = <BenchmarkName>( |
| 99 | + data_dir="./data", |
| 100 | + save_to="./results.json", |
| 101 | + processes=4 |
| 102 | +) |
| 103 | + |
| 104 | +# 2. Load data |
| 105 | +benchmark.load(force_download=False) |
| 106 | + |
| 107 | +# 3. Create agent |
| 108 | +agent = ChatAgent(...) |
| 109 | + |
| 110 | +# 4. Run evaluation |
| 111 | +results = benchmark.run( |
| 112 | + agent=agent, |
| 113 | + # benchmark-specific parameters |
| 114 | + randomize=False, |
| 115 | + subset=None # or number of examples |
| 116 | +) |
| 117 | + |
| 118 | +# 5. Access results |
| 119 | +print(results) # Summary metrics |
| 120 | +print(benchmark.results) # Detailed per-example results |
| 121 | +``` |
| 122 | + |
| 123 | +## Implementing Custom Benchmarks |
| 124 | + |
| 125 | +To create a custom benchmark, inherit from `BaseBenchmark` and implement: |
| 126 | + |
| 127 | +1. `download()`: Download benchmark data |
| 128 | +2. `load()`: Load data into `self._data` dictionary |
| 129 | +3. `run()`: Execute benchmark and populate `self._results` |
| 130 | +4. Optional: Override `train`, `valid`, `test` properties |
| 131 | + |
| 132 | +## References |
| 133 | + |
| 134 | +- **GAIA**: https://huggingface.co/datasets/gaia-benchmark/GAIA |
| 135 | +- **APIBank**: https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/api-bank |
| 136 | +- **APIBench (Gorilla)**: https://huggingface.co/datasets/gorilla-llm/APIBench |
| 137 | +- **Nexus**: https://huggingface.co/collections/Nexusflow/nexusraven-v2 |
| 138 | +- **BrowseComp**: https://openai.com/index/browsecomp/ |
| 139 | +- **RAGBench**: https://arxiv.org/abs/2407.11005 |
| 140 | + |
| 141 | +## Other Resources |
| 142 | + |
| 143 | +- Explore the [Agents](./agents.md) module for creating custom agents |
0 commit comments