Skip to content

Commit 8bb2cac

Browse files
committed
updated readme
1 parent d9d7fb2 commit 8bb2cac

File tree

1 file changed

+86
-99
lines changed

1 file changed

+86
-99
lines changed

README.md

Lines changed: 86 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -1,151 +1,138 @@
11
<p align="center"><img src="docs/twilioAlphaLogoLight.png#gh-dark-mode-only" height="100" alt="Twilio Alpha"/><img src="docs/twilioAlphaLogoDark.png#gh-light-mode-only" height="100" alt="Twilio Alpha"/></p>
22
<h1 align="center">MCP-TE Benchmark</h1>
33

4-
A standardized framework for evaluating the efficiency gains of AI agents using Model Context Protocol (MCP) compared to custom tools, such as terminal execution and web search.
4+
A standardized framework for evaluating the efficiency gains and trade-offs of AI agents using Model Context Protocol (MCP) compared to traditional methods.
55

66
## Abstract
77

8-
MCP-TE Benchmark (where "TE" stands for "Task Efficiency") is designed to measure the efficiency gains and qualitative differences when AI coding agents utilize structured context protocols (like MCP) compared to traditional development methods (e.g., documentation lookup, trial-and-error). As AI coding assistants become more integrated into development workflows, understanding how they interact with APIs and structured protocols becomes increasingly important for optimizing developer productivity and cost.
8+
MCP-TE Benchmark (where "TE" stands for "Task Efficiency") is designed to measure the efficiency gains, resource utilization changes, and qualitative differences when AI coding agents utilize structured context protocols (like MCP) compared to traditional development methods (e.g., file search, terminal execution, web search). As AI coding assistants become more integrated into development workflows, understanding how they interact with APIs and structured protocols is crucial for optimizing developer productivity and evaluating overall cost-effectiveness.
99

1010
## Leaderboard
1111

12-
### Overall Performance
12+
### Overall Performance (Model: claude-3.7-sonnet)
1313

14-
| Metric | Control | MCP | Improvement |
15-
|--------|---------|-----|-------------|
16-
| Average Duration (s) | 62.5 | 49.7 | -20.6% |
17-
| Average API Calls | 10.3 | 8.3 | -19.3% |
18-
| Average Interactions | 1.1 | 1.0 | -3.3% |
19-
| Average Tokens | 2286.1 | 2141.4 | -6.3% |
20-
| Average Cache Reads | 191539.5 | 246152.5 | +28.5% |
21-
| Average Cache Writes | 11043.5 | 16973.9 | +53.7% |
22-
| Average Cost ($) | 0.1 | 0.2 | +27.5% |
23-
| Success Rate | 92.3% | 100.0% | +8.3% |
14+
*Environment: Twilio (MCP Server), Cline (MCP Client), Model: claude-3.7-sonnet*
2415

25-
*Key Improvements:*
26-
- 20.6% reduction in task completion time
27-
- 27.5% reduction in overall cost
28-
- 8.3% improvement in success rate
29-
- Significant improvements in cache utilization
16+
| Metric | Control | MCP | Change |
17+
| :--------------------- | :--------- | :--------- | :----- |
18+
| Average Duration (s) | 62.5 | 49.7 | -20.5% |
19+
| Average API Calls | 10.3 | 8.3 | -19.3% |
20+
| Average Interactions | 1.1 | 1.0 | -3.3% |
21+
| Average Tokens | 2286.1 | 2141.4 | -6.3% |
22+
| Average Cache Reads | 191539.5 | 246152.5 | +28.5% |
23+
| Average Cache Writes | 11043.5 | 16973.9 | +53.7% |
24+
| Average Cost ($) | 0.1 | 0.2 | +27.5% |
25+
| Success Rate | 92.3% | 100.0% | +8.3% |
3026

31-
*Environment:* Twilio (MCP Server), Cline (MCP Client), Mixed models
27+
*Note: Calculations based on data in `metrics/summary.json`.*
3228

33-
### Task-Specific Performance
29+
*Key Findings (claude-3.7-sonnet):*
30+
* **Efficiency Gains:** MCP usage resulted in faster task completion (-20.5% duration), fewer API calls (-19.3%), and slightly fewer user interactions (-3.3%). Token usage also saw a modest decrease (-6.3%).
31+
* **Increased Resource Utilization:** MCP significantly increased cache reads (+28.5%) and cache writes (+53.7%).
32+
* **Cost Increase:** The increased resource utilization, particularly cache operations or potentially different API call patterns within MCP, led to a notable increase in average task cost (+27.5%).
33+
* **Improved Reliability:** MCP achieved a perfect success rate (100%), an 8.3% improvement over the control group.
3434

35-
#### Twilio - Task 1: Purchase a Canadian Phone Number
35+
*Conclusion:* For the `claude-3.7-sonnet` model in this benchmark, MCP offers improvements in speed, API efficiency, and reliability, but at the cost of increased cache operations and overall monetary cost per task.
3636

37-
| Model | Mode | Duration (s) | API Calls | Interactions | Success |
38-
|-------|------|--------------|-----------|--------------|---------|
39-
| auto | Control | 20.7 | 3.5 | 1.0 | 100% |
40-
| auto | MCP | 38.4 | 2.3 | 1.0 | 100% |
41-
| claude-3.7-sonnet | Control | 64.3 | 9.0 | 1.0 | 100% |
42-
| claude-3.7-sonnet | MCP | 42.2 | 3.0 | 1.0 | 100% |
37+
### Task-Specific Performance (Model: claude-3.7-sonnet)
4338

44-
#### Twilio - Task 2: Create a Task Router Activity
39+
*Calculations based on data in `summary.json`.*
4540

46-
| Model | Mode | Duration (s) | API Calls | Interactions | Success |
47-
|-------|------|--------------|-----------|--------------|---------|
48-
| auto | Control | 65.6 | 14.0 | 2.0 | 100% |
49-
| auto | MCP | 43.5 | 3.0 | 1.0 | 100% |
50-
| claude-3.7-sonnet | Control | 35.3 | 4.0 | 1.0 | 100% |
51-
| claude-3.7-sonnet | MCP | N/A | N/A | N/A | N/A |
41+
#### Task 1: Purchase a Canadian Phone Number
5242

53-
#### Twilio - Task 3: Create a Queue with Task Filter
43+
| Mode | Duration (s) | API Calls | Interactions | Success Rate |
44+
| :------ | :----------- | :-------- | :----------- | :----------- |
45+
| Control | 79.4 | 12.8 | 1.2 | 100.0% |
46+
| MCP | 62.3 | 9.6 | 1.1 | 100.0% |
5447

55-
| Model | Mode | Duration (s) | API Calls | Interactions | Success |
56-
|-------|------|--------------|-----------|--------------|---------|
57-
| auto | Control | 38.5 | 5.0 | 1.0 | 100% |
58-
| auto | MCP | 45.2 | 2.0 | 1.0 | 100% |
59-
| claude-3.7-sonnet | Control | 40.1 | 4.0 | 1.0 | 100% |
60-
| claude-3.7-sonnet | MCP | N/A | N/A | N/A | N/A |
48+
#### Task 2: Create a Task Router Activity
49+
50+
| Mode | Duration (s) | API Calls | Interactions | Success Rate |
51+
| :------ | :----------- | :-------- | :----------- | :----------- |
52+
| Control | 46.4 | 8.4 | 1.0 | 77.8% |
53+
| MCP | 30.7 | 5.9 | 1.0 | 100.0% |
54+
55+
#### Task 3: Create a Queue with Task Filter
56+
57+
| Mode | Duration (s) | API Calls | Interactions | Success Rate |
58+
| :------ | :----------- | :-------- | :----------- | :----------- |
59+
| Control | 61.8 | 9.5 | 1.0 | 100.0% |
60+
| MCP | 56.1 | 9.4 | 1.0 | 100.0% |
6161

6262
## Benchmark Design & Metrics
6363

6464
The MCP-TE Benchmark evaluates AI coding agents' performance using a Control vs. Treatment methodology:
6565

66-
- **Control Group:** Completion of API tasks using traditional methods (web search, documentation, and terminal capabilities)
67-
- **Treatment Group:** Completion of the same API tasks using Model Context Protocol (MCP)
66+
* **Control Group:** Completion of API tasks using traditional methods (web search, file search, and terminal capabilities).
67+
* **Treatment Group (MCP):** Completion of the same API tasks using a Twilio Model Context Protocol server (MCP).
6868

69-
### Key Metrics
69+
### Key Metrics Collected
7070

71-
| Metric | Description |
72-
|--------|-------------|
73-
| Duration | Time taken to complete a task from start to finish (in seconds) |
74-
| API Calls | Number of API calls made during task completion |
75-
| Interactions | Number of exchanges between the user and the AI assistant |
76-
| Success Rate | Percentage of tasks completed successfully |
71+
| Metric | Description |
72+
| :------------- | :-------------------------------------------------------------------------- |
73+
| Duration | Time taken to complete a task from start to finish (in seconds) |
74+
| API Calls | Number of API calls made during task completion |
75+
| Interactions | Number of exchanges between the user and the AI assistant |
76+
| Tokens | Total input and output tokens used during the task |
77+
| Cache Reads | Number of cached tokens read (measure of cache hit effectiveness) |
78+
| Cache Writes | Number of tokens written to the cache (measure of context loading/saving) |
79+
| Cost | Estimated cost ($) based on token usage and cache operations (model specific) |
80+
| Success Rate | Percentage of tasks completed successfully |
7781

7882
### Metrics Collection
7983

80-
All metrics are now collected automatically from the Claude chat logs:
84+
All metrics are collected automatically from Claude chat logs using the scripts provided in this repository:
8185

82-
- **Duration:** Time from task start to completion, measured automatically
83-
- **API Calls:** Number of API calls made during task completion, extracted from chat logs
84-
- **Interactions:** Number of exchanges between the user and the AI assistant, extracted from chat logs
85-
- **Token Usage:** Input and output tokens used during the task
86-
- **Cost:** Estimated cost based on token usage
87-
- **Success Rate:** Percentage of tasks completed successfully
88-
89-
To extract metrics from chat logs, run:
90-
```bash
91-
npm run extract-metrics
92-
```
93-
94-
This script will analyze the Claude chat logs and generate metrics files in the `metrics/tasks/` directory, including an updated `summary.json` file that powers the dashboard.
86+
* Run `npm run extract-metrics` to process chat logs.
87+
* This script analyzes logs, calculates metrics for each task run, and saves them as individual JSON files in `metrics/tasks/`.
88+
* It also generates/updates a `summary.json` file in the same directory, consolidating all individual results.
9589

9690
## Tasks
9791

9892
The current benchmark includes the following tasks specific to the Twilio MCP Server:
9993

100-
1. **Purchase a Canadian Phone Number:** Search for and purchase an available Canadian phone number (preferably with area code 416)
101-
2. **Create a Task Router Activity:** Create a new Task Router activity named "Bathroom"
102-
3. **Create a Queue with Task Filter:** Create a queue with a task filter that prevents routing tasks to workers in the "Bathroom" activity
94+
1. **Purchase a Canadian Phone Number:** Search for and purchase an available Canadian phone number (preferably with area code 416).
95+
2. **Create a Task Router Activity:** Create a new Task Router activity named "Bathroom".
96+
3. **Create a Queue with Task Filter:** Create a queue with a task filter that prevents routing tasks to workers in the "Bathroom" activity.
10397

104-
While the initial task suite focuses on Twilio MCP Server functionality, the MCP-TE framework is designed to be adaptable to other APIs and context protocols.
98+
(Setup, Running Tests, Extracting Metrics, Dashboard, CLI Summary sections remain largely the same as they accurately describe the repo structure and tools)
10599

106100
## Setup
107101

108-
1. Clone this repository:
109-
```
110-
git clone https://github.com/nmogil-tw/mcp-te-benchmark.git
111-
```
112-
2. Run the setup script:
113-
```
114-
./scripts/setup.sh
115-
```
116-
3. Create your `.env` file from the example:
117-
```
118-
cp .env.example .env
119-
```
120-
4. Edit the `.env` file with your Twilio credentials
121-
5. Install dependencies:
122-
```
123-
npm install
124-
```
125-
6. Start the dashboard server:
126-
```
127-
npm start
128-
```
102+
1. Clone this repository:
103+
```bash
104+
git clone https://github.com/nmogil-tw/mcp-te-benchmark.git
105+
cd mcp-te-benchmark
106+
```
107+
2. Install dependencies:
108+
```bash
109+
npm install
110+
```
111+
3. Create your `.env` file from the example:
112+
```bash
113+
cp .env.example .env
114+
```
115+
4. Edit the `.env` file with your necessary credentials (e.g., Twilio).
116+
5. *(Optional)* If needed, run any project-specific setup scripts (check `scripts/` directory if applicable).
129117

130118
## Running Tests
131119

132120
### Testing Protocol
133121

134-
1. Open Cline and start a new chat with Claude
135-
136-
2. Upload the appropriate instruction file as context:
137-
- For control tests: `agent-instructions/control_instructions.md`
138-
- For MCP tests: `agent-instructions/mcp_instructions.md`
139-
140-
3. Start the test with: `Complete Task [TASK_NUMBER] using the commands in the instructions`
141-
142-
4. The AI assistant will complete the task, and all metrics will be automatically collected from the chat logs
122+
1. Open Cline (or the specified MCP Client) and start a new chat with the target model (e.g., Claude).
123+
2. Upload the appropriate instruction file as context:
124+
* For control tests: `agent-instructions/control_instructions.md`
125+
* For MCP tests: `agent-instructions/mcp_instructions.md`
126+
3. Start the test with the prompt: `Complete Task [TASK_NUMBER] using the commands in the instructions`
127+
4. Allow the AI assistant to complete the task. Metrics will be collected from the chat logs later.
128+
5. Repeat for all desired tasks and modes.
143129

144130
### Extracting Metrics from Chat Logs
145131

146-
After running tests, extract metrics from Claude chat logs:
132+
After running tests, extract metrics from the chat logs:
147133

148134
```bash
135+
# Extracts metrics and updates summary.json
149136
npm run extract-metrics
150137
```
151138

0 commit comments

Comments
 (0)