Skip to content

Commit 820ff33

Browse files
authored
Merge pull request #3 from twilio-internal/metrics-revamp
Metrics revamp
2 parents 92d4081 + b4442f1 commit 820ff33

File tree

11 files changed

+1136
-661
lines changed

11 files changed

+1136
-661
lines changed

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2025 Twilio Alpha
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 116 additions & 110 deletions
Original file line numberDiff line numberDiff line change
@@ -1,151 +1,156 @@
11
<p align="center"><img src="docs/twilioAlphaLogoLight.png#gh-dark-mode-only" height="100" alt="Twilio Alpha"/><img src="docs/twilioAlphaLogoDark.png#gh-light-mode-only" height="100" alt="Twilio Alpha"/></p>
22
<h1 align="center">MCP-TE Benchmark</h1>
33

4-
A standardized framework for evaluating the efficiency gains of AI agents using Model Context Protocol (MCP) compared to custom tools, such as terminal execution and web search.
4+
A standardized framework for evaluating the efficiency gains and trade-offs of AI agents using Model Context Protocol (MCP) compared to traditional methods.
55

66
## Abstract
77

8-
MCP-TE Benchmark (where "TE" stands for "Task Efficiency") is designed to measure the efficiency gains and qualitative differences when AI coding agents utilize structured context protocols (like MCP) compared to traditional development methods (e.g., documentation lookup, trial-and-error). As AI coding assistants become more integrated into development workflows, understanding how they interact with APIs and structured protocols becomes increasingly important for optimizing developer productivity and cost.
8+
MCP-TE Benchmark (where "TE" stands for "Task Efficiency") is designed to measure the efficiency gains, resource utilization changes, and qualitative differences when AI coding agents utilize structured context protocols (like MCP) compared to traditional development methods (e.g., file search, terminal execution, web search). As AI coding assistants become more integrated into development workflows, understanding how they interact with APIs and structured protocols is crucial for optimizing developer productivity and evaluating overall cost-effectiveness.
99

1010
## Leaderboard
1111

12-
### Overall Performance
13-
14-
| Metric | Control | MCP | Improvement |
15-
|--------|---------|-----|-------------|
16-
| Average Duration (s) | 62.5 | 49.7 | -20.6% |
17-
| Average API Calls | 10.3 | 8.3 | -19.3% |
18-
| Average Interactions | 1.1 | 1.0 | -3.3% |
19-
| Average Tokens | 2286.1 | 2141.4 | -6.3% |
20-
| Average Cache Reads | 191539.5 | 246152.5 | +28.5% |
21-
| Average Cache Writes | 11043.5 | 16973.9 | +53.7% |
22-
| Average Cost ($) | 0.1 | 0.2 | +27.5% |
23-
| Success Rate | 92.3% | 100.0% | +8.3% |
24-
25-
*Key Improvements:*
26-
- 20.6% reduction in task completion time
27-
- 27.5% reduction in overall cost
28-
- 8.3% improvement in success rate
29-
- Significant improvements in cache utilization
30-
31-
*Environment:* Twilio (MCP Server), Cline (MCP Client), Mixed models
32-
33-
### Task-Specific Performance
34-
35-
#### Twilio - Task 1: Purchase a Canadian Phone Number
36-
37-
| Model | Mode | Duration (s) | API Calls | Interactions | Success |
38-
|-------|------|--------------|-----------|--------------|---------|
39-
| auto | Control | 20.7 | 3.5 | 1.0 | 100% |
40-
| auto | MCP | 38.4 | 2.3 | 1.0 | 100% |
41-
| claude-3.7-sonnet | Control | 64.3 | 9.0 | 1.0 | 100% |
42-
| claude-3.7-sonnet | MCP | 42.2 | 3.0 | 1.0 | 100% |
43-
44-
#### Twilio - Task 2: Create a Task Router Activity
45-
46-
| Model | Mode | Duration (s) | API Calls | Interactions | Success |
47-
|-------|------|--------------|-----------|--------------|---------|
48-
| auto | Control | 65.6 | 14.0 | 2.0 | 100% |
49-
| auto | MCP | 43.5 | 3.0 | 1.0 | 100% |
50-
| claude-3.7-sonnet | Control | 35.3 | 4.0 | 1.0 | 100% |
51-
| claude-3.7-sonnet | MCP | N/A | N/A | N/A | N/A |
52-
53-
#### Twilio - Task 3: Create a Queue with Task Filter
54-
55-
| Model | Mode | Duration (s) | API Calls | Interactions | Success |
56-
|-------|------|--------------|-----------|--------------|---------|
57-
| auto | Control | 38.5 | 5.0 | 1.0 | 100% |
58-
| auto | MCP | 45.2 | 2.0 | 1.0 | 100% |
59-
| claude-3.7-sonnet | Control | 40.1 | 4.0 | 1.0 | 100% |
60-
| claude-3.7-sonnet | MCP | N/A | N/A | N/A | N/A |
12+
### Overall Performance (Model: claude-3.7-sonnet)
13+
14+
*Environment: Twilio (MCP Server), Cline (MCP Client), Model: claude-3.7-sonnet*
15+
16+
| Metric | Control | MCP | Change |
17+
| :--------------------- | :--------- | :--------- | :----- |
18+
| Average Duration (s) | 62.54 | 49.68 | -20.56% |
19+
| Average API Calls | 10.27 | 8.29 | -19.26% |
20+
| Average Interactions | 1.08 | 1.04 | -3.27% |
21+
| Average Tokens | 2286.12 | 2141.38 | -6.33% |
22+
| Average Cache Reads | 191539.50 | 246152.46 | +28.51% |
23+
| Average Cache Writes | 11043.46 | 16973.88 | +53.70% |
24+
| Average Cost ($) | 0.13 | 0.17 | +27.55% |
25+
| Success Rate | 92.31% | 100.0% | +8.33% |
26+
27+
*Note: Calculations based on data in `metrics/summary.json`.*
28+
29+
*Key Findings (claude-3.7-sonnet):*
30+
* **Efficiency Gains:** MCP usage resulted in faster task completion (-20.5% duration), fewer API calls (-19.3%), and slightly fewer user interactions (-3.3%). Token usage also saw a modest decrease (-6.3%).
31+
* **Increased Resource Utilization:** MCP significantly increased cache reads (+28.5%) and cache writes (+53.7%).
32+
* **Cost Increase:** The increased resource utilization, particularly cache operations or potentially different API call patterns within MCP, led to a notable increase in average task cost (+27.5%).
33+
* **Improved Reliability:** MCP achieved a perfect success rate (100%), an 8.3% improvement over the control group.
34+
35+
*Conclusion:* For the `claude-3.7-sonnet` model in this benchmark, MCP offers improvements in speed, API efficiency, and reliability, but at the cost of increased cache operations and overall monetary cost per task.
36+
37+
### Task-Specific Performance (Model: claude-3.7-sonnet)
38+
39+
*Calculations based on data in `summary.json`.*
40+
41+
#### Task 1: Purchase a Canadian Phone Number
42+
43+
| Metric | Control | MCP | Change |
44+
| :--------------------- | :--------- | :--------- | :------- |
45+
| Duration (s) | 79.41 | 62.27 | -21.57% |
46+
| API Calls | 12.78 | 9.63 | -24.67% |
47+
| Interactions | 1.22 | 1.13 | -7.95% |
48+
| Tokens | 2359.33 | 2659.88 | +12.74% |
49+
| Cache Reads | 262556.11 | 281086.13 | +7.06% |
50+
| Cache Writes | 17196.33 | 25627.63 | +49.03% |
51+
| Cost ($) | 0.18 | 0.22 | +23.50% |
52+
| Success Rate | 100.00% | 100.00% | 0.00% |
53+
54+
#### Task 2: Create a Task Router Activity
55+
56+
| Metric | Control | MCP | Change |
57+
| :--------------------- | :--------- | :--------- | :------- |
58+
| Duration (s) | 46.37 | 30.71 | -33.77% |
59+
| API Calls | 8.44 | 5.88 | -30.43% |
60+
| Interactions | 1.00 | 1.00 | 0.00% |
61+
| Tokens | 2058.89 | 1306.63 | -36.54% |
62+
| Cache Reads | 144718.44 | 164311.50 | +13.54% |
63+
| Cache Writes | 6864.44 | 11219.13 | +63.44% |
64+
| Cost ($) | 0.10 | 0.11 | +11.09% |
65+
| Success Rate | 77.78% | 100.00% | +28.57% |
66+
67+
#### Task 3: Create a Queue with Task Filter
68+
69+
| Metric | Control | MCP | Change |
70+
| :--------------------- | :--------- | :--------- | :------- |
71+
| Duration (s) | 61.77 | 56.07 | -9.23% |
72+
| API Calls | 9.50 | 9.38 | -1.32% |
73+
| Interactions | 1.00 | 1.00 | 0.00% |
74+
| Tokens | 2459.38 | 2457.63 | -0.07% |
75+
| Cache Reads | 164319.50 | 293059.75 | +78.35% |
76+
| Cache Writes | 8822.88 | 14074.88 | +59.53% |
77+
| Cost ($) | 0.12 | 0.18 | +49.06% |
78+
| Success Rate | 100.00% | 100.00% | 0.00% |
6179

6280
## Benchmark Design & Metrics
6381

6482
The MCP-TE Benchmark evaluates AI coding agents' performance using a Control vs. Treatment methodology:
6583

66-
- **Control Group:** Completion of API tasks using traditional methods (web search, documentation, and terminal capabilities)
67-
- **Treatment Group:** Completion of the same API tasks using Model Context Protocol (MCP)
84+
* **Control Group:** Completion of API tasks using traditional methods (web search, file search, and terminal capabilities).
85+
* **Treatment Group (MCP):** Completion of the same API tasks using a Twilio Model Context Protocol server (MCP).
6886

69-
### Key Metrics
87+
### Key Metrics Collected
7088

71-
| Metric | Description |
72-
|--------|-------------|
73-
| Duration | Time taken to complete a task from start to finish (in seconds) |
74-
| API Calls | Number of API calls made during task completion |
75-
| Interactions | Number of exchanges between the user and the AI assistant |
76-
| Success Rate | Percentage of tasks completed successfully |
89+
| Metric | Description |
90+
| :------------- | :-------------------------------------------------------------------------- |
91+
| Duration | Time taken to complete a task from start to finish (in seconds) |
92+
| API Calls | Number of API calls made during task completion |
93+
| Interactions | Number of exchanges between the user and the AI assistant |
94+
| Tokens | Total input and output tokens used during the task |
95+
| Cache Reads | Number of cached tokens read (measure of cache hit effectiveness) |
96+
| Cache Writes | Number of tokens written to the cache (measure of context loading/saving) |
97+
| Cost | Estimated cost ($) based on token usage and cache operations (model specific) |
98+
| Success Rate | Percentage of tasks completed successfully |
7799

78100
### Metrics Collection
79101

80-
All metrics are now collected automatically from the Claude chat logs:
81-
82-
- **Duration:** Time from task start to completion, measured automatically
83-
- **API Calls:** Number of API calls made during task completion, extracted from chat logs
84-
- **Interactions:** Number of exchanges between the user and the AI assistant, extracted from chat logs
85-
- **Token Usage:** Input and output tokens used during the task
86-
- **Cost:** Estimated cost based on token usage
87-
- **Success Rate:** Percentage of tasks completed successfully
88-
89-
To extract metrics from chat logs, run:
90-
```bash
91-
npm run extract-metrics
92-
```
102+
All metrics are collected automatically from Claude chat logs using the scripts provided in this repository:
93103

94-
This script will analyze the Claude chat logs and generate metrics files in the `metrics/tasks/` directory, including an updated `summary.json` file that powers the dashboard.
104+
* Run `npm run extract-metrics` to process chat logs.
105+
* This script analyzes logs, calculates metrics for each task run, and saves them as individual JSON files in `metrics/tasks/`.
106+
* It also generates/updates a `summary.json` file in the same directory, consolidating all individual results.
95107

96108
## Tasks
97109

98110
The current benchmark includes the following tasks specific to the Twilio MCP Server:
99111

100-
1. **Purchase a Canadian Phone Number:** Search for and purchase an available Canadian phone number (preferably with area code 416)
101-
2. **Create a Task Router Activity:** Create a new Task Router activity named "Bathroom"
102-
3. **Create a Queue with Task Filter:** Create a queue with a task filter that prevents routing tasks to workers in the "Bathroom" activity
112+
1. **Purchase a Canadian Phone Number:** Search for and purchase an available Canadian phone number (preferably with area code 416).
113+
2. **Create a Task Router Activity:** Create a new Task Router activity named "Bathroom".
114+
3. **Create a Queue with Task Filter:** Create a queue with a task filter that prevents routing tasks to workers in the "Bathroom" activity.
103115

104-
While the initial task suite focuses on Twilio MCP Server functionality, the MCP-TE framework is designed to be adaptable to other APIs and context protocols.
116+
(Setup, Running Tests, Extracting Metrics, Dashboard, CLI Summary sections remain largely the same as they accurately describe the repo structure and tools)
105117

106118
## Setup
107119

108-
1. Clone this repository:
109-
```
110-
git clone https://github.com/nmogil-tw/mcp-te-benchmark.git
111-
```
112-
2. Run the setup script:
113-
```
114-
./scripts/setup.sh
115-
```
116-
3. Create your `.env` file from the example:
117-
```
118-
cp .env.example .env
119-
```
120-
4. Edit the `.env` file with your Twilio credentials
121-
5. Install dependencies:
122-
```
123-
npm install
124-
```
125-
6. Start the dashboard server:
126-
```
127-
npm start
128-
```
120+
1. Clone this repository:
121+
```bash
122+
git clone https://github.com/nmogil-tw/mcp-te-benchmark.git
123+
cd mcp-te-benchmark
124+
```
125+
2. Install dependencies:
126+
```bash
127+
npm install
128+
```
129+
3. Create your `.env` file from the example:
130+
```bash
131+
cp .env.example .env
132+
```
133+
4. Edit the `.env` file with your necessary credentials (e.g., Twilio).
134+
5. *(Optional)* If needed, run any project-specific setup scripts (check `scripts/` directory if applicable).
129135

130136
## Running Tests
131137

132138
### Testing Protocol
133139

134-
1. Open Cline and start a new chat with Claude
135-
136-
2. Upload the appropriate instruction file as context:
137-
- For control tests: `agent-instructions/control_instructions.md`
138-
- For MCP tests: `agent-instructions/mcp_instructions.md`
139-
140-
3. Start the test with: `Complete Task [TASK_NUMBER] using the commands in the instructions`
141-
142-
4. The AI assistant will complete the task, and all metrics will be automatically collected from the chat logs
140+
1. Open Cline (or the specified MCP Client) and start a new chat with the target model (e.g., Claude).
141+
2. Upload the appropriate instruction file as context:
142+
* For control tests: `agent-instructions/control_instructions.md`
143+
* For MCP tests: `agent-instructions/mcp_instructions.md`
144+
3. Start the test with the prompt: `Complete Task [TASK_NUMBER] using the commands in the instructions`
145+
4. Allow the AI assistant to complete the task. Metrics will be collected from the chat logs later.
146+
5. Repeat for all desired tasks and modes.
143147

144148
### Extracting Metrics from Chat Logs
145149

146-
After running tests, extract metrics from Claude chat logs:
150+
After running tests, extract metrics from the chat logs:
147151

148152
```bash
153+
# Extracts metrics and updates summary.json
149154
npm run extract-metrics
150155
```
151156

@@ -204,7 +209,8 @@ The benchmark focuses on these key insights:
204209
1. **Time Efficiency:** Comparing the time it takes to complete tasks using MCP vs. traditional methods
205210
2. **API Efficiency:** Measuring the reduction in API calls when using MCP
206211
3. **Interaction Efficiency:** Evaluating if MCP reduces the number of interactions needed to complete tasks
207-
4. **Success Rate:** Determining if MCP improves the reliability of task completion
212+
4. **Cost Efficiency** Evalutating if the added MCP context has an impact on Token Costs
213+
5. **Success Rate:** Determining if MCP improves the reliability of task completion
208214
209215
Negative percentage changes in duration, API calls, and interactions indicate improvements, while positive changes in success rate indicate improvements.
210216

agent-instructions/control_instructions.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ This document contains three Twilio implementation tasks to complete using web s
44

55
## Environment Setup
66

7-
- The Cursor coding agent has access to web search and terminal commands
7+
- The Cline coding agent has access to web search and terminal commands
88
- Use the .env file to access any twilio authentication credentials, like Twilio accound SID
99

1010
## Testing Protocol

agent-instructions/mcp_instructions.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ This document contains three Twilio implementation tasks to complete using the T
44

55
## Environment Setup
66

7-
- The Cursor coding agent has access to Twilio MCP functions
7+
- The Cline coding agent has access to Twilio MCP functions
88
- Use the .env file to access any twilio authentication credentials, like Twilio accound SID
99

1010
## Testing Protocol

0 commit comments

Comments
 (0)