Skip to content

Commit 92d4081

Browse files
authored
Merge pull request #1 from twilio-internal/updating-readme
updated readme & refactored
2 parents 15d585e + d9d7fb2 commit 92d4081

File tree

103 files changed

+5583
-2816
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

103 files changed

+5583
-2816
lines changed

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,4 +22,6 @@ results/analysis/
2222
.DS_Store
2323
Thumbs.db
2424

25-
.logs/
25+
.logs/
26+
27+
example_task_logs/

README.md

Lines changed: 72 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<p align="center"><img src="docs/twilioAlphaLogo.png" height="100" alt="Twilio Alpha"/></p>
1+
<p align="center"><img src="docs/twilioAlphaLogoLight.png#gh-dark-mode-only" height="100" alt="Twilio Alpha"/><img src="docs/twilioAlphaLogoDark.png#gh-light-mode-only" height="100" alt="Twilio Alpha"/></p>
22
<h1 align="center">MCP-TE Benchmark</h1>
33

44
A standardized framework for evaluating the efficiency gains of AI agents using Model Context Protocol (MCP) compared to custom tools, such as terminal execution and web search.
@@ -9,18 +9,26 @@ MCP-TE Benchmark (where "TE" stands for "Task Efficiency") is designed to measur
99

1010
## Leaderboard
1111

12-
**Note:** Due to a limitation in the current MCP Client (Cursor), model selection is restricted in some test runs. 'Auto' indicates the client's automatic model selection. Results for specific models will be added as they become available.
13-
1412
### Overall Performance
1513

1614
| Metric | Control | MCP | Improvement |
1715
|--------|---------|-----|-------------|
18-
| Average Duration (s) | 43.3 | 42.7 | -1.4% |
19-
| Average API Calls | 6.9 | 2.5 | -63.8% |
20-
| Average Interactions | 1.2 | 1.0 | -16.7% |
21-
| Success Rate | 100.0% | 100.0% | 0.0% |
22-
23-
*Environment:* Twilio (MCP Server), Cursor (MCP Client), Mixed models
16+
| Average Duration (s) | 62.5 | 49.7 | -20.6% |
17+
| Average API Calls | 10.3 | 8.3 | -19.3% |
18+
| Average Interactions | 1.1 | 1.0 | -3.3% |
19+
| Average Tokens | 2286.1 | 2141.4 | -6.3% |
20+
| Average Cache Reads | 191539.5 | 246152.5 | +28.5% |
21+
| Average Cache Writes | 11043.5 | 16973.9 | +53.7% |
22+
| Average Cost ($) | 0.1 | 0.2 | +27.5% |
23+
| Success Rate | 92.3% | 100.0% | +8.3% |
24+
25+
*Key Improvements:*
26+
- 20.6% reduction in task completion time
27+
- 27.5% reduction in overall cost
28+
- 8.3% improvement in success rate
29+
- Significant improvements in cache utilization
30+
31+
*Environment:* Twilio (MCP Server), Cline (MCP Client), Mixed models
2432

2533
### Task-Specific Performance
2634

@@ -67,12 +75,23 @@ The MCP-TE Benchmark evaluates AI coding agents' performance using a Control vs.
6775
| Interactions | Number of exchanges between the user and the AI assistant |
6876
| Success Rate | Percentage of tasks completed successfully |
6977

70-
### Metric Collection Limitations
78+
### Metrics Collection
79+
80+
All metrics are now collected automatically from the Claude chat logs:
7181

72-
Some metrics are collected with different methods due to client limitations:
82+
- **Duration:** Time from task start to completion, measured automatically
83+
- **API Calls:** Number of API calls made during task completion, extracted from chat logs
84+
- **Interactions:** Number of exchanges between the user and the AI assistant, extracted from chat logs
85+
- **Token Usage:** Input and output tokens used during the task
86+
- **Cost:** Estimated cost based on token usage
87+
- **Success Rate:** Percentage of tasks completed successfully
88+
89+
To extract metrics from chat logs, run:
90+
```bash
91+
npm run extract-metrics
92+
```
7393

74-
- **Duration and Success/Failure:** Logged automatically by the metrics server
75-
- **API Calls and Interactions:** Currently manually counted post-run by observing the agent's behavior in Cursor, as Cursor does not provide detailed execution logs that would allow for automatic extraction of these metrics
94+
This script will analyze the Claude chat logs and generate metrics files in the `metrics/tasks/` directory, including an updated `summary.json` file that powers the dashboard.
7695

7796
## Tasks
7897

@@ -103,11 +122,7 @@ While the initial task suite focuses on Twilio MCP Server functionality, the MCP
103122
```
104123
npm install
105124
```
106-
6. Start the metrics server:
107-
```
108-
npm run start:metrics
109-
```
110-
7. Start the dashboard server (optional, for real-time visualization):
125+
6. Start the dashboard server:
111126
```
112127
npm start
113128
```
@@ -116,53 +131,57 @@ While the initial task suite focuses on Twilio MCP Server functionality, the MCP
116131

117132
### Testing Protocol
118133

119-
1. Start the metrics server if not already running:
120-
```
121-
npm run start:metrics
122-
```
134+
1. Open Cline and start a new chat with Claude
123135

124-
2. Use the run-test.sh script to prepare a specific test:
125-
```
126-
./scripts/run-test.sh [control|mcp] [1|2|3] [model-name]
127-
```
128-
Where:
129-
- First parameter is the test mode (control or mcp)
130-
- Second parameter is the task number (1, 2, or 3)
131-
- Third parameter is the model name (e.g., "claude-3.7-sonnet")
136+
2. Upload the appropriate instruction file as context:
137+
- For control tests: `agent-instructions/control_instructions.md`
138+
- For MCP tests: `agent-instructions/mcp_instructions.md`
139+
140+
3. Start the test with: `Complete Task [TASK_NUMBER] using the commands in the instructions`
132141

133-
3. Follow the on-screen instructions:
134-
- Open Cursor with the AI Agent
135-
- Load the appropriate instructions file (control_instructions.md or mcp_instructions.md) as context
136-
- Start the conversation with: "Complete Task [TASK_NUMBER] using the commands in the instructions"
142+
4. The AI assistant will complete the task, and all metrics will be automatically collected from the chat logs
137143

138-
4. The AI agent will then:
139-
- Read the instructions
140-
- Execute the start curl command to begin timing
141-
- Complete the required task
142-
- Execute the complete curl command to end timing
144+
### Extracting Metrics from Chat Logs
143145

144-
5. After the AI agent completes the task, press Enter in the terminal window to continue with the next test or generate the summary
146+
After running tests, extract metrics from Claude chat logs:
145147

146-
6. Important: Before running tests, ensure the instruction documents contain the correct endpoint paths:
147-
- The start command should use `/metrics/start`
148-
- The complete command should use `/metrics/complete`
149-
- The model parameter should be included in the start command
148+
```bash
149+
npm run extract-metrics
150+
```
150151

151-
### Batch Testing
152+
This script analyzes the Claude chat logs and automatically extracts:
153+
- Duration of each task
154+
- Number of API calls
155+
- Number of user interactions
156+
- Token usage and estimated cost
157+
- Success/failure status
152158

153-
To run all tests in sequence:
159+
You can also specify the model, client, and server names to use in the metrics:
160+
161+
```bash
162+
npm run extract-metrics -- --model <model-name> --client <client-name> --server <server-name>
154163
```
155-
./scripts/run-test.sh run-all --model [model-name]
164+
165+
For example:
166+
```bash
167+
npm run extract-metrics -- --model claude-3.7-sonnet --client Cline --server Twilio
156168
```
157169

158-
## Viewing Results
170+
These arguments are optional and will override any values found in the logs or the default values. This is useful when the information isn't available in the logs or needs to be standardized across different runs.
171+
172+
Additional options:
173+
- `--force` or `-f`: Force regeneration of all metrics, even if they already exist
174+
- `--verbose` or `-v`: Enable verbose logging for debugging
175+
- `--help` or `-h`: Show help message
176+
177+
The extracted metrics are saved to the `metrics/tasks/` directory and the `summary.json` file is updated.
159178

160179
### Interactive Dashboard
161180

162181
For a visual representation of results:
163182

164-
1. Start the dashboard server (if not already running):
165-
```
183+
1. Start the dashboard server:
184+
```bash
166185
npm start
167186
```
168187
2. Open your browser and navigate to:
@@ -174,8 +193,8 @@ For a visual representation of results:
174193
### Command Line Summary
175194

176195
Generate a text-based summary of results:
177-
```
178-
npm run generate-summary
196+
```bash
197+
npm run regenerate-summary
179198
```
180199

181200
## Results Interpretation
@@ -208,4 +227,4 @@ If you use MCP-TE Benchmark in your research or development, please cite:
208227

209228
## Contact
210229

211-
For questions about MCP-TE Benchmark, please open an issue on this repository or contact the Twilio Emerging Technology & Innovation Team.
230+
For questions about MCP-TE Benchmark, please open an issue on this repository or contact the Twilio Emerging Technology & Innovation Team.

agent-instructions/control_instructions.md

Lines changed: 1 addition & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -7,22 +7,6 @@ This document contains three Twilio implementation tasks to complete using web s
77
- The Cursor coding agent has access to web search and terminal commands
88
- Use the .env file to access any twilio authentication credentials, like Twilio accound SID
99

10-
## Metrics Recording
11-
12-
For accurate performance measurement, you must execute these commands:
13-
14-
1. When starting each task:
15-
```bash
16-
curl -X POST http://localhost:3000/test/start -H "Content-Type: application/json" -d '{"mode": "control", "taskNumber": TASK_NUMBER}', "model": "claude-3.7-sonnet"
17-
```
18-
19-
2. When completing each task:
20-
```bash
21-
curl -X POST http://localhost:3000/test/complete -H "Content-Type: application/json" -d '{"testId": "TEST_ID", "success": true|false}'
22-
```
23-
24-
Replace TASK_NUMBER with the current task number (1, 2, or 3) and TEST_ID with the ID received from the start command.
25-
2610
## Testing Protocol
2711

2812
For each task:
@@ -39,6 +23,7 @@ Goal: Search for and purchase an available Canadian phone number.
3923
Requirements:
4024
- Use area code 416 if available
4125
- If 416 is not available, any Canadian number is acceptable
26+
- Name it "Control {{timestamp}}"
4227
- Store the purchased number for use in Task 3
4328

4429
Success Criteria:

agent-instructions/mcp_instructions.md

Lines changed: 1 addition & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -7,22 +7,6 @@ This document contains three Twilio implementation tasks to complete using the T
77
- The Cursor coding agent has access to Twilio MCP functions
88
- Use the .env file to access any twilio authentication credentials, like Twilio accound SID
99

10-
## Metrics Recording
11-
12-
For accurate performance measurement, you must execute these commands:
13-
14-
1. When starting each task:
15-
```bash
16-
curl -X POST http://localhost:3000/test/start -H "Content-Type: application/json" -d '{"mode": "mcp", "taskNumber": TASK_NUMBER}, "model": "claude-3.7-sonnet"'
17-
```
18-
19-
2. When completing each task:
20-
```bash
21-
curl -X POST http://localhost:3000/test/complete -H "Content-Type: application/json" -d '{"testId": "TEST_ID", "success": true|false}'
22-
```
23-
24-
Replace TASK_NUMBER with the current task number (1, 2, or 3) and TEST_ID with the ID received from the start command.
25-
2610
## Testing Protocol
2711

2812
For each task:
@@ -40,6 +24,7 @@ Requirements:
4024
- Use area code 416 if available
4125
- If 416 is not available, any Canadian number is acceptable
4226
- Store the purchased number for use in Task 3
27+
- Name it "Test {{timestamp}}"
4328
- Use appropriate MCP functions for number search and purchase
4429

4530
Success Criteria:
Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
# Twilio MCP Testing Protocol
2+
3+
## Overview
4+
5+
This document outlines the simplified testing methodology for evaluating performance gains when using Twilio's Model Context Protocol (MCP) compared to traditional API approaches.
6+
7+
## Test Objective
8+
9+
Measure the time required to complete each task using traditional API approaches versus MCP-enabled implementations.
10+
11+
## Metrics Collection
12+
13+
### Automated Approach
14+
Metrics are now automatically collected from Claude chat logs. The system tracks:
15+
16+
1. **Duration:** Time from task start to completion
17+
2. **API Calls:** Number of API calls made during task completion
18+
3. **Interactions:** Number of exchanges between the user and the AI assistant
19+
4. **Token Usage:** Input and output tokens used during the task
20+
5. **Cost:** Estimated cost based on token usage
21+
22+
No manual timing commands are needed. The AI assistant simply completes the task, and all metrics are extracted from the chat logs afterward.
23+
24+
## Test Tasks
25+
26+
### Task 1: Purchase a Canadian Phone Number
27+
- **Start**: When the AI assistant begins searching for Canadian numbers
28+
- **End**: When a Canadian phone number has been successfully purchased
29+
30+
### Task 2: Create a Task Router Activity
31+
- **Start**: When the AI assistant begins creating the activity
32+
- **End**: When the "Bathroom" activity has been successfully created
33+
34+
### Task 3: Create a Queue with Task Filter
35+
- **Start**: When the AI assistant begins creating the queue
36+
- **End**: When the queue with proper task filter has been successfully created
37+
38+
## Testing Procedure
39+
40+
### Setup
41+
1. Start the dashboard server:
42+
```bash
43+
npm start
44+
```
45+
46+
### Running Tests
47+
For each test:
48+
49+
1. Open Cline and start a new chat with Claude
50+
51+
2. Upload the appropriate instruction file as context:
52+
- For control tests: `agent-instructions/control_instructions.md`
53+
- For MCP tests: `agent-instructions/mcp_instructions.md`
54+
55+
3. Start the test with: `Complete Task [TASK_NUMBER] using the commands in the instructions`
56+
57+
4. The AI assistant will complete the task, and all metrics will be automatically collected from the chat logs
58+
59+
## Extracting Metrics from Chat Logs
60+
61+
After running tests, you need to extract metrics from the Claude chat logs:
62+
63+
```bash
64+
npm run extract-metrics
65+
```
66+
67+
This script analyzes the Claude chat logs and automatically extracts:
68+
- Duration of each task
69+
- Number of API calls
70+
- Number of user interactions
71+
- Token usage and estimated cost
72+
- Success/failure status
73+
74+
You can also specify the model, client, and server names to use in the metrics:
75+
76+
```bash
77+
npm run extract-metrics -- --model <model-name> --client <client-name> --server <server-name>
78+
```
79+
80+
For example:
81+
```bash
82+
npm run extract-metrics -- --model claude-3.7-sonnet --client Cline --server Twilio
83+
```
84+
85+
These arguments are optional and will override any values found in the logs or the default values. This is useful when the information isn't available in the logs or needs to be standardized across different runs.
86+
87+
Additional options:
88+
- `--force` or `-f`: Force regeneration of all metrics, even if they already exist
89+
- `--verbose` or `-v`: Enable verbose logging for debugging
90+
- `--help` or `-h`: Show help message
91+
92+
## Results Analysis
93+
94+
After tests are complete and metrics are extracted, you have multiple ways to view and analyze results:
95+
96+
### Interactive Dashboard
97+
The dashboard provides visual comparison of metrics:
98+
99+
1. Access the dashboard:
100+
```
101+
http://localhost:3001
102+
```
103+
104+
2. The dashboard shows:
105+
- Task completion time comparison
106+
- API calls per task
107+
- Interactions per task
108+
- Success rate comparison
109+
- Detailed results table
110+
111+
3. Use the "Refresh Data" button to update with latest results
112+
113+
### Command Line Summary
114+
For a text-based summary:
115+
```bash
116+
npm run regenerate-summary
117+
```
118+
119+
The performance improvement will be shown as percentage reduction in task completion time.
120+
121+
## Troubleshooting
122+
123+
If metrics are not being extracted properly:
124+
- Ensure the chat logs are being saved correctly in Cline
125+
- Check that the AI assistant completed the task successfully
126+
- Try running the extraction with the `--verbose` flag for more detailed logging:
127+
```bash
128+
npm run extract-metrics -- --verbose
129+
```
130+
131+
For dashboard issues:
132+
- Make sure the dashboard server is running (`npm start`)
133+
- Check browser console for any JavaScript errors
134+
- Verify metrics files exist in the `src/server/metrics/` directory

0 commit comments

Comments
 (0)