You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A standardized framework for evaluating the efficiency gains of AI agents using Model Context Protocol (MCP) compared to custom tools, such as terminal execution and web search.
@@ -9,18 +9,26 @@ MCP-TE Benchmark (where "TE" stands for "Task Efficiency") is designed to measur
9
9
10
10
## Leaderboard
11
11
12
-
**Note:** Due to a limitation in the current MCP Client (Cursor), model selection is restricted in some test runs. 'Auto' indicates the client's automatic model selection. Results for specific models will be added as they become available.
@@ -67,12 +75,23 @@ The MCP-TE Benchmark evaluates AI coding agents' performance using a Control vs.
67
75
| Interactions | Number of exchanges between the user and the AI assistant |
68
76
| Success Rate | Percentage of tasks completed successfully |
69
77
70
-
### Metric Collection Limitations
78
+
### Metrics Collection
79
+
80
+
All metrics are now collected automatically from the Claude chat logs:
71
81
72
-
Some metrics are collected with different methods due to client limitations:
82
+
-**Duration:** Time from task start to completion, measured automatically
83
+
-**API Calls:** Number of API calls made during task completion, extracted from chat logs
84
+
-**Interactions:** Number of exchanges between the user and the AI assistant, extracted from chat logs
85
+
-**Token Usage:** Input and output tokens used during the task
86
+
-**Cost:** Estimated cost based on token usage
87
+
-**Success Rate:** Percentage of tasks completed successfully
88
+
89
+
To extract metrics from chat logs, run:
90
+
```bash
91
+
npm run extract-metrics
92
+
```
73
93
74
-
-**Duration and Success/Failure:** Logged automatically by the metrics server
75
-
-**API Calls and Interactions:** Currently manually counted post-run by observing the agent's behavior in Cursor, as Cursor does not provide detailed execution logs that would allow for automatic extraction of these metrics
94
+
This script will analyze the Claude chat logs and generate metrics files in the `metrics/tasks/` directory, including an updated `summary.json` file that powers the dashboard.
76
95
77
96
## Tasks
78
97
@@ -103,11 +122,7 @@ While the initial task suite focuses on Twilio MCP Server functionality, the MCP
103
122
```
104
123
npm install
105
124
```
106
-
6. Start the metrics server:
107
-
```
108
-
npm run start:metrics
109
-
```
110
-
7. Start the dashboard server (optional, for real-time visualization):
125
+
6. Start the dashboard server:
111
126
```
112
127
npm start
113
128
```
@@ -116,53 +131,57 @@ While the initial task suite focuses on Twilio MCP Server functionality, the MCP
116
131
117
132
### Testing Protocol
118
133
119
-
1. Start the metrics server if not already running:
120
-
```
121
-
npm run start:metrics
122
-
```
134
+
1. Open Cline and start a new chat with Claude
123
135
124
-
2. Use the run-test.sh script to prepare a specific test:
npm run extract-metrics -- --model claude-3.7-sonnet --client Cline --server Twilio
156
168
```
157
169
158
-
## Viewing Results
170
+
These arguments are optional and will override any values found in the logs or the default values. This is useful when the information isn't available in the logs or needs to be standardized across different runs.
171
+
172
+
Additional options:
173
+
-`--force` or `-f`: Force regeneration of all metrics, even if they already exist
174
+
-`--verbose` or `-v`: Enable verbose logging for debugging
175
+
-`--help` or `-h`: Show help message
176
+
177
+
The extracted metrics are saved to the `metrics/tasks/` directory and the `summary.json` file is updated.
159
178
160
179
### Interactive Dashboard
161
180
162
181
For a visual representation of results:
163
182
164
-
1. Start the dashboard server (if not already running):
165
-
```
183
+
1. Start the dashboard server:
184
+
```bash
166
185
npm start
167
186
```
168
187
2. Open your browser and navigate to:
@@ -174,8 +193,8 @@ For a visual representation of results:
174
193
### Command Line Summary
175
194
176
195
Generate a text-based summary of results:
177
-
```
178
-
npm run generate-summary
196
+
```bash
197
+
npm run regenerate-summary
179
198
```
180
199
181
200
## Results Interpretation
@@ -208,4 +227,4 @@ If you use MCP-TE Benchmark in your research or development, please cite:
208
227
209
228
## Contact
210
229
211
-
For questions about MCP-TE Benchmark, please open an issue on this repository or contact the Twilio Emerging Technology & Innovation Team.
230
+
For questions about MCP-TE Benchmark, please open an issue on this repository or contact the Twilio Emerging Technology & Innovation Team.
This document outlines the simplified testing methodology for evaluating performance gains when using Twilio's Model Context Protocol (MCP) compared to traditional API approaches.
6
+
7
+
## Test Objective
8
+
9
+
Measure the time required to complete each task using traditional API approaches versus MCP-enabled implementations.
10
+
11
+
## Metrics Collection
12
+
13
+
### Automated Approach
14
+
Metrics are now automatically collected from Claude chat logs. The system tracks:
15
+
16
+
1.**Duration:** Time from task start to completion
17
+
2.**API Calls:** Number of API calls made during task completion
18
+
3.**Interactions:** Number of exchanges between the user and the AI assistant
19
+
4.**Token Usage:** Input and output tokens used during the task
20
+
5.**Cost:** Estimated cost based on token usage
21
+
22
+
No manual timing commands are needed. The AI assistant simply completes the task, and all metrics are extracted from the chat logs afterward.
23
+
24
+
## Test Tasks
25
+
26
+
### Task 1: Purchase a Canadian Phone Number
27
+
-**Start**: When the AI assistant begins searching for Canadian numbers
28
+
-**End**: When a Canadian phone number has been successfully purchased
29
+
30
+
### Task 2: Create a Task Router Activity
31
+
-**Start**: When the AI assistant begins creating the activity
32
+
-**End**: When the "Bathroom" activity has been successfully created
33
+
34
+
### Task 3: Create a Queue with Task Filter
35
+
-**Start**: When the AI assistant begins creating the queue
36
+
-**End**: When the queue with proper task filter has been successfully created
37
+
38
+
## Testing Procedure
39
+
40
+
### Setup
41
+
1. Start the dashboard server:
42
+
```bash
43
+
npm start
44
+
```
45
+
46
+
### Running Tests
47
+
For each test:
48
+
49
+
1. Open Cline and start a new chat with Claude
50
+
51
+
2. Upload the appropriate instruction file as context:
52
+
- For control tests: `agent-instructions/control_instructions.md`
53
+
- For MCP tests: `agent-instructions/mcp_instructions.md`
54
+
55
+
3. Start the test with: `Complete Task [TASK_NUMBER] using the commands in the instructions`
56
+
57
+
4. The AI assistant will complete the task, and all metrics will be automatically collected from the chat logs
58
+
59
+
## Extracting Metrics from Chat Logs
60
+
61
+
After running tests, you need to extract metrics from the Claude chat logs:
62
+
63
+
```bash
64
+
npm run extract-metrics
65
+
```
66
+
67
+
This script analyzes the Claude chat logs and automatically extracts:
68
+
- Duration of each task
69
+
- Number of API calls
70
+
- Number of user interactions
71
+
- Token usage and estimated cost
72
+
- Success/failure status
73
+
74
+
You can also specify the model, client, and server names to use in the metrics:
75
+
76
+
```bash
77
+
npm run extract-metrics -- --model <model-name> --client <client-name> --server <server-name>
78
+
```
79
+
80
+
For example:
81
+
```bash
82
+
npm run extract-metrics -- --model claude-3.7-sonnet --client Cline --server Twilio
83
+
```
84
+
85
+
These arguments are optional and will override any values found in the logs or the default values. This is useful when the information isn't available in the logs or needs to be standardized across different runs.
86
+
87
+
Additional options:
88
+
-`--force` or `-f`: Force regeneration of all metrics, even if they already exist
89
+
-`--verbose` or `-v`: Enable verbose logging for debugging
90
+
-`--help` or `-h`: Show help message
91
+
92
+
## Results Analysis
93
+
94
+
After tests are complete and metrics are extracted, you have multiple ways to view and analyze results:
95
+
96
+
### Interactive Dashboard
97
+
The dashboard provides visual comparison of metrics:
98
+
99
+
1. Access the dashboard:
100
+
```
101
+
http://localhost:3001
102
+
```
103
+
104
+
2. The dashboard shows:
105
+
- Task completion time comparison
106
+
- API calls per task
107
+
- Interactions per task
108
+
- Success rate comparison
109
+
- Detailed results table
110
+
111
+
3. Use the "Refresh Data" button to update with latest results
112
+
113
+
### Command Line Summary
114
+
For a text-based summary:
115
+
```bash
116
+
npm run regenerate-summary
117
+
```
118
+
119
+
The performance improvement will be shown as percentage reduction in task completion time.
120
+
121
+
## Troubleshooting
122
+
123
+
If metrics are not being extracted properly:
124
+
- Ensure the chat logs are being saved correctly in Cline
125
+
- Check that the AI assistant completed the task successfully
126
+
- Try running the extraction with the `--verbose` flag for more detailed logging:
127
+
```bash
128
+
npm run extract-metrics -- --verbose
129
+
```
130
+
131
+
For dashboard issues:
132
+
- Make sure the dashboard server is running (`npm start`)
133
+
- Check browser console for any JavaScript errors
134
+
- Verify metrics files exist in the `src/server/metrics/` directory
0 commit comments