|
1 | 1 | <p align="center"><img src="docs/twilioAlphaLogoLight.png#gh-dark-mode-only" height="100" alt="Twilio Alpha"/><img src="docs/twilioAlphaLogoDark.png#gh-light-mode-only" height="100" alt="Twilio Alpha"/></p> |
2 | 2 | <h1 align="center">MCP-TE Benchmark</h1> |
3 | 3 |
|
4 | | -A standardized framework for evaluating the efficiency gains of AI agents using Model Context Protocol (MCP) compared to custom tools, such as terminal execution and web search. |
| 4 | +A standardized framework for evaluating the efficiency gains and trade-offs of AI agents using Model Context Protocol (MCP) compared to traditional methods. |
5 | 5 |
|
6 | 6 | ## Abstract |
7 | 7 |
|
8 | | -MCP-TE Benchmark (where "TE" stands for "Task Efficiency") is designed to measure the efficiency gains and qualitative differences when AI coding agents utilize structured context protocols (like MCP) compared to traditional development methods (e.g., documentation lookup, trial-and-error). As AI coding assistants become more integrated into development workflows, understanding how they interact with APIs and structured protocols becomes increasingly important for optimizing developer productivity and cost. |
| 8 | +MCP-TE Benchmark (where "TE" stands for "Task Efficiency") is designed to measure the efficiency gains, resource utilization changes, and qualitative differences when AI coding agents utilize structured context protocols (like MCP) compared to traditional development methods (e.g., file search, terminal execution, web search). As AI coding assistants become more integrated into development workflows, understanding how they interact with APIs and structured protocols is crucial for optimizing developer productivity and evaluating overall cost-effectiveness. |
9 | 9 |
|
10 | 10 | ## Leaderboard |
11 | 11 |
|
12 | | -### Overall Performance |
| 12 | +### Overall Performance (Model: claude-3.7-sonnet) |
13 | 13 |
|
14 | | -| Metric | Control | MCP | Improvement | |
15 | | -|--------|---------|-----|-------------| |
16 | | -| Average Duration (s) | 62.5 | 49.7 | -20.6% | |
17 | | -| Average API Calls | 10.3 | 8.3 | -19.3% | |
18 | | -| Average Interactions | 1.1 | 1.0 | -3.3% | |
19 | | -| Average Tokens | 2286.1 | 2141.4 | -6.3% | |
20 | | -| Average Cache Reads | 191539.5 | 246152.5 | +28.5% | |
21 | | -| Average Cache Writes | 11043.5 | 16973.9 | +53.7% | |
22 | | -| Average Cost ($) | 0.1 | 0.2 | +27.5% | |
23 | | -| Success Rate | 92.3% | 100.0% | +8.3% | |
| 14 | +*Environment: Twilio (MCP Server), Cline (MCP Client), Model: claude-3.7-sonnet* |
24 | 15 |
|
25 | | -*Key Improvements:* |
26 | | -- 20.6% reduction in task completion time |
27 | | -- 27.5% reduction in overall cost |
28 | | -- 8.3% improvement in success rate |
29 | | -- Significant improvements in cache utilization |
| 16 | +| Metric | Control | MCP | Change | |
| 17 | +| :--------------------- | :--------- | :--------- | :----- | |
| 18 | +| Average Duration (s) | 62.5 | 49.7 | -20.5% | |
| 19 | +| Average API Calls | 10.3 | 8.3 | -19.3% | |
| 20 | +| Average Interactions | 1.1 | 1.0 | -3.3% | |
| 21 | +| Average Tokens | 2286.1 | 2141.4 | -6.3% | |
| 22 | +| Average Cache Reads | 191539.5 | 246152.5 | +28.5% | |
| 23 | +| Average Cache Writes | 11043.5 | 16973.9 | +53.7% | |
| 24 | +| Average Cost ($) | 0.1 | 0.2 | +27.5% | |
| 25 | +| Success Rate | 92.3% | 100.0% | +8.3% | |
30 | 26 |
|
31 | | -*Environment:* Twilio (MCP Server), Cline (MCP Client), Mixed models |
| 27 | +*Note: Calculations based on data in `metrics/summary.json`.* |
32 | 28 |
|
33 | | -### Task-Specific Performance |
| 29 | +*Key Findings (claude-3.7-sonnet):* |
| 30 | +* **Efficiency Gains:** MCP usage resulted in faster task completion (-20.5% duration), fewer API calls (-19.3%), and slightly fewer user interactions (-3.3%). Token usage also saw a modest decrease (-6.3%). |
| 31 | +* **Increased Resource Utilization:** MCP significantly increased cache reads (+28.5%) and cache writes (+53.7%). |
| 32 | +* **Cost Increase:** The increased resource utilization, particularly cache operations or potentially different API call patterns within MCP, led to a notable increase in average task cost (+27.5%). |
| 33 | +* **Improved Reliability:** MCP achieved a perfect success rate (100%), an 8.3% improvement over the control group. |
34 | 34 |
|
35 | | -#### Twilio - Task 1: Purchase a Canadian Phone Number |
| 35 | +*Conclusion:* For the `claude-3.7-sonnet` model in this benchmark, MCP offers improvements in speed, API efficiency, and reliability, but at the cost of increased cache operations and overall monetary cost per task. |
36 | 36 |
|
37 | | -| Model | Mode | Duration (s) | API Calls | Interactions | Success | |
38 | | -|-------|------|--------------|-----------|--------------|---------| |
39 | | -| auto | Control | 20.7 | 3.5 | 1.0 | 100% | |
40 | | -| auto | MCP | 38.4 | 2.3 | 1.0 | 100% | |
41 | | -| claude-3.7-sonnet | Control | 64.3 | 9.0 | 1.0 | 100% | |
42 | | -| claude-3.7-sonnet | MCP | 42.2 | 3.0 | 1.0 | 100% | |
| 37 | +### Task-Specific Performance (Model: claude-3.7-sonnet) |
43 | 38 |
|
44 | | -#### Twilio - Task 2: Create a Task Router Activity |
| 39 | +*Calculations based on data in `summary.json`.* |
45 | 40 |
|
46 | | -| Model | Mode | Duration (s) | API Calls | Interactions | Success | |
47 | | -|-------|------|--------------|-----------|--------------|---------| |
48 | | -| auto | Control | 65.6 | 14.0 | 2.0 | 100% | |
49 | | -| auto | MCP | 43.5 | 3.0 | 1.0 | 100% | |
50 | | -| claude-3.7-sonnet | Control | 35.3 | 4.0 | 1.0 | 100% | |
51 | | -| claude-3.7-sonnet | MCP | N/A | N/A | N/A | N/A | |
| 41 | +#### Task 1: Purchase a Canadian Phone Number |
52 | 42 |
|
53 | | -#### Twilio - Task 3: Create a Queue with Task Filter |
| 43 | +| Mode | Duration (s) | API Calls | Interactions | Success Rate | |
| 44 | +| :------ | :----------- | :-------- | :----------- | :----------- | |
| 45 | +| Control | 79.4 | 12.8 | 1.2 | 100.0% | |
| 46 | +| MCP | 62.3 | 9.6 | 1.1 | 100.0% | |
54 | 47 |
|
55 | | -| Model | Mode | Duration (s) | API Calls | Interactions | Success | |
56 | | -|-------|------|--------------|-----------|--------------|---------| |
57 | | -| auto | Control | 38.5 | 5.0 | 1.0 | 100% | |
58 | | -| auto | MCP | 45.2 | 2.0 | 1.0 | 100% | |
59 | | -| claude-3.7-sonnet | Control | 40.1 | 4.0 | 1.0 | 100% | |
60 | | -| claude-3.7-sonnet | MCP | N/A | N/A | N/A | N/A | |
| 48 | +#### Task 2: Create a Task Router Activity |
| 49 | + |
| 50 | +| Mode | Duration (s) | API Calls | Interactions | Success Rate | |
| 51 | +| :------ | :----------- | :-------- | :----------- | :----------- | |
| 52 | +| Control | 46.4 | 8.4 | 1.0 | 77.8% | |
| 53 | +| MCP | 30.7 | 5.9 | 1.0 | 100.0% | |
| 54 | + |
| 55 | +#### Task 3: Create a Queue with Task Filter |
| 56 | + |
| 57 | +| Mode | Duration (s) | API Calls | Interactions | Success Rate | |
| 58 | +| :------ | :----------- | :-------- | :----------- | :----------- | |
| 59 | +| Control | 61.8 | 9.5 | 1.0 | 100.0% | |
| 60 | +| MCP | 56.1 | 9.4 | 1.0 | 100.0% | |
61 | 61 |
|
62 | 62 | ## Benchmark Design & Metrics |
63 | 63 |
|
64 | 64 | The MCP-TE Benchmark evaluates AI coding agents' performance using a Control vs. Treatment methodology: |
65 | 65 |
|
66 | | -- **Control Group:** Completion of API tasks using traditional methods (web search, documentation, and terminal capabilities) |
67 | | -- **Treatment Group:** Completion of the same API tasks using Model Context Protocol (MCP) |
| 66 | +* **Control Group:** Completion of API tasks using traditional methods (web search, file search, and terminal capabilities). |
| 67 | +* **Treatment Group (MCP):** Completion of the same API tasks using a Twilio Model Context Protocol server (MCP). |
68 | 68 |
|
69 | | -### Key Metrics |
| 69 | +### Key Metrics Collected |
70 | 70 |
|
71 | | -| Metric | Description | |
72 | | -|--------|-------------| |
73 | | -| Duration | Time taken to complete a task from start to finish (in seconds) | |
74 | | -| API Calls | Number of API calls made during task completion | |
75 | | -| Interactions | Number of exchanges between the user and the AI assistant | |
76 | | -| Success Rate | Percentage of tasks completed successfully | |
| 71 | +| Metric | Description | |
| 72 | +| :------------- | :-------------------------------------------------------------------------- | |
| 73 | +| Duration | Time taken to complete a task from start to finish (in seconds) | |
| 74 | +| API Calls | Number of API calls made during task completion | |
| 75 | +| Interactions | Number of exchanges between the user and the AI assistant | |
| 76 | +| Tokens | Total input and output tokens used during the task | |
| 77 | +| Cache Reads | Number of cached tokens read (measure of cache hit effectiveness) | |
| 78 | +| Cache Writes | Number of tokens written to the cache (measure of context loading/saving) | |
| 79 | +| Cost | Estimated cost ($) based on token usage and cache operations (model specific) | |
| 80 | +| Success Rate | Percentage of tasks completed successfully | |
77 | 81 |
|
78 | 82 | ### Metrics Collection |
79 | 83 |
|
80 | | -All metrics are now collected automatically from the Claude chat logs: |
| 84 | +All metrics are collected automatically from Claude chat logs using the scripts provided in this repository: |
81 | 85 |
|
82 | | -- **Duration:** Time from task start to completion, measured automatically |
83 | | -- **API Calls:** Number of API calls made during task completion, extracted from chat logs |
84 | | -- **Interactions:** Number of exchanges between the user and the AI assistant, extracted from chat logs |
85 | | -- **Token Usage:** Input and output tokens used during the task |
86 | | -- **Cost:** Estimated cost based on token usage |
87 | | -- **Success Rate:** Percentage of tasks completed successfully |
88 | | - |
89 | | -To extract metrics from chat logs, run: |
90 | | -```bash |
91 | | -npm run extract-metrics |
92 | | -``` |
93 | | - |
94 | | -This script will analyze the Claude chat logs and generate metrics files in the `metrics/tasks/` directory, including an updated `summary.json` file that powers the dashboard. |
| 86 | +* Run `npm run extract-metrics` to process chat logs. |
| 87 | +* This script analyzes logs, calculates metrics for each task run, and saves them as individual JSON files in `metrics/tasks/`. |
| 88 | +* It also generates/updates a `summary.json` file in the same directory, consolidating all individual results. |
95 | 89 |
|
96 | 90 | ## Tasks |
97 | 91 |
|
98 | 92 | The current benchmark includes the following tasks specific to the Twilio MCP Server: |
99 | 93 |
|
100 | | -1. **Purchase a Canadian Phone Number:** Search for and purchase an available Canadian phone number (preferably with area code 416) |
101 | | -2. **Create a Task Router Activity:** Create a new Task Router activity named "Bathroom" |
102 | | -3. **Create a Queue with Task Filter:** Create a queue with a task filter that prevents routing tasks to workers in the "Bathroom" activity |
| 94 | +1. **Purchase a Canadian Phone Number:** Search for and purchase an available Canadian phone number (preferably with area code 416). |
| 95 | +2. **Create a Task Router Activity:** Create a new Task Router activity named "Bathroom". |
| 96 | +3. **Create a Queue with Task Filter:** Create a queue with a task filter that prevents routing tasks to workers in the "Bathroom" activity. |
103 | 97 |
|
104 | | -While the initial task suite focuses on Twilio MCP Server functionality, the MCP-TE framework is designed to be adaptable to other APIs and context protocols. |
| 98 | +(Setup, Running Tests, Extracting Metrics, Dashboard, CLI Summary sections remain largely the same as they accurately describe the repo structure and tools) |
105 | 99 |
|
106 | 100 | ## Setup |
107 | 101 |
|
108 | | -1. Clone this repository: |
109 | | - ``` |
110 | | - git clone https://github.com/nmogil-tw/mcp-te-benchmark.git |
111 | | - ``` |
112 | | -2. Run the setup script: |
113 | | - ``` |
114 | | - ./scripts/setup.sh |
115 | | - ``` |
116 | | -3. Create your `.env` file from the example: |
117 | | - ``` |
118 | | - cp .env.example .env |
119 | | - ``` |
120 | | -4. Edit the `.env` file with your Twilio credentials |
121 | | -5. Install dependencies: |
122 | | - ``` |
123 | | - npm install |
124 | | - ``` |
125 | | -6. Start the dashboard server: |
126 | | - ``` |
127 | | - npm start |
128 | | - ``` |
| 102 | +1. Clone this repository: |
| 103 | + ```bash |
| 104 | + git clone https://github.com/nmogil-tw/mcp-te-benchmark.git |
| 105 | + cd mcp-te-benchmark |
| 106 | + ``` |
| 107 | +2. Install dependencies: |
| 108 | + ```bash |
| 109 | + npm install |
| 110 | + ``` |
| 111 | +3. Create your `.env` file from the example: |
| 112 | + ```bash |
| 113 | + cp .env.example .env |
| 114 | + ``` |
| 115 | +4. Edit the `.env` file with your necessary credentials (e.g., Twilio). |
| 116 | +5. *(Optional)* If needed, run any project-specific setup scripts (check `scripts/` directory if applicable). |
129 | 117 |
|
130 | 118 | ## Running Tests |
131 | 119 |
|
132 | 120 | ### Testing Protocol |
133 | 121 |
|
134 | | -1. Open Cline and start a new chat with Claude |
135 | | - |
136 | | -2. Upload the appropriate instruction file as context: |
137 | | - - For control tests: `agent-instructions/control_instructions.md` |
138 | | - - For MCP tests: `agent-instructions/mcp_instructions.md` |
139 | | - |
140 | | -3. Start the test with: `Complete Task [TASK_NUMBER] using the commands in the instructions` |
141 | | - |
142 | | -4. The AI assistant will complete the task, and all metrics will be automatically collected from the chat logs |
| 122 | +1. Open Cline (or the specified MCP Client) and start a new chat with the target model (e.g., Claude). |
| 123 | +2. Upload the appropriate instruction file as context: |
| 124 | + * For control tests: `agent-instructions/control_instructions.md` |
| 125 | + * For MCP tests: `agent-instructions/mcp_instructions.md` |
| 126 | +3. Start the test with the prompt: `Complete Task [TASK_NUMBER] using the commands in the instructions` |
| 127 | +4. Allow the AI assistant to complete the task. Metrics will be collected from the chat logs later. |
| 128 | +5. Repeat for all desired tasks and modes. |
143 | 129 |
|
144 | 130 | ### Extracting Metrics from Chat Logs |
145 | 131 |
|
146 | | -After running tests, extract metrics from Claude chat logs: |
| 132 | +After running tests, extract metrics from the chat logs: |
147 | 133 |
|
148 | 134 | ```bash |
| 135 | +# Extracts metrics and updates summary.json |
149 | 136 | npm run extract-metrics |
150 | 137 | ``` |
151 | 138 |
|
|
0 commit comments