|
1 | 1 | <p align="center"><img src="docs/twilioAlphaLogoLight.png#gh-dark-mode-only" height="100" alt="Twilio Alpha"/><img src="docs/twilioAlphaLogoDark.png#gh-light-mode-only" height="100" alt="Twilio Alpha"/></p> |
2 | 2 | <h1 align="center">MCP-TE Benchmark</h1> |
3 | 3 |
|
4 | | -A standardized framework for evaluating the efficiency gains of AI agents using Model Context Protocol (MCP) compared to custom tools, such as terminal execution and web search. |
| 4 | +A standardized framework for evaluating the efficiency gains and trade-offs of AI agents using Model Context Protocol (MCP) compared to traditional methods. |
5 | 5 |
|
6 | 6 | ## Abstract |
7 | 7 |
|
8 | | -MCP-TE Benchmark (where "TE" stands for "Task Efficiency") is designed to measure the efficiency gains and qualitative differences when AI coding agents utilize structured context protocols (like MCP) compared to traditional development methods (e.g., documentation lookup, trial-and-error). As AI coding assistants become more integrated into development workflows, understanding how they interact with APIs and structured protocols becomes increasingly important for optimizing developer productivity and cost. |
| 8 | +MCP-TE Benchmark (where "TE" stands for "Task Efficiency") is designed to measure the efficiency gains, resource utilization changes, and qualitative differences when AI coding agents utilize structured context protocols (like MCP) compared to traditional development methods (e.g., file search, terminal execution, web search). As AI coding assistants become more integrated into development workflows, understanding how they interact with APIs and structured protocols is crucial for optimizing developer productivity and evaluating overall cost-effectiveness. |
9 | 9 |
|
10 | 10 | ## Leaderboard |
11 | 11 |
|
12 | | -### Overall Performance |
13 | | - |
14 | | -| Metric | Control | MCP | Improvement | |
15 | | -|--------|---------|-----|-------------| |
16 | | -| Average Duration (s) | 62.5 | 49.7 | -20.6% | |
17 | | -| Average API Calls | 10.3 | 8.3 | -19.3% | |
18 | | -| Average Interactions | 1.1 | 1.0 | -3.3% | |
19 | | -| Average Tokens | 2286.1 | 2141.4 | -6.3% | |
20 | | -| Average Cache Reads | 191539.5 | 246152.5 | +28.5% | |
21 | | -| Average Cache Writes | 11043.5 | 16973.9 | +53.7% | |
22 | | -| Average Cost ($) | 0.1 | 0.2 | +27.5% | |
23 | | -| Success Rate | 92.3% | 100.0% | +8.3% | |
24 | | - |
25 | | -*Key Improvements:* |
26 | | -- 20.6% reduction in task completion time |
27 | | -- 27.5% reduction in overall cost |
28 | | -- 8.3% improvement in success rate |
29 | | -- Significant improvements in cache utilization |
30 | | - |
31 | | -*Environment:* Twilio (MCP Server), Cline (MCP Client), Mixed models |
32 | | - |
33 | | -### Task-Specific Performance |
34 | | - |
35 | | -#### Twilio - Task 1: Purchase a Canadian Phone Number |
36 | | - |
37 | | -| Model | Mode | Duration (s) | API Calls | Interactions | Success | |
38 | | -|-------|------|--------------|-----------|--------------|---------| |
39 | | -| auto | Control | 20.7 | 3.5 | 1.0 | 100% | |
40 | | -| auto | MCP | 38.4 | 2.3 | 1.0 | 100% | |
41 | | -| claude-3.7-sonnet | Control | 64.3 | 9.0 | 1.0 | 100% | |
42 | | -| claude-3.7-sonnet | MCP | 42.2 | 3.0 | 1.0 | 100% | |
43 | | - |
44 | | -#### Twilio - Task 2: Create a Task Router Activity |
45 | | - |
46 | | -| Model | Mode | Duration (s) | API Calls | Interactions | Success | |
47 | | -|-------|------|--------------|-----------|--------------|---------| |
48 | | -| auto | Control | 65.6 | 14.0 | 2.0 | 100% | |
49 | | -| auto | MCP | 43.5 | 3.0 | 1.0 | 100% | |
50 | | -| claude-3.7-sonnet | Control | 35.3 | 4.0 | 1.0 | 100% | |
51 | | -| claude-3.7-sonnet | MCP | N/A | N/A | N/A | N/A | |
52 | | - |
53 | | -#### Twilio - Task 3: Create a Queue with Task Filter |
54 | | - |
55 | | -| Model | Mode | Duration (s) | API Calls | Interactions | Success | |
56 | | -|-------|------|--------------|-----------|--------------|---------| |
57 | | -| auto | Control | 38.5 | 5.0 | 1.0 | 100% | |
58 | | -| auto | MCP | 45.2 | 2.0 | 1.0 | 100% | |
59 | | -| claude-3.7-sonnet | Control | 40.1 | 4.0 | 1.0 | 100% | |
60 | | -| claude-3.7-sonnet | MCP | N/A | N/A | N/A | N/A | |
| 12 | +### Overall Performance (Model: claude-3.7-sonnet) |
| 13 | + |
| 14 | +*Environment: Twilio (MCP Server), Cline (MCP Client), Model: claude-3.7-sonnet* |
| 15 | + |
| 16 | +| Metric | Control | MCP | Change | |
| 17 | +| :--------------------- | :--------- | :--------- | :----- | |
| 18 | +| Average Duration (s) | 62.54 | 49.68 | -20.56% | |
| 19 | +| Average API Calls | 10.27 | 8.29 | -19.26% | |
| 20 | +| Average Interactions | 1.08 | 1.04 | -3.27% | |
| 21 | +| Average Tokens | 2286.12 | 2141.38 | -6.33% | |
| 22 | +| Average Cache Reads | 191539.50 | 246152.46 | +28.51% | |
| 23 | +| Average Cache Writes | 11043.46 | 16973.88 | +53.70% | |
| 24 | +| Average Cost ($) | 0.13 | 0.17 | +27.55% | |
| 25 | +| Success Rate | 92.31% | 100.0% | +8.33% | |
| 26 | + |
| 27 | +*Note: Calculations based on data in `metrics/summary.json`.* |
| 28 | + |
| 29 | +*Key Findings (claude-3.7-sonnet):* |
| 30 | +* **Efficiency Gains:** MCP usage resulted in faster task completion (-20.5% duration), fewer API calls (-19.3%), and slightly fewer user interactions (-3.3%). Token usage also saw a modest decrease (-6.3%). |
| 31 | +* **Increased Resource Utilization:** MCP significantly increased cache reads (+28.5%) and cache writes (+53.7%). |
| 32 | +* **Cost Increase:** The increased resource utilization, particularly cache operations or potentially different API call patterns within MCP, led to a notable increase in average task cost (+27.5%). |
| 33 | +* **Improved Reliability:** MCP achieved a perfect success rate (100%), an 8.3% improvement over the control group. |
| 34 | + |
| 35 | +*Conclusion:* For the `claude-3.7-sonnet` model in this benchmark, MCP offers improvements in speed, API efficiency, and reliability, but at the cost of increased cache operations and overall monetary cost per task. |
| 36 | + |
| 37 | +### Task-Specific Performance (Model: claude-3.7-sonnet) |
| 38 | + |
| 39 | +*Calculations based on data in `summary.json`.* |
| 40 | + |
| 41 | +#### Task 1: Purchase a Canadian Phone Number |
| 42 | + |
| 43 | +| Metric | Control | MCP | Change | |
| 44 | +| :--------------------- | :--------- | :--------- | :------- | |
| 45 | +| Duration (s) | 79.41 | 62.27 | -21.57% | |
| 46 | +| API Calls | 12.78 | 9.63 | -24.67% | |
| 47 | +| Interactions | 1.22 | 1.13 | -7.95% | |
| 48 | +| Tokens | 2359.33 | 2659.88 | +12.74% | |
| 49 | +| Cache Reads | 262556.11 | 281086.13 | +7.06% | |
| 50 | +| Cache Writes | 17196.33 | 25627.63 | +49.03% | |
| 51 | +| Cost ($) | 0.18 | 0.22 | +23.50% | |
| 52 | +| Success Rate | 100.00% | 100.00% | 0.00% | |
| 53 | + |
| 54 | +#### Task 2: Create a Task Router Activity |
| 55 | + |
| 56 | +| Metric | Control | MCP | Change | |
| 57 | +| :--------------------- | :--------- | :--------- | :------- | |
| 58 | +| Duration (s) | 46.37 | 30.71 | -33.77% | |
| 59 | +| API Calls | 8.44 | 5.88 | -30.43% | |
| 60 | +| Interactions | 1.00 | 1.00 | 0.00% | |
| 61 | +| Tokens | 2058.89 | 1306.63 | -36.54% | |
| 62 | +| Cache Reads | 144718.44 | 164311.50 | +13.54% | |
| 63 | +| Cache Writes | 6864.44 | 11219.13 | +63.44% | |
| 64 | +| Cost ($) | 0.10 | 0.11 | +11.09% | |
| 65 | +| Success Rate | 77.78% | 100.00% | +28.57% | |
| 66 | + |
| 67 | +#### Task 3: Create a Queue with Task Filter |
| 68 | + |
| 69 | +| Metric | Control | MCP | Change | |
| 70 | +| :--------------------- | :--------- | :--------- | :------- | |
| 71 | +| Duration (s) | 61.77 | 56.07 | -9.23% | |
| 72 | +| API Calls | 9.50 | 9.38 | -1.32% | |
| 73 | +| Interactions | 1.00 | 1.00 | 0.00% | |
| 74 | +| Tokens | 2459.38 | 2457.63 | -0.07% | |
| 75 | +| Cache Reads | 164319.50 | 293059.75 | +78.35% | |
| 76 | +| Cache Writes | 8822.88 | 14074.88 | +59.53% | |
| 77 | +| Cost ($) | 0.12 | 0.18 | +49.06% | |
| 78 | +| Success Rate | 100.00% | 100.00% | 0.00% | |
61 | 79 |
|
62 | 80 | ## Benchmark Design & Metrics |
63 | 81 |
|
64 | 82 | The MCP-TE Benchmark evaluates AI coding agents' performance using a Control vs. Treatment methodology: |
65 | 83 |
|
66 | | -- **Control Group:** Completion of API tasks using traditional methods (web search, documentation, and terminal capabilities) |
67 | | -- **Treatment Group:** Completion of the same API tasks using Model Context Protocol (MCP) |
| 84 | +* **Control Group:** Completion of API tasks using traditional methods (web search, file search, and terminal capabilities). |
| 85 | +* **Treatment Group (MCP):** Completion of the same API tasks using a Twilio Model Context Protocol server (MCP). |
68 | 86 |
|
69 | | -### Key Metrics |
| 87 | +### Key Metrics Collected |
70 | 88 |
|
71 | | -| Metric | Description | |
72 | | -|--------|-------------| |
73 | | -| Duration | Time taken to complete a task from start to finish (in seconds) | |
74 | | -| API Calls | Number of API calls made during task completion | |
75 | | -| Interactions | Number of exchanges between the user and the AI assistant | |
76 | | -| Success Rate | Percentage of tasks completed successfully | |
| 89 | +| Metric | Description | |
| 90 | +| :------------- | :-------------------------------------------------------------------------- | |
| 91 | +| Duration | Time taken to complete a task from start to finish (in seconds) | |
| 92 | +| API Calls | Number of API calls made during task completion | |
| 93 | +| Interactions | Number of exchanges between the user and the AI assistant | |
| 94 | +| Tokens | Total input and output tokens used during the task | |
| 95 | +| Cache Reads | Number of cached tokens read (measure of cache hit effectiveness) | |
| 96 | +| Cache Writes | Number of tokens written to the cache (measure of context loading/saving) | |
| 97 | +| Cost | Estimated cost ($) based on token usage and cache operations (model specific) | |
| 98 | +| Success Rate | Percentage of tasks completed successfully | |
77 | 99 |
|
78 | 100 | ### Metrics Collection |
79 | 101 |
|
80 | | -All metrics are now collected automatically from the Claude chat logs: |
81 | | - |
82 | | -- **Duration:** Time from task start to completion, measured automatically |
83 | | -- **API Calls:** Number of API calls made during task completion, extracted from chat logs |
84 | | -- **Interactions:** Number of exchanges between the user and the AI assistant, extracted from chat logs |
85 | | -- **Token Usage:** Input and output tokens used during the task |
86 | | -- **Cost:** Estimated cost based on token usage |
87 | | -- **Success Rate:** Percentage of tasks completed successfully |
88 | | - |
89 | | -To extract metrics from chat logs, run: |
90 | | -```bash |
91 | | -npm run extract-metrics |
92 | | -``` |
| 102 | +All metrics are collected automatically from Claude chat logs using the scripts provided in this repository: |
93 | 103 |
|
94 | | -This script will analyze the Claude chat logs and generate metrics files in the `metrics/tasks/` directory, including an updated `summary.json` file that powers the dashboard. |
| 104 | +* Run `npm run extract-metrics` to process chat logs. |
| 105 | +* This script analyzes logs, calculates metrics for each task run, and saves them as individual JSON files in `metrics/tasks/`. |
| 106 | +* It also generates/updates a `summary.json` file in the same directory, consolidating all individual results. |
95 | 107 |
|
96 | 108 | ## Tasks |
97 | 109 |
|
98 | 110 | The current benchmark includes the following tasks specific to the Twilio MCP Server: |
99 | 111 |
|
100 | | -1. **Purchase a Canadian Phone Number:** Search for and purchase an available Canadian phone number (preferably with area code 416) |
101 | | -2. **Create a Task Router Activity:** Create a new Task Router activity named "Bathroom" |
102 | | -3. **Create a Queue with Task Filter:** Create a queue with a task filter that prevents routing tasks to workers in the "Bathroom" activity |
| 112 | +1. **Purchase a Canadian Phone Number:** Search for and purchase an available Canadian phone number (preferably with area code 416). |
| 113 | +2. **Create a Task Router Activity:** Create a new Task Router activity named "Bathroom". |
| 114 | +3. **Create a Queue with Task Filter:** Create a queue with a task filter that prevents routing tasks to workers in the "Bathroom" activity. |
103 | 115 |
|
104 | | -While the initial task suite focuses on Twilio MCP Server functionality, the MCP-TE framework is designed to be adaptable to other APIs and context protocols. |
| 116 | +(Setup, Running Tests, Extracting Metrics, Dashboard, CLI Summary sections remain largely the same as they accurately describe the repo structure and tools) |
105 | 117 |
|
106 | 118 | ## Setup |
107 | 119 |
|
108 | | -1. Clone this repository: |
109 | | - ``` |
110 | | - git clone https://github.com/nmogil-tw/mcp-te-benchmark.git |
111 | | - ``` |
112 | | -2. Run the setup script: |
113 | | - ``` |
114 | | - ./scripts/setup.sh |
115 | | - ``` |
116 | | -3. Create your `.env` file from the example: |
117 | | - ``` |
118 | | - cp .env.example .env |
119 | | - ``` |
120 | | -4. Edit the `.env` file with your Twilio credentials |
121 | | -5. Install dependencies: |
122 | | - ``` |
123 | | - npm install |
124 | | - ``` |
125 | | -6. Start the dashboard server: |
126 | | - ``` |
127 | | - npm start |
128 | | - ``` |
| 120 | +1. Clone this repository: |
| 121 | + ```bash |
| 122 | + git clone https://github.com/nmogil-tw/mcp-te-benchmark.git |
| 123 | + cd mcp-te-benchmark |
| 124 | + ``` |
| 125 | +2. Install dependencies: |
| 126 | + ```bash |
| 127 | + npm install |
| 128 | + ``` |
| 129 | +3. Create your `.env` file from the example: |
| 130 | + ```bash |
| 131 | + cp .env.example .env |
| 132 | + ``` |
| 133 | +4. Edit the `.env` file with your necessary credentials (e.g., Twilio). |
| 134 | +5. *(Optional)* If needed, run any project-specific setup scripts (check `scripts/` directory if applicable). |
129 | 135 |
|
130 | 136 | ## Running Tests |
131 | 137 |
|
132 | 138 | ### Testing Protocol |
133 | 139 |
|
134 | | -1. Open Cline and start a new chat with Claude |
135 | | - |
136 | | -2. Upload the appropriate instruction file as context: |
137 | | - - For control tests: `agent-instructions/control_instructions.md` |
138 | | - - For MCP tests: `agent-instructions/mcp_instructions.md` |
139 | | - |
140 | | -3. Start the test with: `Complete Task [TASK_NUMBER] using the commands in the instructions` |
141 | | - |
142 | | -4. The AI assistant will complete the task, and all metrics will be automatically collected from the chat logs |
| 140 | +1. Open Cline (or the specified MCP Client) and start a new chat with the target model (e.g., Claude). |
| 141 | +2. Upload the appropriate instruction file as context: |
| 142 | + * For control tests: `agent-instructions/control_instructions.md` |
| 143 | + * For MCP tests: `agent-instructions/mcp_instructions.md` |
| 144 | +3. Start the test with the prompt: `Complete Task [TASK_NUMBER] using the commands in the instructions` |
| 145 | +4. Allow the AI assistant to complete the task. Metrics will be collected from the chat logs later. |
| 146 | +5. Repeat for all desired tasks and modes. |
143 | 147 |
|
144 | 148 | ### Extracting Metrics from Chat Logs |
145 | 149 |
|
146 | | -After running tests, extract metrics from Claude chat logs: |
| 150 | +After running tests, extract metrics from the chat logs: |
147 | 151 |
|
148 | 152 | ```bash |
| 153 | +# Extracts metrics and updates summary.json |
149 | 154 | npm run extract-metrics |
150 | 155 | ``` |
151 | 156 |
|
@@ -204,7 +209,8 @@ The benchmark focuses on these key insights: |
204 | 209 | 1. **Time Efficiency:** Comparing the time it takes to complete tasks using MCP vs. traditional methods |
205 | 210 | 2. **API Efficiency:** Measuring the reduction in API calls when using MCP |
206 | 211 | 3. **Interaction Efficiency:** Evaluating if MCP reduces the number of interactions needed to complete tasks |
207 | | -4. **Success Rate:** Determining if MCP improves the reliability of task completion |
| 212 | +4. **Cost Efficiency** Evalutating if the added MCP context has an impact on Token Costs |
| 213 | +5. **Success Rate:** Determining if MCP improves the reliability of task completion |
208 | 214 |
|
209 | 215 | Negative percentage changes in duration, API calls, and interactions indicate improvements, while positive changes in success rate indicate improvements. |
210 | 216 |
|
|
0 commit comments