twilio-labs
diff --git a/‎LICENSE‎
Lines changed: 21 additions & 0 deletions b/‎LICENSE‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 116 additions & 110 deletions b/‎README.md‎
Lines changed: 116 additions & 110 deletions
diff --git a/‎agent-instructions/control_instructions.md‎
Lines changed: 1 addition & 1 deletion b/‎agent-instructions/control_instructions.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎agent-instructions/mcp_instructions.md‎
Lines changed: 1 addition & 1 deletion b/‎agent-instructions/mcp_instructions.md‎
Lines changed: 1 addition & 1 deletion
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2025 Twilio Alpha
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
@@ -1,151 +1,156 @@
 <p align="center"><img src="docs/twilioAlphaLogoLight.png#gh-dark-mode-only" height="100" alt="Twilio Alpha"/><img src="docs/twilioAlphaLogoDark.png#gh-light-mode-only" height="100" alt="Twilio Alpha"/></p>
 <h1 align="center">MCP-TE Benchmark</h1> 
 
-A standardized framework for evaluating the efficiency gains of AI agents using Model Context Protocol (MCP) compared to custom tools, such as terminal execution and web search.
+A standardized framework for evaluating the efficiency gains and trade-offs of AI agents using Model Context Protocol (MCP) compared to traditional methods.
 
 ## Abstract
 
-MCP-TE Benchmark (where "TE" stands for "Task Efficiency") is designed to measure the efficiency gains and qualitative differences when AI coding agents utilize structured context protocols (like MCP) compared to traditional development methods (e.g., documentation lookup, trial-and-error). As AI coding assistants become more integrated into development workflows, understanding how they interact with APIs and structured protocols becomes increasingly important for optimizing developer productivity and cost.
+MCP-TE Benchmark (where "TE" stands for "Task Efficiency") is designed to measure the efficiency gains, resource utilization changes, and qualitative differences when AI coding agents utilize structured context protocols (like MCP) compared to traditional development methods (e.g., file search, terminal execution, web search). As AI coding assistants become more integrated into development workflows, understanding how they interact with APIs and structured protocols is crucial for optimizing developer productivity and evaluating overall cost-effectiveness.
 
 ## Leaderboard
 
-### Overall Performance
-
-| Metric | Control | MCP | Improvement |
-|--------|---------|-----|-------------|
-| Average Duration (s) | 62.5 | 49.7 | -20.6% |
-| Average API Calls | 10.3 | 8.3 | -19.3% |
-| Average Interactions | 1.1 | 1.0 | -3.3% |
-| Average Tokens | 2286.1 | 2141.4 | -6.3% |
-| Average Cache Reads | 191539.5 | 246152.5 | +28.5% |
-| Average Cache Writes | 11043.5 | 16973.9 | +53.7% |
-| Average Cost ($) | 0.1 | 0.2 | +27.5% |
-| Success Rate | 92.3% | 100.0% | +8.3% |
-
-*Key Improvements:*
-- 20.6% reduction in task completion time
-- 27.5% reduction in overall cost
-- 8.3% improvement in success rate
-- Significant improvements in cache utilization
-
-*Environment:* Twilio (MCP Server), Cline (MCP Client), Mixed models
-
-### Task-Specific Performance
-
-#### Twilio - Task 1: Purchase a Canadian Phone Number
-
-| Model | Mode | Duration (s) | API Calls | Interactions | Success |
-|-------|------|--------------|-----------|--------------|---------|
-| auto | Control | 20.7 | 3.5 | 1.0 | 100% |
-| auto | MCP | 38.4 | 2.3 | 1.0 | 100% |
-| claude-3.7-sonnet | Control | 64.3 | 9.0 | 1.0 | 100% |
-| claude-3.7-sonnet | MCP | 42.2 | 3.0 | 1.0 | 100% |
-
-#### Twilio - Task 2: Create a Task Router Activity
-
-| Model | Mode | Duration (s) | API Calls | Interactions | Success |
-|-------|------|--------------|-----------|--------------|---------|
-| auto | Control | 65.6 | 14.0 | 2.0 | 100% |
-| auto | MCP | 43.5 | 3.0 | 1.0 | 100% |
-| claude-3.7-sonnet | Control | 35.3 | 4.0 | 1.0 | 100% |
-| claude-3.7-sonnet | MCP | N/A | N/A | N/A | N/A |
-
-#### Twilio - Task 3: Create a Queue with Task Filter
-
-| Model | Mode | Duration (s) | API Calls | Interactions | Success |
-|-------|------|--------------|-----------|--------------|---------|
-| auto | Control | 38.5 | 5.0 | 1.0 | 100% |
-| auto | MCP | 45.2 | 2.0 | 1.0 | 100% |
-| claude-3.7-sonnet | Control | 40.1 | 4.0 | 1.0 | 100% |
-| claude-3.7-sonnet | MCP | N/A | N/A | N/A | N/A |
+### Overall Performance (Model: claude-3.7-sonnet)
+
+*Environment: Twilio (MCP Server), Cline (MCP Client), Model: claude-3.7-sonnet*
+
+| Metric                 | Control    | MCP        | Change  |
+| :--------------------- | :--------- | :--------- | :-----  |
+| Average Duration (s)   | 62.54      | 49.68      | -20.56% |
+| Average API Calls      | 10.27      | 8.29       | -19.26% |
+| Average Interactions   | 1.08       | 1.04       | -3.27%  |
+| Average Tokens         | 2286.12    | 2141.38    | -6.33%  |
+| Average Cache Reads    | 191539.50  | 246152.46  | +28.51% |
+| Average Cache Writes   | 11043.46   | 16973.88   | +53.70% |
+| Average Cost ($)       | 0.13       | 0.17       | +27.55% |
+| Success Rate           | 92.31%     | 100.0%     | +8.33%  |
+
+*Note: Calculations based on data in `metrics/summary.json`.*
+
+*Key Findings (claude-3.7-sonnet):*
+*   **Efficiency Gains:** MCP usage resulted in faster task completion (-20.5% duration), fewer API calls (-19.3%), and slightly fewer user interactions (-3.3%). Token usage also saw a modest decrease (-6.3%).
+*   **Increased Resource Utilization:** MCP significantly increased cache reads (+28.5%) and cache writes (+53.7%).
+*   **Cost Increase:** The increased resource utilization, particularly cache operations or potentially different API call patterns within MCP, led to a notable increase in average task cost (+27.5%).
+*   **Improved Reliability:** MCP achieved a perfect success rate (100%), an 8.3% improvement over the control group.
+
+*Conclusion:* For the `claude-3.7-sonnet` model in this benchmark, MCP offers improvements in speed, API efficiency, and reliability, but at the cost of increased cache operations and overall monetary cost per task.
+
+### Task-Specific Performance (Model: claude-3.7-sonnet)
+
+*Calculations based on data in `summary.json`.*
+
+#### Task 1: Purchase a Canadian Phone Number
+
+| Metric                  | Control    | MCP        | Change   |
+| :--------------------- | :--------- | :--------- | :------- |
+| Duration (s)           | 79.41      | 62.27      | -21.57%  |
+| API Calls              | 12.78      | 9.63       | -24.67%  |
+| Interactions           | 1.22       | 1.13       | -7.95%   |
+| Tokens                 | 2359.33    | 2659.88    | +12.74%  |
+| Cache Reads            | 262556.11  | 281086.13  | +7.06%   |
+| Cache Writes           | 17196.33   | 25627.63   | +49.03%  |
+| Cost ($)               | 0.18       | 0.22       | +23.50%  |
+| Success Rate           | 100.00%    | 100.00%    | 0.00%    |
+
+#### Task 2: Create a Task Router Activity
+
+| Metric                  | Control    | MCP        | Change   |
+| :--------------------- | :--------- | :--------- | :------- |
+| Duration (s)           | 46.37      | 30.71      | -33.77%  |
+| API Calls              | 8.44       | 5.88       | -30.43%  |
+| Interactions           | 1.00       | 1.00       | 0.00%    |
+| Tokens                 | 2058.89    | 1306.63    | -36.54%  |
+| Cache Reads            | 144718.44  | 164311.50  | +13.54%  |
+| Cache Writes           | 6864.44    | 11219.13   | +63.44%  |
+| Cost ($)               | 0.10       | 0.11       | +11.09%  |
+| Success Rate           | 77.78%     | 100.00%    | +28.57%  |
+
+#### Task 3: Create a Queue with Task Filter
+
+| Metric                  | Control    | MCP        | Change   |
+| :--------------------- | :--------- | :--------- | :------- |
+| Duration (s)           | 61.77      | 56.07      | -9.23%   |
+| API Calls              | 9.50       | 9.38       | -1.32%   |
+| Interactions           | 1.00       | 1.00       | 0.00%    |
+| Tokens                 | 2459.38    | 2457.63    | -0.07%   |
+| Cache Reads            | 164319.50  | 293059.75  | +78.35%  |
+| Cache Writes           | 8822.88    | 14074.88   | +59.53%  |
+| Cost ($)               | 0.12       | 0.18       | +49.06%  |
+| Success Rate           | 100.00%    | 100.00%    | 0.00%    |
 
 ## Benchmark Design & Metrics
 
 The MCP-TE Benchmark evaluates AI coding agents' performance using a Control vs. Treatment methodology:
 
-- **Control Group:** Completion of API tasks using traditional methods (web search, documentation, and terminal capabilities)
-- **Treatment Group:** Completion of the same API tasks using Model Context Protocol (MCP)
+*   **Control Group:** Completion of API tasks using traditional methods (web search, file search, and terminal capabilities).
+*   **Treatment Group (MCP):** Completion of the same API tasks using a Twilio Model Context Protocol server (MCP).
 
-### Key Metrics
+### Key Metrics Collected
 
-| Metric | Description |
-|--------|-------------|
-| Duration | Time taken to complete a task from start to finish (in seconds) |
-| API Calls | Number of API calls made during task completion |
-| Interactions | Number of exchanges between the user and the AI assistant |
-| Success Rate | Percentage of tasks completed successfully |
+| Metric         | Description                                                                 |
+| :------------- | :-------------------------------------------------------------------------- |
+| Duration       | Time taken to complete a task from start to finish (in seconds)             |
+| API Calls      | Number of API calls made during task completion                             |
+| Interactions   | Number of exchanges between the user and the AI assistant                   |
+| Tokens         | Total input and output tokens used during the task                          |
+| Cache Reads    | Number of cached tokens read (measure of cache hit effectiveness)           |
+| Cache Writes   | Number of tokens written to the cache (measure of context loading/saving)   |
+| Cost           | Estimated cost ($) based on token usage and cache operations (model specific) |
+| Success Rate   | Percentage of tasks completed successfully                                  |
 
 ### Metrics Collection
 
-All metrics are now collected automatically from the Claude chat logs:
-
-- **Duration:** Time from task start to completion, measured automatically
-- **API Calls:** Number of API calls made during task completion, extracted from chat logs
-- **Interactions:** Number of exchanges between the user and the AI assistant, extracted from chat logs
-- **Token Usage:** Input and output tokens used during the task
-- **Cost:** Estimated cost based on token usage
-- **Success Rate:** Percentage of tasks completed successfully
-
-To extract metrics from chat logs, run:
-```bash
-npm run extract-metrics
-```
+All metrics are collected automatically from Claude chat logs using the scripts provided in this repository:
 
-This script will analyze the Claude chat logs and generate metrics files in the `metrics/tasks/` directory, including an updated `summary.json` file that powers the dashboard.
+*   Run `npm run extract-metrics` to process chat logs.
+*   This script analyzes logs, calculates metrics for each task run, and saves them as individual JSON files in `metrics/tasks/`.
+*   It also generates/updates a `summary.json` file in the same directory, consolidating all individual results.
 
 ## Tasks
 
 The current benchmark includes the following tasks specific to the Twilio MCP Server:
 
-1. **Purchase a Canadian Phone Number:** Search for and purchase an available Canadian phone number (preferably with area code 416)
-2. **Create a Task Router Activity:** Create a new Task Router activity named "Bathroom"
-3. **Create a Queue with Task Filter:** Create a queue with a task filter that prevents routing tasks to workers in the "Bathroom" activity
+1.  **Purchase a Canadian Phone Number:** Search for and purchase an available Canadian phone number (preferably with area code 416).
+2.  **Create a Task Router Activity:** Create a new Task Router activity named "Bathroom".
+3.  **Create a Queue with Task Filter:** Create a queue with a task filter that prevents routing tasks to workers in the "Bathroom" activity.
 
-While the initial task suite focuses on Twilio MCP Server functionality, the MCP-TE framework is designed to be adaptable to other APIs and context protocols.
+(Setup, Running Tests, Extracting Metrics, Dashboard, CLI Summary sections remain largely the same as they accurately describe the repo structure and tools)
 
 ## Setup
 
-1. Clone this repository:
-   ```
-   git clone https://github.com/nmogil-tw/mcp-te-benchmark.git
-   ```
-2. Run the setup script:
-   ```
-   ./scripts/setup.sh
-   ```
-3. Create your `.env` file from the example:
-   ```
-   cp .env.example .env
-   ```
-4. Edit the `.env` file with your Twilio credentials
-5. Install dependencies:
-   ```
-   npm install
-   ```
-6. Start the dashboard server:
-   ```
-   npm start
-   ```
+1.  Clone this repository:
+    ```bash
+    git clone https://github.com/nmogil-tw/mcp-te-benchmark.git
+    cd mcp-te-benchmark
+    ```
+2.  Install dependencies:
+    ```bash
+    npm install
+    ```
+3.  Create your `.env` file from the example:
+    ```bash
+    cp .env.example .env
+    ```
+4.  Edit the `.env` file with your necessary credentials (e.g., Twilio).
+5.  *(Optional)* If needed, run any project-specific setup scripts (check `scripts/` directory if applicable).
 
 ## Running Tests
 
 ### Testing Protocol
 
-1. Open Cline and start a new chat with Claude
-
-2. Upload the appropriate instruction file as context:
-   - For control tests: `agent-instructions/control_instructions.md`
-   - For MCP tests: `agent-instructions/mcp_instructions.md`
-
-3. Start the test with: `Complete Task [TASK_NUMBER] using the commands in the instructions`
-
-4. The AI assistant will complete the task, and all metrics will be automatically collected from the chat logs
+1.  Open Cline (or the specified MCP Client) and start a new chat with the target model (e.g., Claude).
+2.  Upload the appropriate instruction file as context:
+    *   For control tests: `agent-instructions/control_instructions.md`
+    *   For MCP tests: `agent-instructions/mcp_instructions.md`
+3.  Start the test with the prompt: `Complete Task [TASK_NUMBER] using the commands in the instructions`
+4.  Allow the AI assistant to complete the task. Metrics will be collected from the chat logs later.
+5.  Repeat for all desired tasks and modes.
 
 ### Extracting Metrics from Chat Logs
 
-After running tests, extract metrics from Claude chat logs:
+After running tests, extract metrics from the chat logs:
 
 ```bash
+# Extracts metrics and updates summary.json
 npm run extract-metrics
 ```
 
@@ -204,7 +209,8 @@ The benchmark focuses on these key insights:
 1. **Time Efficiency:** Comparing the time it takes to complete tasks using MCP vs. traditional methods
 2. **API Efficiency:** Measuring the reduction in API calls when using MCP
 3. **Interaction Efficiency:** Evaluating if MCP reduces the number of interactions needed to complete tasks
-4. **Success Rate:** Determining if MCP improves the reliability of task completion
+4. **Cost Efficiency** Evalutating if the added MCP context has an impact on Token Costs
+5. **Success Rate:** Determining if MCP improves the reliability of task completion
 
 Negative percentage changes in duration, API calls, and interactions indicate improvements, while positive changes in success rate indicate improvements.
 
 
@@ -4,7 +4,7 @@ This document contains three Twilio implementation tasks to complete using web s
 
 ## Environment Setup
 
-- The Cursor coding agent has access to web search and terminal commands
+- The Cline coding agent has access to web search and terminal commands
 - Use the .env file to access any twilio authentication credentials, like Twilio accound SID
 
 ## Testing Protocol
 
@@ -4,7 +4,7 @@ This document contains three Twilio implementation tasks to complete using the T
 
 ## Environment Setup
 
-- The Cursor coding agent has access to Twilio MCP functions
+- The Cline coding agent has access to Twilio MCP functions
 - Use the .env file to access any twilio authentication credentials, like Twilio accound SID
 
 ## Testing Protocol