twilio-labs
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 1 deletion b/‎.gitignore‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 72 additions & 53 deletions b/‎README.md‎
Lines changed: 72 additions & 53 deletions
diff --git a/‎agent-instructions/control_instructions.md‎
Lines changed: 1 addition & 16 deletions b/‎agent-instructions/control_instructions.md‎
Lines changed: 1 addition & 16 deletions
diff --git a/‎agent-instructions/mcp_instructions.md‎
Lines changed: 1 addition & 16 deletions b/‎agent-instructions/mcp_instructions.md‎
Lines changed: 1 addition & 16 deletions
diff --git a/‎agent-instructions/testing_protocol.md‎
Lines changed: 134 additions & 0 deletions b/‎agent-instructions/testing_protocol.md‎
Lines changed: 134 additions & 0 deletions
@@ -22,4 +22,6 @@ results/analysis/
 .DS_Store
 Thumbs.db
 
-.logs/
+.logs/
+
+example_task_logs/
@@ -1,4 +1,4 @@
-<p align="center"><img src="docs/twilioAlphaLogo.png" height="100" alt="Twilio Alpha"/></p>
+<p align="center"><img src="docs/twilioAlphaLogoLight.png#gh-dark-mode-only" height="100" alt="Twilio Alpha"/><img src="docs/twilioAlphaLogoDark.png#gh-light-mode-only" height="100" alt="Twilio Alpha"/></p>
 <h1 align="center">MCP-TE Benchmark</h1> 
 
 A standardized framework for evaluating the efficiency gains of AI agents using Model Context Protocol (MCP) compared to custom tools, such as terminal execution and web search.
@@ -9,18 +9,26 @@ MCP-TE Benchmark (where "TE" stands for "Task Efficiency") is designed to measur
 
 ## Leaderboard
 
-**Note:** Due to a limitation in the current MCP Client (Cursor), model selection is restricted in some test runs. 'Auto' indicates the client's automatic model selection. Results for specific models will be added as they become available.
-
 ### Overall Performance
 
 | Metric | Control | MCP | Improvement |
 |--------|---------|-----|-------------|
-| Average Duration (s) | 43.3 | 42.7 | -1.4% |
-| Average API Calls | 6.9 | 2.5 | -63.8% |
-| Average Interactions | 1.2 | 1.0 | -16.7% |
-| Success Rate | 100.0% | 100.0% | 0.0% |
-
-*Environment:* Twilio (MCP Server), Cursor (MCP Client), Mixed models
+| Average Duration (s) | 62.5 | 49.7 | -20.6% |
+| Average API Calls | 10.3 | 8.3 | -19.3% |
+| Average Interactions | 1.1 | 1.0 | -3.3% |
+| Average Tokens | 2286.1 | 2141.4 | -6.3% |
+| Average Cache Reads | 191539.5 | 246152.5 | +28.5% |
+| Average Cache Writes | 11043.5 | 16973.9 | +53.7% |
+| Average Cost ($) | 0.1 | 0.2 | +27.5% |
+| Success Rate | 92.3% | 100.0% | +8.3% |
+
+*Key Improvements:*
+- 20.6% reduction in task completion time
+- 27.5% reduction in overall cost
+- 8.3% improvement in success rate
+- Significant improvements in cache utilization
+
+*Environment:* Twilio (MCP Server), Cline (MCP Client), Mixed models
 
 ### Task-Specific Performance
 
@@ -67,12 +75,23 @@ The MCP-TE Benchmark evaluates AI coding agents' performance using a Control vs.
 | Interactions | Number of exchanges between the user and the AI assistant |
 | Success Rate | Percentage of tasks completed successfully |
 
-### Metric Collection Limitations
+### Metrics Collection
+
+All metrics are now collected automatically from the Claude chat logs:
 
-Some metrics are collected with different methods due to client limitations:
+- **Duration:** Time from task start to completion, measured automatically
+- **API Calls:** Number of API calls made during task completion, extracted from chat logs
+- **Interactions:** Number of exchanges between the user and the AI assistant, extracted from chat logs
+- **Token Usage:** Input and output tokens used during the task
+- **Cost:** Estimated cost based on token usage
+- **Success Rate:** Percentage of tasks completed successfully
+
+To extract metrics from chat logs, run:
+```bash
+npm run extract-metrics
+```
 
-- **Duration and Success/Failure:** Logged automatically by the metrics server
-- **API Calls and Interactions:** Currently manually counted post-run by observing the agent's behavior in Cursor, as Cursor does not provide detailed execution logs that would allow for automatic extraction of these metrics
+This script will analyze the Claude chat logs and generate metrics files in the `metrics/tasks/` directory, including an updated `summary.json` file that powers the dashboard.
 
 ## Tasks
 
@@ -103,11 +122,7 @@ While the initial task suite focuses on Twilio MCP Server functionality, the MCP
    ```
    npm install
    ```
-6. Start the metrics server:
-   ```
-   npm run start:metrics
-   ```
-7. Start the dashboard server (optional, for real-time visualization):
+6. Start the dashboard server:
    ```
    npm start
    ```
@@ -116,53 +131,57 @@ While the initial task suite focuses on Twilio MCP Server functionality, the MCP
 
 ### Testing Protocol
 
-1. Start the metrics server if not already running:
-   ```
-   npm run start:metrics
-   ```
+1. Open Cline and start a new chat with Claude
 
-2. Use the run-test.sh script to prepare a specific test:
-   ```
-   ./scripts/run-test.sh [control|mcp] [1|2|3] [model-name]
-   ```
-   Where:
-   - First parameter is the test mode (control or mcp)
-   - Second parameter is the task number (1, 2, or 3)
-   - Third parameter is the model name (e.g., "claude-3.7-sonnet")
+2. Upload the appropriate instruction file as context:
+   - For control tests: `agent-instructions/control_instructions.md`
+   - For MCP tests: `agent-instructions/mcp_instructions.md`
+
+3. Start the test with: `Complete Task [TASK_NUMBER] using the commands in the instructions`
 
-3. Follow the on-screen instructions:
-   - Open Cursor with the AI Agent
-   - Load the appropriate instructions file (control_instructions.md or mcp_instructions.md) as context
-   - Start the conversation with: "Complete Task [TASK_NUMBER] using the commands in the instructions"
+4. The AI assistant will complete the task, and all metrics will be automatically collected from the chat logs
 
-4. The AI agent will then:
-   - Read the instructions
-   - Execute the start curl command to begin timing
-   - Complete the required task
-   - Execute the complete curl command to end timing
+### Extracting Metrics from Chat Logs
 
-5. After the AI agent completes the task, press Enter in the terminal window to continue with the next test or generate the summary
+After running tests, extract metrics from Claude chat logs:
 
-6. Important: Before running tests, ensure the instruction documents contain the correct endpoint paths:
-   - The start command should use `/metrics/start`
-   - The complete command should use `/metrics/complete`
-   - The model parameter should be included in the start command
+```bash
+npm run extract-metrics
+```
 
-### Batch Testing
+This script analyzes the Claude chat logs and automatically extracts:
+- Duration of each task
+- Number of API calls
+- Number of user interactions
+- Token usage and estimated cost
+- Success/failure status
 
-To run all tests in sequence:
+You can also specify the model, client, and server names to use in the metrics:
+
+```bash
+npm run extract-metrics -- --model <model-name> --client <client-name> --server <server-name>
 ```
-./scripts/run-test.sh run-all --model [model-name]
+
+For example:
+```bash
+npm run extract-metrics -- --model claude-3.7-sonnet --client Cline --server Twilio
 ```
 
-## Viewing Results
+These arguments are optional and will override any values found in the logs or the default values. This is useful when the information isn't available in the logs or needs to be standardized across different runs.
+
+Additional options:
+- `--force` or `-f`: Force regeneration of all metrics, even if they already exist
+- `--verbose` or `-v`: Enable verbose logging for debugging
+- `--help` or `-h`: Show help message
+
+The extracted metrics are saved to the `metrics/tasks/` directory and the `summary.json` file is updated.
 
 ### Interactive Dashboard
 
 For a visual representation of results:
 
-1. Start the dashboard server (if not already running):
-   ```
+1. Start the dashboard server:
+   ```bash
    npm start
    ```
 2. Open your browser and navigate to:
@@ -174,8 +193,8 @@ For a visual representation of results:
 ### Command Line Summary
 
 Generate a text-based summary of results:
-```
-npm run generate-summary
+```bash
+npm run regenerate-summary
 ```
 
 ## Results Interpretation
@@ -208,4 +227,4 @@ If you use MCP-TE Benchmark in your research or development, please cite:
 
 ## Contact
 
-For questions about MCP-TE Benchmark, please open an issue on this repository or contact the Twilio Emerging Technology & Innovation Team.
+For questions about MCP-TE Benchmark, please open an issue on this repository or contact the Twilio Emerging Technology & Innovation Team.
@@ -7,22 +7,6 @@ This document contains three Twilio implementation tasks to complete using web s
 - The Cursor coding agent has access to web search and terminal commands
 - Use the .env file to access any twilio authentication credentials, like Twilio accound SID
 
-## Metrics Recording
-
-For accurate performance measurement, you must execute these commands:
-
-1. When starting each task:
-```bash
-curl -X POST http://localhost:3000/test/start -H "Content-Type: application/json" -d '{"mode": "control", "taskNumber": TASK_NUMBER}', "model": "claude-3.7-sonnet"
-```
-
-2. When completing each task:
-```bash
-curl -X POST http://localhost:3000/test/complete -H "Content-Type: application/json" -d '{"testId": "TEST_ID", "success": true|false}'
-```
-
-Replace TASK_NUMBER with the current task number (1, 2, or 3) and TEST_ID with the ID received from the start command.
-
 ## Testing Protocol
 
 For each task:
@@ -39,6 +23,7 @@ Goal: Search for and purchase an available Canadian phone number.
 Requirements:
 - Use area code 416 if available
 - If 416 is not available, any Canadian number is acceptable
+- Name it "Control {{timestamp}}"
 - Store the purchased number for use in Task 3
 
 Success Criteria:
 
@@ -7,22 +7,6 @@ This document contains three Twilio implementation tasks to complete using the T
 - The Cursor coding agent has access to Twilio MCP functions
 - Use the .env file to access any twilio authentication credentials, like Twilio accound SID
 
-## Metrics Recording
-
-For accurate performance measurement, you must execute these commands:
-
-1. When starting each task:
-```bash
-curl -X POST http://localhost:3000/test/start -H "Content-Type: application/json" -d '{"mode": "mcp", "taskNumber": TASK_NUMBER}, "model": "claude-3.7-sonnet"'
-```
-
-2. When completing each task:
-```bash
-curl -X POST http://localhost:3000/test/complete -H "Content-Type: application/json" -d '{"testId": "TEST_ID", "success": true|false}'
-```
-
-Replace TASK_NUMBER with the current task number (1, 2, or 3) and TEST_ID with the ID received from the start command.
-
 ## Testing Protocol
 
 For each task:
@@ -40,6 +24,7 @@ Requirements:
 - Use area code 416 if available
 - If 416 is not available, any Canadian number is acceptable
 - Store the purchased number for use in Task 3
+- Name it "Test {{timestamp}}"
 - Use appropriate MCP functions for number search and purchase
 
 Success Criteria:
 
@@ -0,0 +1,134 @@
+# Twilio MCP Testing Protocol
+
+## Overview
+
+This document outlines the simplified testing methodology for evaluating performance gains when using Twilio's Model Context Protocol (MCP) compared to traditional API approaches.
+
+## Test Objective
+
+Measure the time required to complete each task using traditional API approaches versus MCP-enabled implementations.
+
+## Metrics Collection
+
+### Automated Approach
+Metrics are now automatically collected from Claude chat logs. The system tracks:
+
+1. **Duration:** Time from task start to completion
+2. **API Calls:** Number of API calls made during task completion
+3. **Interactions:** Number of exchanges between the user and the AI assistant
+4. **Token Usage:** Input and output tokens used during the task
+5. **Cost:** Estimated cost based on token usage
+
+No manual timing commands are needed. The AI assistant simply completes the task, and all metrics are extracted from the chat logs afterward.
+
+## Test Tasks
+
+### Task 1: Purchase a Canadian Phone Number
+- **Start**: When the AI assistant begins searching for Canadian numbers
+- **End**: When a Canadian phone number has been successfully purchased
+
+### Task 2: Create a Task Router Activity
+- **Start**: When the AI assistant begins creating the activity
+- **End**: When the "Bathroom" activity has been successfully created
+
+### Task 3: Create a Queue with Task Filter
+- **Start**: When the AI assistant begins creating the queue
+- **End**: When the queue with proper task filter has been successfully created
+
+## Testing Procedure
+
+### Setup
+1. Start the dashboard server:
+   ```bash
+   npm start
+   ```
+
+### Running Tests
+For each test:
+
+1. Open Cline and start a new chat with Claude
+
+2. Upload the appropriate instruction file as context:
+   - For control tests: `agent-instructions/control_instructions.md`
+   - For MCP tests: `agent-instructions/mcp_instructions.md`
+
+3. Start the test with: `Complete Task [TASK_NUMBER] using the commands in the instructions`
+
+4. The AI assistant will complete the task, and all metrics will be automatically collected from the chat logs
+
+## Extracting Metrics from Chat Logs
+
+After running tests, you need to extract metrics from the Claude chat logs:
+
+```bash
+npm run extract-metrics
+```
+
+This script analyzes the Claude chat logs and automatically extracts:
+- Duration of each task
+- Number of API calls
+- Number of user interactions
+- Token usage and estimated cost
+- Success/failure status
+
+You can also specify the model, client, and server names to use in the metrics:
+
+```bash
+npm run extract-metrics -- --model <model-name> --client <client-name> --server <server-name>
+```
+
+For example:
+```bash
+npm run extract-metrics -- --model claude-3.7-sonnet --client Cline --server Twilio
+```
+
+These arguments are optional and will override any values found in the logs or the default values. This is useful when the information isn't available in the logs or needs to be standardized across different runs.
+
+Additional options:
+- `--force` or `-f`: Force regeneration of all metrics, even if they already exist
+- `--verbose` or `-v`: Enable verbose logging for debugging
+- `--help` or `-h`: Show help message
+
+## Results Analysis
+
+After tests are complete and metrics are extracted, you have multiple ways to view and analyze results:
+
+### Interactive Dashboard
+The dashboard provides visual comparison of metrics:
+
+1. Access the dashboard:
+   ```
+   http://localhost:3001
+   ```
+
+2. The dashboard shows:
+   - Task completion time comparison
+   - API calls per task
+   - Interactions per task
+   - Success rate comparison
+   - Detailed results table
+
+3. Use the "Refresh Data" button to update with latest results
+
+### Command Line Summary
+For a text-based summary:
+```bash
+npm run regenerate-summary
+```
+
+The performance improvement will be shown as percentage reduction in task completion time.
+
+## Troubleshooting
+
+If metrics are not being extracted properly:
+- Ensure the chat logs are being saved correctly in Cline
+- Check that the AI assistant completed the task successfully
+- Try running the extraction with the `--verbose` flag for more detailed logging:
+  ```bash
+  npm run extract-metrics -- --verbose
+  ```
+
+For dashboard issues:
+- Make sure the dashboard server is running (`npm start`)
+- Check browser console for any JavaScript errors
+- Verify metrics files exist in the `src/server/metrics/` directory