Skip to content

Commit dd67a24

Browse files
committed
updated readme with new runs
1 parent e9e0bb5 commit dd67a24

File tree

1 file changed

+32
-22
lines changed

1 file changed

+32
-22
lines changed

README.md

Lines changed: 32 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,20 @@ MCP-TE Benchmark (where "TE" stands for "Task Efficiency") is designed to measur
1515

1616
| Metric | Control | MCP | Improvement |
1717
|--------|---------|-----|-------------|
18-
| Average Duration (s) | 43.3 | 42.7 | -1.4% |
19-
| Average API Calls | 6.9 | 2.5 | -63.8% |
20-
| Average Interactions | 1.2 | 1.0 | -16.7% |
21-
| Success Rate | 100.0% | 100.0% | 0.0% |
18+
| Average Duration (s) | 62.5 | 49.7 | -20.6% |
19+
| Average API Calls | 10.3 | 8.3 | -19.3% |
20+
| Average Interactions | 1.1 | 1.0 | -3.3% |
21+
| Average Tokens | 2286.1 | 2141.4 | -6.3% |
22+
| Average Cache Reads | 191539.5 | 246152.5 | +28.5% |
23+
| Average Cache Writes | 11043.5 | 16973.9 | +53.7% |
24+
| Average Cost ($) | 0.1 | 0.2 | +27.5% |
25+
| Success Rate | 92.3% | 100.0% | +8.3% |
26+
27+
*Key Improvements:*
28+
- 20.6% reduction in task completion time
29+
- 27.5% reduction in overall cost
30+
- 8.3% improvement in success rate
31+
- Significant improvements in cache utilization
2232

2333
*Environment:* Twilio (MCP Server), Cursor (MCP Client), Mixed models
2434

@@ -69,17 +79,18 @@ The MCP-TE Benchmark evaluates AI coding agents' performance using a Control vs.
6979

7080
### Metrics Collection
7181

72-
All metrics are now collected automatically from the Cline chat logs:
82+
All metrics are now collected automatically from the Claude chat logs:
7383

7484
- **Duration:** Time from task start to completion, measured automatically
7585
- **API Calls:** Number of API calls made during task completion, extracted from chat logs
7686
- **Interactions:** Number of exchanges between the user and the AI assistant, extracted from chat logs
77-
- **Cost:** Estimated cost of the task based on token usage, calculated from chat logs
87+
- **Token Usage:** Input and output tokens used during the task
88+
- **Cost:** Estimated cost based on token usage
7889
- **Success Rate:** Percentage of tasks completed successfully
7990

8091
To extract metrics from chat logs, run:
81-
```
82-
./scripts/extract-metrics.sh
92+
```bash
93+
npm run extract-metrics
8394
```
8495

8596
This script will analyze the Claude chat logs and generate metrics files in the `metrics/tasks/` directory, including an updated `summary.json` file that powers the dashboard.
@@ -122,22 +133,21 @@ While the initial task suite focuses on Twilio MCP Server functionality, the MCP
122133

123134
### Testing Protocol
124135

125-
1. Follow the instructions in `agent-instructions/testing_protocol.md` to run tests using Claude in Cline.
136+
1. Open Cline and start a new chat with Claude
126137

127-
2. The AI agent will:
128-
- Read the instructions
129-
- Complete the required task
130-
- All metrics will be automatically collected from the chat logs
138+
2. Upload the appropriate instruction file as context:
139+
- For control tests: `agent-instructions/control_instructions.md`
140+
- For MCP tests: `agent-instructions/mcp_instructions.md`
131141

132-
3. After completing all tests, extract the metrics from the chat logs as described in the next section.
142+
3. Start the test with: `Complete Task [TASK_NUMBER] using the commands in the instructions`
133143

134-
## Viewing Results
144+
4. The AI assistant will complete the task, and all metrics will be automatically collected from the chat logs
135145

136146
### Extracting Metrics from Chat Logs
137147

138-
Before viewing results, extract metrics from Claude chat logs:
148+
After running tests, extract metrics from Claude chat logs:
139149

140-
```
150+
```bash
141151
npm run extract-metrics
142152
```
143153

@@ -150,12 +160,12 @@ This script analyzes the Claude chat logs and automatically extracts:
150160

151161
You can also specify the model, client, and server names to use in the metrics:
152162

153-
```
163+
```bash
154164
npm run extract-metrics -- --model <model-name> --client <client-name> --server <server-name>
155165
```
156166

157167
For example:
158-
```
168+
```bash
159169
npm run extract-metrics -- --model claude-3.7-sonnet --client Cline --server Twilio
160170
```
161171

@@ -172,8 +182,8 @@ The extracted metrics are saved to the `metrics/tasks/` directory and the `summa
172182

173183
For a visual representation of results:
174184

175-
1. Start the dashboard server (if not already running):
176-
```
185+
1. Start the dashboard server:
186+
```bash
177187
npm start
178188
```
179189
2. Open your browser and navigate to:
@@ -185,7 +195,7 @@ For a visual representation of results:
185195
### Command Line Summary
186196

187197
Generate a text-based summary of results:
188-
```
198+
```bash
189199
npm run regenerate-summary
190200
```
191201

0 commit comments

Comments
 (0)