You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -69,17 +79,18 @@ The MCP-TE Benchmark evaluates AI coding agents' performance using a Control vs.
69
79
70
80
### Metrics Collection
71
81
72
-
All metrics are now collected automatically from the Cline chat logs:
82
+
All metrics are now collected automatically from the Claude chat logs:
73
83
74
84
-**Duration:** Time from task start to completion, measured automatically
75
85
-**API Calls:** Number of API calls made during task completion, extracted from chat logs
76
86
-**Interactions:** Number of exchanges between the user and the AI assistant, extracted from chat logs
77
-
-**Cost:** Estimated cost of the task based on token usage, calculated from chat logs
87
+
-**Token Usage:** Input and output tokens used during the task
88
+
-**Cost:** Estimated cost based on token usage
78
89
-**Success Rate:** Percentage of tasks completed successfully
79
90
80
91
To extract metrics from chat logs, run:
81
-
```
82
-
./scripts/extract-metrics.sh
92
+
```bash
93
+
npm run extract-metrics
83
94
```
84
95
85
96
This script will analyze the Claude chat logs and generate metrics files in the `metrics/tasks/` directory, including an updated `summary.json` file that powers the dashboard.
@@ -122,22 +133,21 @@ While the initial task suite focuses on Twilio MCP Server functionality, the MCP
122
133
123
134
### Testing Protocol
124
135
125
-
1.Follow the instructions in `agent-instructions/testing_protocol.md` to run tests using Claude in Cline.
136
+
1.Open Cline and start a new chat with Claude
126
137
127
-
2. The AI agent will:
128
-
- Read the instructions
129
-
- Complete the required task
130
-
- All metrics will be automatically collected from the chat logs
138
+
2. Upload the appropriate instruction file as context:
139
+
- For control tests: `agent-instructions/control_instructions.md`
140
+
- For MCP tests: `agent-instructions/mcp_instructions.md`
131
141
132
-
3.After completing all tests, extract the metrics from the chat logs as described in the next section.
142
+
3.Start the test with: `Complete Task [TASK_NUMBER] using the commands in the instructions`
133
143
134
-
## Viewing Results
144
+
4. The AI assistant will complete the task, and all metrics will be automatically collected from the chat logs
135
145
136
146
### Extracting Metrics from Chat Logs
137
147
138
-
Before viewing results, extract metrics from Claude chat logs:
148
+
After running tests, extract metrics from Claude chat logs:
139
149
140
-
```
150
+
```bash
141
151
npm run extract-metrics
142
152
```
143
153
@@ -150,12 +160,12 @@ This script analyzes the Claude chat logs and automatically extracts:
150
160
151
161
You can also specify the model, client, and server names to use in the metrics:
152
162
153
-
```
163
+
```bash
154
164
npm run extract-metrics -- --model <model-name> --client <client-name> --server <server-name>
155
165
```
156
166
157
167
For example:
158
-
```
168
+
```bash
159
169
npm run extract-metrics -- --model claude-3.7-sonnet --client Cline --server Twilio
160
170
```
161
171
@@ -172,8 +182,8 @@ The extracted metrics are saved to the `metrics/tasks/` directory and the `summa
172
182
173
183
For a visual representation of results:
174
184
175
-
1. Start the dashboard server (if not already running):
176
-
```
185
+
1. Start the dashboard server:
186
+
```bash
177
187
npm start
178
188
```
179
189
2. Open your browser and navigate to:
@@ -185,7 +195,7 @@ For a visual representation of results:
0 commit comments