Commit f1b5947
authored
Support analysis along different metrics in the dataset (#11937)
### Summary
- Allow running benchmark analysis along target the metric in the
dataset
- Set verbose level control how much details to be reported
- Bug fixes to properly handle `nan` value in the dataset
### Test plan
Analysis the reported metrics stability along the `token_per_sec` for
`Qwen3-0.6B` on all devices with all recipes (hf/optimum-et vs etLLM):
`python .ci/scripts/benchmark_tooling/analyze_benchmark_stability.py
--primary-file private.xlsx --reference-file public.xlsx --metric
token_per_sec --verbose-level 0`
Report results:
```
====================================================================================================
===== Analyzing Stability Against Metric 'token_per_sec' ==========================================
====================================================================================================
Primary dataset: private.xlsx
Reference dataset for comparison: public.xlsx
====================================================================================================
===== LOADING PRIMARY DATASETS (Private) ==========================================================
====================================================================================================
successfully fetched 10 sheets from private.xlsx
Loading dataset: table1 with config: {'model': 'Qwen/Qwen3-0.6B', 'backend': 'et_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Apple iPhone 15 (private)', 'arch': 'iOS 18.0', 'total_rows': 59, 'aws_type': 'private'}
Loading dataset: table2 with config: {'model': 'Qwen/Qwen3-0.6B', 'backend': 'et_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Apple iPhone 15 Plus (private)', 'arch': 'iOS 17.4.1', 'total_rows': 58, 'aws_type': 'private'}
Loading dataset: table3 with config: {'model': 'Qwen/Qwen3-0.6B', 'backend': 'et_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Apple iPhone 15 Pro (private)', 'arch': 'iOS 18.4.1', 'total_rows': 59, 'aws_type': 'private'}
Loading dataset: table4 with config: {'model': 'Qwen/Qwen3-0.6B', 'backend': 'et_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Samsung Galaxy S22 5G (private)', 'arch': 'Android 13', 'total_rows': 79, 'aws_type': 'private'}
Loading dataset: table5 with config: {'model': 'Qwen/Qwen3-0.6B', 'backend': 'et_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Samsung Galaxy S22 Ultra 5G (private)', 'arch': 'Android 14', 'total_rows': 79, 'aws_type': 'private'}
Loading dataset: table6 with config: {'model': 'Qwen/Qwen3-0.6B', 'backend': 'hf_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Apple iPhone 15 (private)', 'arch': 'iOS 18.0', 'total_rows': 57, 'aws_type': 'private'}
Loading dataset: table7 with config: {'model': 'Qwen/Qwen3-0.6B', 'backend': 'hf_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Apple iPhone 15 Plus (private)', 'arch': 'iOS 17.4.1', 'total_rows': 57, 'aws_type': 'private'}
Loading dataset: table8 with config: {'model': 'Qwen/Qwen3-0.6B', 'backend': 'hf_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Apple iPhone 15 Pro (private)', 'arch': 'iOS 18.4.1', 'total_rows': 57, 'aws_type': 'private'}
Loading dataset: table9 with config: {'model': 'Qwen/Qwen3-0.6B', 'backend': 'hf_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Samsung Galaxy S22 5G (private)', 'arch': 'Android 13', 'total_rows': 78, 'aws_type': 'private'}
Loading dataset: table10 with config: {'model': 'Qwen/Qwen3-0.6B', 'backend': 'hf_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Samsung Galaxy S22 Ultra 5G (private)', 'arch': 'Android 14', 'total_rows': 78, 'aws_type': 'private'}
====================================================================================================
===== LOADING REFERENCE DATASETS (Public) =========================================================
====================================================================================================
successfully fetched 6 sheets from public.xlsx
Loading dataset: table1 with config:{'model': 'Qwen/Qwen3-0.6B', 'backend': 'et_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Apple iPhone 15', 'arch': 'iOS 18.0', 'total_rows': 45, 'aws_type': 'public'}
Loading dataset: table2 with config:{'model': 'Qwen/Qwen3-0.6B', 'backend': 'et_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Apple iPhone 15 Plus', 'arch': 'iOS 17.4.1', 'total_rows': 43, 'aws_type': 'public'}
Loading dataset: table3 with config:{'model': 'Qwen/Qwen3-0.6B', 'backend': 'et_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Samsung Galaxy S22 5G', 'arch': 'Android 13', 'total_rows': 71, 'aws_type': 'public'}
Loading dataset: table4 with config:{'model': 'Qwen/Qwen3-0.6B', 'backend': 'hf_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Apple iPhone 15', 'arch': 'iOS 18.0', 'total_rows': 43, 'aws_type': 'public'}
Loading dataset: table5 with config:{'model': 'Qwen/Qwen3-0.6B', 'backend': 'hf_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Apple iPhone 15 Plus', 'arch': 'iOS 17.4.1', 'total_rows': 42, 'aws_type': 'public'}
Loading dataset: table6 with config:{'model': 'Qwen/Qwen3-0.6B', 'backend': 'hf_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Samsung Galaxy S22 5G', 'arch': 'Android 13', 'total_rows': 71, 'aws_type': 'public'}
====================================================================================================
===== COMPREHENSIVE STABILITY SUMMARY =============================================================
====================================================================================================
Comprehensive Latency Stability Analysis Summary
================================================================================
Primary (Private) Datasets Summary:
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
| Dataset | Model | Device | Mean Value | CV (%) | Stability Score | Stability Rating |
+===========+========================================================+===================================================+==============+==========+===================+====================+
| table10 | Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) | Samsung Galaxy S22 Ultra 5G (private)(Android 14) | 62.82 | 1.45 | 91.17 | Excellent |
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
| table9 | Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) | Samsung Galaxy S22 5G (private)(Android 13) | 61.79 | 1.85 | 88.38 | Good |
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
| table5 | Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) | Samsung Galaxy S22 Ultra 5G (private)(Android 14) | 64.65 | 2.32 | 86.10 | Good |
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
| table4 | Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) | Samsung Galaxy S22 5G (private)(Android 13) | 62.27 | 3.02 | 81.37 | Good |
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
| table3 | Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) | Apple iPhone 15 Pro (private)(iOS 18.4.1) | 24.69 | 3.39 | 78.78 | Moderate |
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
| table8 | Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) | Apple iPhone 15 Pro (private)(iOS 18.4.1) | 22.88 | 3.65 | 78.23 | Moderate |
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
| table1 | Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) | Apple iPhone 15 (private)(iOS 18.0) | 7.66 | 3.75 | 76.56 | Moderate |
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
| table6 | Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) | Apple iPhone 15 (private)(iOS 18.0) | 7.14 | 4.18 | 73.67 | Moderate |
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
| table2 | Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) | Apple iPhone 15 Plus (private)(iOS 17.4.1) | 6.52 | 4.36 | 73.08 | Moderate |
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
| table7 | Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) | Apple iPhone 15 Plus (private)(iOS 17.4.1) | 6.11 | 4.50 | 72.90 | Moderate |
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
Reference (Public) Datasets Summary:
+-----------+--------------------------------------------------------+-----------------------------------+--------------+----------+-------------------+--------------------+
| Dataset | Model | Device | Mean Value | CV (%) | Stability Score | Stability Rating |
+===========+========================================================+===================================+==============+==========+===================+====================+
| table6 | Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) | Samsung Galaxy S22 5G(Android 13) | 62.78 | 3.72 | 77.73 | Moderate |
+-----------+--------------------------------------------------------+-----------------------------------+--------------+----------+-------------------+--------------------+
| table3 | Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) | Samsung Galaxy S22 5G(Android 13) | 62.68 | 4.30 | 74.12 | Moderate |
+-----------+--------------------------------------------------------+-----------------------------------+--------------+----------+-------------------+--------------------+
| table2 | Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) | Apple iPhone 15 Plus(iOS 17.4.1) | 7.08 | 5.21 | 67.91 | Moderate |
+-----------+--------------------------------------------------------+-----------------------------------+--------------+----------+-------------------+--------------------+
| table5 | Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) | Apple iPhone 15 Plus(iOS 17.4.1) | 6.49 | 5.42 | 67.74 | Moderate |
+-----------+--------------------------------------------------------+-----------------------------------+--------------+----------+-------------------+--------------------+
| table4 | Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) | Apple iPhone 15(iOS 18.0) | 7.03 | 7.17 | 55.51 | Poor |
+-----------+--------------------------------------------------------+-----------------------------------+--------------+----------+-------------------+--------------------+
| table1 | Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) | Apple iPhone 15(iOS 18.0) | 6.89 | 20.22 | 21.99 | Poor |
+-----------+--------------------------------------------------------+-----------------------------------+--------------+----------+-------------------+--------------------+
Private vs Public Comparison:
+-------------------------------------------------------------------------------------------+----------------------------------------------+------------------------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| Dataset | Private Device | Public Device | Private Score | Public Score | Score Diff | Private CV (%) | Public CV (%) | CV Diff (%) |
+===========================================================================================+==============================================+====================================+=================+================+==============+==================+=================+===============+
| Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) on Apple iPhone 15 (private) | Apple iPhone 15 (private) (iOS 18.0) | Apple iPhone 15 (iOS 18.0) | 76.56 | 21.99 | 54.58 | 3.75 | 20.22 | -16.46 |
+-------------------------------------------------------------------------------------------+----------------------------------------------+------------------------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) on Apple iPhone 15 (private) | Apple iPhone 15 (private) (iOS 18.0) | Apple iPhone 15 (iOS 18.0) | 73.67 | 55.51 | 18.17 | 4.18 | 7.17 | -2.99 |
+-------------------------------------------------------------------------------------------+----------------------------------------------+------------------------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) on Samsung Galaxy S22 5G (private) | Samsung Galaxy S22 5G (private) (Android 13) | Samsung Galaxy S22 5G (Android 13) | 88.38 | 77.73 | 10.64 | 1.85 | 3.72 | -1.87 |
+-------------------------------------------------------------------------------------------+----------------------------------------------+------------------------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) on Samsung Galaxy S22 5G (private) | Samsung Galaxy S22 5G (private) (Android 13) | Samsung Galaxy S22 5G (Android 13) | 81.37 | 74.12 | 7.25 | 3.02 | 4.30 | -1.28 |
+-------------------------------------------------------------------------------------------+----------------------------------------------+------------------------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) on Apple iPhone 15 Plus (private) | Apple iPhone 15 Plus (private) (iOS 17.4.1) | Apple iPhone 15 Plus (iOS 17.4.1) | 73.08 | 67.91 | 5.17 | 4.36 | 5.21 | -0.86 |
+-------------------------------------------------------------------------------------------+----------------------------------------------+------------------------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) on Apple iPhone 15 Plus (private) | Apple iPhone 15 Plus (private) (iOS 17.4.1) | Apple iPhone 15 Plus (iOS 17.4.1) | 72.90 | 67.74 | 5.16 | 4.50 | 5.42 | -0.92 |
+-------------------------------------------------------------------------------------------+----------------------------------------------+------------------------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
Private environment is more stable in 6 of 6 cases.
Public environment is more stable in 0 of 6 cases.
Overall Insights and Recommendations:
Stability Distribution in Private Datasets:
- Moderate: 6 dataset(s)
- Good: 3 dataset(s)
- Excellent: 1 dataset(s)
```1 parent 124758e commit f1b5947
File tree
1 file changed
+108
-117
lines changed- .ci/scripts/benchmark_tooling
1 file changed
+108
-117
lines changedLines changed: 108 additions & 117 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
66 | 66 | | |
67 | 67 | | |
68 | 68 | | |
69 | | - | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
70 | 74 | | |
71 | | - | |
| 75 | + | |
| 76 | + | |
72 | 77 | | |
73 | | - | |
| 78 | + | |
74 | 79 | | |
75 | 80 | | |
76 | 81 | | |
| |||
99 | 104 | | |
100 | 105 | | |
101 | 106 | | |
102 | | - | |
103 | | - | |
104 | | - | |
105 | | - | |
106 | | - | |
107 | | - | |
108 | | - | |
109 | | - | |
110 | | - | |
111 | | - | |
112 | | - | |
113 | | - | |
114 | | - | |
115 | | - | |
116 | 107 | | |
| 108 | + | |
117 | 109 | | |
118 | 110 | | |
119 | 111 | | |
120 | 112 | | |
121 | 113 | | |
122 | 114 | | |
123 | 115 | | |
124 | | - | |
| 116 | + | |
125 | 117 | | |
126 | | - | |
| 118 | + | |
| 119 | + | |
127 | 120 | | |
128 | 121 | | |
129 | 122 | | |
| |||
161 | 154 | | |
162 | 155 | | |
163 | 156 | | |
164 | | - | |
165 | | - | |
166 | | - | |
167 | | - | |
168 | | - | |
169 | | - | |
170 | | - | |
171 | | - | |
172 | | - | |
173 | | - | |
174 | | - | |
175 | | - | |
176 | | - | |
177 | | - | |
178 | 157 | | |
| 158 | + | |
179 | 159 | | |
180 | 160 | | |
181 | 161 | | |
| |||
187 | 167 | | |
188 | 168 | | |
189 | 169 | | |
190 | | - | |
| 170 | + | |
| 171 | + | |
191 | 172 | | |
192 | 173 | | |
193 | 174 | | |
| |||
201 | 182 | | |
202 | 183 | | |
203 | 184 | | |
204 | | - | |
205 | | - | |
206 | | - | |
207 | | - | |
208 | | - | |
209 | | - | |
210 | | - | |
211 | | - | |
212 | | - | |
213 | | - | |
214 | | - | |
215 | | - | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
216 | 199 | | |
217 | | - | |
218 | | - | |
219 | | - | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
220 | 203 | | |
221 | 204 | | |
222 | | - | |
| 205 | + | |
223 | 206 | | |
224 | 207 | | |
225 | 208 | | |
226 | 209 | | |
227 | 210 | | |
| 211 | + | |
228 | 212 | | |
229 | 213 | | |
230 | 214 | | |
| |||
238 | 222 | | |
239 | 223 | | |
240 | 224 | | |
241 | | - | |
| 225 | + | |
242 | 226 | | |
243 | 227 | | |
244 | 228 | | |
| |||
270 | 254 | | |
271 | 255 | | |
272 | 256 | | |
273 | | - | |
274 | | - | |
275 | | - | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
276 | 261 | | |
277 | 262 | | |
278 | 263 | | |
| |||
285 | 270 | | |
286 | 271 | | |
287 | 272 | | |
288 | | - | |
| 273 | + | |
| 274 | + | |
289 | 275 | | |
290 | 276 | | |
291 | 277 | | |
292 | | - | |
293 | | - | |
294 | | - | |
295 | | - | |
296 | | - | |
297 | | - | |
298 | | - | |
299 | | - | |
300 | | - | |
301 | | - | |
302 | | - | |
| 278 | + | |
| 279 | + | |
303 | 280 | | |
304 | 281 | | |
305 | 282 | | |
306 | 283 | | |
307 | | - | |
308 | | - | |
309 | | - | |
310 | 284 | | |
311 | 285 | | |
312 | 286 | | |
| |||
316 | 290 | | |
317 | 291 | | |
318 | 292 | | |
319 | | - | |
320 | | - | |
321 | | - | |
322 | | - | |
323 | | - | |
324 | | - | |
325 | | - | |
326 | | - | |
327 | 293 | | |
328 | 294 | | |
329 | 295 | | |
330 | 296 | | |
331 | | - | |
332 | | - | |
333 | 297 | | |
334 | 298 | | |
335 | 299 | | |
| |||
342 | 306 | | |
343 | 307 | | |
344 | 308 | | |
345 | | - | |
346 | | - | |
347 | | - | |
348 | | - | |
349 | | - | |
350 | | - | |
351 | | - | |
352 | | - | |
353 | | - | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
354 | 339 | | |
355 | | - | |
356 | | - | |
357 | | - | |
358 | | - | |
359 | | - | |
360 | | - | |
361 | | - | |
362 | | - | |
363 | | - | |
364 | | - | |
365 | | - | |
366 | | - | |
367 | | - | |
368 | | - | |
369 | | - | |
370 | | - | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
371 | 343 | | |
372 | 344 | | |
373 | 345 | | |
374 | 346 | | |
375 | | - | |
| 347 | + | |
376 | 348 | | |
377 | 349 | | |
378 | 350 | | |
| |||
419 | 391 | | |
420 | 392 | | |
421 | 393 | | |
422 | | - | |
| 394 | + | |
423 | 395 | | |
424 | 396 | | |
425 | 397 | | |
| |||
436 | 408 | | |
437 | 409 | | |
438 | 410 | | |
439 | | - | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
440 | 414 | | |
441 | 415 | | |
442 | 416 | | |
| |||
719 | 693 | | |
720 | 694 | | |
721 | 695 | | |
722 | | - | |
723 | | - | |
724 | | - | |
| 696 | + | |
| 697 | + | |
| 698 | + | |
725 | 699 | | |
726 | | - | |
727 | | - | |
| 700 | + | |
| 701 | + | |
728 | 702 | | |
729 | 703 | | |
730 | 704 | | |
| |||
1056 | 1030 | | |
1057 | 1031 | | |
1058 | 1032 | | |
1059 | | - | |
| 1033 | + | |
1060 | 1034 | | |
1061 | 1035 | | |
1062 | 1036 | | |
| |||
1293 | 1267 | | |
1294 | 1268 | | |
1295 | 1269 | | |
1296 | | - | |
| 1270 | + | |
1297 | 1271 | | |
1298 | 1272 | | |
1299 | 1273 | | |
| |||
1330 | 1304 | | |
1331 | 1305 | | |
1332 | 1306 | | |
1333 | | - | |
| 1307 | + | |
1334 | 1308 | | |
1335 | 1309 | | |
1336 | 1310 | | |
| |||
1541 | 1515 | | |
1542 | 1516 | | |
1543 | 1517 | | |
| 1518 | + | |
| 1519 | + | |
| 1520 | + | |
| 1521 | + | |
| 1522 | + | |
1544 | 1523 | | |
1545 | 1524 | | |
1546 | 1525 | | |
1547 | 1526 | | |
1548 | 1527 | | |
1549 | | - | |
| 1528 | + | |
| 1529 | + | |
| 1530 | + | |
| 1531 | + | |
| 1532 | + | |
| 1533 | + | |
| 1534 | + | |
1550 | 1535 | | |
1551 | 1536 | | |
1552 | 1537 | | |
1553 | 1538 | | |
1554 | | - | |
| 1539 | + | |
| 1540 | + | |
| 1541 | + | |
| 1542 | + | |
| 1543 | + | |
| 1544 | + | |
| 1545 | + | |
1555 | 1546 | | |
1556 | 1547 | | |
1557 | 1548 | | |
| |||
0 commit comments