Commit e6c3c9f
Introduce accuracy testing to test_llms.py:
- Add --accuracy-testing argument to test_llms.py tests that we track in benchmark testing.
- Add --batch-size argument to accuracy tests in test_llms.py because we can't fit default batch 32 on device
due to larger input sequence length required in accuracy testing.
With batch size 32 and input sequence length, 7B and 8B models failed with OOM issues.
- Run accuracy tests in separate job called: run-n150-accuracy-benchmarks of the perf-benchmark-experimental workflow
Teacher forcing for accuracy testing:
- Add teacher forcing support to generate_and_benchmark() for accuracy testing mode
- Route to teacher_forced_generate() when ground_truth_tokens provided
- Update construct_inputs() to support pre-tokenized input and custom prompts
Generating ground truth .refpt files (generate_reference_outputs.py):
Add generate_reference_outputs.py: create ground truth .refpt files for accuracy testing
Generate reference top1/top5 token predictions for LLM accuracy benchmarking:
- Loads HuggingFace models on CPU for deterministic inference
- Processes "Tale of Two Cities" text corpus with teacher forcing
- Outputs .refpt files containing reference tokens and top-k predictions
- Used by TokenAccuracy class to validate TOP1/TOP5 accuracy during benchmarks
Ensures reproducibility through eval mode, disabled dropout, greedy decoding,
and StaticCache matching the benchmark environment. Reference files must be
regenerated if input_sequence_length changes.
Usage:
python3 scripts/generate_reference_outputs.py \
--model "meta-llama/Llama-3.2-1B-Instruct" \
--output_file "reference_outputs/Llama-3.2-1B-Instruct.refpt" \
--total_length 128
Adding shared utility for decode (decode_utils.py):
Centralize LLM decode operations used by reference output generation and accuracy testing:
- Teacher forcing generation with ground truth tokens
- Reference top-k prediction generation for .refpt files
- Static cache and accuracy testing initialization
- Logits extraction and top-k token utilities
Prevents implementation drift between reference generation and benchmark paths
by sharing the same decode logic, tokenization, and cache semantics.
TokenAccuracy class for validating LLM inference quality(token_accuracy.py):
- Loads precomputed reference data from .refpt files (tokens, top1/top5 predictions)
- Validates torch/transformers versions match reference file for reproducibility
- Splits reference tokens into prefill (input) and decode (ground truth) windows
- Computes TOP1/TOP5 accuracy by comparing model predictions against reference
- Provides teacher forcing tokens for deterministic decode loops
Slight refactoring
- Simplify static cache initialization using init_static_cache helper
- Remove unused variables is_multichip and mesh from generate_and_benchmark function.1 parent 4db9918 commit e6c3c9f
File tree
34 files changed
+1518
-103
lines changed- .github/workflows
- benchmark/tt-xla
- reference_outputs
- scripts
34 files changed
+1518
-103
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
197 | 197 | | |
198 | 198 | | |
199 | 199 | | |
200 | | - | |
| 200 | + | |
201 | 201 | | |
202 | 202 | | |
203 | 203 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
305 | 305 | | |
306 | 306 | | |
307 | 307 | | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
308 | 439 | | |
309 | 440 | | |
310 | 441 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| 18 | + | |
| 19 | + | |
18 | 20 | | |
19 | 21 | | |
20 | 22 | | |
| |||
28 | 30 | | |
29 | 31 | | |
30 | 32 | | |
| 33 | + | |
31 | 34 | | |
32 | 35 | | |
33 | 36 | | |
| |||
44 | 47 | | |
45 | 48 | | |
46 | 49 | | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
47 | 69 | | |
48 | 70 | | |
49 | 71 | | |
| |||
53 | 75 | | |
54 | 76 | | |
55 | 77 | | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
56 | 87 | | |
57 | 88 | | |
58 | 89 | | |
| 90 | + | |
59 | 91 | | |
60 | 92 | | |
61 | 93 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
162 | 162 | | |
163 | 163 | | |
164 | 164 | | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
165 | 172 | | |
166 | 173 | | |
167 | 174 | | |
| |||
217 | 224 | | |
218 | 225 | | |
219 | 226 | | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
0 commit comments