Skip to content

Conversation

@farook-edev
Copy link
Contributor

I also included some cleanup for debugging prints and such.

should close #1059, and maybe #1058 and #1060 since there's no more work to be done on any of them.

@freedomtan can you confirm?

@farook-edev farook-edev requested a review from a team as a code owner November 10, 2025 09:50
@github-actions
Copy link

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@sonarqubecloud
Copy link

@freedomtan
Copy link
Contributor

How do we test this?

@farook-edev
Copy link
Contributor Author

How do we test this?

Running the cmdline or app on Performance mode and checking the summary should show 100 samples instead of 1

@freedomtan
Copy link
Contributor

@freedomtan to test it.

@freedomtan
Copy link
Contributor

@farook-edev please share the datasets so that @anhappdev could to upload them to the CDN.

Copy link
Contributor

@freedomtan freedomtan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES, the performance_sample_count is reasonable now on Pixel 10.

================================================
MLPerf Results Summary
================================================
SUT name : TFLite
Scenario : SingleStream
Mode     : PerformanceOnly
90th percentile latency (ns) : 6196843547
90th first token percentile latency (ns) : 5568099822
Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Skipped
  Early stopping satisfied: Yes
TTFT Early Stopping Result:
 * Processed at least 64 queries (66).
 * Would discard 0 highest latency queries.
 * Early stopping 90th percentile estimate: 17103484388
 * Not enough queries processed for 99th percentile
 early stopping estimate (would need to process at
 least 662 total queries).
TPOT Early Stopping Result:
 * Processed at least 64 queries (66).
 * Would discard 0 highest latency queries.
 * Early stopping 90th percentile estimate: 834750626
 * Not enough queries processed for 99th percentile
 early stopping estimate (would need to process at
 least 662 total queries).

================================================
Additional Stats
================================================
QPS w/ loadgen overhead         : 0.21
QPS w/o loadgen overhead        : 0.21

Min latency (ns)                : 1436503361
Max latency (ns)                : 17598149232
Mean latency (ns)               : 4682296887
50.00 percentile latency (ns)   : 4288985862
90.00 percentile latency (ns)   : 6196843547
95.00 percentile latency (ns)   : 8122990944
97.00 percentile latency (ns)   : 13729710479
99.00 percentile latency (ns)   : 17598149232
99.90 percentile latency (ns)   : 17598149232

TPS w/ loadgen overhead         : 0.45
TPS w/o loadgen overhead        : 0.44
Min First Token latency (ns)                : 1436326199
Max First Token latency (ns)                : 17103484388
Mean First Token latency (ns)               : 4058709108
50.00 percentile first token latency (ns)   : 3507652997
90.00 percentile first token latency (ns)   : 5568099822
95.00 percentile first token latency (ns)   : 7330172688
97.00 percentile first token latency (ns)   : 12895301937
99.00 percentile first token latency (ns)   : 17103484388
99.90 percentile first token latency (ns)   : 17103484388

Min Time to Output Token (ns)                : -177162
Max Time to Output Token (ns)                : 834750626
Mean Time to Output Token (ns)               : 576319279
50.00 percentile time to output token (ns)   : 515973151
90.00 percentile time to output token (ns)   : 790773282
95.00 percentile time to output token (ns)   : 802750365
97.00 percentile time to output token (ns)   : 834408542
99.00 percentile time to output token (ns)   : 834750626
99.90 percentile time to output token (ns)   : 834750626

================================================
Test Parameters Used
================================================
samples_per_query : 1
target_qps : 1000
ttft_latency (ns): 100000000
tpot_latency (ns): 100000000
max_async_queries : 1
min_duration (ms): 60000
max_duration (ms): 300000
min_query_count : 100
max_query_count : 0
qsl_rng_seed : 3066443479025735752
sample_index_rng_seed : 10688027786191513374
schedule_rng_seed : 14962580496156340209
accuracy_log_rng_seed : 0
accuracy_log_probability : 0
accuracy_log_sampling_target : 0
print_timestamps : 0
performance_issue_unique : 0
performance_issue_same : 0
performance_issue_same_index : 0
performance_sample_count : 100

No warnings encountered during test.

1 ERROR encountered. See detailed log.

@farook-edev farook-edev merged commit 2510839 into submission-v6.0 Nov 11, 2025
27 checks passed
@farook-edev farook-edev deleted the farook/full-dataset-performance branch November 11, 2025 08:18
@github-actions github-actions bot locked and limited conversation to collaborators Nov 11, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LLM IFEval Dataset Implementation LLM TinyMMLU Dataset Implementation LLM Dataset Implementation

3 participants