Feat. Benchmark Sets #1095

farook-edev · 2026-01-06T00:58:16Z

This PR adds a new element (BenchmarkSet) which bundles together benchmarks that are mostly similar but need to be run separately (i.e. different models or datasets but same function).

Under the hood the benchmarks work exactly the same, no C++ logic has been changed. The added configuration is only for the frontend.

The way it works is by bundling similar benchmarks under a set, and having each benchmark be active if one or more options it requires are active. For example, if we take LLM, let's say we have 3 models and 3 dataset implementations to test, ModelA-DatasetB, ModelC-DatasetA, and so on, that'll be 9 benchmarks.
Benchmark ModelA-DatasetC will define 2 required options, Model-A and Dataset-C, then the Benchmark Set will contain 6 options in 2 categories, Models (A,B,C) and DataSets (A,B,C).
If a user then enables Models A and C, And dataset A. the set will automatically activate ModelA-DatasetA and ModelC-DatasetA and disable all the others.

The benefit from this approach is that instead of having 9 benchmarks that are basically the same, we'll have 1 set containg 6 options. While the core benchmarking will not see the sets or options.

This PR also applies the above described implementation to image_classification_v2, combining the default and offline versions into a set, and providing 2 options to enable and disable the benchmarks. This is only a secondary improvements, since this system is meant to tidy up the (at least) 4 benchmarks that LLM will add.

I've also included a video of the system in action:

optionvid.mp4

Closes #1082

…ith variants of the same benchmark.

github-actions · 2026-01-06T00:58:25Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

sonarqubecloud · 2026-01-12T01:02:36Z

Quality Gate passed

Issues
149 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

freedomtan · 2026-01-13T06:26:06Z

@freedomtan to check if the running order (offline one should be the latest one to run) is kept unchanged.

anhappdev · 2026-01-13T06:28:57Z

mobile_back_tflite/cpp/backend_tflite/backend_settings/tflite_settings_android.pbtxt


 benchmark_setting {
-  benchmark_id: "image_classification_v2"
+  benchmark_id: "image_classification_online_v2"


We collect run results shared by users. This change will cause a migration if we analyse the results over time.

freedomtan · 2026-01-13T06:37:09Z

select the benchmark, not datasets (they should not be selectable) from UI. e.g., assuming we have both ifeval and tinymmlu as planned, they are not supposed to be electable by user.

Mostelk · 2026-01-15T08:52:22Z

@farook-edev @anhappdev would like to test, but we don;t have download link for the reference models, benchmark_setting {
benchmark_id: "llm"
framework: "TFLite"
delegate_choice: {
delegate_name: "CPU"
accelerator_name: "cpu"
accelerator_desc: "CPU"
model_file: {
model_path: "local:///mlperf_models/llama_q8_ekv3072.tflite"
model_checksum: "54efe0be372b55303673245067beef62"
}
model_file: {
model_path: "local:///mlperf_models/llama3_1b.spm.model"
model_checksum: "2ad260fc18b965ce16006d76c9327082"
}
}
Is it possible to make up download them automatically

farook-edev · 2026-01-15T09:47:32Z

@farook-edev @anhappdev would like to test, but we don;t have download link for the reference models, benchmark_setting { benchmark_id: "llm" framework: "TFLite" delegate_choice: { delegate_name: "CPU" accelerator_name: "cpu" accelerator_desc: "CPU" model_file: { model_path: "local:///mlperf_models/llama_q8_ekv3072.tflite" model_checksum: "54efe0be372b55303673245067beef62" } model_file: { model_path: "local:///mlperf_models/llama3_1b.spm.model" model_checksum: "2ad260fc18b965ce16006d76c9327082" } } Is it possible to make up download them automatically

the files are available here, I think the model isn't there because it hadn't been decided at the time..

feature: Added benchmark sets to avoid cluttering the config screen w…

0e3f052

…ith variants of the same benchmark.

farook-edev added 3 commits January 11, 2026 22:02

formatting

1054001

formatting

b038ce1

Merge branch 'submission-v6.0' into farook/benchmark_options

09bf49d

farook-edev marked this pull request as ready for review January 12, 2026 00:25

farook-edev requested review from a team and anhappdev as code owners January 12, 2026 00:25

farook-edev linked an issue Jan 12, 2026 that may be closed by this pull request

Allow more than 1 LLM benchmark #1082

Open

anhappdev reviewed Jan 13, 2026

View reviewed changes

farook-edev marked this pull request as draft January 13, 2026 19:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat. Benchmark Sets #1095

Feat. Benchmark Sets #1095

Uh oh!

farook-edev commented Jan 6, 2026

Uh oh!

github-actions bot commented Jan 6, 2026 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Jan 12, 2026

Uh oh!

freedomtan commented Jan 13, 2026

Uh oh!

anhappdev Jan 13, 2026

Uh oh!

freedomtan commented Jan 13, 2026

Uh oh!

Mostelk commented Jan 15, 2026

Uh oh!

farook-edev commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Feat. Benchmark Sets #1095

Are you sure you want to change the base?

Feat. Benchmark Sets #1095

Uh oh!

Conversation

farook-edev commented Jan 6, 2026

Uh oh!

github-actions bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud bot commented Jan 12, 2026

Quality Gate passed

Uh oh!

freedomtan commented Jan 13, 2026

Uh oh!

anhappdev Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

freedomtan commented Jan 13, 2026

Uh oh!

Mostelk commented Jan 15, 2026

Uh oh!

farook-edev commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions bot commented Jan 6, 2026 •

edited

Loading