Skip to content

Conversation

@farook-edev
Copy link
Contributor

This PR adds a new element (BenchmarkSet) which bundles together benchmarks that are mostly similar but need to be run separately (i.e. different models or datasets but same function).

Under the hood the benchmarks work exactly the same, no C++ logic has been changed. The added configuration is only for the frontend.

The way it works is by bundling similar benchmarks under a set, and having each benchmark be active if one or more options it requires are active. For example, if we take LLM, let's say we have 3 models and 3 dataset implementations to test, ModelA-DatasetB, ModelC-DatasetA, and so on, that'll be 9 benchmarks.
Benchmark ModelA-DatasetC will define 2 required options, Model-A and Dataset-C, then the Benchmark Set will contain 6 options in 2 categories, Models (A,B,C) and DataSets (A,B,C).
If a user then enables Models A and C, And dataset A. the set will automatically activate ModelA-DatasetA and ModelC-DatasetA and disable all the others.

The benefit from this approach is that instead of having 9 benchmarks that are basically the same, we'll have 1 set containg 6 options. While the core benchmarking will not see the sets or options.

This PR also applies the above described implementation to image_classification_v2, combining the default and offline versions into a set, and providing 2 options to enable and disable the benchmarks. This is only a secondary improvements, since this system is meant to tidy up the (at least) 4 benchmarks that LLM will add.

I've also included a video of the system in action:

optionvid.mp4

Closes #1082

@github-actions
Copy link

github-actions bot commented Jan 6, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@farook-edev farook-edev marked this pull request as ready for review January 12, 2026 00:25
@farook-edev farook-edev requested review from a team and anhappdev as code owners January 12, 2026 00:25
@sonarqubecloud
Copy link

@farook-edev farook-edev linked an issue Jan 12, 2026 that may be closed by this pull request
@freedomtan
Copy link
Contributor

@freedomtan to check if the running order (offline one should be the latest one to run) is kept unchanged.


benchmark_setting {
benchmark_id: "image_classification_v2"
benchmark_id: "image_classification_online_v2"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We collect run results shared by users. This change will cause a migration if we analyse the results over time.

@freedomtan
Copy link
Contributor

  • select the benchmark, not datasets (they should not be selectable) from UI. e.g., assuming we have both ifeval and tinymmlu as planned, they are not supposed to be electable by user.

@farook-edev farook-edev marked this pull request as draft January 13, 2026 19:08
@Mostelk
Copy link

Mostelk commented Jan 15, 2026

@farook-edev @anhappdev would like to test, but we don;t have download link for the reference models, benchmark_setting {
benchmark_id: "llm"
framework: "TFLite"
delegate_choice: {
delegate_name: "CPU"
accelerator_name: "cpu"
accelerator_desc: "CPU"
model_file: {
model_path: "local:///mlperf_models/llama_q8_ekv3072.tflite"
model_checksum: "54efe0be372b55303673245067beef62"
}
model_file: {
model_path: "local:///mlperf_models/llama3_1b.spm.model"
model_checksum: "2ad260fc18b965ce16006d76c9327082"
}
}
Is it possible to make up download them automatically

@farook-edev
Copy link
Contributor Author

@farook-edev @anhappdev would like to test, but we don;t have download link for the reference models, benchmark_setting { benchmark_id: "llm" framework: "TFLite" delegate_choice: { delegate_name: "CPU" accelerator_name: "cpu" accelerator_desc: "CPU" model_file: { model_path: "local:///mlperf_models/llama_q8_ekv3072.tflite" model_checksum: "54efe0be372b55303673245067beef62" } model_file: { model_path: "local:///mlperf_models/llama3_1b.spm.model" model_checksum: "2ad260fc18b965ce16006d76c9327082" } } Is it possible to make up download them automatically

the files are available here, I think the model isn't there because it hadn't been decided at the time..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow more than 1 LLM benchmark

5 participants