-
Notifications
You must be signed in to change notification settings - Fork 609
[VLM] Accuracy Evaluation #2393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
hanyunfan
merged 35 commits into
mlcommons:master
from
CentML:jcalderon/vlm-accuracy-eval
Dec 2, 2025
Merged
Changes from 8 commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
35c9704
Initial proposal for VLM - evaluation
johncalesp f5995c3
address review comments and test hiclass implementation
johncalesp 46e4fc0
[Automated Commit] Format Codebase
github-actions[bot] 0dc13dc
additional fixes to reviews
johncalesp 4aa5e0d
[Automated Commit] Format Codebase
github-actions[bot] 5e1590b
address PR comments
johncalesp e1ccc85
[Automated Commit] Format Codebase
github-actions[bot] 7e0c444
add a more detail description of the field dataset.split
johncalesp b35b057
Enable exception logging in _query_endpoint_async
wangshangsam 48b5bdb
[Automated Commit] Format Codebase
github-actions[bot] 0e4c5ee
Merge branch 'master' into jcalderon/vlm-accuracy-eval
wangshangsam f8e1498
[Automated Commit] Format Codebase
github-actions[bot] f464499
Trigger CI/CD pipeline
johncalesp 9609cd0
Merge branch 'master' into jcalderon/vlm-accuracy-eval
wangshangsam bc56ec9
Add performance_sample_count_override as a CLI flag.
wangshangsam b8e2909
Merge branch 'jcalderon/vlm-accuracy-eval' of github.com:CentML/mlper…
wangshangsam 8b43239
[Automated Commit] Format Codebase
github-actions[bot] 9466529
Merge branch 'master' into jcalderon/vlm-accuracy-eval
wangshangsam dae5065
add json format to queries
johncalesp c840dd6
[Automated Commit] Format Codebase
github-actions[bot] 0b45001
added schema file and made necessary changes
johncalesp 5f1d02c
[Automated Commit] Format Codebase
github-actions[bot] 1849d6c
refactoring and linting
wangshangsam eef83eb
[Automated Commit] Format Codebase
github-actions[bot] dafa7f1
Add Dockerfile
wangshangsam ee91e7f
Add use_guided_decoding to let user choose to use guided_decoding or …
wangshangsam b9dd5ad
[Automated Commit] Format Codebase
github-actions[bot] ace336e
add f1 scores of uniform random selection
johncalesp 60f72be
[Automated Commit] Format Codebase
github-actions[bot] 9c7b793
Enabling mlperf-inf-mm-vl2l benchmark vllm.
wangshangsam 443ff3d
Merge branch 'jcalderon/vlm-accuracy-eval' of github.com:CentML/mlper…
wangshangsam 36ab421
[Automated Commit] Format Codebase
github-actions[bot] ea1e465
Commit to trigger the GitHub Actions in inference PR
anandhu-eng 93a1a3e
Merge pull request #6 from anandhu-eng/patch-39
wangshangsam a1e6d76
empty commit
wangshangsam File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
208 changes: 208 additions & 0 deletions
208
multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,208 @@ | ||
| """Task definitions for the VL2L benchmark.""" | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import json | ||
| from pathlib import Path | ||
| from typing import TYPE_CHECKING | ||
|
|
||
| import numpy as np | ||
| from datasets import load_dataset | ||
| from hiclass.metrics import f1 | ||
| from loguru import logger | ||
| from sklearn.metrics import f1_score | ||
| from tabulate import tabulate | ||
|
|
||
| if TYPE_CHECKING: | ||
| from pydantic import FilePath | ||
|
|
||
| from .cli import Dataset as DatasetCLI | ||
|
|
||
|
|
||
| def get_hierarchical_components(predicted_path: str, | ||
| true_path: str, | ||
| separator: str = " > ") -> tuple[int, int, int]: | ||
| """Calculates the components for Hierarchical Precision. | ||
|
|
||
| Args: | ||
| predicted_path: Categories predicted by the VLM. | ||
| true_path: Ground truth categories. | ||
| separator: String used to separate each category. | ||
|
|
||
| Returns: | ||
| Tuple of number of intersections, | ||
| correctly predicted categories and | ||
| ground truth categories. | ||
| """ | ||
| # 1. Split the paths into categories (nodes) | ||
| predicted_categories = [c.strip() for c in predicted_path.split(separator)] | ||
| true_categories = [c.strip() for c in true_path.split(separator)] | ||
|
|
||
| # Check for empty paths | ||
| if not predicted_categories or not true_categories: | ||
| return 0, len(predicted_categories), len(true_categories) | ||
|
|
||
| # 2. Count the intersection (longest common prefix) | ||
| intersection_count = 0 | ||
|
|
||
| # Iterate through the paths simultaneously | ||
| for pred_cat, true_cat in zip(predicted_categories, | ||
| true_categories, | ||
| strict=False): | ||
| if pred_cat == true_cat: | ||
| intersection_count += 1 | ||
| else: | ||
| # Stop as soon as a mismatch is found (enforces hierarchical match) | ||
| break | ||
|
|
||
| pred_length = len(predicted_categories) | ||
| true_length = len(true_categories) | ||
|
|
||
| return intersection_count, pred_length, true_length | ||
|
|
||
|
|
||
| def calculate_hierarchical_f1(data: list[tuple[str, str]]) -> float: | ||
| """Calculates the aggregate hF scores for a list of samples. | ||
|
|
||
| Args: | ||
| data: A list of tuples, where each tuple is | ||
| (predicted_path_str, true_path_str). | ||
|
|
||
| Returns: | ||
| F1 score | ||
| """ | ||
| total_intersection = 0 | ||
| total_predicted_length = 0 | ||
| total_true_length = 0 | ||
|
|
||
| # 1. Aggregate the components across all samples | ||
| for pred_path, true_path in data: | ||
| intersection, pred_len, true_len = \ | ||
| get_hierarchical_components(pred_path, true_path) | ||
|
|
||
| total_intersection += intersection | ||
| total_predicted_length += pred_len | ||
| total_true_length += true_len | ||
|
|
||
| # 2. Calculate hP and hR | ||
| hp = total_intersection / total_predicted_length \ | ||
| if total_predicted_length > 0 else 0.0 | ||
| hr = total_intersection / total_true_length \ | ||
| if total_true_length > 0 else 0.0 | ||
|
|
||
| return 0.0 if hp + hr == 0 else 2 * (hp * hr) / (hp + hr) | ||
|
|
||
|
|
||
| def calculate_exact_match(generated_text: str, original_text: str) -> float: | ||
| """Calculates binary Exact Match (EM) score. | ||
|
|
||
| We clean the text (lowercase, strip whitespace) for a fairer comparison. | ||
|
|
||
| Args: | ||
| generated_text: Output from the VLM. | ||
| original_text: Ground truth information from the dataset. | ||
|
|
||
| Returns: | ||
| 1 if the values match or 0 otherwise | ||
| """ | ||
| gen = generated_text.strip().lower() | ||
| orig = original_text.strip().lower() | ||
|
|
||
| return 1.0 if gen == orig else 0.0 | ||
|
|
||
|
|
||
| def calculate_secondhand_f1(data: list[tuple[str, str]]) -> float: | ||
| """Calculate F1 score of is_secondhand field. | ||
|
|
||
| Args: | ||
| data: List of tuples of predicted and true values | ||
| Returs: | ||
| f1 score | ||
| """ | ||
| y_pred = [] | ||
| y_src = [] | ||
| for pred, src in data: | ||
| y_pred.append(pred) | ||
| y_src.append(src) | ||
|
|
||
| return f1_score(y_src, y_pred) | ||
|
|
||
|
|
||
| def calculate_hiclass_f1(data: list[tuple[str, str]]) -> float: | ||
| """Alt method to calculate hierarchical F1. | ||
|
|
||
| Args: | ||
| data: List of tuples of predicted and true values | ||
| Returs: | ||
| f1 score | ||
| """ | ||
| y_pred_raw = [] | ||
| y_true_raw = [] | ||
|
|
||
| for pred, src in data: | ||
| path1 = pred.split(" > ") | ||
| path2 = src.split(" > ") | ||
|
|
||
| y_pred_raw.append(path1) | ||
| y_true_raw.append(path2) | ||
|
|
||
| # 2. Find the global maximum length across ALL samples | ||
| # We check the longest path in both true and pred lists | ||
| max_len = max(len(p) for p in y_true_raw + y_pred_raw) | ||
|
|
||
| # 3. Pad all lists to the global max_len | ||
| for i in range(len(y_true_raw)): | ||
| # Pad Truth | ||
| pad_len_true = max_len - len(y_true_raw[i]) | ||
| y_true_raw[i] += [""] * pad_len_true | ||
|
|
||
| # Pad Prediction | ||
| pad_len_pred = max_len - len(y_pred_raw[i]) | ||
| y_pred_raw[i] += [""] * pad_len_pred | ||
|
|
||
| # 4. Convert to numpy arrays | ||
| y_true = np.array(y_true_raw) | ||
| y_pred = np.array(y_pred_raw) | ||
|
|
||
| # 5. Calculate Score | ||
| return f1(y_true, y_pred) | ||
|
|
||
|
|
||
| def run_evaluation(filename: FilePath, dataset: DatasetCLI) -> None: | ||
| """Main function to run the evaluation.""" | ||
| with Path.open(filename) as f: | ||
| model_output = json.load(f) | ||
|
|
||
| original_data = load_dataset( | ||
| dataset.repo_id, | ||
| dataset.token, | ||
| split="+".join(dataset.split), | ||
| ) | ||
|
|
||
| category_dataset_pred_src = [] | ||
| is_secondhand_pred_src = [] | ||
| for elem in model_output: | ||
| byte_data = bytes.fromhex(elem["data"]) | ||
| idx = elem["qsl_idx"] | ||
| pred_text_decode = byte_data.decode("utf-8") | ||
| pred_item = json.loads(pred_text_decode) | ||
| ground_truth_item = original_data[idx] | ||
| category_dataset_pred_src.append((pred_item["category"], | ||
| ground_truth_item["ground_truth_category"])) | ||
| is_secondhand_pred_src.append((int(pred_item["is_secondhand"]), | ||
| int(ground_truth_item["ground_truth_is_secondhand"]))) | ||
|
|
||
| category_f1_score = calculate_hierarchical_f1( | ||
| category_dataset_pred_src) | ||
| hiclass_f1 = calculate_hiclass_f1(category_dataset_pred_src) | ||
| is_secondhand_f1_score = calculate_secondhand_f1(is_secondhand_pred_src) | ||
|
|
||
| data = [ | ||
| ["category", category_f1_score, hiclass_f1], | ||
| ["is_secondhand", is_secondhand_f1_score], | ||
| ] | ||
|
|
||
| logger.info("Results:\n{}", tabulate(data, | ||
| headers=["Fields", "F1 Score", | ||
| "HiClass F1 Score"], | ||
| tablefmt="fancy_grid")) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.