Support analysis along different metrics in the dataset (#11937)

guangy10 · web-flow · commit f1b5947c8d08 · 2025-06-24T17:34:38.000-07:00
### Summary
- Allow running benchmark analysis along target the metric in the
dataset
 - Set verbose level control how much details to be reported
 - Bug fixes to properly handle `nan` value in the dataset

### Test plan
Analysis the reported metrics stability along the `token_per_sec` for
`Qwen3-0.6B` on all devices with all recipes (hf/optimum-et vs etLLM):

`python .ci/scripts/benchmark_tooling/analyze_benchmark_stability.py
--primary-file private.xlsx --reference-file public.xlsx --metric
token_per_sec --verbose-level 0`

Report results:
```
====================================================================================================
===== Analyzing Stability Against Metric 'token_per_sec' ==========================================
====================================================================================================

Primary dataset: private.xlsx
Reference dataset for comparison: public.xlsx


====================================================================================================
===== LOADING PRIMARY DATASETS (Private) ==========================================================
====================================================================================================

successfully fetched 10 sheets from private.xlsx
Loading dataset: table1 with config: {'model': 'Qwen/Qwen3-0.6B', 'backend': 'et_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Apple iPhone 15 (private)', 'arch': 'iOS 18.0', 'total_rows': 59, 'aws_type': 'private'}
Loading dataset: table2 with config: {'model': 'Qwen/Qwen3-0.6B', 'backend': 'et_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Apple iPhone 15 Plus (private)', 'arch': 'iOS 17.4.1', 'total_rows': 58, 'aws_type': 'private'}
Loading dataset: table3 with config: {'model': 'Qwen/Qwen3-0.6B', 'backend': 'et_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Apple iPhone 15 Pro (private)', 'arch': 'iOS 18.4.1', 'total_rows': 59, 'aws_type': 'private'}
Loading dataset: table4 with config: {'model': 'Qwen/Qwen3-0.6B', 'backend': 'et_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Samsung Galaxy S22 5G (private)', 'arch': 'Android 13', 'total_rows': 79, 'aws_type': 'private'}
Loading dataset: table5 with config: {'model': 'Qwen/Qwen3-0.6B', 'backend': 'et_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Samsung Galaxy S22 Ultra 5G (private)', 'arch': 'Android 14', 'total_rows': 79, 'aws_type': 'private'}
Loading dataset: table6 with config: {'model': 'Qwen/Qwen3-0.6B', 'backend': 'hf_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Apple iPhone 15 (private)', 'arch': 'iOS 18.0', 'total_rows': 57, 'aws_type': 'private'}
Loading dataset: table7 with config: {'model': 'Qwen/Qwen3-0.6B', 'backend': 'hf_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Apple iPhone 15 Plus (private)', 'arch': 'iOS 17.4.1', 'total_rows': 57, 'aws_type': 'private'}
Loading dataset: table8 with config: {'model': 'Qwen/Qwen3-0.6B', 'backend': 'hf_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Apple iPhone 15 Pro (private)', 'arch': 'iOS 18.4.1', 'total_rows': 57, 'aws_type': 'private'}
Loading dataset: table9 with config: {'model': 'Qwen/Qwen3-0.6B', 'backend': 'hf_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Samsung Galaxy S22 5G (private)', 'arch': 'Android 13', 'total_rows': 78, 'aws_type': 'private'}
Loading dataset: table10 with config: {'model': 'Qwen/Qwen3-0.6B', 'backend': 'hf_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Samsung Galaxy S22 Ultra 5G (private)', 'arch': 'Android 14', 'total_rows': 78, 'aws_type': 'private'}


====================================================================================================
===== LOADING REFERENCE DATASETS (Public) =========================================================
====================================================================================================

successfully fetched 6 sheets from public.xlsx
Loading dataset: table1 with config:{'model': 'Qwen/Qwen3-0.6B', 'backend': 'et_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Apple iPhone 15', 'arch': 'iOS 18.0', 'total_rows': 45, 'aws_type': 'public'}
Loading dataset: table2 with config:{'model': 'Qwen/Qwen3-0.6B', 'backend': 'et_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Apple iPhone 15 Plus', 'arch': 'iOS 17.4.1', 'total_rows': 43, 'aws_type': 'public'}
Loading dataset: table3 with config:{'model': 'Qwen/Qwen3-0.6B', 'backend': 'et_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Samsung Galaxy S22 5G', 'arch': 'Android 13', 'total_rows': 71, 'aws_type': 'public'}
Loading dataset: table4 with config:{'model': 'Qwen/Qwen3-0.6B', 'backend': 'hf_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Apple iPhone 15', 'arch': 'iOS 18.0', 'total_rows': 43, 'aws_type': 'public'}
Loading dataset: table5 with config:{'model': 'Qwen/Qwen3-0.6B', 'backend': 'hf_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Apple iPhone 15 Plus', 'arch': 'iOS 17.4.1', 'total_rows': 42, 'aws_type': 'public'}
Loading dataset: table6 with config:{'model': 'Qwen/Qwen3-0.6B', 'backend': 'hf_xnnpack_custom_spda_kv_cache_8da4w', 'device': 'Samsung Galaxy S22 5G', 'arch': 'Android 13', 'total_rows': 71, 'aws_type': 'public'}


====================================================================================================
===== COMPREHENSIVE STABILITY SUMMARY =============================================================
====================================================================================================


Comprehensive Latency Stability Analysis Summary
================================================================================

Primary (Private) Datasets Summary:
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
| Dataset   | Model                                                  | Device                                            |   Mean Value |   CV (%) |   Stability Score | Stability Rating   |
+===========+========================================================+===================================================+==============+==========+===================+====================+
| table10   | Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) | Samsung Galaxy S22 Ultra 5G (private)(Android 14) |        62.82 |     1.45 |             91.17 | Excellent          |
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
| table9    | Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) | Samsung Galaxy S22 5G (private)(Android 13)       |        61.79 |     1.85 |             88.38 | Good               |
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
| table5    | Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) | Samsung Galaxy S22 Ultra 5G (private)(Android 14) |        64.65 |     2.32 |             86.10 | Good               |
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
| table4    | Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) | Samsung Galaxy S22 5G (private)(Android 13)       |        62.27 |     3.02 |             81.37 | Good               |
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
| table3    | Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) | Apple iPhone 15 Pro (private)(iOS 18.4.1)         |        24.69 |     3.39 |             78.78 | Moderate           |
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
| table8    | Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) | Apple iPhone 15 Pro (private)(iOS 18.4.1)         |        22.88 |     3.65 |             78.23 | Moderate           |
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
| table1    | Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) | Apple iPhone 15 (private)(iOS 18.0)               |         7.66 |     3.75 |             76.56 | Moderate           |
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
| table6    | Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) | Apple iPhone 15 (private)(iOS 18.0)               |         7.14 |     4.18 |             73.67 | Moderate           |
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
| table2    | Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) | Apple iPhone 15 Plus (private)(iOS 17.4.1)        |         6.52 |     4.36 |             73.08 | Moderate           |
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+
| table7    | Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) | Apple iPhone 15 Plus (private)(iOS 17.4.1)        |         6.11 |     4.50 |             72.90 | Moderate           |
+-----------+--------------------------------------------------------+---------------------------------------------------+--------------+----------+-------------------+--------------------+

Reference (Public) Datasets Summary:
+-----------+--------------------------------------------------------+-----------------------------------+--------------+----------+-------------------+--------------------+
| Dataset   | Model                                                  | Device                            |   Mean Value |   CV (%) |   Stability Score | Stability Rating   |
+===========+========================================================+===================================+==============+==========+===================+====================+
| table6    | Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) | Samsung Galaxy S22 5G(Android 13) |        62.78 |     3.72 |             77.73 | Moderate           |
+-----------+--------------------------------------------------------+-----------------------------------+--------------+----------+-------------------+--------------------+
| table3    | Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) | Samsung Galaxy S22 5G(Android 13) |        62.68 |     4.30 |             74.12 | Moderate           |
+-----------+--------------------------------------------------------+-----------------------------------+--------------+----------+-------------------+--------------------+
| table2    | Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) | Apple iPhone 15 Plus(iOS 17.4.1)  |         7.08 |     5.21 |             67.91 | Moderate           |
+-----------+--------------------------------------------------------+-----------------------------------+--------------+----------+-------------------+--------------------+
| table5    | Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) | Apple iPhone 15 Plus(iOS 17.4.1)  |         6.49 |     5.42 |             67.74 | Moderate           |
+-----------+--------------------------------------------------------+-----------------------------------+--------------+----------+-------------------+--------------------+
| table4    | Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) | Apple iPhone 15(iOS 18.0)         |         7.03 |     7.17 |             55.51 | Poor               |
+-----------+--------------------------------------------------------+-----------------------------------+--------------+----------+-------------------+--------------------+
| table1    | Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) | Apple iPhone 15(iOS 18.0)         |         6.89 |    20.22 |             21.99 | Poor               |
+-----------+--------------------------------------------------------+-----------------------------------+--------------+----------+-------------------+--------------------+

Private vs Public Comparison:
+-------------------------------------------------------------------------------------------+----------------------------------------------+------------------------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| Dataset                                                                                   | Private Device                               | Public Device                      |   Private Score |   Public Score |   Score Diff |   Private CV (%) |   Public CV (%) |   CV Diff (%) |
+===========================================================================================+==============================================+====================================+=================+================+==============+==================+=================+===============+
| Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) on Apple iPhone 15 (private)       | Apple iPhone 15 (private) (iOS 18.0)         | Apple iPhone 15 (iOS 18.0)         |           76.56 |          21.99 |        54.58 |             3.75 |           20.22 |        -16.46 |
+-------------------------------------------------------------------------------------------+----------------------------------------------+------------------------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) on Apple iPhone 15 (private)       | Apple iPhone 15 (private) (iOS 18.0)         | Apple iPhone 15 (iOS 18.0)         |           73.67 |          55.51 |        18.17 |             4.18 |            7.17 |         -2.99 |
+-------------------------------------------------------------------------------------------+----------------------------------------------+------------------------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) on Samsung Galaxy S22 5G (private) | Samsung Galaxy S22 5G (private) (Android 13) | Samsung Galaxy S22 5G (Android 13) |           88.38 |          77.73 |        10.64 |             1.85 |            3.72 |         -1.87 |
+-------------------------------------------------------------------------------------------+----------------------------------------------+------------------------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) on Samsung Galaxy S22 5G (private) | Samsung Galaxy S22 5G (private) (Android 13) | Samsung Galaxy S22 5G (Android 13) |           81.37 |          74.12 |         7.25 |             3.02 |            4.30 |         -1.28 |
+-------------------------------------------------------------------------------------------+----------------------------------------------+------------------------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| Qwen/Qwen3-0.6B(et_xnnpack_custom_spda_kv_cache_8da4w) on Apple iPhone 15 Plus (private)  | Apple iPhone 15 Plus (private) (iOS 17.4.1)  | Apple iPhone 15 Plus (iOS 17.4.1)  |           73.08 |          67.91 |         5.17 |             4.36 |            5.21 |         -0.86 |
+-------------------------------------------------------------------------------------------+----------------------------------------------+------------------------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+
| Qwen/Qwen3-0.6B(hf_xnnpack_custom_spda_kv_cache_8da4w) on Apple iPhone 15 Plus (private)  | Apple iPhone 15 Plus (private) (iOS 17.4.1)  | Apple iPhone 15 Plus (iOS 17.4.1)  |           72.90 |          67.74 |         5.16 |             4.50 |            5.42 |         -0.92 |
+-------------------------------------------------------------------------------------------+----------------------------------------------+------------------------------------+-----------------+----------------+--------------+------------------+-----------------+---------------+

Private environment is more stable in 6 of 6 cases.
Public environment is more stable in 0 of 6 cases.

Overall Insights and Recommendations:
Stability Distribution in Private Datasets:
  - Moderate: 6 dataset(s)
  - Good: 3 dataset(s)
  - Excellent: 1 dataset(s)
```
diff --git a/.ci/scripts/benchmark_tooling/analyze_benchmark_stability.py b/.ci/scripts/benchmark_tooling/analyze_benchmark_stability.py
@@ -66,11 +66,16 @@ def is_matching_dataset(primary_sheet, reference_sheet):
 
 
 def analyze_latency_stability(  # noqa: C901
-    primary_file, reference_file=None, output_dir="stability_analysis_results"
+    target_metric,
+    primary_file,
+    reference_file=None,
+    output_dir="stability_analysis_results",
+    verbose_level=0,
 ):
-    print(f"Analyzing latency stability from primary file: {primary_file}")
+    print_section_header(f"Analyzing Stability Against Metric '{target_metric}'")
+    print(f"Primary dataset: {primary_file}")
     if reference_file:
-        print(f"Using reference file for comparison: {reference_file}")
+        print(f"Reference dataset for comparison: {reference_file}")
 
     # Create output directory if it doesn't exist
     if not os.path.exists(output_dir):
@@ -99,31 +104,19 @@ def analyze_latency_stability(  # noqa: C901
 
         model, full_device, base_device, os_version = parse_model_device_config(config)
 
-        # Check if required columns exist
-        required_cols = ["avg_inference_latency(ms)", "metadata_info.timestamp"]
-        if "trimmean_inference_latency(ms)" in df.columns:
-            trimmed_col = "trimmean_inference_latency(ms)"
-            required_cols.append(trimmed_col)
-        else:
-            trimmed_col = None
-
-        if "TPS" in df.columns:
-            tps_col = "TPS"
-            required_cols.append(tps_col)
-        else:
-            tps_col = None
-
         # Skip sheets without required columns
+        required_cols = [target_metric, "metadata_info.timestamp"]
         if not all(col in df.columns for col in required_cols):
             print(f"  Skipping {sheetName}: Missing required columns")
             continue
 
         # Convert Date to datetime
         df["Date"] = pd.to_datetime(df["metadata_info.timestamp"])
 
-        # Calculate stability metrics
+        # Calculate stability metrics along the target column in the dataset
         metrics = calculate_stability_metrics(
-            df, "avg_inference_latency(ms)", trimmed_col, tps_col
+            df,
+            target_metric,
         )
 
         primary_datasets[sheetName] = {
@@ -161,21 +154,8 @@ def analyze_latency_stability(  # noqa: C901
                 config
             )
 
-            # Check if required columns exist
-            required_cols = ["avg_inference_latency(ms)", "metadata_info.timestamp"]
-            if "trimmean_inference_latency(ms)" in df.columns:
-                trimmed_col = "trimmean_inference_latency(ms)"
-                required_cols.append(trimmed_col)
-            else:
-                trimmed_col = None
-
-            if "TPS" in df.columns:
-                tps_col = "TPS"
-                required_cols.append(tps_col)
-            else:
-                tps_col = None
-
             # Skip sheets without required columns
+            required_cols = [target_metric, "metadata_info.timestamp"]
             if not all(col in df.columns for col in required_cols):
                 print(
                     f"  Skipping reference {sheetName}: Missing required columns{required_cols}"
@@ -187,7 +167,8 @@ def analyze_latency_stability(  # noqa: C901
 
             # Calculate stability metrics
             metrics = calculate_stability_metrics(
-                df, "avg_inference_latency(ms)", trimmed_col, tps_col
+                df,
+                target_metric,
             )
 
             reference_datasets[sheetName] = {
@@ -201,30 +182,33 @@ def analyze_latency_stability(  # noqa: C901
             }
 
     # Process primary datasets
-    print_section_header("ANALYZING PRIMARY DATASETS")
-    for sheet, info in primary_datasets.items():
-        # Generate dataset report
-        generate_dataset_report(
-            sheet,
-            info["model"],
-            info["full_device"],
-            "Primary",
-            info["df"],
-            info["metrics"],
-            output_dir,
-        )
+    if verbose_level > 2:
+        print_section_header("ANALYZING PRIMARY DATASETS")
+        for sheet, info in primary_datasets.items():
+            # Generate dataset report
+            generate_dataset_report(
+                sheet,
+                target_metric,
+                info["model"],
+                info["full_device"],
+                "Primary",
+                info["df"],
+                info["metrics"],
+                output_dir,
+            )
 
-        # Generate time series plot
-        if len(info["df"]) > 5:  # Only create plot if enough data points
-            generate_time_series_plot(sheet, info["df"], output_dir, "Primary")
+            # Generate time series plot
+            if len(info["df"]) > 5:  # Only create plot if enough data points
+                generate_time_series_plot(sheet, info["df"], output_dir, "Primary")
 
     # Process reference datasets if provided
-    if reference_file:
+    if reference_file and verbose_level > 3:
         print_section_header("ANALYZING REFERENCE DATASETS")
         for sheet, info in reference_datasets.items():
             # Generate dataset report
             generate_dataset_report(
                 sheet,
+                target_metric,
                 info["model"],
                 info["full_device"],
                 "Reference",
@@ -238,7 +222,7 @@ def analyze_latency_stability(  # noqa: C901
                 generate_time_series_plot(sheet, info["df"], output_dir, "Reference")
 
     # Generate comparison reports for matching datasets
-    if reference_file:
+    if reference_file and verbose_level > 1:
         print_section_header("PRIVATE VS PUBLIC STABILITY COMPARISON")
         matches_found = False
 
@@ -270,9 +254,10 @@ def analyze_latency_stability(  # noqa: C901
         if not matches_found:
             print("No matching datasets found between primary and reference files.")
 
-    # Generate intra-primary summary (comparing across different models/devices)
-    print_section_header("INTRA-PRIMARY STABILITY COMPARISON")
-    generate_intra_primary_summary(primary_datasets, output_dir)
+    if verbose_level > 0:
+        # Generate intra-primary summary (comparing across different models/devices)
+        print_section_header("INTRA-PRIMARY STABILITY COMPARISON")
+        generate_intra_primary_summary(primary_datasets, output_dir)
 
     # Generate summary report for all datasets
     print_section_header("COMPREHENSIVE STABILITY SUMMARY")
@@ -285,28 +270,17 @@ def analyze_latency_stability(  # noqa: C901
 
 
 def calculate_stability_metrics(  # noqa: C901
-    df, raw_col, trimmed_col=None, tps_col=None
+    df,
+    target_metric,
 ):
     """Calculate stability metrics for the given dataset"""
     metrics = {}
-
-    # Extract data
-    raw_latency = df[raw_col].values
-    if trimmed_col and trimmed_col in df.columns:
-        trimmed_latency = df[trimmed_col].values
-    else:
-        trimmed_latency = None
-    if tps_col and tps_col in df.columns:
-        tps = df[tps_col].values
-    else:
-        tps = None
+    # Extract data and ingore NaN values
+    raw_latency = df[target_metric].dropna().values
 
     # Central tendency metrics
     metrics["mean_raw_latency"] = np.mean(raw_latency)
     metrics["median_raw_latency"] = np.median(raw_latency)
-    if trimmed_latency is not None:
-        metrics["mean_trimmed_latency"] = np.mean(trimmed_latency)
-        metrics["median_trimmed_latency"] = np.median(trimmed_latency)
 
     # Dispersion metrics
     metrics["std_raw_latency"] = np.std(raw_latency, ddof=1)
@@ -316,20 +290,10 @@ def calculate_stability_metrics(  # noqa: C901
     metrics["iqr_raw_latency"] = np.percentile(raw_latency, 75) - np.percentile(
         raw_latency, 25
     )
-    if trimmed_latency is not None:
-        metrics["std_trimmed_latency"] = np.std(trimmed_latency, ddof=1)
-        metrics["cv_trimmed_latency"] = (
-            metrics["std_trimmed_latency"] / metrics["mean_trimmed_latency"]
-        ) * 100
-        metrics["iqr_trimmed_latency"] = np.percentile(
-            trimmed_latency, 75
-        ) - np.percentile(trimmed_latency, 25)
 
     # Percentile metrics
     for p in [50, 90, 95, 99]:
         metrics[f"p{p}_raw_latency"] = np.percentile(raw_latency, p)
-        if trimmed_latency is not None:
-            metrics[f"p{p}_trimmed_latency"] = np.percentile(trimmed_latency, p)
 
     # Inter-jitter metrics (variability between runs)
     if np.min(raw_latency) > 0:
@@ -342,37 +306,45 @@ def calculate_stability_metrics(  # noqa: C901
         metrics["p99_raw_latency"] / metrics["p50_raw_latency"]
     )
 
-    if trimmed_latency is not None:
-        if np.min(trimmed_latency) > 0:
-            metrics["max_min_range_ratio_trimmed"] = np.max(trimmed_latency) / np.min(
-                trimmed_latency
-            )
-        else:
-            metrics["max_min_range_ratio_trimmed"] = float("inf")
-            print(
-                "Warning: Minimum trimmed latency value is zero, max/min ratio set to infinity"
+    # Intra-jitter proxy (if both raw and trimmed latency are available)
+    trimmed_metric_col = "trimmean_inference_latency(ms)"
+    if (
+        target_metric == "avg_inference_latency(ms)"
+        and trimmed_metric_col in df.columns
+    ):
+        trimmed_latency = df[trimmed_metric_col].values
+        if trimmed_latency is not None:
+            metrics["mean_trimmed_latency"] = np.mean(trimmed_latency)
+            metrics["median_trimmed_latency"] = np.median(trimmed_latency)
+            metrics["std_trimmed_latency"] = np.std(trimmed_latency, ddof=1)
+            metrics["cv_trimmed_latency"] = (
+                metrics["std_trimmed_latency"] / metrics["mean_trimmed_latency"]
+            ) * 100
+            metrics["iqr_trimmed_latency"] = np.percentile(
+                trimmed_latency, 75
+            ) - np.percentile(trimmed_latency, 25)
+            for p in [50, 90, 95, 99]:
+                metrics[f"p{p}_trimmed_latency"] = np.percentile(trimmed_latency, p)
+            if np.min(trimmed_latency) > 0:
+                metrics["max_min_range_ratio_trimmed"] = np.max(
+                    trimmed_latency
+                ) / np.min(trimmed_latency)
+            else:
+                metrics["max_min_range_ratio_trimmed"] = float("inf")
+                print(
+                    "Warning: Minimum trimmed latency value is zero, max/min ratio set to infinity"
+                )
+            metrics["p99_p50_ratio_trimmed"] = (
+                metrics["p99_trimmed_latency"] / metrics["p50_trimmed_latency"]
             )
-
-        metrics["p99_p50_ratio_trimmed"] = (
-            metrics["p99_trimmed_latency"] / metrics["p50_trimmed_latency"]
-        )
-
-    # Intra-jitter proxy (if both raw and trimmed are available)
-    if trimmed_latency is not None:
-        trimming_effect = (raw_latency - trimmed_latency) / raw_latency
-        metrics["mean_trimming_effect_ratio"] = np.mean(trimming_effect)
-        metrics["max_trimming_effect_ratio"] = np.max(trimming_effect)
-
-    # TPS metrics
-    if tps is not None:
-        metrics["mean_tps"] = np.mean(tps)
-        metrics["std_tps"] = np.std(tps, ddof=1)
-        metrics["cv_tps"] = (metrics["std_tps"] / metrics["mean_tps"]) * 100
+            trimming_effect = (raw_latency - trimmed_latency) / raw_latency
+            metrics["mean_trimming_effect_ratio"] = np.mean(trimming_effect)
+            metrics["max_trimming_effect_ratio"] = np.max(trimming_effect)
 
     # Time-based stability (rolling window of 5 samples)
     if len(df) >= 5:
         df_sorted = df.sort_values("Date")
-        rolling_std = df_sorted[raw_col].rolling(window=5).std()
+        rolling_std = df_sorted[target_metric].rolling(window=5).std()
         metrics["mean_rolling_std"] = rolling_std.mean()
         metrics["max_rolling_std"] = rolling_std.max()
 
@@ -419,7 +391,7 @@ def calculate_stability_metrics(  # noqa: C901
 
 
 def generate_dataset_report(  # noqa: C901
-    sheet_name, model, device, dataset_type, df, metrics, output_dir
+    sheet_name, target_column, model, device, dataset_type, df, metrics, output_dir
 ):
     """Generate a detailed report for a single dataset"""
     report_file = f"{output_dir}/{sheet_name}_{dataset_type.lower()}_report.txt"
@@ -436,7 +408,9 @@ def generate_dataset_report(  # noqa: C901
 
     # Dataset overview
     report_content.append("Dataset Overview:")
-    report_content.append(f"  - Number of samples: {len(df)}")
+    report_content.append(
+        f"  - Number of samples: {len(df[target_column].dropna().values)}"
+    )
     report_content.append(f"  - Date range: {df['Date'].min()} to {df['Date'].max()}")
     report_content.append("")
 
@@ -719,12 +693,12 @@ def generate_comparison_report(  # noqa: C901
 
     # Add key metrics to the table
     metrics_to_compare = [
-        ("Mean Latency (ms)", "mean_raw_latency", "ms"),
-        ("Median Latency (ms)", "median_raw_latency", "ms"),
-        ("Standard Deviation (ms)", "std_raw_latency", "ms"),
+        ("Mean Value", "mean_raw_latency", ""),
+        ("Median Value", "median_raw_latency", ""),
+        ("Standard Deviation", "std_raw_latency", ""),
         ("CV (%)", "cv_raw_latency", "%"),
-        ("IQR (ms)", "iqr_raw_latency", "ms"),
-        ("P99 (ms)", "p99_raw_latency", "ms"),
+        ("IQR", "iqr_raw_latency", ""),
+        ("P99", "p99_raw_latency", ""),
         ("Max/Min Ratio", "max_min_range_ratio_raw", ""),
         ("P99/P50 Ratio", "p99_p50_ratio_raw", ""),
         ("Stability Score", "stability_score", ""),
@@ -1056,7 +1030,7 @@ def generate_intra_primary_summary(primary_datasets, output_dir):  # noqa: C901
                 "Sheet": sheet_name,
                 "Model": info["model"],
                 "Device": info["full_device"],
-                "Mean Latency (ms)": info["metrics"]["mean_raw_latency"],
+                "Mean Value": info["metrics"]["mean_raw_latency"],
                 "CV (%)": info["metrics"]["cv_raw_latency"],
                 "Stability Score": info["metrics"]["stability_score"],
                 "Stability Rating": info["metrics"]["stability_rating"],
@@ -1293,7 +1267,7 @@ def generate_summary_report(  # noqa: C901
                 "Dataset": sheet_name,
                 "Model": model,
                 "Device": device_display,
-                "Mean Latency (ms)": info["metrics"]["mean_raw_latency"],
+                "Mean Value": info["metrics"]["mean_raw_latency"],
                 "CV (%)": info["metrics"]["cv_raw_latency"],
                 "Stability Score": info["metrics"]["stability_score"],
                 "Stability Rating": info["metrics"]["stability_rating"],
@@ -1330,7 +1304,7 @@ def generate_summary_report(  # noqa: C901
                     "Dataset": sheet_name,
                     "Model": model,
                     "Device": device_display,
-                    "Mean Latency (ms)": info["metrics"]["mean_raw_latency"],
+                    "Mean Value": info["metrics"]["mean_raw_latency"],
                     "CV (%)": info["metrics"]["cv_raw_latency"],
                     "Stability Score": info["metrics"]["stability_score"],
                     "Stability Rating": info["metrics"]["stability_rating"],
@@ -1541,17 +1515,34 @@ def main():
         help="Path to Excel file containing reference (public) benchmark data for comparison",
         default=None,
     )
+    parser.add_argument(
+        "--metric",
+        help="Target metric to analyze (default: avg_inference_latency(ms)). Examples: avg_inference_latency(ms), token_per_sec",
+        default="avg_inference_latency(ms)",
+    )
     parser.add_argument(
         "--output-dir",
         default="stability_analysis_results",
         help="Directory to save analysis results (default: stability_analysis_results)",
     )
-
+    parser.add_argument(
+        "--verbose-level",
+        type=int,
+        default=0,
+        choices=range(4),
+        help="Verbose level 0-3 (default: 0) to control analysis output detail. Higher values show more detailed results.",
+    )
     # Parse arguments
     args = parser.parse_args()
 
     # Run analysis
-    analyze_latency_stability(args.primary_file, args.reference_file, args.output_dir)
+    analyze_latency_stability(
+        args.metric,
+        args.primary_file,
+        args.reference_file,
+        args.output_dir,
+        args.verbose_level,
+    )
 
 
 if __name__ == "__main__":