compare.py: Add --statistics to show statistical significance with a t-test. #288

aemerson · 2025-10-16T20:57:27Z

The alpha level can also be adjusted with --alpha

…t-test. The alpha level can also be adjusted with --alpha

tomershafir · 2025-10-19T11:38:48Z

utils/compare.py

+
+        # Compute statistics on raw data before merging (if requested)
+        if config.statistics:
+            # Get metrics early for statistics computation


This part duplicates logic from after the conditional, and lacks some checks that exist there. Need to unify this logic

tomershafir · 2025-10-19T11:45:32Z

utils/compare.py

+            if len(temp_metrics) == 0:
+                defaults = ["Exec_Time", "exec_time", "Value", "Runtime"]
+                for defkey in defaults:
+                    if defkey in lhs_d.columns:


Maybe check if defkey in lhs_d.columns or defkey in rhs_d.columns to be compatible with the existing flow, or unify elseway

tomershafir · 2025-10-19T11:55:39Z

utils/compare.py

+    stats_dict = {}
+
+    for metric in metrics:
+        if metric not in lhs_d.columns:


Maybe check if metric not in lhs_d.columns or metric not in rhs_d.columns: to be compatible with the existing flow, or unify elseway

tomershafir · 2025-10-19T11:59:03Z

utils/compare.py

    for metric in data.columns.levels[0]:
        data = add_diff_column(metric, data, absolute_diff=config.absolute_diff)

+    if config.statistics and stats_dict is not None:


Better check if config.statistics and stats_dict is not None and stats_dict: for when compute returns an empty dict (it cannot return None)

tomershafir · 2025-10-19T12:08:17Z

utils/compare.py

+
+        # Group by program
+        for program in lhs_d.index.get_level_values(1).unique():
+            lhs_values = lhs_d.loc[(slice(None), program), metric].dropna()


Cant we do something more efficient like this pseudo code:

for program, group in lhs_d.groupby(level=1): lhs_values = group[metric].dropna()

tomershafir · 2025-10-19T12:10:56Z

utils/compare.py

+
+            stats_dict[metric][program] = {
+                'std_lhs': lhs_values.std(ddof=1) if len(lhs_values) >= 2 else float('nan'),
+                'std_rhs': rhs_values.std(ddof=1) if len(rhs_values) >= 2 else float('nan'),


should take the name from --rhs-name like f'std_{rhs_name}'

tomershafir · 2025-10-19T12:12:59Z

utils/compare.py

+            lhs_values = lhs_d.loc[(slice(None), program), metric].dropna()
+            rhs_values = rhs_d.loc[(slice(None), program), metric].dropna()
+
+            stats_dict[metric][program] = {


Sink this conditional calculation into if len(lhs_values) >= 2 and len(rhs_values) >= 2:

tomershafir · 2025-10-19T12:14:59Z

utils/compare.py

+                t_stat, p_val = stats.ttest_ind(lhs_values, rhs_values)
+                stats_dict[metric][program]['t-value'] = t_stat
+                stats_dict[metric][program]['p-value'] = p_val
+                stats_dict[metric][program]['significant'] = "✅" if p_val < alpha else "❌"


Maybe change to something more parsable but still human readable like Y/N

tomershafir · 2025-10-19T13:28:53Z

utils/compare.py

+def add_precomputed_statistics(data, stats_dict):
+    """Add precomputed statistics to the unstacked dataframe."""
+    for metric in data.columns.levels[0]:
+        if metric not in stats_dict:


As part of metrics logic unification, should this effectively become an assert (inverted)?

tomershafir · 2025-10-19T13:44:39Z

utils/compare.py

+            values = []
+            for program in data.index:
+                if program in stats_dict[metric]:
+                    values.append(stats_dict[metric][program].get(stat_name, float('nan') if stat_name != 'significant' else ""))


Eliminate duplicated expression float('nan') if stat_name != 'significant' else "" and hoist it to the top level loop

compare.py: Add --statistics to show statistical significance with a …

c72a17a

…t-test. The alpha level can also be adjusted with --alpha

aemerson requested review from fhahn, guy-david and tomershafir October 16, 2025 20:59

tomershafir requested changes Oct 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

compare.py: Add --statistics to show statistical significance with a t-test. #288

compare.py: Add --statistics to show statistical significance with a t-test. #288

aemerson commented Oct 16, 2025

Uh oh!

tomershafir Oct 19, 2025

Uh oh!

tomershafir Oct 19, 2025

Uh oh!

tomershafir Oct 19, 2025

Uh oh!

tomershafir Oct 19, 2025

Uh oh!

tomershafir Oct 19, 2025

Uh oh!

tomershafir Oct 19, 2025

Uh oh!

tomershafir Oct 19, 2025

Uh oh!

tomershafir Oct 19, 2025

Uh oh!

tomershafir Oct 19, 2025

Uh oh!

tomershafir Oct 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

compare.py: Add --statistics to show statistical significance with a t-test. #288

Are you sure you want to change the base?

compare.py: Add --statistics to show statistical significance with a t-test. #288

Conversation

aemerson commented Oct 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants