-
Notifications
You must be signed in to change notification settings - Fork 365
compare.py: Add --statistics to show statistical significance with a t-test. #288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…t-test. The alpha level can also be adjusted with --alpha
|
||
# Compute statistics on raw data before merging (if requested) | ||
if config.statistics: | ||
# Get metrics early for statistics computation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part duplicates logic from after the conditional, and lacks some checks that exist there. Need to unify this logic
if len(temp_metrics) == 0: | ||
defaults = ["Exec_Time", "exec_time", "Value", "Runtime"] | ||
for defkey in defaults: | ||
if defkey in lhs_d.columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe check if defkey in lhs_d.columns or defkey in rhs_d.columns
to be compatible with the existing flow, or unify elseway
stats_dict = {} | ||
|
||
for metric in metrics: | ||
if metric not in lhs_d.columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe check if metric not in lhs_d.columns or metric not in rhs_d.columns:
to be compatible with the existing flow, or unify elseway
for metric in data.columns.levels[0]: | ||
data = add_diff_column(metric, data, absolute_diff=config.absolute_diff) | ||
|
||
if config.statistics and stats_dict is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better check if config.statistics and stats_dict is not None and stats_dict:
for when compute returns an empty dict (it cannot return None)
|
||
# Group by program | ||
for program in lhs_d.index.get_level_values(1).unique(): | ||
lhs_values = lhs_d.loc[(slice(None), program), metric].dropna() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cant we do something more efficient like this pseudo code:
for program, group in lhs_d.groupby(level=1):
lhs_values = group[metric].dropna()
|
||
stats_dict[metric][program] = { | ||
'std_lhs': lhs_values.std(ddof=1) if len(lhs_values) >= 2 else float('nan'), | ||
'std_rhs': rhs_values.std(ddof=1) if len(rhs_values) >= 2 else float('nan'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should take the name from --rhs-name
like f'std_{rhs_name}'
lhs_values = lhs_d.loc[(slice(None), program), metric].dropna() | ||
rhs_values = rhs_d.loc[(slice(None), program), metric].dropna() | ||
|
||
stats_dict[metric][program] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sink this conditional calculation into if len(lhs_values) >= 2 and len(rhs_values) >= 2:
t_stat, p_val = stats.ttest_ind(lhs_values, rhs_values) | ||
stats_dict[metric][program]['t-value'] = t_stat | ||
stats_dict[metric][program]['p-value'] = p_val | ||
stats_dict[metric][program]['significant'] = "✅" if p_val < alpha else "❌" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change to something more parsable but still human readable like Y/N
def add_precomputed_statistics(data, stats_dict): | ||
"""Add precomputed statistics to the unstacked dataframe.""" | ||
for metric in data.columns.levels[0]: | ||
if metric not in stats_dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of metrics logic unification, should this effectively become an assert (inverted)?
values = [] | ||
for program in data.index: | ||
if program in stats_dict[metric]: | ||
values.append(stats_dict[metric][program].get(stat_name, float('nan') if stat_name != 'significant' else "")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eliminate duplicated expression float('nan') if stat_name != 'significant' else ""
and hoist it to the top level loop
The alpha level can also be adjusted with --alpha