Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 14 additions & 11 deletions src/teehr/metrics/deterministic_funcs.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import numpy as np
import numpy.typing as npt
import pandas as pd
from scipy.stats import rankdata

from teehr.models.metrics.basemodels import MetricsBasemodel
from teehr.models.metrics.basemodels import TransformEnum
Expand Down Expand Up @@ -340,19 +341,21 @@ def spearman_correlation_inner(p: pd.Series, s: pd.Series) -> float:
"""Spearman Rank Correlation Coefficient."""
p, s = _transform(p, s, model)

primary_rank = p.rank()
secondary_rank = s.rank()
count = len(p)
# calculate ranks (average method for ties)
primary_ranks = rankdata(p, method='average')
secondary_ranks = rankdata(s, method='average')

# calculate covariance between p_rank and s_rank
covariance = np.cov(primary_ranks, secondary_ranks)[0, 1]

# calculate standard deviations of ranks
std_primary = np.std(primary_ranks)
std_secondary = np.std(secondary_ranks)

if model.add_epsilon:
result = 1 - (
6 * np.sum(np.abs(primary_rank - secondary_rank)**2)
/ (count * (count**2 - 1)) + EPSILON
)
result = covariance / (std_primary * std_secondary + EPSILON)
else:
result = 1 - (
6 * np.sum(np.abs(primary_rank - secondary_rank)**2)
/ (count * (count**2 - 1))
)
result = covariance / (std_primary * std_secondary)

return result
Copy link
Collaborator

@samlamont samlamont Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was doing some manual testing using the setup_v0_3_study evaluation and was getting slightly different results between this function, and the pandas (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) and scipy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html#scipy.stats.spearmanr) methods (which both resulted in identical results)

Not really sure what the difference in the functions is but it might be good to understand why they're different. We could also just make use of the pandas or scipy method here? Looks like scipy has a nan_policy argument that could be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thorough testing and discussion, it would appear that the current implementation is the most accurate with regards to handling ties in the ranked data while also allowing error handling via Epsilon for the edge-case where the primary/secondary timeseries are constant arrays resulting in a divide by zero (this results in NaN result when using scipy.stats.spearmanr() or pandas.corr(method='spearman')). The differing results between the proposed implementation and the scipy/pandas built-ins seems to result from their use of the Spearman approximation ( res = 1 - (6 * np.sum(d**2)) / (n * (n**2 - 1)) ) -- with the results differing more when more ties are present.

Requesting a rereview for merge.


Expand Down