updates pearson and r_squared to consider epsilon#562
Conversation
samland1116
commented
Oct 15, 2025
- Updates pearson and r_squared to consider epsilon
- Adds tests to confirm the logic replicates numpy built-in's
|
still need to increment dev version post-review |
samlamont
left a comment
There was a problem hiding this comment.
added a few comments to see if we can avoid the loop, and to update the test data.
Also, out of scope here but we could think about using numpy masked arrays in the metric functions to handle NaN's in the future, which could be more prevalent when we change how the join works. Just a thought for later
| sum_sq_y += diff_s ** 2 | ||
|
|
||
| # calculate denominator with epsilon | ||
| denominator = np.sqrt(sum_sq_x * sum_sq_y) + EPSILON |
There was a problem hiding this comment.
could we use numpy's cov and std here instead of looping over every value, which could be slow for large datasets
Something like
numerator = np.cov(p, s)[0][1]
denominator = (np.nanstd(p) * np.nanstd(s)) + EPSILON
# calculate result
result = numerator / denominatornot 100% sure this is correct but it would be good if we could avoid the loop
|
|
||
| # calculate denominator with epsilon | ||
| denominator = np.sqrt(sum_sq_x * sum_sq_y) + EPSILON | ||
|
|
| pearson_e.add_epsilon = True | ||
|
|
||
| # get metrics_df | ||
| metrics_df_tansformed_e = eval.metrics.query( |
There was a problem hiding this comment.
looks like this already existed here but eval is built-in python function so we should avoid using it for variable names, otherwise weird things can happen
| pearson_e | ||
| ] | ||
| ).to_pandas() | ||
|
|
There was a problem hiding this comment.
it seems like the test data as it is doesn't result in any divide by zero warnings so it's hard to know if the update is actually working. One approach would be to set primary values to a constant value before calculating metrics to be sure the epsilon is doing it's job?
sdf = ev.joined_timeseries.to_sdf()
from pyspark.sql.functions import lit
sdf = sdf.withColumn("primary_value", lit(100.0))
ev.joined_timeseries._write_spark_df(sdf, write_mode="overwrite")|
oops. hold off on review. Forgot we added in the spearman rank issue to this ticket. |
| ) | ||
| result = covariance / (std_primary * std_secondary) | ||
|
|
||
| return result |
There was a problem hiding this comment.
I was doing some manual testing using the setup_v0_3_study evaluation and was getting slightly different results between this function, and the pandas (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) and scipy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html#scipy.stats.spearmanr) methods (which both resulted in identical results)
Not really sure what the difference in the functions is but it might be good to understand why they're different. We could also just make use of the pandas or scipy method here? Looks like scipy has a nan_policy argument that could be helpful
There was a problem hiding this comment.
After thorough testing and discussion, it would appear that the current implementation is the most accurate with regards to handling ties in the ranked data while also allowing error handling via Epsilon for the edge-case where the primary/secondary timeseries are constant arrays resulting in a divide by zero (this results in NaN result when using scipy.stats.spearmanr() or pandas.corr(method='spearman')). The differing results between the proposed implementation and the scipy/pandas built-ins seems to result from their use of the Spearman approximation ( res = 1 - (6 * np.sum(d**2)) / (n * (n**2 - 1)) ) -- with the results differing more when more ties are present.
Requesting a rereview for merge.
|
closes #556 |
* updates pearson and r_squared to consider epsilon * addressed PR feedback * increment dev version to 0.5.1dev9 from 0.5.1dev8 * update naive spearman method to consider ties * increment dev version in pyproject.toml