Skip to content

Comments

updates pearson and r_squared to consider epsilon#562

Merged
samland1116 merged 5 commits intomainfrom
556-divide-by-zero-still-occurs-with-add_espilson-true-for-pearsoncorrelation-and-rsquared
Oct 29, 2025
Merged

updates pearson and r_squared to consider epsilon#562
samland1116 merged 5 commits intomainfrom
556-divide-by-zero-still-occurs-with-add_espilson-true-for-pearsoncorrelation-and-rsquared

Conversation

@samland1116
Copy link
Contributor

  • Updates pearson and r_squared to consider epsilon
  • Adds tests to confirm the logic replicates numpy built-in's

@samland1116 samland1116 added this to the v0.6 Release milestone Oct 15, 2025
@samland1116 samland1116 self-assigned this Oct 15, 2025
@samland1116 samland1116 added the bug Something isn't working label Oct 15, 2025
@samland1116
Copy link
Contributor Author

still need to increment dev version post-review

Copy link
Collaborator

@samlamont samlamont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a few comments to see if we can avoid the loop, and to update the test data.

Also, out of scope here but we could think about using numpy masked arrays in the metric functions to handle NaN's in the future, which could be more prevalent when we change how the join works. Just a thought for later

sum_sq_y += diff_s ** 2

# calculate denominator with epsilon
denominator = np.sqrt(sum_sq_x * sum_sq_y) + EPSILON
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use numpy's cov and std here instead of looping over every value, which could be slow for large datasets

Something like

            numerator = np.cov(p, s)[0][1]
            denominator = (np.nanstd(p) * np.nanstd(s)) + EPSILON

            # calculate result
            result = numerator / denominator

not 100% sure this is correct but it would be good if we could avoid the loop


# calculate denominator with epsilon
denominator = np.sqrt(sum_sq_x * sum_sq_y) + EPSILON

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

pearson_e.add_epsilon = True

# get metrics_df
metrics_df_tansformed_e = eval.metrics.query(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this already existed here but eval is built-in python function so we should avoid using it for variable names, otherwise weird things can happen

pearson_e
]
).to_pandas()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like the test data as it is doesn't result in any divide by zero warnings so it's hard to know if the update is actually working. One approach would be to set primary values to a constant value before calculating metrics to be sure the epsilon is doing it's job?

    sdf = ev.joined_timeseries.to_sdf()
    from pyspark.sql.functions import lit
    sdf = sdf.withColumn("primary_value", lit(100.0))
    ev.joined_timeseries._write_spark_df(sdf, write_mode="overwrite")

@samland1116 samland1116 requested a review from samlamont October 21, 2025 20:45
@samland1116
Copy link
Contributor Author

oops. hold off on review. Forgot we added in the spearman rank issue to this ticket.

)
result = covariance / (std_primary * std_secondary)

return result
Copy link
Collaborator

@samlamont samlamont Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was doing some manual testing using the setup_v0_3_study evaluation and was getting slightly different results between this function, and the pandas (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) and scipy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html#scipy.stats.spearmanr) methods (which both resulted in identical results)

Not really sure what the difference in the functions is but it might be good to understand why they're different. We could also just make use of the pandas or scipy method here? Looks like scipy has a nan_policy argument that could be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thorough testing and discussion, it would appear that the current implementation is the most accurate with regards to handling ties in the ranked data while also allowing error handling via Epsilon for the edge-case where the primary/secondary timeseries are constant arrays resulting in a divide by zero (this results in NaN result when using scipy.stats.spearmanr() or pandas.corr(method='spearman')). The differing results between the proposed implementation and the scipy/pandas built-ins seems to result from their use of the Spearman approximation ( res = 1 - (6 * np.sum(d**2)) / (n * (n**2 - 1)) ) -- with the results differing more when more ties are present.

Requesting a rereview for merge.

@samland1116 samland1116 requested a review from samlamont October 29, 2025 17:25
@samland1116 samland1116 merged commit 961a7f0 into main Oct 29, 2025
@samland1116 samland1116 deleted the 556-divide-by-zero-still-occurs-with-add_espilson-true-for-pearsoncorrelation-and-rsquared branch October 29, 2025 19:21
@samland1116
Copy link
Contributor Author

closes #556

samlamont pushed a commit that referenced this pull request Nov 3, 2025
* updates pearson and r_squared to consider epsilon

* addressed PR feedback

* increment dev version to 0.5.1dev9 from 0.5.1dev8

* update naive spearman method to consider ties

* increment dev version in pyproject.toml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants