updates pearson and r_squared to consider epsilon by samland1116 · Pull Request #562 · RTIInternational/teehr

samland1116 · 2025-10-15T17:57:21Z

Updates pearson and r_squared to consider epsilon
Adds tests to confirm the logic replicates numpy built-in's

samland1116 · 2025-10-15T17:58:43Z

still need to increment dev version post-review

samlamont

added a few comments to see if we can avoid the loop, and to update the test data.

Also, out of scope here but we could think about using numpy masked arrays in the metric functions to handle NaN's in the future, which could be more prevalent when we change how the join works. Just a thought for later

samlamont · 2025-10-16T15:28:23Z

src/teehr/metrics/deterministic_funcs.py

+                sum_sq_y += diff_s ** 2
+
+            # calculate denominator with epsilon
+            denominator = np.sqrt(sum_sq_x * sum_sq_y) + EPSILON


could we use numpy's cov and std here instead of looping over every value, which could be slow for large datasets

Something like

numerator = np.cov(p, s)[0][1] denominator = (np.nanstd(p) * np.nanstd(s)) + EPSILON # calculate result result = numerator / denominator

not 100% sure this is correct but it would be good if we could avoid the loop

samlamont · 2025-10-16T15:28:53Z

src/teehr/metrics/deterministic_funcs.py

+
+            # calculate denominator with epsilon
+            denominator = np.sqrt(sum_sq_x * sum_sq_y) + EPSILON
+


same as above

samlamont · 2025-10-16T15:30:51Z

tests/query/test_get_metrics_query.py

+    pearson_e.add_epsilon = True
+
    # get metrics_df
    metrics_df_tansformed_e = eval.metrics.query(


looks like this already existed here but eval is built-in python function so we should avoid using it for variable names, otherwise weird things can happen

samlamont · 2025-10-16T15:34:58Z

tests/query/test_get_metrics_query.py

+            pearson_e
+        ]
+    ).to_pandas()



it seems like the test data as it is doesn't result in any divide by zero warnings so it's hard to know if the update is actually working. One approach would be to set primary values to a constant value before calculating metrics to be sure the epsilon is doing it's job?

sdf = ev.joined_timeseries.to_sdf() from pyspark.sql.functions import lit sdf = sdf.withColumn("primary_value", lit(100.0)) ev.joined_timeseries._write_spark_df(sdf, write_mode="overwrite")

samland1116 · 2025-10-21T21:08:07Z

oops. hold off on review. Forgot we added in the spearman rank issue to this ticket.

samlamont · 2025-10-27T15:05:02Z

src/teehr/metrics/deterministic_funcs.py

-                )
+            result = covariance / (std_primary * std_secondary)

        return result


I was doing some manual testing using the setup_v0_3_study evaluation and was getting slightly different results between this function, and the pandas (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) and scipy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html#scipy.stats.spearmanr) methods (which both resulted in identical results)

Not really sure what the difference in the functions is but it might be good to understand why they're different. We could also just make use of the pandas or scipy method here? Looks like scipy has a nan_policy argument that could be helpful

After thorough testing and discussion, it would appear that the current implementation is the most accurate with regards to handling ties in the ranked data while also allowing error handling via Epsilon for the edge-case where the primary/secondary timeseries are constant arrays resulting in a divide by zero (this results in NaN result when using scipy.stats.spearmanr() or pandas.corr(method='spearman')). The differing results between the proposed implementation and the scipy/pandas built-ins seems to result from their use of the Spearman approximation ( res = 1 - (6 * np.sum(d**2)) / (n * (n**2 - 1)) ) -- with the results differing more when more ties are present.

Requesting a rereview for merge.

samland1116 · 2025-10-30T21:14:27Z

closes #556

* updates pearson and r_squared to consider epsilon * addressed PR feedback * increment dev version to 0.5.1dev9 from 0.5.1dev8 * update naive spearman method to consider ties * increment dev version in pyproject.toml

updates pearson and r_squared to consider epsilon

a332657

samland1116 added this to the v0.6 Release milestone Oct 15, 2025

samland1116 requested review from mgdenno and samlamont October 15, 2025 17:57

samland1116 self-assigned this Oct 15, 2025

samland1116 added the bug Something isn't working label Oct 15, 2025

samlamont requested changes Oct 16, 2025

View reviewed changes

addressed PR feedback

40d5edc

samland1116 requested a review from samlamont October 21, 2025 20:45

increment dev version to 0.5.1dev9 from 0.5.1dev8

c512383

update naive spearman method to consider ties

2261c55

samlamont reviewed Oct 27, 2025

View reviewed changes

samland1116 requested a review from samlamont October 29, 2025 17:25

increment dev version in pyproject.toml

4eaeb90

samlamont approved these changes Oct 29, 2025

View reviewed changes

samland1116 merged commit 961a7f0 into main Oct 29, 2025

samland1116 deleted the 556-divide-by-zero-still-occurs-with-add_espilson-true-for-pearsoncorrelation-and-rsquared branch October 29, 2025 19:21

samland1116 mentioned this pull request Oct 30, 2025

Divide by zero still occurs with add_espilson = True for PearsonCorrelation and Rsquared #556

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

updates pearson and r_squared to consider epsilon#562

updates pearson and r_squared to consider epsilon#562
samland1116 merged 5 commits intomainfrom
556-divide-by-zero-still-occurs-with-add_espilson-true-for-pearsoncorrelation-and-rsquared

samland1116 commented Oct 15, 2025

Uh oh!

samland1116 commented Oct 15, 2025

Uh oh!

samlamont left a comment

Uh oh!

samlamont Oct 16, 2025

Uh oh!

samlamont Oct 16, 2025

Uh oh!

samlamont Oct 16, 2025

Uh oh!

samlamont Oct 16, 2025

Uh oh!

samland1116 commented Oct 21, 2025

Uh oh!

samlamont Oct 27, 2025 •

edited

Loading

Uh oh!

samland1116 Oct 29, 2025

Uh oh!

samland1116 commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		# calculate denominator with epsilon
		denominator = np.sqrt(sum_sq_x * sum_sq_y) + EPSILON

Comments

Conversation

samland1116 commented Oct 15, 2025

Uh oh!

samland1116 commented Oct 15, 2025

Uh oh!

samlamont left a comment

Choose a reason for hiding this comment

Uh oh!

samlamont Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

samlamont Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

samlamont Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

samlamont Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

samland1116 commented Oct 21, 2025

Uh oh!

samlamont Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samland1116 Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

samland1116 commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

samlamont Oct 27, 2025 •

edited

Loading