Skip to content

Note difference in rounding behavior between Sklearn and Onnx models #767

Merged
mike-w-wilson merged 6 commits intomainfrom
dm/sklearn_warning
Sep 17, 2025
Merged

Note difference in rounding behavior between Sklearn and Onnx models #767
mike-w-wilson merged 6 commits intomainfrom
dm/sklearn_warning

Conversation

@matren395
Copy link
Contributor

Note that gnomAD's assign_population_pcs() code DOES round outputs for prob_gen_anc when using an sklearn model but does NOT when using an onnx model.

For a given cutoff of 0.75 (for example) for prob_nfe , some sample with a real probability of 0.7499999999 could be imputed as 'Remaining' when using an onnx model but as 'Non-Finnish European' when using a sklearn model.

@matren395 matren395 self-assigned this Feb 21, 2025
Copy link
Contributor

@mike-w-wilson mike-w-wilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this in! I suggest a little more detail in the message which I think clarifies where to expect the difference. Please rework if you dont like it but if you do, LGTM.

raise TypeError("The supplied model is not an sklearn model!")

logger.warning(
"sklearn models have different rounding behavior than ONNX models. This may lead to subtly different results around cutoffs."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"sklearn models have different rounding behavior than ONNX models. This may lead to subtly different results around cutoffs."
"sklearn models have different rounding behavior than ONNX models. This may "
"lead to subtly different assignment results for samples around probability "
"cutoffs."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, thanks! I originally had something wordier, but was worried it was too much.

GitHub's browser is giving me a hard time committing your suggestion as-is, so I had to do it in a separate commit, then I'll merge it

@matren395
Copy link
Contributor Author

1 - split this into two logger.warning() statements, one that .pickle/sklearn is generally just kinda worse practice now, and 2- that changing can lead to those rounding errors

article Julia G. sent me abt this - https://medium.com/featurepreneur/pickle-is-sour-lets-use-onnx-90c0805338ac

@matren395 matren395 requested a review from a team as a code owner July 15, 2025 14:25
@matren395
Copy link
Contributor Author

Sorry, forgot about this for a bit! Rebased it and the commits are a bit messy , but up to date!

matren395 and others added 6 commits September 17, 2025 11:26
more descriptive
sklearn!

# Conflicts:
#	gnomad/sample_qc/ancestry.py
# Conflicts:
#	gnomad/sample_qc/ancestry.py
@mike-w-wilson mike-w-wilson merged commit 897d750 into main Sep 17, 2025
6 checks passed
@mike-w-wilson mike-w-wilson deleted the dm/sklearn_warning branch September 17, 2025 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants