Skip to content

Conversation

@KaishinShaw
Copy link

Description

This PR improves the observability of the data preprocessing pipeline.

Currently, if a user encounters a "No gene exist." error, it is difficult to pinpoint whether the issue stems from:

  1. Sample ID mismatch during intersection.
  2. Mismatched Gene IDs in the provided gene list.
  3. Overly aggressive expression filtering.

Changes

  • Added logging.info statements to report gene and sample counts after key steps:
    • Raw phenotype loading.
    • User-provided gene list loading.
    • create_readydata (intersection of genotype, phenotype, and covariates).
    • Zero-expression filtering.
    • Final gene list and expression threshold filtering.
  • Added a logging.warning if the dataset becomes empty immediately after intersection.
  • Added a logging.error if the final gene count is zero.

These changes allow users to easily debug data input issues by checking the log output.
@zixuanzhang
Copy link
Collaborator

Thank you for bringing this to our attention and make the edits accordingly! We are working on improving the software and will merge this changes in the future release!

@quattro
Copy link
Contributor

quattro commented Dec 15, 2025

Wow thank you so much. As @zixuanzhang mentioned, we have an updated version on another branch we're hoping to release soon, that also includes the features you developed here.

Hoping to get them out before the holidays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants