Skip to content

Speedup and refactor remove variants cmd#6094

Open
northwestwitch wants to merge 22 commits intomainfrom
speedup_remove_variants
Open

Speedup and refactor remove variants cmd#6094
northwestwitch wants to merge 22 commits intomainfrom
speedup_remove_variants

Conversation

@northwestwitch
Copy link
Member

@northwestwitch northwestwitch commented Mar 4, 2026

This PR adds a functionality or fixes a bug.

Testing on cg-vm1 server (Clinical Genomics Stockholm)

Prepare for testing

  1. Make sure the PR is pushed and available on Docker Hub
  2. First book your testing time using the Pax software available at https://pax.scilifelab.se/. The resource you are going to call dibs on is scout-stage and the server is cg-vm1.
  3. ssh <USER.NAME>@cg-vm1.scilifelab.se
  4. sudo -iu hiseq.clinical
  5. ssh localhost
  6. (optional) Find out which scout branch is currently deployed on cg-vm1: podman ps
  7. Stop the service with current deployed branch: systemctl --user stop scout@<name_of_currently_deployed_branch>
  8. Start the scout service with the branch to test: systemctl --user start scout@<this_branch>
  9. Make sure the branch is deployed: systemctl --user status scout.target
  10. After testing is done, repeat procedure at https://pax.scilifelab.se/, which will release the allocated resource (scout-stage) to be used for testing by other users.
Testing on hasta server (Clinical Genomics Stockholm)

Prepare for testing

  1. ssh <USER.NAME>@hasta.scilifelab.se
  2. Book your testing time using the Pax software. us; paxa -u <user> -s hasta -r scout-stage. You can also use the WSGI Pax app available at https://pax.scilifelab.se/.
  3. (optional) Find out which scout branch is currently deployed on cg-vm1: conda activate S_scout; pip freeze | grep scout-browser
  4. Deploy the branch to test: bash /home/proj/production/servers/resources/hasta.scilifelab.se/update-tool-stage.sh -e S_scout -t scout -b <this_branch>
  5. Make sure the branch is deployed: us; scout --version
  6. After testing is done, repeat the paxa procedure, which will release the allocated resource (scout-stage) to be used for testing by other users.

How to test:

  1. how to test it, possibly with real cases/data

Expected outcome:
The functionality should be working
Take a screenshot and attach or copy/paste the output.

Review:

  • code approved by DN
  • tests executed by CR

@northwestwitch northwestwitch changed the title Speedup remove variants Speedup and refactor remove variants cmd Mar 4, 2026
@sonarqubecloud
Copy link

sonarqubecloud bot commented Mar 4, 2026

@sonarqubecloud
Copy link

sonarqubecloud bot commented Mar 4, 2026

@northwestwitch northwestwitch force-pushed the speedup_remove_variants branch from 9095321 to 0e92415 Compare March 4, 2026 13:45
@codecov
Copy link

codecov bot commented Mar 4, 2026

Codecov Report

❌ Patch coverage is 92.15686% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.90%. Comparing base (4602f73) to head (d348692).

Files with missing lines Patch % Lines
scout/commands/delete/variants.py 92.63% 7 Missing ⚠️
scout/adapter/mongo/query.py 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6094      +/-   ##
==========================================
- Coverage   83.93%   83.90%   -0.04%     
==========================================
  Files         343      343              
  Lines       21174    21207      +33     
==========================================
+ Hits        17773    17793      +20     
- Misses       3401     3414      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@northwestwitch northwestwitch force-pushed the speedup_remove_variants branch from 0408d18 to 5705c49 Compare March 9, 2026 10:37
@northwestwitch
Copy link
Member Author

northwestwitch commented Mar 9, 2026

Ok, now the dry run is quite quick: less than 1m23s (it took hours before crashing last time)

Tested with the following command for instance: time scout delete variants -u chiara.rasi@scilifelab.se --rank-threshold 5 --variants-threshold 50000 --dry-run

image

It created a tab-separated list of cases 113 with more than 50K variants which can be used in the following step (delete without dry run):

NOTE THAT THE NON DRY-RUN COMMAND IS STILL SLOW but I'm reasoning that we don't care that much because we are processing the removal by batches of cases after having confirmed with the clinicians that those cases are fine to purge - at least we did like this last time. Note that the script accepts also a list of case IDs for this reason.

@northwestwitch northwestwitch marked this pull request as ready for review March 9, 2026 12:40
remove_n_variants = 0
remove_n_omics_variants = 0

if not dry_run:
Copy link
Member Author

@northwestwitch northwestwitch Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before if it was dry-run it was counting exactly how many vars would be deleted. I figured WHO CARES: what we are interested in is a list of cases with more than a number of variants and their specifics, to be be processed afterwards

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though it isn't perhaps a full dry-run: maybe we should have another name for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be perhaps more appropriate, but I don't think it's worth because it would be a major change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. We could add it to that list perhaps! Or split into two, a true dry-run, and a case variants count option. But lets not overdo it - its a use-once-a-year kind of thing.

@dnil
Copy link
Member

dnil commented Mar 9, 2026

Nice! I never thought I'd see the words "speedup" and "add progressbar" in the same PR! 😁

Copy link
Member

@dnil dnil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super! Good continued subdivision of the problems, nice with the explicit logging to file and very nice to start treating the research variants separately. I think the latter still needs a bit of work, if I read this correctly. While those "snv_research" categories are very nice on the CLI (same as for upload file categories, etc) they don't really exist on the db for searching.

I still feel it would be helpful to just save the cases from an open cursor to a list, while we are working. The individual case queries are so slow, they easily time out the operation, even if one makes manual batches.

remove_n_variants = 0
remove_n_omics_variants = 0

if not dry_run:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though it isn't perhaps a full dry-run: maybe we should have another name for it.

return
delete_stats["case_counter"] += 1
case_evaluated, _ = store.evaluated_variants(
case_id=case["_id"], institute_id=case["owner"], limit_dismissed=5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for the 5? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to speed it up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, I do think it is correct: just wondering if you saw it somehow safer than say 2, or less so than 15?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I haven't tested. We can have 15 if you prefer, doesn't matter to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also fine with 0 as long as we understand what is happening?

],
}
if remove_ctg:
query["category"] = {"$in": remove_ctg}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, we never split the CTGs right? So these would still have those whatever_research on them? But it would never propagate to a { category: 'whatever', variant_type: "research" }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, now that you tell me I remember. I'll fix, or perhaps I'll just leave it as it was

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted now! Very good point!

@northwestwitch
Copy link
Member Author

When you have time I'm ready for another round on this one @dnil. Thanks!

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Delete variants command does not allow to specifically remove research variants Speed up variant cleanup

3 participants