Fix VCF integration assertion, revert extract scatters [VS-1820]#9340
Fix VCF integration assertion, revert extract scatters [VS-1820]#9340mcovarr merged 9 commits intoah_var_storefrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes a broken numeric comparison in the VCF integration WDL assertion by using awk’s correct float comparison operator, ensuring the cost-diff tolerance check actually fails when it should.
Changes:
- Replace invalid
-gtusage in anawknumeric comparison with>so the assertion behaves correctly. - Update
.dockstore.ymlbranch filters to include the PR branch for the integration workflow.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| scripts/variantstore/wdl/test/GvsQuickstartVcfIntegration.wdl | Corrects the awk comparison operator so tolerance assertions don’t silently pass. |
| .dockstore.yml | Updates Dockstore branch filter to include the VS-1820 fix branch for GvsQuickstartIntegration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f669666 to
2bb4c5c
Compare
koncheto-broad
left a comment
There was a problem hiding this comment.
To be clear, we're changing the behavior of extract to all users with less than 5000 samples in this case in order to fix this assertion? Because we only changed the scattering to 1 in order to satisfy the demand from some users for "one shard per chromosome" when possible. But that made extract longer and less reliable, so we changed it back to sharding to 500 for callsets below 5k. I'm not 100% sure that we should revert extract behavior to one shard per chromosome for the sake of making our integrations tests succeed without at least pricing out the actual cost consequences of leaving it the way it is and just changing the assertion. The api we use to pull the data is cheap and a relatively small fraction of extract cost, so I'd love to know how much it is actually costing us to leave things the (inefficient) way they are for now and potentially disable the test until we solve the issue. Could you calculate the actual cost consequences in two runs with the different sharding behaviors?
|
Github actions tests reported job failures from actions build 22922237195
|
awk does not use
-gtfor numeric comparisons; this should be (and used to be)>. Unfortunately our version of awk doesn't seem to complain when we try to use-gtand instead just silently exits with rc 0. 😞Appropriately failing integration test here with just the comparison fixed.
After further research, I found that these large discrepancies between expected and actual values only seem to happen with 3 sample integration runs. For AnVIL 3K, using the 500-wide scatter actually significantly reduced Storage API bytes scanned during extract. Shards appear to have downloaded only the data they needed (more or less) and were far less subject to preemptions due to their much shorter runtimes.
The final changes here were to update the
cost_observability.tsvfile for Exome, BGE, VETS and VQSR for both 20/X/Y and all chromosomes to match the currently observed Storage API usage with a 500 wide extract scatter.