Skip to content

Fix VCF integration assertion, revert extract scatters [VS-1820]#9340

Merged
mcovarr merged 9 commits intoah_var_storefrom
vs_1820_fix_vcf_integration_assertion
Mar 10, 2026
Merged

Fix VCF integration assertion, revert extract scatters [VS-1820]#9340
mcovarr merged 9 commits intoah_var_storefrom
vs_1820_fix_vcf_integration_assertion

Conversation

@mcovarr
Copy link
Collaborator

@mcovarr mcovarr commented Mar 2, 2026

awk does not use -gt for numeric comparisons; this should be (and used to be) >. Unfortunately our version of awk doesn't seem to complain when we try to use -gt and instead just silently exits with rc 0. 😞

$ DIFF_FOUND=0.2
$ TOLERANCE=0.1
$ awk "BEGIN{ exit ($DIFF_FOUND -gt $TOLERANCE) }"
$ echo $?
0
$ awk "BEGIN{ exit ($DIFF_FOUND > $TOLERANCE) }"
$ echo $?
1
$ awk "BEGIN{ exit ($DIFF_FOUND -lt $TOLERANCE) }"
$ echo $?
0
$ awk "BEGIN{ exit ($DIFF_FOUND < $TOLERANCE) }"
$ echo $?
0

Appropriately failing integration test here with just the comparison fixed.

After further research, I found that these large discrepancies between expected and actual values only seem to happen with 3 sample integration runs. For AnVIL 3K, using the 500-wide scatter actually significantly reduced Storage API bytes scanned during extract. Shards appear to have downloaded only the data they needed (more or less) and were far less subject to preemptions due to their much shorter runtimes.

The final changes here were to update the cost_observability.tsv file for Exome, BGE, VETS and VQSR for both 20/X/Y and all chromosomes to match the currently observed Storage API usage with a 500 wide extract scatter.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a broken numeric comparison in the VCF integration WDL assertion by using awk’s correct float comparison operator, ensuring the cost-diff tolerance check actually fails when it should.

Changes:

  • Replace invalid -gt usage in an awk numeric comparison with > so the assertion behaves correctly.
  • Update .dockstore.yml branch filters to include the PR branch for the integration workflow.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
scripts/variantstore/wdl/test/GvsQuickstartVcfIntegration.wdl Corrects the awk comparison operator so tolerance assertions don’t silently pass.
.dockstore.yml Updates Dockstore branch filter to include the VS-1820 fix branch for GvsQuickstartIntegration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mcovarr mcovarr marked this pull request as draft March 2, 2026 21:28
@mcovarr mcovarr force-pushed the vs_1820_fix_vcf_integration_assertion branch from f669666 to 2bb4c5c Compare March 3, 2026 14:05
@mcovarr mcovarr marked this pull request as ready for review March 3, 2026 19:55
@mcovarr mcovarr changed the title Fix VCF integration assertion [VS-1820] Fix VCF integration assertion, revert extract scatters [VS-1820] Mar 3, 2026
Copy link
Collaborator

@koncheto-broad koncheto-broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, we're changing the behavior of extract to all users with less than 5000 samples in this case in order to fix this assertion? Because we only changed the scattering to 1 in order to satisfy the demand from some users for "one shard per chromosome" when possible. But that made extract longer and less reliable, so we changed it back to sharding to 500 for callsets below 5k. I'm not 100% sure that we should revert extract behavior to one shard per chromosome for the sake of making our integrations tests succeed without at least pricing out the actual cost consequences of leaving it the way it is and just changing the assertion. The api we use to pull the data is cheap and a relatively small fraction of extract cost, so I'd love to know how much it is actually costing us to leave things the (inefficient) way they are for now and potentially disable the test until we solve the issue. Could you calculate the actual cost consequences in two runs with the different sharding behaviors?

@mcovarr mcovarr merged commit caaa54f into ah_var_store Mar 10, 2026
14 of 17 checks passed
@mcovarr mcovarr deleted the vs_1820_fix_vcf_integration_assertion branch March 10, 2026 20:17
@gatk-bot
Copy link

Github actions tests reported job failures from actions build 22922237195
Failures in the following jobs:

Test Type JDK Job ID Logs
conda 17.0.6+10 22922237195.3 logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants