Integrate GOTTCHA2 hits into shared state management#17
Open
nrminor wants to merge 5 commits intoschema-v2.5.0from
Open
Integrate GOTTCHA2 hits into shared state management#17nrminor wants to merge 5 commits intoschema-v2.5.0from
nrminor wants to merge 5 commits intoschema-v2.5.0from
Conversation
Add a source field ("blast" or "gottcha2") to HitRecord and HitObservation so hits from different identification tools can coexist in the same parquet store. Rename contig_id to sequence_id to generalize beyond BLAST contigs (GOTTCHA2 hits are individual reads, not assembled contigs).
Backward compatibility with pre-migration parquet files uses a three-part strategy:
1. New writes include both sequence_id (the real value) and contig_id=NULL. This ensures DuckDB's union_by_name=true always surfaces both columns when reading a mix of old and new files. (union_by_name only creates columns present in at least one file — if all files lack sequence_id, referencing it in SQL is a binder error that TRY() cannot catch.)
2. The DuckDB view uses EXCLUDE (contig_id, sequence_id) + COALESCE(sequence_id, contig_id) AS sequence_id, so downstream queries see a single normalized sequence_id column regardless of file vintage.
3. Compaction calls normalize_schema() which renames contig_id to sequence_id, adds source="blast" default for legacy rows, and retains a NULL contig_id stub so the EXCLUDE/COALESCE pattern continues to work even when reading only compacted files.
Once all pre-migration parquet files have been compacted and no legacy data remains on disk, a future cleanup can stop writing contig_id=NULL in new files, drop the EXCLUDE/COALESCE from the view, and remove the contig_id stub from normalize_schema().
Remove when: guards from all GOTTCHA2 processes (GOTTCHA2_PROFILE_NANOPORE, GOTTCHA2_PROFILE_ILLUMINA, GENERATE_FASTA) and LabKey validation processes (VALIDATE_GOTTCHA2_FULL_LIST, VALIDATE_GOTTCHA2_FASTA_LIST). Empty-channel gating already prevents these from executing when GOTTCHA2 isn't selected, making the when: guards redundant. Replace the procedural if (params.labkey) block around BUNDLE_GOTTCHA2_FOR_LABKEY with a ch_labkey_gate channel ternary. When LabKey is disabled, ch_labkey_gate is Channel.empty(), so combining it with data channels produces nothing and the bundle subworkflow never executes. This keeps the DAG shape consistent regardless of LabKey configuration. Remove the VALIDATE_GOTTCHA2_BLAST_VERIFIED_LIST process and its call from VALIDATE_LK_GOTTCHA2. This validated a LabKey list that nothing ever wrote to (the VERIFY_WITH_BLAST feature was never completed). The labkey_gottcha_blast_verified_full_list param is left in the schema as a dormant no-op since removing it would require a major semver bump. Also remove it from LABKEY_GOTTCHA2_PARAMS in NvdUtils.groovy so it is no longer required when LabKey is enabled. Remove the commented-out VERIFY_WITH_BLAST line.
Wire CHECK_RUN_STATE and COMPLETE_RUN into the GOTTCHA2 workflow, following the stat_blast_workflow pattern. This gives GOTTCHA2 the same run-state lifecycle as BLAST: duplicate-run prevention, sample locking, hit registration to the shared parquet store, and run completion tracking. Fix the GOTTCHA2 input in main.nf to use PREPROCESS_READS.out instead of GATHER_READS.out. Previously GOTTCHA2 received raw reads, bypassing host scrubbing and any other preprocessing. This was a latent bug since preprocessing was added. Add REGISTER_GOTTCHA2_HITS process in modules/gottcha2.nf and bin/register_gottcha2_hits.py. The script parses taxonomy directly from GOTTCHA2 extracted FASTA headers, which encode LEVEL, NAME, and TAXID per sequence (names use underscores in place of spaces, restored during parsing). Each extracted sequence becomes a HitRecord with source="gottcha2", using the same hit key computation, sequence compression, and parquet write path as BLAST hits. The full profile TSV is accepted as an input for provenance but not consumed during registration. Run completion gating follows the same if/else pattern as stat_blast: when LabKey is enabled, COMPLETE_RUN waits on BUNDLE_GOTTCHA2_FOR_LABKEY; otherwise it waits on REGISTER_GOTTCHA2_HITS. Both STAT_BLAST_WORKFLOW and GOTTCHA2_WORKFLOW now independently call CHECK_RUN_STATE and COMPLETE_RUN. When both run together (--tools all), the second CHECK_RUN_STATE sees the run already registered and treats it as a resume, which is harmless. The second COMPLETE_RUN overwrites completed_at with the later timestamp, so the recorded runtime extends to whenever the slower workflow finishes. This is arguably the correct behavior. Ideally, state management would be hoisted to main.nf with ch_run_context passed into both workflows as a take: parameter, but the workflows are internally tangled enough (LabKey gating, validation subworkflows, per-workflow completion logic) that this refactor is better deferred.
…ystem The GOTTCHA2 upload processes accepted --sample-set-id and called py_nvd.state for duplicate detection, but the Nextflow wiring never passed sample_set_id or state_dir, so all state tracking was dead code. Uploads also ran without gating on LabKey list validation, and used params.condor_cluster (a legacy CHTC param not in the schema) instead of workflow.runName for provenance. VALIDATE_LK_GOTTCHA2 now emits a validated signal. BUNDLE_GOTTCHA2_FOR_LABKEY accepts experiment_id, run_id, validation_complete, and run_context as explicit value channel inputs, threading them to all upload processes. The bundle emits upload_log so COMPLETE_RUN gates on upload completion. The ch_labkey_gate channel ternary from the previous commit is replaced with an if (params.labkey) block for the bundle call, since ch_validation_gate depends on VALIDATE_LK_GOTTCHA2.out.validated which only exists inside that block. This matches how stat_blast_workflow.nf handles its bundle invocation.
When both BLAST and GOTTCHA2 run together (--tools all), whichever workflow's uploads finish first calls mark_sample_uploaded, setting processed_samples.status to 'uploaded'. On a re-run where only one workflow completed its uploads, CHECK_RUN_STATE sees the sample as already uploaded and can hard-abort the other workflow before it gets a chance to upload. The root cause is that get_uploaded_sample_ids queries processed_samples.status, which has no upload_type dimension. The uploads table already tracks upload_type (blast, blast_fasta, gottcha2, gottcha2_fasta) and is the actual source of truth for what was uploaded. This commit switches get_uploaded_sample_ids to query the uploads table instead, filtered by workflow-specific upload types passed from each Nextflow workflow through CHECK_RUN_STATE. mark_sample_uploaded becomes dead code and is removed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GOTTCHA2 hits now coexist with BLAST hits in the same parquet store, and the GOTTCHA2 workflow participates in the same run-state lifecycle (duplicate-run prevention, sample locking, hit registration, upload tracking, run completion) as BLAST. This also fixes a cross-workflow bug where uploading in one workflow could cause the other to incorrectly skip samples on re-run.
This PR is expected to receive additional commits before merge.
Commit walkthrough
The PR is structured as five sequential commits, each building on the last. Reviewing commit-by-commit is recommended.
1. Evolve hit schema: add
sourcefield, renamecontig_idtosequence_idAdds a
sourcefield ("blast"|"gottcha2") toHitRecordandHitObservationso hits from different tools share one parquet store. Renamescontig_id→sequence_idsince GOTTCHA2 hits are individual reads, not assembled contigs.Backward compatibility with pre-migration parquet files uses a three-part strategy: new writes include both
sequence_idand acontig_id=NULLstub; the DuckDB view usesCOALESCE(sequence_id, contig_id); compaction normalizes vianormalize_schema().Files:
hits.py,models.py,register_hits.py,commands/hits.py, and their tests.2. Refactor GOTTCHA2 workflow to declarative style and remove dead code
Removes
when:guards from GOTTCHA2 processes (empty-channel gating already handles this). Replaces the proceduralif (params.labkey)block around the bundle with a channel ternary. Removes the never-completedVERIFY_WITH_BLASTfeature and its validation process.Files:
gottcha2_workflow.nf,gottcha2.nf,validate_lk_gottcha2_lists.nf,NvdUtils.groovy.3. Add state management and GOTTCHA2 hit registration
Wires
CHECK_RUN_STATEandCOMPLETE_RUNinto the GOTTCHA2 workflow. AddsREGISTER_GOTTCHA2_HITS, which parses taxonomy from GOTTCHA2 extracted FASTA headers (format:>{read}|{ref}:{range} LEVEL={level} NAME={name} TAXID={taxid}) and writesHitRecords withsource="gottcha2".Also fixes a latent bug: GOTTCHA2 was receiving raw reads from
GATHER_READS.outinstead of preprocessed reads fromPREPROCESS_READS.out, bypassing host scrubbing.Files:
main.nf,gottcha2_workflow.nf,gottcha2.nf, newregister_gottcha2_hits.py+ 18 tests.4. Fully integrate GOTTCHA2 LabKey uploads into state management
The GOTTCHA2 upload scripts already accepted
--sample-set-idand calledpy_nvd.state, but the Nextflow wiring never passed these values — all state tracking was dead code. This commit threadsrun_context(sample_set_id + state_dir) through the bundle subworkflow to all upload processes, adds validation gating, and replaces the legacyparams.condor_clusterwithworkflow.runName.Files:
bundle_gottcha2_for_labkey.nf,validate_lk_gottcha2_lists.nf,gottcha2_workflow.nf,labkey_upload_gottcha2_full.py,labkey_upload_gottcha2_fasta.py.5. Fix cross-workflow upload status contamination in CHECK_RUN_STATE
When both workflows run together (
--tools all), whichever finishes uploading first calledmark_sample_uploaded, settingprocessed_samples.status = 'uploaded'with no upload-type dimension. On re-run, the other workflow would see samples as "already uploaded" and could hard-abort.Fix:
get_uploaded_sample_idsnow queries theuploadstable (which already hasupload_typegranularity) filtered by workflow-specific types passed from Nextflow.mark_sample_uploadedis removed entirely;record_uploadis the sole source of truth.Files:
state.py,check_run_state.py,utils.nf, both workflow files, all four upload scripts,test_state.py.Testing
915 passed, 27 pre-existing failures (unrelated
mock_taxonomy_opentests), 0 new failures, 0 new lint warnings.