Skip to content

Integrate GOTTCHA2 hits into shared state management#17

Open
nrminor wants to merge 5 commits intoschema-v2.5.0from
gottcha2-refactor
Open

Integrate GOTTCHA2 hits into shared state management#17
nrminor wants to merge 5 commits intoschema-v2.5.0from
gottcha2-refactor

Conversation

@nrminor
Copy link
Member

@nrminor nrminor commented Feb 13, 2026

Summary

GOTTCHA2 hits now coexist with BLAST hits in the same parquet store, and the GOTTCHA2 workflow participates in the same run-state lifecycle (duplicate-run prevention, sample locking, hit registration, upload tracking, run completion) as BLAST. This also fixes a cross-workflow bug where uploading in one workflow could cause the other to incorrectly skip samples on re-run.

This PR is expected to receive additional commits before merge.

Commit walkthrough

The PR is structured as five sequential commits, each building on the last. Reviewing commit-by-commit is recommended.

1. Evolve hit schema: add source field, rename contig_id to sequence_id

Adds a source field ("blast" | "gottcha2") to HitRecord and HitObservation so hits from different tools share one parquet store. Renames contig_idsequence_id since GOTTCHA2 hits are individual reads, not assembled contigs.

Backward compatibility with pre-migration parquet files uses a three-part strategy: new writes include both sequence_id and a contig_id=NULL stub; the DuckDB view uses COALESCE(sequence_id, contig_id); compaction normalizes via normalize_schema().

Files: hits.py, models.py, register_hits.py, commands/hits.py, and their tests.

2. Refactor GOTTCHA2 workflow to declarative style and remove dead code

Removes when: guards from GOTTCHA2 processes (empty-channel gating already handles this). Replaces the procedural if (params.labkey) block around the bundle with a channel ternary. Removes the never-completed VERIFY_WITH_BLAST feature and its validation process.

Files: gottcha2_workflow.nf, gottcha2.nf, validate_lk_gottcha2_lists.nf, NvdUtils.groovy.

3. Add state management and GOTTCHA2 hit registration

Wires CHECK_RUN_STATE and COMPLETE_RUN into the GOTTCHA2 workflow. Adds REGISTER_GOTTCHA2_HITS, which parses taxonomy from GOTTCHA2 extracted FASTA headers (format: >{read}|{ref}:{range} LEVEL={level} NAME={name} TAXID={taxid}) and writes HitRecords with source="gottcha2".

Also fixes a latent bug: GOTTCHA2 was receiving raw reads from GATHER_READS.out instead of preprocessed reads from PREPROCESS_READS.out, bypassing host scrubbing.

Files: main.nf, gottcha2_workflow.nf, gottcha2.nf, new register_gottcha2_hits.py + 18 tests.

4. Fully integrate GOTTCHA2 LabKey uploads into state management

The GOTTCHA2 upload scripts already accepted --sample-set-id and called py_nvd.state, but the Nextflow wiring never passed these values — all state tracking was dead code. This commit threads run_context (sample_set_id + state_dir) through the bundle subworkflow to all upload processes, adds validation gating, and replaces the legacy params.condor_cluster with workflow.runName.

Files: bundle_gottcha2_for_labkey.nf, validate_lk_gottcha2_lists.nf, gottcha2_workflow.nf, labkey_upload_gottcha2_full.py, labkey_upload_gottcha2_fasta.py.

5. Fix cross-workflow upload status contamination in CHECK_RUN_STATE

When both workflows run together (--tools all), whichever finishes uploading first called mark_sample_uploaded, setting processed_samples.status = 'uploaded' with no upload-type dimension. On re-run, the other workflow would see samples as "already uploaded" and could hard-abort.

Fix: get_uploaded_sample_ids now queries the uploads table (which already has upload_type granularity) filtered by workflow-specific types passed from Nextflow. mark_sample_uploaded is removed entirely; record_upload is the sole source of truth.

Files: state.py, check_run_state.py, utils.nf, both workflow files, all four upload scripts, test_state.py.

Testing

915 passed, 27 pre-existing failures (unrelated mock_taxonomy_open tests), 0 new failures, 0 new lint warnings.

Add a source field ("blast" or "gottcha2") to HitRecord and HitObservation so hits from different identification tools can coexist in the same parquet store. Rename contig_id to sequence_id to generalize beyond BLAST contigs (GOTTCHA2 hits are individual reads, not assembled contigs).

Backward compatibility with pre-migration parquet files uses a three-part strategy:

1. New writes include both sequence_id (the real value) and contig_id=NULL. This ensures DuckDB's union_by_name=true always surfaces both columns when reading a mix of old and new files. (union_by_name only creates columns present in at least one file — if all files lack sequence_id, referencing it in SQL is a binder error that TRY() cannot catch.)

2. The DuckDB view uses EXCLUDE (contig_id, sequence_id) + COALESCE(sequence_id, contig_id) AS sequence_id, so downstream queries see a single normalized sequence_id column regardless of file vintage.

3. Compaction calls normalize_schema() which renames contig_id to sequence_id, adds source="blast" default for legacy rows, and retains a NULL contig_id stub so the EXCLUDE/COALESCE pattern continues to work even when reading only compacted files.

Once all pre-migration parquet files have been compacted and no legacy data remains on disk, a future cleanup can stop writing contig_id=NULL in new files, drop the EXCLUDE/COALESCE from the view, and remove the contig_id stub from normalize_schema().
Remove when: guards from all GOTTCHA2 processes (GOTTCHA2_PROFILE_NANOPORE, GOTTCHA2_PROFILE_ILLUMINA, GENERATE_FASTA) and LabKey validation processes (VALIDATE_GOTTCHA2_FULL_LIST, VALIDATE_GOTTCHA2_FASTA_LIST). Empty-channel gating already prevents these from executing when GOTTCHA2 isn't selected, making the when: guards redundant.

Replace the procedural if (params.labkey) block around BUNDLE_GOTTCHA2_FOR_LABKEY with a ch_labkey_gate channel ternary. When LabKey is disabled, ch_labkey_gate is Channel.empty(), so combining it with data channels produces nothing and the bundle subworkflow never executes. This keeps the DAG shape consistent regardless of LabKey configuration.

Remove the VALIDATE_GOTTCHA2_BLAST_VERIFIED_LIST process and its call from VALIDATE_LK_GOTTCHA2. This validated a LabKey list that nothing ever wrote to (the VERIFY_WITH_BLAST feature was never completed). The labkey_gottcha_blast_verified_full_list param is left in the schema as a dormant no-op since removing it would require a major semver bump. Also remove it from LABKEY_GOTTCHA2_PARAMS in NvdUtils.groovy so it is no longer required when LabKey is enabled.

Remove the commented-out VERIFY_WITH_BLAST line.
Wire CHECK_RUN_STATE and COMPLETE_RUN into the GOTTCHA2 workflow, following the stat_blast_workflow pattern. This gives GOTTCHA2 the same run-state lifecycle as BLAST: duplicate-run prevention, sample locking, hit registration to the shared parquet store, and run completion tracking.

Fix the GOTTCHA2 input in main.nf to use PREPROCESS_READS.out instead of GATHER_READS.out. Previously GOTTCHA2 received raw reads, bypassing host scrubbing and any other preprocessing. This was a latent bug since preprocessing was added.

Add REGISTER_GOTTCHA2_HITS process in modules/gottcha2.nf and bin/register_gottcha2_hits.py. The script parses taxonomy directly from GOTTCHA2 extracted FASTA headers, which encode LEVEL, NAME, and TAXID per sequence (names use underscores in place of spaces, restored during parsing). Each extracted sequence becomes a HitRecord with source="gottcha2", using the same hit key computation, sequence compression, and parquet write path as BLAST hits. The full profile TSV is accepted as an input for provenance but not consumed during registration.

Run completion gating follows the same if/else pattern as stat_blast: when LabKey is enabled, COMPLETE_RUN waits on BUNDLE_GOTTCHA2_FOR_LABKEY; otherwise it waits on REGISTER_GOTTCHA2_HITS.

Both STAT_BLAST_WORKFLOW and GOTTCHA2_WORKFLOW now independently call CHECK_RUN_STATE and COMPLETE_RUN. When both run together (--tools all), the second CHECK_RUN_STATE sees the run already registered and treats it as a resume, which is harmless. The second COMPLETE_RUN overwrites completed_at with the later timestamp, so the recorded runtime extends to whenever the slower workflow finishes. This is arguably the correct behavior. Ideally, state management would be hoisted to main.nf with ch_run_context passed into both workflows as a take: parameter, but the workflows are internally tangled enough (LabKey gating, validation subworkflows, per-workflow completion logic) that this refactor is better deferred.
…ystem

The GOTTCHA2 upload processes accepted --sample-set-id and called py_nvd.state for duplicate detection, but the Nextflow wiring never passed sample_set_id or state_dir, so all state tracking was dead code. Uploads also ran without gating on LabKey list validation, and used params.condor_cluster (a legacy CHTC param not in the schema) instead of workflow.runName for provenance.

VALIDATE_LK_GOTTCHA2 now emits a validated signal. BUNDLE_GOTTCHA2_FOR_LABKEY accepts experiment_id, run_id, validation_complete, and run_context as explicit value channel inputs, threading them to all upload processes. The bundle emits upload_log so COMPLETE_RUN gates on upload completion.

The ch_labkey_gate channel ternary from the previous commit is replaced with an if (params.labkey) block for the bundle call, since ch_validation_gate depends on VALIDATE_LK_GOTTCHA2.out.validated which only exists inside that block. This matches how stat_blast_workflow.nf handles its bundle invocation.
When both BLAST and GOTTCHA2 run together (--tools all), whichever workflow's uploads finish first calls mark_sample_uploaded, setting processed_samples.status to 'uploaded'. On a re-run where only one workflow completed its uploads, CHECK_RUN_STATE sees the sample as already uploaded and can hard-abort the other workflow before it gets a chance to upload.

The root cause is that get_uploaded_sample_ids queries processed_samples.status, which has no upload_type dimension. The uploads table already tracks upload_type (blast, blast_fasta, gottcha2, gottcha2_fasta) and is the actual source of truth for what was uploaded. This commit switches get_uploaded_sample_ids to query the uploads table instead, filtered by workflow-specific upload types passed from each Nextflow workflow through CHECK_RUN_STATE. mark_sample_uploaded becomes dead code and is removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant