Skip to content

Conversation

@samarth8392
Copy link
Contributor

Changes

XAVIER could fail at Snakefile parse time with:
Fatal: Either a valid pairs file or sample names must be provided.
Sample names provided: set()

even when valid *.fastq.gz inputs were provided and tumor-only mode should have proceeded.

Root cause

Sample discovery depends on name_symlinks, which is normally created by sym_safe() symlinking discovered inputs into:
input_files/fastq/ (FASTQ mode) or
input_files/bam/ (BAM mode)

However, the Snakefile logic previously did this:
If input_files/fastq existed, it only globbed input_files/fastq/*.fastq.gz and did not run sym_safe() again.
If the directory existed but was empty (e.g., from a partial init/failed run/manual mkdir), then name_symlinks=[] → samples=set() → read_pairsfile() raised the fatal error before any rules executed.

Issues

Harden sample discovery to repopulate symlinks when the input directory exists but contains no files:
Always os.makedirs(input_fqdir, exist_ok=True) / os.makedirs(input_bamdir, exist_ok=True)
Prefer existing symlinks when present
If globbing the directory returns empty, call sym_safe(...) to (re)populate it
Add an early, actionable error message if samples still cannot be inferred

This makes tumor-only runs robust to stale/empty input_files/* directories and prevents parse-time failure.

Fixes #172
Fixes #173

PR Checklist

(Strikethrough any points that are not applicable.)

  • This comment contains a description of changes with justifications, with any relevant issues linked.
  • [ ] Update docs if there are any API changes.
  • Update CHANGELOG.md with a short description of any user-facing changes and reference the PR number. Guidelines: https://keepachangelog.com/en/1.1.0/

@kelly-sovacool
Copy link
Member

Looks good @samarth8392! Can you just confirm that you tested this branch on biowulf and it works as expected now?

@samarth8392
Copy link
Contributor Author

Yes, I ran the test run on the test dataset

./bin/xavier run \
	--input ./tests/data/*R?*.fastq.gz \
	--output $MAINDIR/runs/xavier_hg38 \
	--genome hg38 \
	--mode slurm \
	--runmode run

and I can confirm that the bam_check ran without issues

 $ grep "bam_check" snakemake.log.jobby.short.tsv | less

bam_check.samples=WES_NC_N_1_sub        COMPLETED       
bam_check.samples=WES_NC_T_1_sub        COMPLETED

@samarth8392
Copy link
Contributor Author

This PR also fixed #173 as I have updated the names in the test data pairs.tsv file

Copy link
Member

@kelly-sovacool kelly-sovacool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent, thanks for these fixes Samarth!

@kelly-sovacool kelly-sovacool merged commit 52a4a22 into main Jan 29, 2026
6 checks passed
@kelly-sovacool kelly-sovacool deleted the iss-172 branch January 29, 2026 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RuleException in rule collect_cohort_mafs when using --pairs Fatal: Either a valid pairs file or sample names must be provided

2 participants