Bambu discovery mode failing for large dataset

Hello,

I’m running bambu (v3.9.0) on 566 Nanopore cDNA BAMs aligned to GRCh38.
Quantification-only (discovery = FALSE) completes fine, but enabling discovery = TRUE keeps failing with a BiocParallel error, usually at different samples (e.g. sample 103 or 108). I've previously run bambu successfully for ~400 samples so I'm not sure what has changed on my end.

**Environment:**
- R version: R 4.5.1
- bambu version: 3.9.0 (also tested latest GitHub)
- Bioconductor: 3.21
- OS: Linux (HPC SLURM cluster)
- Threads: tested both ncore=16 and ncore=1

**Example script**
```
suppressPackageStartupMessages({
  library(bambu)
  library(Rsamtools)
  library(GenomeInfoDb)
})

fa   <- "references/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa"
gtf  <- "references/gencode.v49.annotation.gtf"
bams <- sort(Sys.glob("minimap2/*.sorted.bam"))

ann <- prepareAnnotations(gtf)
seqlevelsStyle(ann) <- "UCSC"

outdir <- "bambu"
rcdir  <- file.path(outdir, "rcOutDir")
dir.create(rcdir, recursive = TRUE, showWarnings = FALSE)

res <- bambu(
  reads       = BamFileList(bams, asMates = FALSE, yieldSize = 1e7),
  annotations = ann,
  genome      = fa,
  rcOutDir    = rcdir,
  ncore       = 1,
  yieldSize   = 1e7,
  discovery   = TRUE,
  verbose     = TRUE
)
```

**Error output**
```
Running Bambu-v3.9.0
--- Start generating read class files ---
sample_1.sorted.bam
Number of alignments/reads: 14469864
prediction accuracy (CV) (higher for splice donor than splice acceptor)
Error: BiocParallel errors
  1 remote errors, element index: 1
  565 unevaluated and other errors
  first remote error:
Error in fisher.test(table(predictions > 0.5, labels.train.cv.test)): 
  'x' must have at least 2 rows and columns
```

Occasionally the error message is:
```
Error in bambu.processReadsByFile(...): 
  not all chromosomes present in reference annotations, annotations might be incomplete.
```
But I’ve verified that BAMs, FASTA, and GTF share the expected UCSC-style seqlevels and pruning with keepSeqlevels does not resolve the fisher.test error.

**Notes:**
- The same BAMs and reference/annotation work fine in quantification-only mode.
- The crash occurs at different BAMs on different runs, so it doesn’t seem to be a single corrupted file.
- Increasing min.readCount and min.sampleNumber in opt.discovery did not prevent the error.
- Running with ncore=1 produces the same error, so it doesn’t appear to be a threading issue.

**Questions:**
- Is this a known issue with the discovery step’s splice motif model (fisher.test)?
- Do you have recommendations on specific discovery parameters, or a workaround, to avoid this error? I can't see any issues reporting this, so I would appreciate your help!

**R and bash scripts, and standard output using stable and updated releases:**
- [bambu_bioconductor.txt](https://github.com/user-attachments/files/22683810/bambu_bioconductor.txt)
- [bambu_bioconductor.sh](https://github.com/user-attachments/files/22683815/bambu_bioconductor.sh)
- [bambu_bioconductor_28814141.txt](https://github.com/user-attachments/files/22683808/bambu_bioconductor_28814141.txt)
- [bambu_github.txt](https://github.com/user-attachments/files/22683807/bambu_github.txt)
- [bambu_github.sh](https://github.com/user-attachments/files/22683813/bambu_github.sh)
- [bambu_github_28814145.txt](https://github.com/user-attachments/files/22683814/bambu_github_28814145.txt)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bambu discovery mode failing for large dataset #508

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bambu discovery mode failing for large dataset #508

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions