Skip to content

Bambu discovery mode failing for large dataset #508

@Max-Tomlinson

Description

@Max-Tomlinson

Hello,

I’m running bambu (v3.9.0) on 566 Nanopore cDNA BAMs aligned to GRCh38.
Quantification-only (discovery = FALSE) completes fine, but enabling discovery = TRUE keeps failing with a BiocParallel error, usually at different samples (e.g. sample 103 or 108). I've previously run bambu successfully for ~400 samples so I'm not sure what has changed on my end.

Environment:

  • R version: R 4.5.1
  • bambu version: 3.9.0 (also tested latest GitHub)
  • Bioconductor: 3.21
  • OS: Linux (HPC SLURM cluster)
  • Threads: tested both ncore=16 and ncore=1

Example script

suppressPackageStartupMessages({
  library(bambu)
  library(Rsamtools)
  library(GenomeInfoDb)
})

fa   <- "references/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa"
gtf  <- "references/gencode.v49.annotation.gtf"
bams <- sort(Sys.glob("minimap2/*.sorted.bam"))

ann <- prepareAnnotations(gtf)
seqlevelsStyle(ann) <- "UCSC"

outdir <- "bambu"
rcdir  <- file.path(outdir, "rcOutDir")
dir.create(rcdir, recursive = TRUE, showWarnings = FALSE)

res <- bambu(
  reads       = BamFileList(bams, asMates = FALSE, yieldSize = 1e7),
  annotations = ann,
  genome      = fa,
  rcOutDir    = rcdir,
  ncore       = 1,
  yieldSize   = 1e7,
  discovery   = TRUE,
  verbose     = TRUE
)

Error output

Running Bambu-v3.9.0
--- Start generating read class files ---
sample_1.sorted.bam
Number of alignments/reads: 14469864
prediction accuracy (CV) (higher for splice donor than splice acceptor)
Error: BiocParallel errors
  1 remote errors, element index: 1
  565 unevaluated and other errors
  first remote error:
Error in fisher.test(table(predictions > 0.5, labels.train.cv.test)): 
  'x' must have at least 2 rows and columns

Occasionally the error message is:

Error in bambu.processReadsByFile(...): 
  not all chromosomes present in reference annotations, annotations might be incomplete.

But I’ve verified that BAMs, FASTA, and GTF share the expected UCSC-style seqlevels and pruning with keepSeqlevels does not resolve the fisher.test error.

Notes:

  • The same BAMs and reference/annotation work fine in quantification-only mode.
  • The crash occurs at different BAMs on different runs, so it doesn’t seem to be a single corrupted file.
  • Increasing min.readCount and min.sampleNumber in opt.discovery did not prevent the error.
  • Running with ncore=1 produces the same error, so it doesn’t appear to be a threading issue.

Questions:

  • Is this a known issue with the discovery step’s splice motif model (fisher.test)?
  • Do you have recommendations on specific discovery parameters, or a workaround, to avoid this error? I can't see any issues reporting this, so I would appreciate your help!

R and bash scripts, and standard output using stable and updated releases:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions