Skip to content

Long processing time for large dataset (92 FASTQs, 71 GB) #8

@katievigil

Description

@katievigil

Hi Marti,

Thank you for developing this very helpful tool! I recently set it up on our HPC using the NCBI protein viral database with Diamond, blastx and it’s running smoothly.

I’m working with 92 FASTQ files (~71 GB total), and the job has been running for about 9 days without completing yet. I was wondering if you have any recommendations for speeding up the analysis, or if you could provide an estimate of how long a dataset of this size typically takes to process?

Thanks so much for your help!

####Slurm script###
#!/bin/bash
#SBATCH --job-name=marti_viral
#SBATCH --account=loni_virome2025
#SBATCH --partition=single
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --time=7-00:00:00
#SBATCH --output=slurm-%j.out-%N
#SBATCH --error=slurm-%j.err-%N
#SBATCH -D /work/kvigil/marti_out/ONR_viral

--- environment ---

set -euo pipefail
echo "== SLURM info =="
echo "JobID: $SLURM_JOB_ID Node: $(hostname)"
echo "CPUs: $SLURM_CPUS_PER_TASK"

MARTi & DIAMOND on PATH (adjust if needed)

export PATH="$HOME/MARTi/bin:$PATH"
which diamond && diamond --version || { echo "diamond not found on PATH"; exit 1; }

Fast node-local temp; DIAMOND uses this

export TMPDIR="/work/kvigil/tmp/${SLURM_JOB_ID}"
mkdir -p "$TMPDIR"

Config file

CONF="$HOME/marti_viral_diamond_longreads.conf"

Sanity: threads config should match cpus-per-task (2 jobs x 8 threads = 16)

grep -E 'LocalSchedulerMaxJobs|BlastThreads' "$CONF" || true

Recommended: resume safely from any partial state

rm -f /work/kvigil/marti_out/ONR_viral/progress.info # uncomment for a clean restart

echo "== Running MARTi =="
marti -config "$CONF" -loglevel

marti_viral_diamond_longreads.conf

ProcessBarcodes:

Scheduler:local
LocalSchedulerMaxJobs:2

InactivityTimeout:10
StopProcessingAfter:0

TaxonomyDir:/work/kvigil/db/taxdump
LCAMaxHits:20
LCAScorePercent:90
LCAMinIdentity:60
LCAMinQueryCoverage:0
LCAMinCombinedScore:0
LCAMinLength:50

ConvertFastQ

ReadsPerBlast:8000
ReadFilterMinQ:9
ReadFilterMinLength:500

BlastProcess
Name:diamond-nr
Program:diamond
Database:/work/kvigil/db/viral_proteins_tax.dmnd
MaxE:0.001
MaxTargetSeqs:100
BlastThreads:8
UseToClassify
Options: --ultra-sensitive --long-reads --frameshift 15 --range-culling --outfmt 6

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions