-
Notifications
You must be signed in to change notification settings - Fork 4
Description
I followed the installation instructions and the tutorial instructions but am running into an error when I try to run the Site Frequency Spectrum.
angsd-wrapper SFS ./Site_Frequency_Spectrum_Config
This is my output and it appears it's failing when trying to find the file needed to fold (or not fold) the spectrum, but it can't.
WRAPPER: Zipping advanced arguments onto basic ones
-> angsd version: 0.911-44-g1c0ebb6 (htslib: 1.3.1-30-gbb03b02) build(Oct 31 2021 11:04:52)
-> Reading fasta: /mnt/steelhead/remote/Sophie/Programs/angsd-wrapper/Example_Data/Sequences/Tripsacum_TDD39103.fa
-> Reading fasta: /mnt/steelhead/remote/Sophie/Programs/angsd-wrapper/Example_Data/Sequences/Zea_mays.AGPv3.30.dna_sm.chromosome.10.fa
-> (Using Filipe G Vieira modification of: abcSaf.cpp)
-> Parsing 11 number of samples
-> Region lookup 1/1
-> We have now allocated approximately 10 Megabytes of raw nodes to the nodepool
-> Printing at chr: 10 pos:17551496 chunknumber 1100
-> We have now allocated approximately 20 Megabytes of raw nodes to the nodepool
-> Printing at chr: 10 pos:19386992 chunknumber 2000 [emFrequency_F] caught nan will not exit
logLike (3*nInd). nInd=11
keepList (nInd)
used logLike (3*length(keep))=11
-> Printing at chr: 10 pos:22395913 chunknumber 3200 [emFrequency_F] caught nan will not exit
logLike (3*nInd). nInd=11
keepList (nInd)
used logLike (3*length(keep))=10
[emFrequency_F] caught nan will not exit
logLike (3*nInd). nInd=11
keepList (nInd)
used logLike (3*length(keep))=10
[emFrequency_F] caught nan will not exit
logLike (3*nInd). nInd=11
keepList (nInd)
used logLike (3*length(keep))=10
-> Printing at chr: 10 pos:24004662 chunknumber 3600 [emFrequency_F] caught nan will not exit
logLike (3*nInd). nInd=11
keepList (nInd)
used logLike (3*length(keep))=11
-> Printing at chr: 10 pos:24908040 chunknumber 4000
-> Done reading data waiting for calculations to finish
-> Done waiting for threads
-> npools:26 unfreed tnodes before clean:0
-> Output filenames:
->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.arg"
->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.mafs.gz"
->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.geno.gz"
->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.gz"
->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.pos.gz"
->"/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.idx"
-> Sun Oct 31 12:08:56 2021
-> Arguments and parameters for all analysis are located in .arg file
[ALL done] cpu-time used = 199.08 sec
[ALL done] walltime used = 130.00 sec
-> Version of fname:/mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.idx is:2
-> Assuming .saf.gz file: /mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.gz
-> Assuming .saf.pos.gz: /mnt/steelhead/remote/Sophie/scratch/Maize/SFS/Maize_SFSOut.saf.pos.gz
-> Problem opening file: '-fold'
Looking at the wrapper shell script (Site_Frequency_Spectrum.sh) it appears that is failing in the final section of the script in the middle of a series of pipes to the final file which does get output in my scratch directory, it's just empty.
#!/usr/bin/env bash
set -e
set -o pipefail
# Load variables from supplied config file
source "$1"
# Are we using Common_Config? If so, source it
if [[ -f "${COMMON}" ]]
then
source "${COMMON}"
fi
# Where is angsd-wrapper located?
SOURCE=$2
# Where is ANGSD?
ANGSD_DIR=${SOURCE}/dependencies/angsd
# Variables created from transforming other variables
# The number of individuals in the taxon we are analyzing
N_IND=$(wc -l < "${SAMPLE_LIST}")
# How many inbreeding coefficients are supplied?
N_F=$(wc -l < "${SAMPLE_INBREEDING}")
# For ANGSD, the actual sample size is twice the number of individuals, since each individual has two chromosomes.
# The individual inbreeding coefficents take care of the mismatch between these two numbers
# Perform a check to see if number of individuals matches number of inbreeding coefficients
if [ "${N_IND}" -ne "${N_F}" ]
then
echo "Mismatch between number of samples in ${SAMPLE_LIST} and ${SAMPLE_INBREEDING}"
exit 1
fi
# Check to see if ancestral state is supplied: If not, polarize samples using
# the reference sequence and generate folded saf.
if [ ! -f "${ANC_SEQ}" ]
then
echo "Ancestral state data not found, using reference sequence to polarize alignment data. BAQ will likewise not be calculated."
if [ ! -f "${REF_SEQ}" ]
then
echo "No reference sequence supplied, unable to perform calculations."
exit 2
else
ANC_SEQ=$REF_SEQ
REF_SEQ=
BAQ=0
FOLD=1
fi
else
FOLD=0
fi
# Create outdirectory
OUT="${SCRATCH}"/"${PROJECT}"/SFS
mkdir -p "${OUT}"
# Now we actually run the command, this creates a binary file that contains the prior SFS
if [[ -f "${OUT}"/"${PROJECT}"_SFSOut.mafs.gz ]] && [ "$OVERRIDE" = "false" ]
then
echo "WRAPPER:maf already exists and OVERRIDE=false, skipping angsd -bam..."
else
# Do we have a regions file?
if [[ -f "${REGIONS}" ]]
then
WRAPPER_ARGS=$(echo -bam "${SAMPLE_LIST}" \
-out "${OUT}"/"${PROJECT}"_SFSOut \
-indF "${SAMPLE_INBREEDING}" \
-doSaf "${DO_SAF}" \
-uniqueOnly "${UNIQUE_ONLY}" \
-anc "${ANC_SEQ}" \
-minMapQ "${MIN_MAPQ}" \
-minQ "${MIN_BASEQUAL}" \
-nInd "${N_IND}" \
-minInd "${MIN_IND}"\
-baq "${BAQ}" \
-ref "${REF_SEQ}" \
-GL "${GT_LIKELIHOOD}" \
-P "${N_CORES}" \
-doMajorMinor "${DO_MAJORMINOR}" \
-doMaf "${DO_MAF}" \
-doGeno "${DO_GENO}" \
-rf "${REGIONS}" \
-doPost "${DO_POST}")
# Are we missing a definiton for regions?
elif [[ -z "${REGIONS}" ]]
then
WRAPPER_ARGS=$(echo -bam "${SAMPLE_LIST}" \
-out "${OUT}"/"${PROJECT}"_SFSOut \
-indF "${SAMPLE_INBREEDING}" \
-doSaf "${DO_SAF}" \
-uniqueOnly "${UNIQUE_ONLY}" \
-anc "${ANC_SEQ}" \
-minMapQ "${MIN_MAPQ}" \
-minQ "${MIN_BASEQUAL}" \
-nInd "${N_IND}" \
-minInd "${MIN_IND}"\
-baq "${BAQ}" \
-ref "${REF_SEQ}" \
-GL "${GT_LIKELIHOOD}" \
-P "${N_CORES}" \
-doMajorMinor "${DO_MAJORMINOR}" \
-doMaf "${DO_MAF}" \
-doGeno "${DO_GENO}" \
-doPost "${DO_POST}")
# Assuming a single region was defined in config file
else
WRAPPER_ARGS=$(echo -bam "${SAMPLE_LIST}" \
-out "${OUT}"/"${PROJECT}"_SFSOut \
-indF "${SAMPLE_INBREEDING}" \
-doSaf "${DO_SAF}" \
-uniqueOnly "${UNIQUE_ONLY}" \
-anc "${ANC_SEQ}" \
-folded "${FOLD}" \
-minMapQ "${MIN_MAPQ}" \
-minQ "${MIN_BASEQUAL}" \
-nInd "${N_IND}" \
-minInd "${MIN_IND}" \
-baq "${BAQ}" \
-ref "${REF_SEQ}" \
-GL "${GT_LIKELIHOOD}" \
-P "${N_CORES}" \
-doMajorMinor "${DO_MAJORMINOR}" \
-doMaf "${DO_MAF}" \
-doGeno "${DO_GENO}" \
-doPost "${DO_POST}" \
-r "${REGIONS}")
fi
fi
# Check for advanced arguments, and overwrite any overlapping definitions
FINAL_ARGS=($(source "${SOURCE}/Wrappers/Arg_Zipper.sh" "${WRAPPER_ARGS}" "${ADVANCED_ARGS}"))
# DEBUGGING
# echo "Wrapper arguments: ${WRAPPER_ARGS}" 1<&2
# echo -e "Final arguments:" ${FINAL_ARGS} 1<&2
"${ANGSD_DIR}"/angsd "${FINAL_ARGS[@]}"
"${ANGSD_DIR}"/misc/realSFS \
"${OUT}"/"${PROJECT}"_SFSOut.saf.idx \
-P "${N_CORES}" \
-fold "${FOLD}" \
> "${OUT}"/"${PROJECT}"_DerivedSFS.graph.me`
I can also include my configuration file if helpful (Site_Frequency_Spectrum_Config) which also directs the script to another configuration file in the same directory (Common_Config), but I'm wondering whether anyone else has run into this error while trying to move through this tutorial before. I am trying to figure out if this is a file path issue or if the SFS is not running correctly and there is some other error in the output file I am not identifying correctly.