EI-CoreBioinformatics · yuxuanlan · Apr 12, 2023 · Apr 19, 2023 · Apr 20, 2023 · Dec 14, 2023
diff --git a/README.md b/README.md
@@ -1,6 +1,7 @@
 # Smart-Seq2 quality control pipeline
 
 Smart-seq2 quantification and quality control pipeline, developed for single-cell service at the Earlham Institue.
+version 1.1
 
 Starting with the FASTQ files generated from the SmartSeq2 experiment and a sample sheet containing the appropriate metadata, transcript level counts are generated, merged and used to produce QC metrics.
 A QC report document is generated for each of the plates, and the whole experiment.
@@ -9,8 +10,6 @@ The pipeline does not require internet connection, but some required files have
 
 ## Inputs
 
-* Demulitplexed read data (.fastq.gz)
-* Sample sheet (.csv)
 * Config file
 
 ## Outputs
@@ -23,11 +22,40 @@ The pipeline does not require internet connection, but some required files have
 
 ## Tools used
 
-* kallisto 0.45.1
-* nextflow 19.04
-* R 3.5.2 and R packages (scater, rmarkdown, tidyverse)
+* kallisto 0.48.0
+* nextflow
+* R and R packages (scater 1.28.0, rjson 0.2.21, rmarkdown, tinytex, knitr, dplyr)
 * singularity 2.4.2
-
+* pandoc, texlive, ghostscript
+
+## Preprocessing
+1. convert the sample sheet to UNIX format
+```
+source dos2unix-7.4.1_CBG
+dos2unix -n old_samplesheet samplesheet
+```
+2. format read names. The pipeline can take symbolic links and requires read names in the following format: 
+```
+${Sample_Plate}_${demultiplexed_readname}
+``` 
+For example, the following FASTQ file from plate `CUB35DAY0` with a filename:
+```
+R0882-S0001_A68701_CUB35DAY0A10_H3VY5DRX2_CGTACTAG-AGAGGATA_L001_R1.fastq.gz
+```
+becomes
+```
+CUB35DAY0_R0882-S0001_A68701_CUB35DAY0A10_H3VY5DRX2_CGTACTAG-AGAGGATA_L001_R1.fastq.gz
+```
+3. make sample.csv
+```
+echo "sampleId,R1,R2" > sample.csv
+for sampleid in $(sed '1,1d' $samplesheet | cut -d',' -f2); do
+    R1=$(ls $symlinkdir/*_${sampleid}_*_R1.fastq.gz)
+    R2=$(ls $symlinkdir/*_${sampleid}_*_R2.fastq.gz)
+    echo $sampleid","$R1","$R2 >> $analysis_dir/samples.csv
+done
+
+```
 
 ## Running the pipeline
 
@@ -36,27 +64,37 @@ Pipeline is written in Nextflow, so a run is usually initiated in the following
 
 Examples of a config file and sample sheet are in the repository.
 
+An example:
+
+```
+cd /ei/cb/development/lany/CB-GENANNO-525_Charlotte_Utting_EI_CU_ENQ-5286_A_01/Analysis/scqc_reqs-1.1/4plates.run2
+sbatch -p ei-cb -J scqc_GENANNO-525.all_plate -o scqc_GENANNO-525.all_plate.%j.%N.log -c 1 --mem 10G \
+    --mail-type=ALL --mail-user=user.email.com \
+    --wrap "source singlecellQC-1.1_CBG && cd $analysis_dir && \
+    nextflow run /ei/software/cb/singlecellQC/1.1/x86_64/bin/scqc_nf.sh \
+    -c GENANNO-525.scqc-1.1.all_plates.config -with-report -resume"
+```
+
 ## Config file
 
-Parameters with 'params.' prefix can be passed when starting the pipeline by adding them at the end of the pipeline start call, e.g.
+Parameters inside 'params' scope can be passed when starting the pipeline by adding them at the end of the pipeline start call, e.g.
 'nextflow run scqc_nf.sh -c scqc.config --qcoutdir=my_qc_directory'. Alternatively, they can be edited in the config file.
 
-Parameters related to the output, organism species and this pipeline specifics.
+Parameters related to the output, organism species and this pipeline.
 
 * quantificationsdir - Directory to contain qunatifications produced for each of the samples and the counts matrices
 * qcoutdir' - Directory to contain the final QC report and other QC-related files
-* reads -  Location of sample FASTQ files.
-* species - 'Hsapiens' or 'Mmusculus'
 * samplesheet - .csv file containing information about sample names, wells, control status and other sample metadata.
+* reads -  Location of sample FASTQ files.
+* species - E.g., 'Hsapiens', 'Mmusculus'.
 * plate_ids - List of plate identificators (typically 4 strings), as they appear in the names of raw data samples. This is how the pipeline merges
     samples into plate-level matrices which are then used for plate-level QC.
 * mtnamefile - In case of non-human species, .rds file for mitochondrial gene. Leave empty ('') if human.
     This is a vector in R, containing the list of Ensembl transcript IDs, saved as .rds (using saveRDS())
-
-* idx - Location of the kallisto index to use. Indices are precomputed. If you want to use a new one, one can be build with 'kallisto index'.
 * pattern - The format of the FASTQ endings showing how they should be grouped, as a glob pattern
+* trans2gen_tsv - Transcript id to gene id mapping tsv file
 
-General HPC parameters:
+General HPC parameters in scope `executor`:
 
 * executor - Type of HPC scheduler.
 
@@ -66,26 +104,32 @@ General HPC parameters:
 
 * queueSize - Maximum number of jobs the pipeline will submit at once.
 
+Process specific parameters can be set in scope `process`.
+
 For more information on configuration file options, check out [nextflow documentation](https://www.nextflow.io/docs/latest/config.html).
 
 
 ## Sample sheet
 
 Sample sheet will be unique for every run.
 
-It is a .csv file that has to have the following columns. Additional columns are not a problem, but are not used.
+It is a .csv file that has to have the following columns. Additional columns are not a problem, but are not used. 
+Make sure there is no white space in any entries.
 
-* unique_sample_id_suffix - Part of the FASTQ file name that uniquely matches to one sample. This is how the pipeline connects the raw data with sample sheet information.
-* well - Row/column location of the sample on the plate. Something like A01, A02, etc. These are needed for plate position plots.
-* plate_id - A string corresponding to one of the plates used in the experiment.
-* number_of_cells - Number of cells in a well.
+* Sample_ID, Sample_Name - These two columns are also in the Illumina sample sheet.
+* Sample_Plate - A string corresponding to one of the plates used in the experiment
+* Sample_Well - Row/column location of the sample on the plate. Something like A01, A02, etc. These are needed for plate position plots.
+* row - Row location of the sample on the plate. Normally ranges between A to H.
+* column - Column location of the sample on the plate. Normal range is between 1 and 12.
 * control - TRUE if the well contains a control, FALSE otherwise.
-* experiment - Name of the experiment. Can be anything as long as it's the same througout the column.
+* number_of_cells - Number of cells in a well, e.g., 0, 1, 2, 20, 50. Wells contains more or less than 1 cells will be labeled as "control".
+* meta_1: meta data field 1, catagorical value.
+* meta_2: meta data field 2, catagorical value
 
 ## Required precomputed resources
 
 * kallisto index
 * list of mitochondrial genes
+* transcript id to gene id mapping tsv file
 * singularity images
-* biomaRt annotation