You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+31Lines changed: 31 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -42,6 +42,29 @@ Currently, we support three distinct approaches for quantifying the number of re
42
42
43
43
The workflows assume that the files needed by PathSeq and CAMMiQ are pre-built and their location is specified in `config/PathSeq-config.yaml`. The index used by PathSeq in this project is ~41 GB while the indices used by CAMMiQ are > 200 GB so we do not distribute them with the rest of the package although they are available upon request.
44
44
45
+
The existing projects assume that these database files will be in the top-level directory so the first step is to create the directory `CSI-Microbes-identification/data` (if it doesn't already exist) using
46
+
47
+
```
48
+
mkdir data
49
+
```
50
+
51
+
The PathSeq files can be divided into two groups: the microbial reference files and the host (human) reference files. To download the host reference files (which can take a while), use the below commands
The microbial reference fasta file (`microbev1.fa`), VecScreen hits (`microbev1-vecscreen-combined-matches.bed`) and taxonomy hierarchy file (`microbev1_release201_taxonomy.db`) used in our paper are available from [zenodo](https://zenodo.org/record/5604433). The microbial reference fasta file (`microbev1.fa`) can be used to build the additional microbial files required by PathSeq (`microbev1.fa.fai`, `microbev1.dict` and `microbev1.fa.img`) using the below command.
61
+
62
+
```
63
+
./scripts/run-build-PathSeq-microbe-files.sh
64
+
```
65
+
66
+
Please note that building the `microbev1.fa.img` file will take ~1 day.
67
+
45
68
### Construction of PathSeq files
46
69
47
70
An example of how to build the PathSeq index (and other required files) is available in `build-PathSeq-microbes-files/Snakefile`.
@@ -360,6 +383,14 @@ to
360
383
}
361
384
```
362
385
386
+
### What if I don't have access to Biowulf?
387
+
388
+
Biowulf is the NIH's linux cluster that uses the slurm workload manager. It should be relatively straightforward to run CSI-Microbes-identification to another linux cluster that uses the slurm workload manager. Users should ensure that there is a module system that includes snakemake (6.0.5)<sup>[REF](#Snakemake)</sup>, sratoolkit (2.10.9)<sup>[REF](#SRAToolkit)</sup>, cellranger (5.0.1)<sup>[REF](#CellRanger)</sup>, samtools (1.11)<sup>[REF](#SAMtools)</sup>, bedtools (2.29.2)<sup>[REF](#BedTools)</sup>, and picard (latest=2.25.0)<sup>[REF](#Picard)</sup>.
389
+
390
+
Users may also need to change the names of the partitions to the partitions used by their server (see above question for an example). To speed up and effectively parallelize the PathSeq step, it is important to use nodes with at least 200 GB of local storage because we copy the PathSeq files to the local node before running PathSeq. In our experience, running PathSeq on multiple nodes using the same reference files will be much slower because of network latency as well competition for access to the one PathSeq reference files between the nodes (which will also sometimes cause errors).
391
+
392
+
Currently, the example 10x analyses use an HG38 genome that exists on biowulf. Therefore, the `CellRanger`: `genome_dir` value in `config/PathSeq-config.yaml` will need to updated as well.
393
+
363
394
### What are the expected output files?
364
395
365
396
The expected output files from CSI-Microbes-identification are pathseq.txt files, which are output in `output/PathSeq`. For example, the pathseq file for cell barcode TTTCCTCTCCACTGGG-1 from sample GSM3454529 (exposed to _Salmonella_) is located at `output/PathSeq/Pt0-GSM3454529-TTTCCTCTCCACTGGG-1/pathseq.txt`. These output files are used as input to [CSI-Microbes-analysis](https://github.com/ruppinlab/CSI-Microbes-analysis), which computes the differential abundance of microbes across cell-types.
0 commit comments