You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .github/CONTRIBUTING.md
+13-41Lines changed: 13 additions & 41 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -61,21 +61,16 @@ For further information/help, please consult the [nf-core/fetchngs documentation
61
61
62
62
To make the nf-core/fetchngs code and processing logic more understandable for new contributors and to ensure quality, we semi-standardise the way the code and other contributions are written.
63
63
64
-
### Adding a new step
65
-
66
-
If you wish to contribute a new step, please use the following coding standards:
67
-
68
-
1. Define the corresponding input channel into your new process from the expected previous process channel
69
-
2. Write the process block (see below).
70
-
3. Define the output channel if needed (see below).
71
-
4. Add any new flags/options to `nextflow.config` with a default (see below).
72
-
5. Add any new flags/options to `nextflow_schema.json` with help text (with `nf-core schema build`).
73
-
6. Add any new flags/options to the help message (for integer/text parameters, print to help the corresponding `nextflow.config` parameter).
74
-
7. Add sanity checks for all relevant parameters.
75
-
8. Add any new software to the `scrape_software_versions.py` script in `bin/` and the version command to the `scrape_software_versions` process in `main.nf`.
76
-
9. Do local tests that the new code works properly and as expected.
77
-
10. Add a new test command in `.github/workflow/ci.yml`.
78
-
11. Add any descriptions of output files to `docs/output.md`.
64
+
### Adding a new step or module
65
+
66
+
If you wish to contribute a new step or module please see the [official guidelines](https://nf-co.re/developers/adding_modules#new-module-guidelines-and-pr-review-checklist) and use the following coding standards:
67
+
68
+
1. Add any new flags/options to `nextflow.config` with a default (see section below).
69
+
2. Add any new flags/options to `nextflow_schema.json` with help text via `nf-core schema build`.
70
+
3. Add sanity checks for all relevant parameters.
71
+
4. Perform local tests to validate that the new code works as expected.
72
+
5. If applicable, add a new test command in `.github/workflow/ci.yml`.
73
+
6. Add any descriptions of output files to `docs/output.md`.
79
74
80
75
### Default values
81
76
@@ -87,40 +82,17 @@ Once there, use `nf-core schema build` to add to `nextflow_schema.json`.
87
82
88
83
Sensible defaults for process resource requirements (CPUs / memory / time) for a process should be defined in `conf/base.config`. These should generally be specified generic with `withLabel:` selectors so they can be shared across multiple processes/steps of the pipeline. A nf-core standard set of labels that should be followed where possible can be seen in the [nf-core pipeline template](https://github.com/nf-core/tools/blob/master/nf_core/pipeline-template/conf/base.config), which has the default process as a single core-process, and then different levels of multi-core configurations for increasingly large memory requirements defined with standardised labels.
89
84
90
-
The process resources can be passed on to the tool dynamically within the process with the `${task.cpu}` and `${task.memory}` variables in the `script:` block.
91
-
92
-
### Naming schemes
85
+
### Channel naming convention
93
86
94
87
Please use the following naming schemes, to make it easy to understand what is going where.
95
88
96
-
*initial process channel: `ch_output_from_<process>`
97
-
*intermediate and terminal channels: `ch_<previousprocess>_for_<nextprocess>`
89
+
*Initial process channel: `ch_output_from_<process>`
90
+
*Intermediate and terminal channels: `ch_<previousprocess>_for_<nextprocess>`
98
91
99
92
### Nextflow version bumping
100
93
101
94
If you are using a new feature from core Nextflow, you may bump the minimum required version of nextflow in the pipeline with: `nf-core bump-version --nextflow . [min-nf-version]`
102
95
103
-
### Software version reporting
104
-
105
-
If you add a new tool to the pipeline, please ensure you add the information of the tool to the `get_software_version` process.
106
-
107
-
Add to the script block of the process, something like the following:
<YOUR_TOOL> --help | head -n 1 &> v_<YOUR_TOOL>.txt 2>&1||true
117
-
```
118
-
119
-
You then need to edit the script `bin/scrape_software_versions.py` to:
120
-
121
-
1. Add a Python regex for your tool's `--version` output (as in stored in the `v_<YOUR_TOOL>.txt` file), to ensure the version is reported as a `v` and the version number e.g. `v2.1.1`
122
-
2. Add a HTML entry to the `OrderedDict` for formatting in MultiQC.
123
-
124
96
### Images and figures
125
97
126
98
For overview images and other documents we follow the nf-core [style guidelines and examples](https://nf-co.re/developers/design_guidelines).
* Convert pipeline to updated Nextflow DSL2 syntax for future adoption across nf-core
11
+
* Added a workflow to download FastQ files and to create samplesheets for ids from the [Synapse platform](https://www.synapse.org/) hosted by [Sage Bionetworks](https://sagebionetworks.org/).
12
+
* SRA identifiers not available for direct download via the ENA FTP will now be downloaded via [`sra-tools`](https://github.com/ncbi/sra-tools).
13
+
* Added `--force_sratools_download` parameter to preferentially download all FastQ files via `sra-tools` instead of ENA FTP.
14
+
* Correctly handle errors from SRA identifiers that do **not** return metadata, for example, due to being private.
15
+
* Retry an error in prefetch via bash script in order to allow it to resume interrupted downloads.
16
+
* Name output FastQ files by `{EXP_ACC}_{RUN_ACC}*fastq.gz` instead of `{EXP_ACC}_{T*}*fastq.gz` for run id provenance
17
+
*[[#46](https://github.com/nf-core/fetchngs/issues/46)] - Bug in sra_ids_to_runinfo.py
18
+
* Added support for [DDBJ ids](https://www.ddbj.nig.ac.jp/index-e.html). See examples below:
> Fukuda A, Kodama Y, Mashima J, Fujisawa T, Ogasawara O. DDBJ update: streamlining submission and access of human data. Nucleic Acids Res. 2021 Jan 8;49(D1):D71-D75. doi: 10.1093/nar/gkaa982. PubMed PMID: 33156332; PubMed Central PMCID: PMC7779041.
29
+
25
30
*[GEO](https://pubmed.ncbi.nlm.nih.gov/23193258/)
26
31
> Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 2013 Jan;41(Database issue):D991-5. doi: 10.1093/nar/gks1193. Epub 2012 Nov 27. PubMed PMID: 23193258; PubMed Central PMCID: PMC3531084.
Copy file name to clipboardExpand all lines: README.md
+21-8Lines changed: 21 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,21 +16,34 @@
16
16
17
17
## Introduction
18
18
19
-
**nf-core/fetchngs** is a bioinformatics pipeline to fetch metadata and raw FastQ files from public databases. At present, the pipeline supports SRA / ENA / GEO ids (see [usage docs](https://nf-co.re/fetchngs/usage#introduction)).
19
+
**nf-core/fetchngs** is a bioinformatics pipeline to fetch metadata and raw FastQ files from both public and private databases. At present, the pipeline supports SRA / ENA / DDBJ / GEO / Synapse ids (see [usage docs](https://nf-co.re/fetchngs/usage#introduction)).
20
20
21
21
The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.
22
22
23
23
On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the [nf-core website](https://nf-co.re/fetchngs/results).
24
24
25
25
## Pipeline summary
26
26
27
-
Via a single file of ids, provided one-per-line (see [example input file](https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/samplesheet/public_database_ids.txt)) the pipeline performs the following steps:
27
+
Via a single file of ids, provided one-per-line (see [example input file](https://raw.githubusercontent.com/nf-core/test-datasets/fetchngs/sra_ids_test.txt)) the pipeline performs the following steps:
28
+
29
+
### SRA / ENA / DDBJ / GEO ids
28
30
29
31
1. Resolve database ids back to appropriate experiment-level ids and to be compatible with the [ENA API](https://ena-docs.readthedocs.io/en/latest/retrieval/programmatic-access.html)
30
-
2. Fetch extensive id metadata including direct download links to FastQ files via ENA API
31
-
3. Download FastQ files in parallel via `curl` and perform `md5sum` check
32
+
2. Fetch extensive id metadata via ENA API
33
+
3. Download FastQ files:
34
+
- If direct download links are available from the ENA API, fetch in parallel via `curl` and perform `md5sum` check
35
+
- Otherwise use [`sra-tools`](https://github.com/ncbi/sra-tools) to download `.sra` files and convert them to FastQ
32
36
4. Collate id metadata and paths to FastQ files in a single samplesheet
33
37
38
+
### Synapse ids
39
+
40
+
1. Resolve Synapse directory ids to their corresponding FastQ files ids via the `synapse list` command.
41
+
2. Retrieve FastQ file metadata including FastQ file names, md5sums, etags, annotations and other data provenance via the `synapse show` command.
42
+
3. Download FastQ files in parallel via `synapse get`
43
+
4. Collate paths to FastQ files in a single samplesheet
44
+
45
+
### Samplesheet format
46
+
34
47
The columns in the auto-created samplesheet can be tailored to be accepted out-of-the-box by selected nf-core pipelines, these currently include [nf-core/rnaseq](https://nf-co.re/rnaseq/usage#samplesheet-input) and the Illumina processing mode of [nf-core/viralrecon](https://nf-co.re/viralrecon/usage#illumina-samplesheet-format). You can use the `--nf_core_pipeline` parameter to customise this behaviour e.g. `--nf_core_pipeline rnaseq`. More pipelines will be supported in due course as we adopt and standardise samplesheet input across nf-core.
35
48
36
49
## Quick Start
@@ -45,9 +58,9 @@ The columns in the auto-created samplesheet can be tailored to be accepted out-o
45
58
nextflow run nf-core/fetchngs -profile test,<docker/singularity/podman/shifter/charliecloud/conda/institute>
46
59
```
47
60
48
-
>* Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists foryour Institute. If so, you can simply use `-profile <institute>`in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment.
49
-
>* If you are using `singularity`then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the `--singularity_pull_docker_container` parameter to pull and convert the Docker image instead. Alternatively, it is highly recommended to use the [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use) command to pre-download all of the required containers before running the pipeline and to set the [`NXF_SINGULARITY_CACHEDIR` or `singularity.cacheDir`](https://www.nextflow.io/docs/latest/singularity.html?#singularity-docker-hub) Nextflow options to be able to store and re-use the images from a central location for future pipeline runs.
50
-
>* If you are using `conda`, it is highly recommended to use the [`NXF_CONDA_CACHEDIR` or `conda.cacheDir`](https://www.nextflow.io/docs/latest/conda.html) settings to store the environments in a central location for future pipeline runs.
61
+
>- Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists foryour Institute. If so, you can simply use `-profile <institute>`in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment.
62
+
>- If you are using `singularity`then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the `--singularity_pull_docker_container` parameter to pull and convert the Docker image instead. Alternatively, it is highly recommended to use the [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use) command to pre-download all of the required containers before running the pipeline and to set the [`NXF_SINGULARITY_CACHEDIR` or `singularity.cacheDir`](https://www.nextflow.io/docs/latest/singularity.html?#singularity-docker-hub) Nextflow options to be able to store and re-use the images from a central location for future pipeline runs.
63
+
>- If you are using `conda`, it is highly recommended to use the [`NXF_CONDA_CACHEDIR` or `conda.cacheDir`](https://www.nextflow.io/docs/latest/conda.html) settings to store the environments in a central location for future pipeline runs.
51
64
52
65
4. Start running your own analysis!
53
66
@@ -61,7 +74,7 @@ The nf-core/fetchngs pipeline comes with documentation about the pipeline [usage
61
74
62
75
## Credits
63
76
64
-
nf-core/fetchngs was originally written by Harshil Patel ([@drpatelh](https://github.com/drpatelh)) from [The Bioinformatics & Biostatistics Group](https://www.crick.ac.uk/research/science-technology-platforms/bioinformatics-and-biostatistics/) at [The Francis Crick Institute, London](https://www.crick.ac.uk/) and Jose Espinosa-Carrasco ([@JoseEspinosa](https://github.com/JoseEspinosa)) from [The Comparative Bioinformatics Group](https://www.crg.eu/en/cedric_notredame) at [The Centre for Genomic Regulation, Spain](https://www.crg.eu/).
77
+
nf-core/fetchngs was originally written by Harshil Patel ([@drpatelh](https://github.com/drpatelh)) from [Seqera Labs, Spain](https://seqera.io/) and Jose Espinosa-Carrasco ([@JoseEspinosa](https://github.com/JoseEspinosa)) from [The Comparative Bioinformatics Group](https://www.crg.eu/en/cedric_notredame) at [The Centre for Genomic Regulation, Spain](https://www.crg.eu/). Support for download of sequencing reads without FTP links via sra-tools was added by Moritz E. Beber ([@Midnighter](https://github.com/Midnighter)) from [Unseen Bio ApS, Denmark](https://unseenbio.com). The Synapse workflow was added by Daisy Han [@daisyhan97](https://github.com/daisyhan97) and Bruno Grande [@BrunoGrandePhD](https://github.com/BrunoGrandePhD) from [Sage Bionetworks, Seattle](https://sagebionetworks.org/).
0 commit comments