Skip to content

Commit 0c43cc7

Browse files
authored
Merge pull request #54 from nf-core/dev
Dev -> Master for 1.4 release
2 parents 2d593fb + 8b0cbb9 commit 0c43cc7

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+1888
-767
lines changed

.github/CONTRIBUTING.md

Lines changed: 13 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -61,21 +61,16 @@ For further information/help, please consult the [nf-core/fetchngs documentation
6161

6262
To make the nf-core/fetchngs code and processing logic more understandable for new contributors and to ensure quality, we semi-standardise the way the code and other contributions are written.
6363

64-
### Adding a new step
65-
66-
If you wish to contribute a new step, please use the following coding standards:
67-
68-
1. Define the corresponding input channel into your new process from the expected previous process channel
69-
2. Write the process block (see below).
70-
3. Define the output channel if needed (see below).
71-
4. Add any new flags/options to `nextflow.config` with a default (see below).
72-
5. Add any new flags/options to `nextflow_schema.json` with help text (with `nf-core schema build`).
73-
6. Add any new flags/options to the help message (for integer/text parameters, print to help the corresponding `nextflow.config` parameter).
74-
7. Add sanity checks for all relevant parameters.
75-
8. Add any new software to the `scrape_software_versions.py` script in `bin/` and the version command to the `scrape_software_versions` process in `main.nf`.
76-
9. Do local tests that the new code works properly and as expected.
77-
10. Add a new test command in `.github/workflow/ci.yml`.
78-
11. Add any descriptions of output files to `docs/output.md`.
64+
### Adding a new step or module
65+
66+
If you wish to contribute a new step or module please see the [official guidelines](https://nf-co.re/developers/adding_modules#new-module-guidelines-and-pr-review-checklist) and use the following coding standards:
67+
68+
1. Add any new flags/options to `nextflow.config` with a default (see section below).
69+
2. Add any new flags/options to `nextflow_schema.json` with help text via `nf-core schema build`.
70+
3. Add sanity checks for all relevant parameters.
71+
4. Perform local tests to validate that the new code works as expected.
72+
5. If applicable, add a new test command in `.github/workflow/ci.yml`.
73+
6. Add any descriptions of output files to `docs/output.md`.
7974

8075
### Default values
8176

@@ -87,40 +82,17 @@ Once there, use `nf-core schema build` to add to `nextflow_schema.json`.
8782

8883
Sensible defaults for process resource requirements (CPUs / memory / time) for a process should be defined in `conf/base.config`. These should generally be specified generic with `withLabel:` selectors so they can be shared across multiple processes/steps of the pipeline. A nf-core standard set of labels that should be followed where possible can be seen in the [nf-core pipeline template](https://github.com/nf-core/tools/blob/master/nf_core/pipeline-template/conf/base.config), which has the default process as a single core-process, and then different levels of multi-core configurations for increasingly large memory requirements defined with standardised labels.
8984

90-
The process resources can be passed on to the tool dynamically within the process with the `${task.cpu}` and `${task.memory}` variables in the `script:` block.
91-
92-
### Naming schemes
85+
### Channel naming convention
9386

9487
Please use the following naming schemes, to make it easy to understand what is going where.
9588

96-
* initial process channel: `ch_output_from_<process>`
97-
* intermediate and terminal channels: `ch_<previousprocess>_for_<nextprocess>`
89+
* Initial process channel: `ch_output_from_<process>`
90+
* Intermediate and terminal channels: `ch_<previousprocess>_for_<nextprocess>`
9891

9992
### Nextflow version bumping
10093

10194
If you are using a new feature from core Nextflow, you may bump the minimum required version of nextflow in the pipeline with: `nf-core bump-version --nextflow . [min-nf-version]`
10295

103-
### Software version reporting
104-
105-
If you add a new tool to the pipeline, please ensure you add the information of the tool to the `get_software_version` process.
106-
107-
Add to the script block of the process, something like the following:
108-
109-
```bash
110-
<YOUR_TOOL> --version &> v_<YOUR_TOOL>.txt 2>&1 || true
111-
```
112-
113-
or
114-
115-
```bash
116-
<YOUR_TOOL> --help | head -n 1 &> v_<YOUR_TOOL>.txt 2>&1 || true
117-
```
118-
119-
You then need to edit the script `bin/scrape_software_versions.py` to:
120-
121-
1. Add a Python regex for your tool's `--version` output (as in stored in the `v_<YOUR_TOOL>.txt` file), to ensure the version is reported as a `v` and the version number e.g. `v2.1.1`
122-
2. Add a HTML entry to the `OrderedDict` for formatting in MultiQC.
123-
12496
### Images and figures
12597

12698
For overview images and other documents we follow the nf-core [style guidelines and examples](https://nf-co.re/developers/design_guidelines).

.github/workflows/ci.yml

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,11 @@ jobs:
1818
if: ${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/fetchngs') }}
1919
runs-on: ubuntu-latest
2020
env:
21-
NXF_VER: ${{ matrix.nxf_ver }}
2221
NXF_ANSI_LOG: false
2322
strategy:
2423
matrix:
25-
# Nextflow versions: check pipeline minimum and current latest
26-
nxf_ver: ["21.04.0", ""]
24+
# Nextflow versions: check pipeline minimum and latest edge version
25+
nxf_ver: ["NXF_VER=21.04.0", "NXF_EDGE=1"]
2726
steps:
2827
- name: Check out pipeline code
2928
uses: actions/checkout@v2
@@ -34,8 +33,10 @@ jobs:
3433
run: |
3534
wget -qO- get.nextflow.io | bash
3635
sudo mv nextflow /usr/local/bin/
36+
export ${{ matrix.nxf_ver }}
37+
nextflow self-update
3738
38-
- name: Run pipeline with test data
39+
- name: Run pipeline with SRA test data
3940
run: |
4041
nextflow run ${GITHUB_WORKSPACE} -profile test,docker
4142
@@ -53,6 +54,7 @@ jobs:
5354
"--nf_core_pipeline rnaseq",
5455
"--ena_metadata_fields run_accession,experiment_accession,library_layout,fastq_ftp,fastq_md5 --sample_mapping_fields run_accession,library_layout",
5556
--skip_fastq_download,
57+
--force_sratools_download
5658
]
5759
steps:
5860
- name: Check out pipeline code

.nf-core.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,9 @@ lint:
22
files_unchanged:
33
- .github/CONTRIBUTING.md
44
- assets/sendmail_template.txt
5+
- lib/NfcoreSchema.groovy
56
- lib/NfcoreTemplate.groovy
7+
files_exist:
8+
- bin/scrape_software_versions.py
9+
- modules/local/get_software_versions.nf
10+
actions_ci: False

CHANGELOG.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,30 @@
33
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
44
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
55

6+
## [[1.4](https://github.com/nf-core/fetchngs/releases/tag/1.4)] - 2021-11-09
7+
8+
### Enhancements & fixes
9+
10+
* Convert pipeline to updated Nextflow DSL2 syntax for future adoption across nf-core
11+
* Added a workflow to download FastQ files and to create samplesheets for ids from the [Synapse platform](https://www.synapse.org/) hosted by [Sage Bionetworks](https://sagebionetworks.org/).
12+
* SRA identifiers not available for direct download via the ENA FTP will now be downloaded via [`sra-tools`](https://github.com/ncbi/sra-tools).
13+
* Added `--force_sratools_download` parameter to preferentially download all FastQ files via `sra-tools` instead of ENA FTP.
14+
* Correctly handle errors from SRA identifiers that do **not** return metadata, for example, due to being private.
15+
* Retry an error in prefetch via bash script in order to allow it to resume interrupted downloads.
16+
* Name output FastQ files by `{EXP_ACC}_{RUN_ACC}*fastq.gz` instead of `{EXP_ACC}_{T*}*fastq.gz` for run id provenance
17+
* [[#46](https://github.com/nf-core/fetchngs/issues/46)] - Bug in sra_ids_to_runinfo.py
18+
* Added support for [DDBJ ids](https://www.ddbj.nig.ac.jp/index-e.html). See examples below:
19+
20+
| `DDBJ` |
21+
|---------------|
22+
| PRJDB4176 |
23+
| SAMD00114846 |
24+
| DRA008156 |
25+
| DRP004793 |
26+
| DRR171822 |
27+
| DRS090921 |
28+
| DRX162434 |
29+
630
## [[1.3](https://github.com/nf-core/fetchngs/releases/tag/1.3)] - 2021-09-15
731

832
### Enhancements & fixes

CITATIONS.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@
1414

1515
* [Requests](https://docs.python-requests.org/)
1616

17+
* [sra-tools](https://github.com/ncbi/sra-tools)
18+
1719
## Pipeline resources
1820

1921
* [ENA](https://pubmed.ncbi.nlm.nih.gov/33175160/)
@@ -22,9 +24,15 @@
2224
* [SRA](https://pubmed.ncbi.nlm.nih.gov/21062823/)
2325
> Leinonen R, Sugawara H, Shumway M, International Nucleotide Sequence Database Collaboration. The sequence read archive. Nucleic Acids Res. 2011 Jan;39 (Database issue):D19-21. doi: 10.1093/nar/gkq1019. Epub 2010 Nov 9. PubMed PMID: 21062823; PubMed Central PMCID: PMC3013647.
2426
27+
* [DDBJ](https://pubmed.ncbi.nlm.nih.gov/33156332/)
28+
> Fukuda A, Kodama Y, Mashima J, Fujisawa T, Ogasawara O. DDBJ update: streamlining submission and access of human data. Nucleic Acids Res. 2021 Jan 8;49(D1):D71-D75. doi: 10.1093/nar/gkaa982. PubMed PMID: 33156332; PubMed Central PMCID: PMC7779041.
29+
2530
* [GEO](https://pubmed.ncbi.nlm.nih.gov/23193258/)
2631
> Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 2013 Jan;41(Database issue):D991-5. doi: 10.1093/nar/gks1193. Epub 2012 Nov 27. PubMed PMID: 23193258; PubMed Central PMCID: PMC3531084.
2732
33+
* [Synapse](https://pubmed.ncbi.nlm.nih.gov/24071850/)
34+
> Omberg L, Ellrott K, Yuan Y, Kandoth C, Wong C, Kellen MR, Friend SH, Stuart J, Liang H, Margolin AA. Enabling transparent and collaborative computational analysis of 12 tumor types within The Cancer Genome Atlas. Nat Genet. 2013 Oct;45(10):1121-6. doi: 10.1038/ng.2761. PMID: 24071850; PMCID: PMC3950337.
35+
2836
## Software packaging/containerisation tools
2937

3038
* [Anaconda](https://anaconda.com)

README.md

Lines changed: 21 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,21 +16,34 @@
1616

1717
## Introduction
1818

19-
**nf-core/fetchngs** is a bioinformatics pipeline to fetch metadata and raw FastQ files from public databases. At present, the pipeline supports SRA / ENA / GEO ids (see [usage docs](https://nf-co.re/fetchngs/usage#introduction)).
19+
**nf-core/fetchngs** is a bioinformatics pipeline to fetch metadata and raw FastQ files from both public and private databases. At present, the pipeline supports SRA / ENA / DDBJ / GEO / Synapse ids (see [usage docs](https://nf-co.re/fetchngs/usage#introduction)).
2020

2121
The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.
2222

2323
On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the [nf-core website](https://nf-co.re/fetchngs/results).
2424

2525
## Pipeline summary
2626

27-
Via a single file of ids, provided one-per-line (see [example input file](https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/samplesheet/public_database_ids.txt)) the pipeline performs the following steps:
27+
Via a single file of ids, provided one-per-line (see [example input file](https://raw.githubusercontent.com/nf-core/test-datasets/fetchngs/sra_ids_test.txt)) the pipeline performs the following steps:
28+
29+
### SRA / ENA / DDBJ / GEO ids
2830

2931
1. Resolve database ids back to appropriate experiment-level ids and to be compatible with the [ENA API](https://ena-docs.readthedocs.io/en/latest/retrieval/programmatic-access.html)
30-
2. Fetch extensive id metadata including direct download links to FastQ files via ENA API
31-
3. Download FastQ files in parallel via `curl` and perform `md5sum` check
32+
2. Fetch extensive id metadata via ENA API
33+
3. Download FastQ files:
34+
- If direct download links are available from the ENA API, fetch in parallel via `curl` and perform `md5sum` check
35+
- Otherwise use [`sra-tools`](https://github.com/ncbi/sra-tools) to download `.sra` files and convert them to FastQ
3236
4. Collate id metadata and paths to FastQ files in a single samplesheet
3337

38+
### Synapse ids
39+
40+
1. Resolve Synapse directory ids to their corresponding FastQ files ids via the `synapse list` command.
41+
2. Retrieve FastQ file metadata including FastQ file names, md5sums, etags, annotations and other data provenance via the `synapse show` command.
42+
3. Download FastQ files in parallel via `synapse get`
43+
4. Collate paths to FastQ files in a single samplesheet
44+
45+
### Samplesheet format
46+
3447
The columns in the auto-created samplesheet can be tailored to be accepted out-of-the-box by selected nf-core pipelines, these currently include [nf-core/rnaseq](https://nf-co.re/rnaseq/usage#samplesheet-input) and the Illumina processing mode of [nf-core/viralrecon](https://nf-co.re/viralrecon/usage#illumina-samplesheet-format). You can use the `--nf_core_pipeline` parameter to customise this behaviour e.g. `--nf_core_pipeline rnaseq`. More pipelines will be supported in due course as we adopt and standardise samplesheet input across nf-core.
3548

3649
## Quick Start
@@ -45,9 +58,9 @@ The columns in the auto-created samplesheet can be tailored to be accepted out-o
4558
nextflow run nf-core/fetchngs -profile test,<docker/singularity/podman/shifter/charliecloud/conda/institute>
4659
```
4760

48-
> * Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use `-profile <institute>` in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment.
49-
> * If you are using `singularity` then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the `--singularity_pull_docker_container` parameter to pull and convert the Docker image instead. Alternatively, it is highly recommended to use the [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use) command to pre-download all of the required containers before running the pipeline and to set the [`NXF_SINGULARITY_CACHEDIR` or `singularity.cacheDir`](https://www.nextflow.io/docs/latest/singularity.html?#singularity-docker-hub) Nextflow options to be able to store and re-use the images from a central location for future pipeline runs.
50-
> * If you are using `conda`, it is highly recommended to use the [`NXF_CONDA_CACHEDIR` or `conda.cacheDir`](https://www.nextflow.io/docs/latest/conda.html) settings to store the environments in a central location for future pipeline runs.
61+
> - Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use `-profile <institute>` in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment.
62+
> - If you are using `singularity` then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the `--singularity_pull_docker_container` parameter to pull and convert the Docker image instead. Alternatively, it is highly recommended to use the [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use) command to pre-download all of the required containers before running the pipeline and to set the [`NXF_SINGULARITY_CACHEDIR` or `singularity.cacheDir`](https://www.nextflow.io/docs/latest/singularity.html?#singularity-docker-hub) Nextflow options to be able to store and re-use the images from a central location for future pipeline runs.
63+
> - If you are using `conda`, it is highly recommended to use the [`NXF_CONDA_CACHEDIR` or `conda.cacheDir`](https://www.nextflow.io/docs/latest/conda.html) settings to store the environments in a central location for future pipeline runs.
5164

5265
4. Start running your own analysis!
5366

@@ -61,7 +74,7 @@ The nf-core/fetchngs pipeline comes with documentation about the pipeline [usage
6174

6275
## Credits
6376

64-
nf-core/fetchngs was originally written by Harshil Patel ([@drpatelh](https://github.com/drpatelh)) from [The Bioinformatics & Biostatistics Group](https://www.crick.ac.uk/research/science-technology-platforms/bioinformatics-and-biostatistics/) at [The Francis Crick Institute, London](https://www.crick.ac.uk/) and Jose Espinosa-Carrasco ([@JoseEspinosa](https://github.com/JoseEspinosa)) from [The Comparative Bioinformatics Group](https://www.crg.eu/en/cedric_notredame) at [The Centre for Genomic Regulation, Spain](https://www.crg.eu/).
77+
nf-core/fetchngs was originally written by Harshil Patel ([@drpatelh](https://github.com/drpatelh)) from [Seqera Labs, Spain](https://seqera.io/) and Jose Espinosa-Carrasco ([@JoseEspinosa](https://github.com/JoseEspinosa)) from [The Comparative Bioinformatics Group](https://www.crg.eu/en/cedric_notredame) at [The Centre for Genomic Regulation, Spain](https://www.crg.eu/). Support for download of sequencing reads without FTP links via sra-tools was added by Moritz E. Beber ([@Midnighter](https://github.com/Midnighter)) from [Unseen Bio ApS, Denmark](https://unseenbio.com). The Synapse workflow was added by Daisy Han [@daisyhan97](https://github.com/daisyhan97) and Bruno Grande [@BrunoGrandePhD](https://github.com/BrunoGrandePhD) from [Sage Bionetworks, Seattle](https://sagebionetworks.org/).
6578

6679
## Contributions and Support
6780

assets/schema_input.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@
88
"type": "array",
99
"items": {
1010
"type": "string",
11-
"pattern": "^[SEPG][RAS][RXSMPAJXE][EN]?[AB]?\\d{4,9}$",
12-
"errorMessage": "Please provide a valid SRA, GEO or ENA identifier"
11+
"pattern":"^(((SR|ER|DR)[APRSX])|(SAM(N|EA|D))|(PRJ(NA|EB|DB))|(GS[EM])|(syn))(\\d+)$",
12+
"errorMessage": "Please provide a valid SRA, ENA, DDBJ or GEO identifier"
1313
}
1414
}
1515
}

0 commit comments

Comments
 (0)