Skip to content

Commit b79cde2

Browse files
authored
Merge pull request #103 from nf-core/dev
Dev -> Master for v1.7 release
2 parents 7b7ab2f + 80bdaa8 commit b79cde2

File tree

13 files changed

+113
-33
lines changed

13 files changed

+113
-33
lines changed

CHANGELOG.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,26 @@
33
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
44
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
55

6+
## [[1.7](https://github.com/nf-core/fetchngs/releases/tag/1.7)] - 2022-07-01
7+
8+
### :warning: Major enhancements
9+
10+
Support for GEO ids has been dropped in this release due to breaking changes introduced in the NCBI API. For more detailed information please see [this PR](https://github.com/nf-core/fetchngs/pull/102).
11+
12+
As a workaround, if you have a GEO accession you can directly download a text file containing the appropriate SRA ids to pass to the pipeline:
13+
14+
- Search for your GEO accession on [GEO](https://www.ncbi.nlm.nih.gov/geo)
15+
- Click `SRA Run Selector` at the bottom of the GEO accession page
16+
- Select the desired samples in the `SRA Run Selector` and then download the `Accession List`
17+
18+
This downloads a text file called `SRR_Acc_List.txt` that can be directly provided to the pipeline e.g. `--input SRR_Acc_List.txt`.
19+
20+
### Enhancements & fixes
21+
22+
- [#97](https://github.com/nf-core/fetchngs/pull/97) - Add support for generating nf-core/taxprofiler compatible samplesheets.
23+
- [#99](https://github.com/nf-core/fetchngs/issues/99) - SRA_IDS_TO_RUNINFO fails due to bad request
24+
- Add `enum` field for `--nf_core_pipeline` to parameter schema so only accept supported pipelines are accepted
25+
626
## [[1.6](https://github.com/nf-core/fetchngs/releases/tag/1.6)] - 2022-05-17
727

828
- [#57](https://github.com/nf-core/fetchngs/pull/57) - fetchngs fails if FTP is blocked

README.md

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717

1818
## Introduction
1919

20-
**nf-core/fetchngs** is a bioinformatics pipeline to fetch metadata and raw FastQ files from both public and private databases. At present, the pipeline supports SRA / ENA / DDBJ / GEO / Synapse ids (see [usage docs](https://nf-co.re/fetchngs/usage#introduction)).
20+
**nf-core/fetchngs** is a bioinformatics pipeline to fetch metadata and raw FastQ files from both public and private databases. At present, the pipeline supports SRA / ENA / DDBJ / Synapse ids (see [usage docs](https://nf-co.re/fetchngs/usage#introduction)).
2121

2222
The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.
2323

@@ -27,7 +27,7 @@ On release, automated continuous integration tests run the pipeline on a full-si
2727

2828
Via a single file of ids, provided one-per-line (see [example input file](https://raw.githubusercontent.com/nf-core/test-datasets/fetchngs/sra_ids_test.txt)) the pipeline performs the following steps:
2929

30-
### SRA / ENA / DDBJ / GEO ids
30+
### SRA / ENA / DDBJ ids
3131

3232
1. Resolve database ids back to appropriate experiment-level ids and to be compatible with the [ENA API](https://ena-docs.readthedocs.io/en/latest/retrieval/programmatic-access.html)
3333
2. Fetch extensive id metadata via ENA API
@@ -36,6 +36,18 @@ Via a single file of ids, provided one-per-line (see [example input file](https:
3636
- Otherwise use [`sra-tools`](https://github.com/ncbi/sra-tools) to download `.sra` files and convert them to FastQ
3737
4. Collate id metadata and paths to FastQ files in a single samplesheet
3838

39+
### GEO ids
40+
41+
Support for GEO ids was dropped in [[v1.7](https://github.com/nf-core/fetchngs/releases/tag/1.7)] due to breaking changes introduced in the NCBI API. For more detailed information please see [this PR](https://github.com/nf-core/fetchngs/pull/102).
42+
43+
As a workaround, if you have a GEO accession you can directly download a text file containing the appropriate SRA ids to pass to the pipeline instead:
44+
45+
- Search for your GEO accession on [GEO](https://www.ncbi.nlm.nih.gov/geo)
46+
- Click `SRA Run Selector` at the bottom of the GEO accession page
47+
- Select the desired samples in the `SRA Run Selector` and then download the `Accession List`
48+
49+
This downloads a text file called `SRR_Acc_List.txt` that can be directly provided to the pipeline e.g. `--input SRR_Acc_List.txt`.
50+
3951
### Synapse ids
4052

4153
1. Resolve Synapse directory ids to their corresponding FastQ files ids via the `synapse list` command.
@@ -45,7 +57,13 @@ Via a single file of ids, provided one-per-line (see [example input file](https:
4557

4658
### Samplesheet format
4759

48-
The columns in the auto-created samplesheet can be tailored to be accepted out-of-the-box by selected nf-core pipelines, these currently include [nf-core/rnaseq](https://nf-co.re/rnaseq/usage#samplesheet-input) and the Illumina processing mode of [nf-core/viralrecon](https://nf-co.re/viralrecon/usage#illumina-samplesheet-format). You can use the `--nf_core_pipeline` parameter to customise this behaviour e.g. `--nf_core_pipeline rnaseq`. More pipelines will be supported in due course as we adopt and standardise samplesheet input across nf-core.
60+
The columns in the auto-created samplesheet can be tailored to be accepted out-of-the-box by selected nf-core pipelines, these currently include:
61+
62+
- [nf-core/rnaseq](https://nf-co.re/rnaseq/usage#samplesheet-input)
63+
- Ilumina processing mode of [nf-core/viralrecon](https://nf-co.re/viralrecon/usage#illumina-samplesheet-format)
64+
- [nf-core/taxprofiler](https://nf-co.re/nf-core/taxprofiler)
65+
66+
You can use the `--nf_core_pipeline` parameter to customise this behaviour e.g. `--nf_core_pipeline rnaseq`. More pipelines will be supported in due course as we adopt and standardise samplesheet input across nf-core.
4967

5068
## Quick Start
5169

assets/schema_input.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@
88
"type": "array",
99
"items": {
1010
"type": "string",
11-
"pattern": "^(((SR|ER|DR)[APRSX])|(SAM(N|EA|D))|(PRJ(NA|EB|DB))|(GS[EM])|(syn))(\\d+)$",
12-
"errorMessage": "Please provide a valid SRA, ENA, DDBJ or GEO identifier"
11+
"pattern": "^(((SR|ER|DR)[APRSX])|(SAM(N|EA|D))|(PRJ(NA|EB|DB))|(syn))(\\d+)$",
12+
"errorMessage": "Please provide a valid SRA, ENA, DDBJ identifier"
1313
}
1414
}
1515
}

bin/sra_ids_to_runinfo.py

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -193,9 +193,11 @@ def is_valid(cls, identifier):
193193
class DatabaseResolver:
194194
"""Define a service class for resolving various identifiers to experiments."""
195195

196-
_GEO_PREFIXES = {"GSE"}
196+
_GEO_PREFIXES = {
197+
"GSE",
198+
"GSM"
199+
}
197200
_SRA_PREFIXES = {
198-
"GSM",
199201
"PRJNA",
200202
"SAMN",
201203
"SRR",
@@ -207,7 +209,9 @@ class DatabaseResolver:
207209
"PRJDB",
208210
"SAMD",
209211
}
210-
_ENA_PREFIXES = {"ERR"}
212+
_ENA_PREFIXES = {
213+
"ERR"
214+
}
211215

212216
@classmethod
213217
def expand_identifier(cls, identifier):
@@ -246,13 +250,13 @@ def _content_check(cls, response, identifier):
246250
def _id_to_srx(cls, identifier):
247251
"""Resolve the identifier to SRA experiments."""
248252
params = {
249-
"save": "efetch",
253+
"id": identifier,
250254
"db": "sra",
251255
"rettype": "runinfo",
252-
"term": identifier,
256+
"retmode": "text"
253257
}
254258
response = fetch_url(
255-
f"https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?{urlencode(params)}"
259+
f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?{urlencode(params)}"
256260
)
257261
cls._content_check(response, identifier)
258262
return [row["Experiment"] for row in open_table(response, delimiter=",")]
@@ -261,9 +265,14 @@ def _id_to_srx(cls, identifier):
261265
def _gse_to_srx(cls, identifier):
262266
"""Resolve the identifier to SRA experiments."""
263267
ids = []
264-
params = {"acc": identifier, "targ": "gsm", "view": "data", "form": "text"}
268+
params = {
269+
"id": identifier,
270+
"db": "gds",
271+
"rettype": "runinfo",
272+
"retmode": "text"
273+
}
265274
response = fetch_url(
266-
f"https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?{urlencode(params)}"
275+
f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?{urlencode(params)}"
267276
)
268277
cls._content_check(response, identifier)
269278
gsm_ids = [

docs/output.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,19 +9,19 @@ This document describes the output produced by the pipeline. The directories lis
99
The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data depending on the type of ids provided:
1010

1111
- Download FastQ files and create samplesheet from:
12-
1. [SRA / ENA / DDBJ / GEO ids](#sra--ena--ddbj--geo-ids)
12+
1. [SRA / ENA / DDBJ ids](#sra--ena--ddbj-ids)
1313
2. [Synapse ids](#synapse-ids)
1414
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
1515

1616
Please see the [usage documentation](https://nf-co.re/fetchngs/usage#introduction) for a list of supported public repository identifiers and how to provide them to the pipeline.
1717

18-
### SRA / ENA / DDBJ / GEO ids
18+
### SRA / ENA / DDBJ ids
1919

2020
<details markdown="1">
2121
<summary>Output files</summary>
2222

2323
- `fastq/`
24-
- `*.fastq.gz`: Paired-end/single-end reads downloaded from the SRA / ENA / DDBJ / GEO.
24+
- `*.fastq.gz`: Paired-end/single-end reads downloaded from the SRA / ENA / DDBJ.
2525
- `fastq/md5/`
2626
- `*.md5`: Files containing `md5` sum for FastQ files downloaded from the ENA.
2727
- `samplesheet/`

docs/usage.md

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,15 @@
88

99
The pipeline has been set-up to automatically download and process the raw FastQ files from both public and private repositories. Identifiers can be provided in a file, one-per-line via the `--input` parameter. Currently, the following types of example identifiers are supported:
1010

11-
| `SRA` | `ENA` | `DDBJ` | `GEO` | `Synapse` |
12-
| ------------ | ------------ | ------------ | ---------- | ----------- |
13-
| SRR11605097 | ERR4007730 | DRR171822 | GSM4432381 | syn26240435 |
14-
| SRX8171613 | ERX4009132 | DRX162434 | GSE147507 | |
15-
| SRS6531847 | ERS4399630 | DRS090921 | | |
16-
| SAMN14689442 | SAMEA6638373 | SAMD00114846 | | |
17-
| SRP256957 | ERP120836 | DRP004793 | | |
18-
| SRA1068758 | ERA2420837 | DRA008156 | | |
19-
| PRJNA625551 | PRJEB37513 | PRJDB4176 | | |
11+
| `SRA` | `ENA` | `DDBJ` | `Synapse` |
12+
| ------------ | ------------ | ------------ | ----------- |
13+
| SRR11605097 | ERR4007730 | DRR171822 | syn26240435 |
14+
| SRX8171613 | ERX4009132 | DRX162434 | |
15+
| SRS6531847 | ERS4399630 | DRS090921 | |
16+
| SAMN14689442 | SAMEA6638373 | SAMD00114846 | |
17+
| SRP256957 | ERP120836 | DRP004793 | |
18+
| SRA1068758 | ERA2420837 | DRA008156 | |
19+
| PRJNA625551 | PRJEB37513 | PRJDB4176 | |
2020

2121
### SRR / ERR / DRR ids
2222

@@ -55,7 +55,13 @@ The final sample information for the FastQ files used for samplesheet generation
5555

5656
### Samplesheet format
5757

58-
As a bonus, the columns in the auto-created samplesheet can be tailored to be accepted out-of-the-box by selected nf-core pipelines, these currently include [nf-core/rnaseq](https://nf-co.re/rnaseq/usage#samplesheet-input) and the Illumina processing mode of [nf-core/viralrecon](https://nf-co.re/viralrecon/usage#illumina-samplesheet-format). You can use the `--nf_core_pipeline` parameter to customise this behaviour e.g. `--nf_core_pipeline rnaseq`. More pipelines will be supported in due course as we adopt and standardise samplesheet input across nf-core. It is highly recommended that you double-check that all of the identifiers you defined using `--input` are represented in the samplesheet. Also, public databases don't reliably hold information such as strandedness information so you may need to amend these entries too if for example your samplesheet was created by providing `--nf_core_pipeline rnaseq`.
58+
As a bonus, the columns in the auto-created samplesheet can be tailored to be accepted out-of-the-box by selected nf-core pipelines, these currently include:
59+
60+
- [nf-core/rnaseq](https://nf-co.re/rnaseq/usage#samplesheet-input)
61+
- Ilumina processing mode of [nf-core/viralrecon](https://nf-co.re/viralrecon/usage#illumina-samplesheet-format)
62+
- [nf-core/taxprofiler](https://nf-co.re/nf-core/taxprofiler)
63+
64+
You can use the `--nf_core_pipeline` parameter to customise this behaviour e.g. `--nf_core_pipeline rnaseq`. More pipelines will be supported in due course as we adopt and standardise samplesheet input across nf-core. It is highly recommended that you double-check that all of the identifiers you defined using `--input` are represented in the samplesheet. Also, public databases don't reliably hold information such as strandedness information so you may need to amend these entries too if for example your samplesheet was created by providing `--nf_core_pipeline rnaseq`.
5965

6066
### Bypass `FTP` data download
6167

lib/WorkflowMain.groovy

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ class WorkflowMain {
104104
if (num_match == total_ids) {
105105
is_sra = true
106106
} else {
107-
log.error "Mixture of ids provided via --input: ${no_match_ids.join(', ')}\nPlease provide either SRA / ENA / DDBJ / GEO or Synapse ids!"
107+
log.error "Mixture of ids provided via --input: ${no_match_ids.join(', ')}\nPlease provide either SRA / ENA / DDBJ or Synapse ids!"
108108
System.exit(1)
109109
}
110110
}
@@ -129,7 +129,7 @@ class WorkflowMain {
129129
if (num_match == total_ids) {
130130
is_synapse = true
131131
} else {
132-
log.error "Mixture of ids provided via --input: ${no_match_ids.join(', ')}\nPlease provide either SRA / ENA / DDBJ / GEO or Synapse ids!"
132+
log.error "Mixture of ids provided via --input: ${no_match_ids.join(', ')}\nPlease provide either SRA / ENA / DDBJ or Synapse ids!"
133133
System.exit(1)
134134
}
135135
}

lib/WorkflowSra.groovy

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,4 +29,21 @@ class WorkflowSra {
2929
" running nf-core/other pipelines.\n" +
3030
"==================================================================================="
3131
}
32+
33+
// Fail pipeline if input ids are from the GEO
34+
public static void isGeoFail(ids, log) {
35+
def pattern = /^(GS[EM])(\d+)$/
36+
for (id in ids) {
37+
if (id =~ pattern) {
38+
log.error "===================================================================================\n" +
39+
" GEO id detected: ${id}\n" +
40+
" Support for GEO ids was dropped in v1.7 due to breaking changes in the NCBI API.\n" +
41+
" Please remove any GEO ids from the input samplesheet.\n\n" +
42+
" Please see:\n" +
43+
" https://github.com/nf-core/fetchngs/pull/102\n" +
44+
"==================================================================================="
45+
System.exit(1)
46+
}
47+
}
48+
}
3249
}

main.nf

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ if (WorkflowMain.isSraId(ch_input, log)) {
4444
} else if (WorkflowMain.isSynapseId(ch_input, log)) {
4545
input_type = 'synapse'
4646
} else {
47-
exit 1, 'Ids provided via --input not recognised please make sure they are either SRA / ENA / DDBJ / GEO or Synapse ids!'
47+
exit 1, 'Ids provided via --input not recognised please make sure they are either SRA / ENA / DDBJ or Synapse ids!'
4848
}
4949

5050
if (params.input_type == input_type) {
@@ -63,7 +63,7 @@ if (params.input_type == input_type) {
6363
workflow NFCORE_FETCHNGS {
6464

6565
//
66-
// WORKFLOW: Download FastQ files for SRA / ENA / DDBJ / GEO ids
66+
// WORKFLOW: Download FastQ files for SRA / ENA / DDBJ ids
6767
//
6868
if (params.input_type == 'sra') {
6969
SRA ( ch_ids )

modules/local/sra_to_samplesheet.nf

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@ process SRA_TO_SAMPLESHEET {
3939
if (pipeline) {
4040
if (pipeline == 'rnaseq') {
4141
pipeline_map << [ strandedness: 'unstranded' ]
42+
} else if (pipeline == 'taxprofiler') {
43+
pipeline_map << [ fasta: '' ]
4244
}
4345
}
4446
pipeline_map << meta_map

0 commit comments

Comments
 (0)