Skip to content

Commit 80bdaa8

Browse files
authored
Merge pull request #102 from drpatelh/api
Fix breaking changes introduced in the NCBI API
2 parents 9fbfd67 + 3696f40 commit 80bdaa8

File tree

12 files changed

+89
-31
lines changed

12 files changed

+89
-31
lines changed

CHANGELOG.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,24 @@
33
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
44
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
55

6-
## [Unpublished Version / DEV]
6+
## [[1.7](https://github.com/nf-core/fetchngs/releases/tag/1.7)] - 2022-07-01
7+
8+
### :warning: Major enhancements
9+
10+
Support for GEO ids has been dropped in this release due to breaking changes introduced in the NCBI API. For more detailed information please see [this PR](https://github.com/nf-core/fetchngs/pull/102).
11+
12+
As a workaround, if you have a GEO accession you can directly download a text file containing the appropriate SRA ids to pass to the pipeline:
13+
14+
- Search for your GEO accession on [GEO](https://www.ncbi.nlm.nih.gov/geo)
15+
- Click `SRA Run Selector` at the bottom of the GEO accession page
16+
- Select the desired samples in the `SRA Run Selector` and then download the `Accession List`
17+
18+
This downloads a text file called `SRR_Acc_List.txt` that can be directly provided to the pipeline e.g. `--input SRR_Acc_List.txt`.
719

820
### Enhancements & fixes
921

1022
- [#97](https://github.com/nf-core/fetchngs/pull/97) - Add support for generating nf-core/taxprofiler compatible samplesheets.
23+
- [#99](https://github.com/nf-core/fetchngs/issues/99) - SRA_IDS_TO_RUNINFO fails due to bad request
1124
- Add `enum` field for `--nf_core_pipeline` to parameter schema so only accept supported pipelines are accepted
1225

1326
## [[1.6](https://github.com/nf-core/fetchngs/releases/tag/1.6)] - 2022-05-17

README.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717

1818
## Introduction
1919

20-
**nf-core/fetchngs** is a bioinformatics pipeline to fetch metadata and raw FastQ files from both public and private databases. At present, the pipeline supports SRA / ENA / DDBJ / GEO / Synapse ids (see [usage docs](https://nf-co.re/fetchngs/usage#introduction)).
20+
**nf-core/fetchngs** is a bioinformatics pipeline to fetch metadata and raw FastQ files from both public and private databases. At present, the pipeline supports SRA / ENA / DDBJ / Synapse ids (see [usage docs](https://nf-co.re/fetchngs/usage#introduction)).
2121

2222
The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.
2323

@@ -27,7 +27,7 @@ On release, automated continuous integration tests run the pipeline on a full-si
2727

2828
Via a single file of ids, provided one-per-line (see [example input file](https://raw.githubusercontent.com/nf-core/test-datasets/fetchngs/sra_ids_test.txt)) the pipeline performs the following steps:
2929

30-
### SRA / ENA / DDBJ / GEO ids
30+
### SRA / ENA / DDBJ ids
3131

3232
1. Resolve database ids back to appropriate experiment-level ids and to be compatible with the [ENA API](https://ena-docs.readthedocs.io/en/latest/retrieval/programmatic-access.html)
3333
2. Fetch extensive id metadata via ENA API
@@ -36,6 +36,18 @@ Via a single file of ids, provided one-per-line (see [example input file](https:
3636
- Otherwise use [`sra-tools`](https://github.com/ncbi/sra-tools) to download `.sra` files and convert them to FastQ
3737
4. Collate id metadata and paths to FastQ files in a single samplesheet
3838

39+
### GEO ids
40+
41+
Support for GEO ids was dropped in [[v1.7](https://github.com/nf-core/fetchngs/releases/tag/1.7)] due to breaking changes introduced in the NCBI API. For more detailed information please see [this PR](https://github.com/nf-core/fetchngs/pull/102).
42+
43+
As a workaround, if you have a GEO accession you can directly download a text file containing the appropriate SRA ids to pass to the pipeline instead:
44+
45+
- Search for your GEO accession on [GEO](https://www.ncbi.nlm.nih.gov/geo)
46+
- Click `SRA Run Selector` at the bottom of the GEO accession page
47+
- Select the desired samples in the `SRA Run Selector` and then download the `Accession List`
48+
49+
This downloads a text file called `SRR_Acc_List.txt` that can be directly provided to the pipeline e.g. `--input SRR_Acc_List.txt`.
50+
3951
### Synapse ids
4052

4153
1. Resolve Synapse directory ids to their corresponding FastQ files ids via the `synapse list` command.

assets/schema_input.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@
88
"type": "array",
99
"items": {
1010
"type": "string",
11-
"pattern": "^(((SR|ER|DR)[APRSX])|(SAM(N|EA|D))|(PRJ(NA|EB|DB))|(GS[EM])|(syn))(\\d+)$",
12-
"errorMessage": "Please provide a valid SRA, ENA, DDBJ or GEO identifier"
11+
"pattern": "^(((SR|ER|DR)[APRSX])|(SAM(N|EA|D))|(PRJ(NA|EB|DB))|(syn))(\\d+)$",
12+
"errorMessage": "Please provide a valid SRA, ENA, DDBJ identifier"
1313
}
1414
}
1515
}

bin/sra_ids_to_runinfo.py

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -193,9 +193,11 @@ def is_valid(cls, identifier):
193193
class DatabaseResolver:
194194
"""Define a service class for resolving various identifiers to experiments."""
195195

196-
_GEO_PREFIXES = {"GSE"}
196+
_GEO_PREFIXES = {
197+
"GSE",
198+
"GSM"
199+
}
197200
_SRA_PREFIXES = {
198-
"GSM",
199201
"PRJNA",
200202
"SAMN",
201203
"SRR",
@@ -207,7 +209,9 @@ class DatabaseResolver:
207209
"PRJDB",
208210
"SAMD",
209211
}
210-
_ENA_PREFIXES = {"ERR"}
212+
_ENA_PREFIXES = {
213+
"ERR"
214+
}
211215

212216
@classmethod
213217
def expand_identifier(cls, identifier):
@@ -246,13 +250,13 @@ def _content_check(cls, response, identifier):
246250
def _id_to_srx(cls, identifier):
247251
"""Resolve the identifier to SRA experiments."""
248252
params = {
249-
"save": "efetch",
253+
"id": identifier,
250254
"db": "sra",
251255
"rettype": "runinfo",
252-
"term": identifier,
256+
"retmode": "text"
253257
}
254258
response = fetch_url(
255-
f"https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?{urlencode(params)}"
259+
f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?{urlencode(params)}"
256260
)
257261
cls._content_check(response, identifier)
258262
return [row["Experiment"] for row in open_table(response, delimiter=",")]
@@ -261,9 +265,14 @@ def _id_to_srx(cls, identifier):
261265
def _gse_to_srx(cls, identifier):
262266
"""Resolve the identifier to SRA experiments."""
263267
ids = []
264-
params = {"acc": identifier, "targ": "gsm", "view": "data", "form": "text"}
268+
params = {
269+
"id": identifier,
270+
"db": "gds",
271+
"rettype": "runinfo",
272+
"retmode": "text"
273+
}
265274
response = fetch_url(
266-
f"https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?{urlencode(params)}"
275+
f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?{urlencode(params)}"
267276
)
268277
cls._content_check(response, identifier)
269278
gsm_ids = [

docs/output.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,19 +9,19 @@ This document describes the output produced by the pipeline. The directories lis
99
The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data depending on the type of ids provided:
1010

1111
- Download FastQ files and create samplesheet from:
12-
1. [SRA / ENA / DDBJ / GEO ids](#sra--ena--ddbj--geo-ids)
12+
1. [SRA / ENA / DDBJ ids](#sra--ena--ddbj-ids)
1313
2. [Synapse ids](#synapse-ids)
1414
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
1515

1616
Please see the [usage documentation](https://nf-co.re/fetchngs/usage#introduction) for a list of supported public repository identifiers and how to provide them to the pipeline.
1717

18-
### SRA / ENA / DDBJ / GEO ids
18+
### SRA / ENA / DDBJ ids
1919

2020
<details markdown="1">
2121
<summary>Output files</summary>
2222

2323
- `fastq/`
24-
- `*.fastq.gz`: Paired-end/single-end reads downloaded from the SRA / ENA / DDBJ / GEO.
24+
- `*.fastq.gz`: Paired-end/single-end reads downloaded from the SRA / ENA / DDBJ.
2525
- `fastq/md5/`
2626
- `*.md5`: Files containing `md5` sum for FastQ files downloaded from the ENA.
2727
- `samplesheet/`

docs/usage.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,15 @@
88

99
The pipeline has been set-up to automatically download and process the raw FastQ files from both public and private repositories. Identifiers can be provided in a file, one-per-line via the `--input` parameter. Currently, the following types of example identifiers are supported:
1010

11-
| `SRA` | `ENA` | `DDBJ` | `GEO` | `Synapse` |
12-
| ------------ | ------------ | ------------ | ---------- | ----------- |
13-
| SRR11605097 | ERR4007730 | DRR171822 | GSM4432381 | syn26240435 |
14-
| SRX8171613 | ERX4009132 | DRX162434 | GSE147507 | |
15-
| SRS6531847 | ERS4399630 | DRS090921 | | |
16-
| SAMN14689442 | SAMEA6638373 | SAMD00114846 | | |
17-
| SRP256957 | ERP120836 | DRP004793 | | |
18-
| SRA1068758 | ERA2420837 | DRA008156 | | |
19-
| PRJNA625551 | PRJEB37513 | PRJDB4176 | | |
11+
| `SRA` | `ENA` | `DDBJ` | `Synapse` |
12+
| ------------ | ------------ | ------------ | ----------- |
13+
| SRR11605097 | ERR4007730 | DRR171822 | syn26240435 |
14+
| SRX8171613 | ERX4009132 | DRX162434 | |
15+
| SRS6531847 | ERS4399630 | DRS090921 | |
16+
| SAMN14689442 | SAMEA6638373 | SAMD00114846 | |
17+
| SRP256957 | ERP120836 | DRP004793 | |
18+
| SRA1068758 | ERA2420837 | DRA008156 | |
19+
| PRJNA625551 | PRJEB37513 | PRJDB4176 | |
2020

2121
### SRR / ERR / DRR ids
2222

lib/WorkflowMain.groovy

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ class WorkflowMain {
104104
if (num_match == total_ids) {
105105
is_sra = true
106106
} else {
107-
log.error "Mixture of ids provided via --input: ${no_match_ids.join(', ')}\nPlease provide either SRA / ENA / DDBJ / GEO or Synapse ids!"
107+
log.error "Mixture of ids provided via --input: ${no_match_ids.join(', ')}\nPlease provide either SRA / ENA / DDBJ or Synapse ids!"
108108
System.exit(1)
109109
}
110110
}
@@ -129,7 +129,7 @@ class WorkflowMain {
129129
if (num_match == total_ids) {
130130
is_synapse = true
131131
} else {
132-
log.error "Mixture of ids provided via --input: ${no_match_ids.join(', ')}\nPlease provide either SRA / ENA / DDBJ / GEO or Synapse ids!"
132+
log.error "Mixture of ids provided via --input: ${no_match_ids.join(', ')}\nPlease provide either SRA / ENA / DDBJ or Synapse ids!"
133133
System.exit(1)
134134
}
135135
}

lib/WorkflowSra.groovy

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,4 +29,21 @@ class WorkflowSra {
2929
" running nf-core/other pipelines.\n" +
3030
"==================================================================================="
3131
}
32+
33+
// Fail pipeline if input ids are from the GEO
34+
public static void isGeoFail(ids, log) {
35+
def pattern = /^(GS[EM])(\d+)$/
36+
for (id in ids) {
37+
if (id =~ pattern) {
38+
log.error "===================================================================================\n" +
39+
" GEO id detected: ${id}\n" +
40+
" Support for GEO ids was dropped in v1.7 due to breaking changes in the NCBI API.\n" +
41+
" Please remove any GEO ids from the input samplesheet.\n\n" +
42+
" Please see:\n" +
43+
" https://github.com/nf-core/fetchngs/pull/102\n" +
44+
"==================================================================================="
45+
System.exit(1)
46+
}
47+
}
48+
}
3249
}

main.nf

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ if (WorkflowMain.isSraId(ch_input, log)) {
4444
} else if (WorkflowMain.isSynapseId(ch_input, log)) {
4545
input_type = 'synapse'
4646
} else {
47-
exit 1, 'Ids provided via --input not recognised please make sure they are either SRA / ENA / DDBJ / GEO or Synapse ids!'
47+
exit 1, 'Ids provided via --input not recognised please make sure they are either SRA / ENA / DDBJ or Synapse ids!'
4848
}
4949

5050
if (params.input_type == input_type) {
@@ -63,7 +63,7 @@ if (params.input_type == input_type) {
6363
workflow NFCORE_FETCHNGS {
6464

6565
//
66-
// WORKFLOW: Download FastQ files for SRA / ENA / DDBJ / GEO ids
66+
// WORKFLOW: Download FastQ files for SRA / ENA / DDBJ ids
6767
//
6868
if (params.input_type == 'sra') {
6969
SRA ( ch_ids )

nextflow.config

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,7 @@ manifest {
158158
description = 'Pipeline to fetch metadata and raw FastQ files from public databases'
159159
mainScript = 'main.nf'
160160
nextflowVersion = '!>=21.10.3'
161-
version = '1.7dev'
161+
version = '1.7'
162162
}
163163

164164
// Load modules.config for DSL2 module specific options

0 commit comments

Comments
 (0)