Skip to content

Commit 3a0ac32

Browse files
authored
Merge pull request #12 from nextflow-io/abhinav/aws-athena-docs
Add docs for aws-athena integration
2 parents 8778cfa + fcfe68f commit 3a0ac32

File tree

2 files changed

+90
-25
lines changed

2 files changed

+90
-25
lines changed

README.md

Lines changed: 23 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,43 @@
11
# SQL DB plugin for Nextflow
22

3-
This plugin provides an extension to implement built-in support for SQL DB access and manipulation in Nextflow scripts.
3+
This plugin provides an extension to implement built-in support for SQL DB access and manipulation in Nextflow scripts.
44

5-
It provides the ability to create a Nextflow channel from SQL queries and to populate database tables.
6-
The current version provides out-of-the-box support for the following databases:
5+
It provides the ability to create a Nextflow channel from SQL queries and to populate database tables.
6+
The current version provides out-of-the-box support for the following databases:
77

88
* [H2](https://www.h2database.com)
9-
* [MySQL](https://www.mysql.com/)
9+
* [MySQL](https://www.mysql.com/)
1010
* [MariaDB](https://mariadb.org/)
1111
* [PostgreSQL](https://www.postgresql.org/)
1212
* [SQLite](https://www.sqlite.org/index.html)
1313
* [DuckDB](https://duckdb.org/)
14-
* [AWS Athena](https://aws.amazon.com/athena/)
15-
14+
* [AWS Athena](https://aws.amazon.com/athena/) (Setup guide [here](/docs/aws-athena.md))
15+
1616
NOTE: THIS IS A PREVIEW TECHNOLOGY, FEATURES AND CONFIGURATION SETTINGS CAN CHANGE IN FUTURE RELEASES.
1717

1818
This repository only holds plugin artefacts. Source code is available at this [link](https://github.com/nextflow-io/nextflow/tree/master/plugins/nf-sqldb).
1919

20-
## Get started
20+
## Get started
2121

22-
Make sure to have Nextflow `22.08.1-edge` or later. Add the following snippet to your `nextflow.config` file.
22+
Make sure to have Nextflow `22.08.1-edge` or later. Add the following snippet to your `nextflow.config` file.
2323

2424
```
2525
plugins {
2626
2727
}
2828
```
2929

30-
The above declaration allows the use of the SQL plugin functionalities in your Nextflow pipelines.
31-
See the section below to configure the connection properties with a database instance.
32-
30+
The above declaration allows the use of the SQL plugin functionalities in your Nextflow pipelines.
31+
See the section below to configure the connection properties with a database instance.
3332

3433
## Configuration
3534

3635
The target database connection coordinates are specified in the `nextflow.config` file using the
3736
`sql.db` scope. The following are available
3837

39-
| Config option | Description |
40-
|--- |--- |
41-
| `sql.db.'<DB-NAME>'.url` | The database connection URL based on Java [JDBC standard](https://docs.oracle.com/javase/tutorial/jdbc/basics/connecting.html#db_connection_url).
38+
| Config option | Description |
39+
|--- |--- |
40+
| `sql.db.'<DB-NAME>'.url` | The database connection URL based on Java [JDBC standard](https://docs.oracle.com/javase/tutorial/jdbc/basics/connecting.html#db_connection_url).
4241
| `sql.db.'<DB-NAME>'.driver` | The database driver class name (optional).
4342
| `sql.db.'<DB-NAME>'.user` | The database connection user name.
4443
| `sql.db.'<DB-NAME>'.password` | The database connection password.
@@ -78,10 +77,10 @@ ch = channel.fromQuery('select alpha, delta, omega from SAMPLE', db: 'foo')
7877

7978
The following options are available:
8079

81-
| Operator option | Description |
82-
|--- |--- |
80+
| Operator option | Description |
81+
|--- |--- |
8382
| `db` | The database handle. It must must a `sql.db` name defined in the `nextflow.config` file.
84-
| `batchSize` | Performs the query in batches of the specified size. This is useful to avoid loading the complete resultset in memory for query returning a large number of entries. NOTE: this feature requires that the underlying SQL database to support `LIMIT` and `OFFSET` capability.
83+
| `batchSize` | Performs the query in batches of the specified size. This is useful to avoid loading the complete resultset in memory for query returning a large number of entries. NOTE: this feature requires that the underlying SQL database to support `LIMIT` and `OFFSET` capability.
8584
| `emitColumns` | When `true` the column names in the select statement are emitted as first tuple in the resulting channel.
8685

8786
### sqlInsert
@@ -111,8 +110,8 @@ NOTE: the target table (e.g. `SAMPLE` in the above example) must be created ahea
111110

112111
The following options are available:
113112

114-
| Operator option | Description |
115-
|-------------------|--- |
113+
| Operator option | Description |
114+
|-------------------|--- |
116115
| `db` | The database handle. It must must a `sql.db` name defined in the `nextflow.config` file.
117116
| `into` | The database table name into with the data needs to be stored.
118117
| `columns` | The database table column names to be filled with the channel data. The column names order and cardinality must match the tuple values emitted by the channel. The columns can be specified as a `List` object or a comma-separated value string.
@@ -146,17 +145,16 @@ To query this file in a Nextflow script use the following snippet:
146145
.view()
147146
```
148147

149-
150148
The `CSVREAD` function provided by the H2 database engine allows the access of a CSV file in your computer file system,
151149
you can replace `test.csv` with a CSV file path of your choice. The `foo>=2` condition shows how to define a filtering
152-
clause using the conventional SQL WHERE constrains.
150+
clause using the conventional SQL WHERE constrains.
153151

154-
## Important
152+
## Important
155153

156-
This plugin is not expected to be used to store and access a pipeline status in a synchronous manner during the pipeline
157-
execution.
154+
This plugin is not expected to be used to store and access a pipeline status in a synchronous manner during the pipeline
155+
execution.
158156

159-
This means that if your script has a `sqlInsert` operation followed by a successive `fromQuery` operation, the query
157+
This means that if your script has a `sqlInsert` operation followed by a successive `fromQuery` operation, the query
160158
may *not* contain previously inserted data due to the asynchronous nature of Nextflow operators.
161159

162160
The SQL support provided by this plugin is meant to be used to fetch DB data from a previous run or to populate DB tables

docs/aws-athena.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
# AWS Athena integration setup
2+
3+
## Pre-requisites
4+
5+
1. An AWS Athena S3 data-source
6+
2. An AWS Glue crawler and database
7+
3. The AWS Glue crawler has been run at least once to populate the tables in the database
8+
9+
## Usage
10+
11+
In the example below, it is assumed that the [NCBI SRA Metadata](https://www.ncbi.nlm.nih.gov/sra/docs/sra-athena/) has been used as the data source. You can refer the [tutorial from NCBI](https://www.youtube.com/watch?v=_F4FhcDWSJg&ab_channel=TheNationalLibraryofMedicine) for setting up the AWS resources correctly
12+
13+
### Configuration
14+
15+
```nextflow config
16+
//NOTE: Replace the values in the config file as per your setup
17+
18+
params {
19+
aws_glue_db = "sra-glue-db"
20+
aws_glue_db_table = "metadata"
21+
}
22+
23+
24+
plugins {
25+
26+
}
27+
28+
29+
sql {
30+
db {
31+
athena {
32+
url = 'jdbc:awsathena://AwsRegion=<YOUR_AWS_REGION>;S3OutputLocation=<YOUR_S3_BUCKET>'
33+
user = '<YOUR_AWS_ACCESS_KEY>'
34+
password = '<YOUR_AWS_SECRET_KEY>'
35+
}
36+
}
37+
}
38+
39+
```
40+
41+
### Pipeline
42+
43+
Once the configuration has been setup correctly, you can use it in the Nextlow code as shown below
44+
45+
```nextflow
46+
include { fromQuery } from 'plugin/nf-sqldb'
47+
48+
def sqlQuery = """
49+
SELECT *
50+
FROM \"${params.aws_glue_db}\".${params.aws_glue_db_table}
51+
WHERE organism = 'Mycobacterium tuberculosis'
52+
LIMIT 10;
53+
"""
54+
55+
Channel.fromQuery(sqlQuery, db: 'athena')
56+
.view()
57+
58+
```
59+
60+
### Output
61+
62+
When you execute the above code, you'll see the AWS Athena query results on the console
63+
64+
```console
65+
[SRR6797500, WGS, SAN RAFFAELE, public, SRX3756197, 131677, Illumina HiSeq 2500, PAIRED, RANDOM, GENOMIC, ILLUMINA, SRS3011891, SAMN08629009, Mycobacterium tuberculosis, SRP128089, 2018-03-02, PRJNA428596, 165, null, 201, 383, null, 131677_WGS, Pathogen.cl, null, uncalculated, uncalculated, null, null, null, bam, sra, s3, s3.us-east-1, {k=assemblyname, v=GCF_000195955.2}, {k=bases, v=383901808}, {k=bytes, v=173931377}, {k=biosample_sam, v=MTB131677}, {k=collected_by_sam, v=missing}, {k=collection_date_sam, v=2010/2014}, {k=host_disease_sam, v=Tuberculosis}, {k=host_sam, v=Homo sapiens}, {k=isolate_sam, v=Clinical isolate18}, {k=isolation_source_sam_ss_dpl262, v=Not applicable}, {k=lat_lon_sam, v=Not collected}, {k=primary_search, v=131677}, {k=primary_search, v=131677_210916_BGD_210916_100.gatk.bam}, {k=primary_search, v=131677_WGS}, {k=primary_search, v=428596}, {k=primary_search, v=8629009}, {k=primary_search, v=PRJNA428596}, {k=primary_search, v=SAMN08629009}, {k=primary_search, v=SRP128089}, {k=primary_search, v=SRR6797500}, {k=primary_search, v=SRS3011891}, {k=primary_search, v=SRX3756197}, {k=primary_search, v=bp0}, {"assemblyname": "GCF_000195955.2", "bases": 383901808, "bytes": 173931377, "biosample_sam": "MTB131677", "collected_by_sam": ["missing"], "collection_date_sam": ["2010/2014"], "host_disease_sam": ["Tuberculosis"], "host_sam": ["Homo sapiens"], "isolate_sam": ["Clinical isolate18"], "isolation_source_sam_ss_dpl262": ["Not applicable"], "lat_lon_sam": ["Not collected"], "primary_search": "131677"}]
66+
67+
```

0 commit comments

Comments
 (0)