Data_on_AnVIL/04-importing_with_SRA.Rmd at main · fhdsl/Data_on_AnVIL · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267

# (PART\*) Importing Data from SRA {-}


```{r, include = FALSE}
ottrpal::set_knitr_image_path()
```

# Quick Start: Importing a single file {#quick-start-sra}

In this example, we'll bring some metagenomic data into AnVIL.

This data comes from [this BioProject](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA904247), which collected soil samples to study bacterial communities in tallgrass prairie. Bacteria play an important role in this ecosystem, but can be changed by disturbance, management, and the presence of herbivores.

We will bring this data into AnVIL from the **Sequence Read Archive**, or SRA. You can check out the [SRA website](https://www.ncbi.nlm.nih.gov/sra) to learn more:

> Sequence Read Archive (SRA) data, available through multiple cloud providers and NCBI servers, is the largest publicly available repository of high throughput sequencing data. The archive accepts data from all branches of life as well as metagenomic and environmental surveys. SRA stores raw sequencing data and alignment information to enhance reproducibility and facilitate new discoveries through data analysis.

The SRA Data corresponding to this project is located [here](https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP409181&o=acc_s%3Aa).

```{r, fig.align='center', echo = FALSE, fig.alt= "Microbiome diversity has many benefitial properties ranging soil and plant health.", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208b8f790dc_23_217")
```

::: {.dictionary}
You might hear new terms for moving data around in the cloud. **Ingress** is when data comes to you, similar to downloading a file or receiving an email with an attachment. **Egress** is sending the data to another resource, similar to uploading or sending an attached file via email. There is no fee for ingressing data to AnVIL from SRA.
:::

## Clone Workspace

Clone the Workspace `https://anvil.terra.bio/#workspaces/anvil-outreach/SRA-data-on-AnVIL`.

For this demo, we have given the cloned Workspace the name `SRA-data-on-AnVIL-example`.

## Set Up Samples

Navigate to the WORKFLOWS Tab and select the SRA_Fetch Workflow.

```{r, fig.align='center', echo = FALSE, fig.alt= "Workflows tab with SRA_Fetch.", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g1f25a933000_0_0")
```

Select "Run workflow(s) with inputs defined by data table".

```{r, fig.align='center', echo = FALSE, fig.alt= "'Run workflow(s) with inputs defined by data table' has been selected.", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g1f25a933000_0_10")
```

Set the "Select root entity type" to "sample" and click SELECT DATA.

```{r, fig.align='center', echo = FALSE, fig.alt= "Step 1 and 2 for setting up the Workflow.", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208af248fb0_0_0")
```

On the Select Data popup, select only the first sample, `SRR22375322`, and click OK.

```{r, fig.align='center', echo = FALSE, fig.alt= "The first sample selected from the data table.", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208af248fb0_0_8")
```

## Launch Workflow

Click on the space underneath "Attribute" and select `this.sample_id`.

```{r, fig.align='center', echo = FALSE, fig.alt= "'this.sample_id' must be selected under the Workflow Attribute", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208af248fb0_0_17")
```

Click SAVE.

```{r, fig.align='center', echo = FALSE, fig.alt= "The SAVE button is highlighted", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208af248fb0_0_26")
```

You are ready to launch the Workflow! Click RUN ANALYSIS.

```{r, fig.align='center', echo = FALSE, fig.alt= "The RUN ANALYSIS button is highlighted", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208af248fb0_0_34")
```

Voilà! Your Workflow is running.

::: {.notice}
Because the Workflow is happening in the cloud, you can close your browser or shut down your computer without interrupting the transfer.
:::

```{r, fig.align='center', echo = FALSE, fig.alt= "The Workflow status page describes submission statistics and job status", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208af248fb0_0_42")
```

## Check Workflow

Click on the JOB HISTORY tab. You should see that the job status is "Done". This might take a few minutes.

```{r, fig.align='center', echo = FALSE, fig.alt= "The check mark indicates the Workflow has completed successfully", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208af248fb0_0_50")
```

## Locate Data

Click on the DATA tab and click on the "sample" table on the left.

```{r, fig.align='center', echo = FALSE, fig.alt= "Navigate to the Files folder under the DATA tab", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208b8f790dc_23_31")
```

You should now see the file associated with the first sample!

```{r, fig.align='center', echo = FALSE, fig.alt= "The imported file is now visible in the sample table", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208b8f790dc_23_41")
```

## Summary

- Clone [Workspace](https://anvil.terra.bio/#workspaces/anvil-outreach/SRA-data-on-AnVIL)
- Go to the WORKFLOWS tab
- Select sample via data table ("Run workflow(s) with inputs defined by data table")
- Set the Attribute to `this.sample_id`
- SAVE and RUN ANALYSIS
- Go to DATA tab and click "sample" table to see file populated

# Multiple SRA files {#multiple-sra-files}

More than likely, you will be importing multiple files from SRA. Luckily, this is quite easy in AnVIL! In contrast to how your local computer works, The SRA Fetch Workflow imports files in parallel, so it does not take a substantially longer time.

## Select Workflow Data

Navigate to the WORKFLOWS Tab and select the SRA_Fetch Workflow.

```{r, fig.align='center', echo = FALSE, fig.alt= "Workflows tab with SRA_Fetch.", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g1f25a933000_0_0")
```

Select "Run workflow(s) with inputs defined by data table".

```{r, fig.align='center', echo = FALSE, fig.alt= "'Run workflow(s) with inputs defined by data table' has been selected.", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g1f25a933000_0_10")
```

Set the "Select root entity type" to "sample" and click SELECT DATA.

```{r, fig.align='center', echo = FALSE, fig.alt= "Step 1 and 2 for setting up the Workflow.", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208af248fb0_0_0")
```

Select the second through fifth samples and click OK on the bottom right.

```{r, fig.align='center', echo = FALSE, fig.alt= "Select multiple files from the sample table", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208b8f790dc_23_54")
```

Ensure the "Attribute" is set to `this.sample_id` and click RUN ANALYSIS.

```{r, fig.align='center', echo = FALSE, fig.alt= "Confirm `this.sample_id` and click the RUN ANALYSIS button", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208b8f790dc_23_64")
```

Click LAUNCH. You can close your browser or shut down your computer without interrupting the transfer.

```{r, fig.align='center', echo = FALSE, fig.alt= "Click the LAUNCH button; the 4 analyses being run is called out", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208b8f790dc_23_73")
```

::: {.notice}
The Workflow knows that you probably want to parallelize the import of your SRA files. This means that each import is happening at the same time. Notice how this workflow with multiple samples actually launched 4 different jobs/analyses! This means that AnVIL can help you process lots of files much faster than working with them one by one.
:::

## Check Workflow

Click on the JOB HISTORY tab. Different submissions are arranged by newest on the top. You should see that the job status is "Done".

```{r, fig.align='center', echo = FALSE, fig.alt= "An arrow pointing to 'Done' indicates the Workflow has completed successfully", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208b8f790dc_23_83")
```


## Locate Data

Click on the DATA tab and click on the "sample" table on the left.

```{r, fig.align='center', echo = FALSE, fig.alt= "Navigate to the Files folder under the DATA tab", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208b8f790dc_23_31")
```

You should now see the files associated with the second through fifth sample!

```{r, fig.align='center', echo = FALSE, fig.alt= "The imported files are now visible in the sample table", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208b8f790dc_23_92")
```

## Summary

- Go to the WORKFLOWS tab
- Select **multiple** samples via data table ("Run workflow(s) with inputs defined by data table")
- Set the Attribute to `this.sample_id`
- SAVE and RUN ANALYSIS
- Go to DATA tab and click "sample" table to see files populated

# Customize Samples

You will probably need to select different samples than the ones in this demo.

We've created another workspace, `SRA-data-on-AnVIL-example2`, to demonstrate how to upload your own sample IDs.

If you go to the DATA tab, you'll notice the same samples (ending in 22-26). These are here because data tables are copied when you clone a workspace. However, let's add a second set of samples ending in 27-31.

```{r, fig.align='center', echo = FALSE, fig.alt= "The 'sample' data table has been cloned from the original Workspace, including the sample IDs", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208b8f790dc_23_101")
```

## Import Data

Click on IMPORT DATA and select "Upload TSV".

```{r, fig.align='center', echo = FALSE, fig.alt= "The IMPORT DATA button and 'Upload TSV' option", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208b8f790dc_23_144")
```

This opens a popup that looks like this:

```{r, fig.align='center', echo = FALSE, fig.alt= "The popup is titled Import Data Table and has the option to click to select a .tsv file", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208b8f790dc_23_153")
```

However, let's take a moment to get acquainted with the new file we'll be uploading.

## The `samples.tsv` File

First, download the samples file here. You might have to right-click and "Save as".

[Download `samples.tsv`](https://raw.githubusercontent.com/fhdsl/AnVIL_SRA_Data/main/samples.tsv)

Next, open the file on your local machine. This is what it might look like in Microsoft Excel:

```{r, fig.align='center', echo = FALSE, fig.alt= "The samples we want to import from SRA are listed in rows in `samples.tsv`", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208b8f790dc_23_163")
```

::: {.notice}
The column header `entity:sample_id` is important. `entity:` is required. `samples` becomes the name of the data table. So for example, if our header was `entity:reference_id`, a data table called "reference" would be created in AnVIL. If you didn't want to overwrite anything in the original "samples" table, you could change the column header. As long as none of the IDs are the same, no data will be overwritten.
:::

## Upload the TSV

Back on AnVIL, Click to select a TSV file. This file should be the one you just downloaded above called `samples.tsv`. You will see a warning about potentially overwriting the existing entries. We know that none of the IDs in the new samples file overlap, so click START IMPORT JOB.

```{r, fig.align='center', echo = FALSE, fig.alt= "The warning is now visible on the popup and the START IMPORT JOB button is highlighted", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208b8f790dc_23_171")
```

New samples have been added!

```{r, fig.align='center', echo = FALSE, fig.alt= "The new samples have been appended to the end of the samples data table", out.width = '100%'}
ottrpal::include_slide("https://docs.google.com/presentation/d/1l0P0gFpsPkYG7blqJ_5JyYYlztJFZDD39CnIB4svrY8/edit#slide=id.g208b8f790dc_23_179")
```

::: {.notice}
You can now proceed with running the Workflow as you did in the [Quick Start](#quick-start) and [Multiple Files](#multiple-sra-files) sections.
:::

## Summary

- Go to the DATA tab
- Select IMPORT DATA and "Upload TSV"
- Add your custom file and click START IMPORT JOB

# Additional Resources