You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+89-17Lines changed: 89 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,3 @@
1
-
2
1
<palign="center">
3
2
<imgwidth="488"alt="Screenshot 2024-08-12 at 2 46 32 PM"src="https://github.com/user-attachments/assets/d3281745-1e08-4287-8c0b-207fe80a2c85">
4
3
<imgwidth="722"alt="Screenshot 2024-08-12 at 11 35 07 AM"src="https://github.com/user-attachments/assets/f82ae7b8-71c2-490f-aedc-cb359e39762b">
@@ -8,6 +7,7 @@
8
7
</p>
9
8
10
9
# Table of Contents
10
+
11
11
-[About](#about)
12
12
-[Research Purpose](#research-purpose)
13
13
-[Introduction](#introduction)
@@ -32,12 +32,19 @@
32
32
-[Acknowledgements](#acknowledgements)
33
33
34
34
# About
35
+
35
36
Easily generate differential expression results from [SPARC](https://sparc.science) scRNA-seq data in a FAIR manner.
37
+
36
38
# Research Purpose
39
+
37
40
## Introduction
41
+
38
42
sPARcRNA_Viz is an **all-in-one gene expression visualization utility** integratable with [o²S²PARC](https://osparc.io/). Using sPARcRNA_Viz, researchers can create an interactive t-SNE from single-cell RNA-sequencing data, as well as perform in silico GSEA analysis to determine the most highly expressed genes. From these statistically significant genes, researchers can determine potential gene ontologies arising from their sample(s). In addition, the seamless integration of sPARcRNA_Viz with the o²S²PARC computing platform enables data accessibility concordant with **FAIR Data Principles**.
43
+
39
44
### Notable Features of sPARcRNA_Viz
45
+
40
46
sPARcRNA_Viz provides the user with the ability to fine-tune multiple **gene expression parameters**:
47
+
41
48
- Minimum number of cells expressing a gene
42
49
- Minimum number of features (genes) per cell
43
50
- Maximum number of features (genes) per cell
@@ -47,59 +54,94 @@ sPARcRNA_Viz provides the user with the ability to fine-tune multiple **gene exp
47
54
- Log fold-change threshold for FindAllMarkers
48
55
- Minimum gene set size for GSEA
49
56
- MSigDB category for GSEA
57
+
50
58
### Technology Stack
59
+
51
60
- o²S²PARC
52
61
- R
53
62
- GNU Make
54
63
- Python3
55
64
- Docker
56
-
- Astro
57
65
- HTML
58
66
- JavaScript
59
67
- Tailwind CSS
68
+
60
69
## Background
70
+
61
71
In recent years, **single-cell RNA-sequencing** (scRNA-seq) has emerged as a preeminent method for the analysis of gene expression in biological tissue, providing researchers access to genetic data previously inaccessible. This is largely due to advancements in wet lab and dry leb techniques, as well computing power, where these improvements enable the collection of large datasets often spanning hundreds of millions of entries. With this newfound wealth of data, a need has arisen for high-efficiency bioinformatics pipelines and tools that allow for the analysis of scRNA-seq data. One computational method currently in use is **differential gene expression (DGE) analysis**, which identifies statistically significant genes (i.e., results that are minimally confounded by experimental errors) and determines the expression level of a gene relative to the entire dataset.<sup>2</sup> Using these statistically significant results, it is possible to correlate the most highly expressed genes to their tangible, biological effects through the use of **gene ontology** databases such as the [Gene Ontology Knowledgebase (GO)](https://www.geneontology.org/).
62
72
<br></br>
63
73
The SPARC Portal currently hosts a rich collection of scRNA-seq data across several different tissues and species. Therefore, the SPARC platform could be further enhanced by the inclusion of data visualization and the aforementioned DGE tools. This is achieved in sPARcRNA_Viz through the use of **t-SNE plotting** and **GSEA**.
74
+
64
75
### About t-SNE Plots
76
+
65
77
t-distributed Stochastic Neighbor Embedding (t-SNE) is a plotting and visualization technique that focuses on pairwise similarities among datasets. Like PCA, it is a dimensionality reduction technique. For its utility in comparing large, complex datasets, t-SNE is commonly employed by RNA-seq researchers.
78
+
66
79
### About GSEA
80
+
67
81
Gene Set Enrichment Analysis (GSEA) is a popular technique for determining statistically significant genes, as well as those that are upregulated and downregulated.<sup>5</sup> This is achieved through a ranking system whereby genes are organized by statistically significance.
68
82
69
83
## Current SPARC Portal Tools
84
+
70
85
As of 8/12/24, the [Transcriptomic_oSPARC](https://github.com/SPARC-FAIR-Codeathon/Transcriptomic_oSPARC) utility<sup>1</sup> would appear to be the most prominent SPARC tool relating to the analysis gene expression. This tool is very effective in displaying industry-standard static graphical outputs, which can prove quite useful to researchers. However, a limitation may perhaps exist in the current customization level; it may be necessary to edit the code itself to change particular parameters. There was also a niche to explore in adding interactivity to the graphs, further enahncing the user experience.
86
+
71
87
## The Problem
88
+
72
89
The gene expression data in SPARC is somewhat limited and is in a raw data format, rendering it less interoperable. Our goal was to make it more interoperable and easy to use. Therefore, our team sought to create a RNA-seq visualization utility that supports the specification of **specific parameters**, as well as **interactivity**. There was also room for experimentation in predicting gene ontology with **GSEA**.
90
+
73
91
## Our Solution: sPARcRNA_Viz
92
+
74
93
To address this challenge, we present **sPARcRNA_Viz**, an scRNA-seq visualization tool for potential entry alongside Transcriptomic_oSPARC. In incorporating flexible parameters, interactivity, and an additional DEA metric, sPARcRNA_Viz will complement Transcriptomic_oSPARC as part of a growing SPARC gene expression toolkit.
75
94
76
95
# Using sPARcRNA_Viz
96
+
77
97
## sPARcRNA_Viz Requirements
98
+
78
99
- GNU Make
79
100
- Python3
80
-
-[``Docker``](https://docs.docker.com/get-docker/) (if you wish to build and test the service locally)
101
+
-[`Docker`](https://docs.docker.com/get-docker/) (if you wish to build and test the service locally)
102
+
81
103
### Required Input Format
104
+
82
105
sPARcRNA_Viz currently supports the following file format: **.csv/.tsv** (barcode and feature files), **.mtx** (matrix file) single-cell matrices along with R data. These formats and 3 files are required to run the analysis successfully.
106
+
83
107
## sPARcRNA_Viz Pipeline Workflow
108
+
84
109
Can refer to [PIPELINE.md](PIPELINE.md).
110
+
85
111
### 1. Setup
112
+
86
113
Load libraries, set options, validate and prepare the directories; find and read raw data files; configure based on inputs.
114
+
87
115
### 2. Create Seurat object
116
+
88
117
[Seurat](https://cran.r-project.org/web/packages/Seurat/index.html) is an R package specially designed for the quality control (QC) , analysis, and exploration of single-cell RNA-seq data. Thus, it proved to be a suitable choice for the purposes of sPARcRNA_Viz.
118
+
89
119
### 3. Normalize and preprocess the data
120
+
90
121
Normalize (so that data reflects true biological differences); find variable features; scale (to standardize the data); perform PCA (Principal Component Analysis to reduce dimensionality); and cluster cells with similar profiles together.
122
+
91
123
### 4. t-SNE
124
+
92
125
t-SNE allows us to visualize statistically significant genes based on these clusters. From these, researchers can determine potential gene ontologies arising from their sample(s).
126
+
93
127
### 5. Differential Gene Expression Analysis
94
-
Differential gene expression analysis takes the normalized gene read counts and allows researchers to determine quantitative changes in gene expression.
128
+
129
+
Differential gene expression analysis takes the normalized gene read counts and allows researchers to determine quantitative changes in gene expression.
130
+
95
131
### 6. GSEA
132
+
96
133
GSEA aids in determining gene groups highly represented in the data.
134
+
97
135
### 7. Combine t-SNE and GSEA results
136
+
98
137
All the cluster results after running GSEA are saved, and the top pathways are saved as well.
138
+
99
139
### 8. Export and Display Results
100
-
All values from the previous steps and top clusters, pathways, etc are saved in a Seurat object that is later visualized. The user can optionally convert this data into .csv file format.
140
+
141
+
All values from the previous steps and top clusters, pathways, etc are saved in a Seurat object that is later visualized. The user can optionally convert this data into .csv file format. The results are displayed in interactive charts using [d3](https://d3js.org/) and [ApexCharts](https://apexcharts.com/). All the results are fully accessible via the file system and do not require any additional software to view.
101
142
102
143
## Configuring sPARcRNA_Viz
144
+
103
145
sPARcRNA_Viz offers a variety of command options:
104
146
| Option | Description | Default |
105
147
| --- | --- | --- |
@@ -116,15 +158,21 @@ sPARcRNA_Viz offers a variety of command options:
116
158
|`--gsea_min_size`| Minimum gene set size for GSEA |`15`|
117
159
|`--gsea_max_size`| Maximum gene set size for GSEA |`500`|
118
160
|`--category`| MSigDB category for GSEA |`"H"`|
161
+
119
162
## Tutorial
163
+
120
164
The [scRNA-seq data](https://sparc.science/datasets/220?type=dataset&datasetDetailsTab=files&path=files/derivative) used in the tutorial is from the SPARC Portal.
165
+
121
166
### 1. Log in to [o²S²PARC](https://osparc.io/)
167
+
122
168
<imgwidth="500"alt="Screenshot 2024-08-12 at 9 36 07 PM"src="https://github.com/user-attachments/assets/5e295cbc-184a-42e9-b85b-8ad82bcc57a0">
123
169
124
170
### 2. Open a new Study
171
+
125
172
<imgwidth="201"alt="Screenshot 2024-08-12 at 9 35 10 PM"src="https://github.com/user-attachments/assets/ced50573-fbf9-458f-8912-29374ad3c26f">
126
173
127
174
### 3. Add 3 File Picker Nodes and upload the required data
175
+
128
176
<imgwidth="500"alt="Screenshot 2024-08-12 at 8 52 18 PM"src="https://github.com/user-attachments/assets/c565355d-8e57-43a1-aade-d653be2d2853">
@@ -134,20 +182,27 @@ The [scRNA-seq data](https://sparc.science/datasets/220?type=dataset&datasetDeta
134
182
(Alternatively, drag and drop the needed files into the workspace.)
135
183
136
184
### 4. Add sPARcRNA_Viz Node
185
+
137
186
<imgwidth="500"alt="Screenshot 2024-08-12 at 9 33 34 PM"src="https://github.com/user-attachments/assets/43b92fb3-44e4-42ea-ada2-bbf2bc866528">
138
187
139
188
### 5. Connect the Nodes
189
+
140
190
<imgwidth="500"alt="Screenshot 2024-08-12 at 9 34 08 PM"src="https://github.com/user-attachments/assets/f49395fe-d51b-4ec1-88f9-b4eae37ce0a9">
141
191
142
192
### 6. Optionally run outputs through JupyterLab R for further analysis
193
+
143
194
<imgwidth="500"alt="Screenshot 2024-08-12 at 9 21 20 PM"src="https://github.com/user-attachments/assets/8c8b0833-90eb-4fea-af5e-13e2947b7d33">
144
195
145
196
## Future Vision
197
+
146
198
sPARcRNA_Viz would be expanded to include other interactive visualizations and API calls to other gene databases. This would provide more ways to analyze genes and integrate with other websites.
147
199
148
200
# FAIR-Centered Design
149
-
Perhaps the **most important** aspect of sPARcRNA_Viz is its emphasis on the FAIR Data Principles. Summarized below are highlight features of sPARcRNA_Viz supporting the FAIR initiative.
201
+
202
+
Perhaps the **most important** aspect of sPARcRNA_Viz is its emphasis on the FAIR Data Principles. Summarized below are highlight features of sPARcRNA_Viz supporting the FAIR initiative.
203
+
150
204
## Importance of FAIR Data Principles
205
+
151
206
<palign="left">
152
207
<img width="576" alt="Screenshot 2024-08-12 at 9 58 14 AM" src=https://github.com/user-attachments/assets/fc0112ba-ac4e-41fe-92ac-65e5339a6eb7>
153
208
</p>
@@ -157,32 +212,47 @@ FAIR data is that which is **F**indable, **A**ccessible, **I**nteroperable, and
157
212
Particularly in the case of scRNA-seq data, which is expensive from both a wet and dry lab standpoint, it is very useful to adhere to FAIR standards. For instance, one particularly common phenomemon with respect to scRNA-seq is **dropout**<sup>4</sup>, where portions of RNA are not captured by experimental techniques. scRNA-seq data can also be signficantly varied with regard to format; often, differently-labeled matrices may contain raw counts data, or data that has been normalized by a method such as CPM, TPM, or RPKM/FPKM. The FAIR article cited on the SPARC website expands upon this idea further: the licensing of data can also pose a challenge for the analysis of gene regulation and expression. Therefore, the intentional **categorization and stewardship** of data can present a major benefit to transcriptomics researchers, propelling scientific progress.
158
213
159
214
### Summary of FAIR Principles Application
160
-
| FAIR Principle | Other Tools | sPARcRNA_Viz |
161
-
| --- | --- | --- |
162
-
|**F**indable | May not be connected to an existing database such as the SPARC Portal, which could hinder the findability of data | sPARcRNA_Viz is **connected to o²S²PARC**, so it can use the well-organized datasets provided on the SPARC portal, and it is archived on Zenodo with the appropriate metadata |
163
-
|**A**ccessible | May have a user interface that requires a programming background | sPARcRNA_Viz's **friendly user interface and visuals** allow researchers to quickly engage with data and is open, free and universally implementable |
164
-
|**I**nteroperable | May not allow for connections between datasets | Through its use of GSEA, sPARcRNA_Viz allows for the **meaningful connection of datasets**: scRNA-seq data can be used in association with gene ontology. In addition, visualizations generated for each dataset can be compared with each other |
165
-
|**R**eusable | May only support the formatting of one dataset | sPARcRNA_Viz be used with multiple datasets due to the ability to **specify parameters**. Likewise, sPARcRNA_Viz offers a security benefit through its use of **input validation**|
|**F**indable | May not be connected to an existing database such as the SPARC Portal, which could hinder the findability of data | sPARcRNA_Viz is **connected to o²S²PARC**, so it can use the well-organized datasets provided on the SPARC portal, and it is archived on Zenodo with the appropriate metadata |
219
+
|**A**ccessible | May have a user interface that requires a programming background | sPARcRNA_Viz's **friendly user interface and visuals** allow researchers to quickly engage with data and is open, free and universally implementable |
220
+
|**I**nteroperable | May not allow for connections between datasets | Through its use of GSEA, sPARcRNA_Viz allows for the **meaningful connection of datasets**: scRNA-seq data can be used in association with gene ontology. In addition, visualizations generated for each dataset can be compared with each other |
221
+
|**R**eusable | May only support the formatting of one dataset | sPARcRNA_Viz be used with multiple datasets due to the ability to **specify parameters**. Likewise, sPARcRNA_Viz offers a security benefit through its use of **input validation**|
222
+
223
+
To ensure that we are compliant with all the FAIR principles, we have also created a crosswalk between the FAIR principles and the sPARcRNA_Viz tool. This crosswalk can be found in the [CROSSWALK.md](CROSSWALK.md) file.
166
224
167
225
# Additional Information
226
+
168
227
## Issue Reporting
228
+
169
229
Please utilize the **Issues** tab of this repository should you encounter any problems with sPARcRNA_Viz.
230
+
170
231
## How to Contribute
232
+
171
233
Please Fork this repository and submit a **Pull Request** to contribute.
234
+
172
235
## Cite Us
236
+
173
237
Please see our [citation](CITATION.cff).
238
+
174
239
## License
240
+
175
241
sPARcRNA_Viz is distributed under the [MIT License](LICENSE).
242
+
176
243
## Team
244
+
177
245
- Mihir Samdarshi (Lead, Sysadmin, Developer)
178
246
- Sanjay Soundarajan (Sysadmin, Developer)
179
247
- Mahitha Simhambhatla (Developer, Writer)
180
248
- Raina Patel (Writer)
181
249
- Ayla Bratton (Writer)
250
+
182
251
## Materials Cited
252
+
183
253
<aid="1">[1]</a>
184
-
Ben Aribi, H., Ding, M., & Kiran, A. (2023).
185
-
Gene expression data visualization tool on the o2S2PARC platform.
254
+
Ben Aribi, H., Ding, M., & Kiran, A. (2023).
255
+
Gene expression data visualization tool on the o2S2PARC platform.
Kim, T. H., Zhou, X., & Chen, M. (2020). Demystifying “drop-outs” in single-cell UMI data. Genome Biology, 21(1).
267
+
Kim, T. H., Zhou, X., & Chen, M. (2020). Demystifying “drop-outs” in single-cell UMI data. Genome Biology, 21(1).
198
268
https://doi.org/10.1186/s13059-020-02096-y <br />
199
269
<aid="5">[5]</a>
200
270
Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S., & Mesirov, J. P. (2005).
201
271
Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles.
202
-
Proceedings of the National Academy of Sciences, 102(43), 15545–15550.
272
+
Proceedings of the National Academy of Sciences, 102(43), 15545–15550.
203
273
https://doi.org/10.1073/pnas.0506580102 <br />
204
274
<aid="6">[6]</a>
205
275
Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., & Gonzalez-Beltran, A. (2016).
206
-
The FAIR Guiding Principles for Scientific Data Management and Stewardship. Scientific Data, 3(1).
276
+
The FAIR Guiding Principles for Scientific Data Management and Stewardship. Scientific Data, 3(1).
0 commit comments