Skip to content

Commit 13e4978

Browse files
authored
Merge pull request #12 from SPARC-FAIR-Codeathon/megasanjay-patch-1
chore: update README.md
2 parents 6e57b31 + 3fe523c commit 13e4978

File tree

1 file changed

+89
-17
lines changed

1 file changed

+89
-17
lines changed

README.md

Lines changed: 89 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
21
<p align="center">
32
<img width="488" alt="Screenshot 2024-08-12 at 2 46 32 PM" src="https://github.com/user-attachments/assets/d3281745-1e08-4287-8c0b-207fe80a2c85">
43
<img width="722" alt="Screenshot 2024-08-12 at 11 35 07 AM" src="https://github.com/user-attachments/assets/f82ae7b8-71c2-490f-aedc-cb359e39762b">
@@ -8,6 +7,7 @@
87
</p>
98

109
# Table of Contents
10+
1111
- [About](#about)
1212
- [Research Purpose](#research-purpose)
1313
- [Introduction](#introduction)
@@ -32,12 +32,19 @@
3232
- [Acknowledgements](#acknowledgements)
3333

3434
# About
35+
3536
Easily generate differential expression results from [SPARC](https://sparc.science) scRNA-seq data in a FAIR manner.
37+
3638
# Research Purpose
39+
3740
## Introduction
41+
3842
sPARcRNA_Viz is an **all-in-one gene expression visualization utility** integratable with [o²S²PARC](https://osparc.io/). Using sPARcRNA_Viz, researchers can create an interactive t-SNE from single-cell RNA-sequencing data, as well as perform in silico GSEA analysis to determine the most highly expressed genes. From these statistically significant genes, researchers can determine potential gene ontologies arising from their sample(s). In addition, the seamless integration of sPARcRNA_Viz with the o²S²PARC computing platform enables data accessibility concordant with **FAIR Data Principles**.
43+
3944
### Notable Features of sPARcRNA_Viz
45+
4046
sPARcRNA_Viz provides the user with the ability to fine-tune multiple **gene expression parameters**:
47+
4148
- Minimum number of cells expressing a gene
4249
- Minimum number of features (genes) per cell
4350
- Maximum number of features (genes) per cell
@@ -47,59 +54,94 @@ sPARcRNA_Viz provides the user with the ability to fine-tune multiple **gene exp
4754
- Log fold-change threshold for FindAllMarkers
4855
- Minimum gene set size for GSEA
4956
- MSigDB category for GSEA
57+
5058
### Technology Stack
59+
5160
- o²S²PARC
5261
- R
5362
- GNU Make
5463
- Python3
5564
- Docker
56-
- Astro
5765
- HTML
5866
- JavaScript
5967
- Tailwind CSS
68+
6069
## Background
70+
6171
In recent years, **single-cell RNA-sequencing** (scRNA-seq) has emerged as a preeminent method for the analysis of gene expression in biological tissue, providing researchers access to genetic data previously inaccessible. This is largely due to advancements in wet lab and dry leb techniques, as well computing power, where these improvements enable the collection of large datasets often spanning hundreds of millions of entries. With this newfound wealth of data, a need has arisen for high-efficiency bioinformatics pipelines and tools that allow for the analysis of scRNA-seq data. One computational method currently in use is **differential gene expression (DGE) analysis**, which identifies statistically significant genes (i.e., results that are minimally confounded by experimental errors) and determines the expression level of a gene relative to the entire dataset.<sup>2</sup> Using these statistically significant results, it is possible to correlate the most highly expressed genes to their tangible, biological effects through the use of **gene ontology** databases such as the [Gene Ontology Knowledgebase (GO)](https://www.geneontology.org/).
6272
<br></br>
6373
The SPARC Portal currently hosts a rich collection of scRNA-seq data across several different tissues and species. Therefore, the SPARC platform could be further enhanced by the inclusion of data visualization and the aforementioned DGE tools. This is achieved in sPARcRNA_Viz through the use of **t-SNE plotting** and **GSEA**.
74+
6475
### About t-SNE Plots
76+
6577
t-distributed Stochastic Neighbor Embedding (t-SNE) is a plotting and visualization technique that focuses on pairwise similarities among datasets. Like PCA, it is a dimensionality reduction technique. For its utility in comparing large, complex datasets, t-SNE is commonly employed by RNA-seq researchers.
78+
6679
### About GSEA
80+
6781
Gene Set Enrichment Analysis (GSEA) is a popular technique for determining statistically significant genes, as well as those that are upregulated and downregulated.<sup>5</sup> This is achieved through a ranking system whereby genes are organized by statistically significance.
6882

6983
## Current SPARC Portal Tools
84+
7085
As of 8/12/24, the [Transcriptomic_oSPARC](https://github.com/SPARC-FAIR-Codeathon/Transcriptomic_oSPARC) utility<sup>1</sup> would appear to be the most prominent SPARC tool relating to the analysis gene expression. This tool is very effective in displaying industry-standard static graphical outputs, which can prove quite useful to researchers. However, a limitation may perhaps exist in the current customization level; it may be necessary to edit the code itself to change particular parameters. There was also a niche to explore in adding interactivity to the graphs, further enahncing the user experience.
86+
7187
## The Problem
88+
7289
The gene expression data in SPARC is somewhat limited and is in a raw data format, rendering it less interoperable. Our goal was to make it more interoperable and easy to use. Therefore, our team sought to create a RNA-seq visualization utility that supports the specification of **specific parameters**, as well as **interactivity**. There was also room for experimentation in predicting gene ontology with **GSEA**.
90+
7391
## Our Solution: sPARcRNA_Viz
92+
7493
To address this challenge, we present **sPARcRNA_Viz**, an scRNA-seq visualization tool for potential entry alongside Transcriptomic_oSPARC. In incorporating flexible parameters, interactivity, and an additional DEA metric, sPARcRNA_Viz will complement Transcriptomic_oSPARC as part of a growing SPARC gene expression toolkit.
7594

7695
# Using sPARcRNA_Viz
96+
7797
## sPARcRNA_Viz Requirements
98+
7899
- GNU Make
79100
- Python3
80-
- [``Docker``](https://docs.docker.com/get-docker/) (if you wish to build and test the service locally)
101+
- [`Docker`](https://docs.docker.com/get-docker/) (if you wish to build and test the service locally)
102+
81103
### Required Input Format
104+
82105
sPARcRNA_Viz currently supports the following file format: **.csv/.tsv** (barcode and feature files), **.mtx** (matrix file) single-cell matrices along with R data. These formats and 3 files are required to run the analysis successfully.
106+
83107
## sPARcRNA_Viz Pipeline Workflow
108+
84109
Can refer to [PIPELINE.md](PIPELINE.md).
110+
85111
### 1. Setup
112+
86113
Load libraries, set options, validate and prepare the directories; find and read raw data files; configure based on inputs.
114+
87115
### 2. Create Seurat object
116+
88117
[Seurat](https://cran.r-project.org/web/packages/Seurat/index.html) is an R package specially designed for the quality control (QC) , analysis, and exploration of single-cell RNA-seq data. Thus, it proved to be a suitable choice for the purposes of sPARcRNA_Viz.
118+
89119
### 3. Normalize and preprocess the data
120+
90121
Normalize (so that data reflects true biological differences); find variable features; scale (to standardize the data); perform PCA (Principal Component Analysis to reduce dimensionality); and cluster cells with similar profiles together.
122+
91123
### 4. t-SNE
124+
92125
t-SNE allows us to visualize statistically significant genes based on these clusters. From these, researchers can determine potential gene ontologies arising from their sample(s).
126+
93127
### 5. Differential Gene Expression Analysis
94-
Differential gene expression analysis takes the normalized gene read counts and allows researchers to determine quantitative changes in gene expression.
128+
129+
Differential gene expression analysis takes the normalized gene read counts and allows researchers to determine quantitative changes in gene expression.
130+
95131
### 6. GSEA
132+
96133
GSEA aids in determining gene groups highly represented in the data.
134+
97135
### 7. Combine t-SNE and GSEA results
136+
98137
All the cluster results after running GSEA are saved, and the top pathways are saved as well.
138+
99139
### 8. Export and Display Results
100-
All values from the previous steps and top clusters, pathways, etc are saved in a Seurat object that is later visualized. The user can optionally convert this data into .csv file format.
140+
141+
All values from the previous steps and top clusters, pathways, etc are saved in a Seurat object that is later visualized. The user can optionally convert this data into .csv file format. The results are displayed in interactive charts using [d3](https://d3js.org/) and [ApexCharts](https://apexcharts.com/). All the results are fully accessible via the file system and do not require any additional software to view.
101142

102143
## Configuring sPARcRNA_Viz
144+
103145
sPARcRNA_Viz offers a variety of command options:
104146
| Option | Description | Default |
105147
| --- | --- | --- |
@@ -116,15 +158,21 @@ sPARcRNA_Viz offers a variety of command options:
116158
| `--gsea_min_size` | Minimum gene set size for GSEA | `15` |
117159
| `--gsea_max_size` | Maximum gene set size for GSEA | `500` |
118160
| `--category` | MSigDB category for GSEA | `"H"` |
161+
119162
## Tutorial
163+
120164
The [scRNA-seq data](https://sparc.science/datasets/220?type=dataset&datasetDetailsTab=files&path=files/derivative) used in the tutorial is from the SPARC Portal.
165+
121166
### 1. Log in to [o²S²PARC](https://osparc.io/)
167+
122168
<img width="500" alt="Screenshot 2024-08-12 at 9 36 07 PM" src="https://github.com/user-attachments/assets/5e295cbc-184a-42e9-b85b-8ad82bcc57a0">
123169

124170
### 2. Open a new Study
171+
125172
<img width="201" alt="Screenshot 2024-08-12 at 9 35 10 PM" src="https://github.com/user-attachments/assets/ced50573-fbf9-458f-8912-29374ad3c26f">
126173

127174
### 3. Add 3 File Picker Nodes and upload the required data
175+
128176
<img width="500" alt="Screenshot 2024-08-12 at 8 52 18 PM" src="https://github.com/user-attachments/assets/c565355d-8e57-43a1-aade-d653be2d2853">
129177
<br></br>
130178
<img width="500" alt="image" src="https://github.com/user-attachments/assets/c992bcbb-c061-46f0-a611-28753686e0ed">
@@ -134,20 +182,27 @@ The [scRNA-seq data](https://sparc.science/datasets/220?type=dataset&datasetDeta
134182
(Alternatively, drag and drop the needed files into the workspace.)
135183

136184
### 4. Add sPARcRNA_Viz Node
185+
137186
<img width="500" alt="Screenshot 2024-08-12 at 9 33 34 PM" src="https://github.com/user-attachments/assets/43b92fb3-44e4-42ea-ada2-bbf2bc866528">
138187

139188
### 5. Connect the Nodes
189+
140190
<img width="500" alt="Screenshot 2024-08-12 at 9 34 08 PM" src="https://github.com/user-attachments/assets/f49395fe-d51b-4ec1-88f9-b4eae37ce0a9">
141191

142192
### 6. Optionally run outputs through JupyterLab R for further analysis
193+
143194
<img width="500" alt="Screenshot 2024-08-12 at 9 21 20 PM" src="https://github.com/user-attachments/assets/8c8b0833-90eb-4fea-af5e-13e2947b7d33">
144195

145196
## Future Vision
197+
146198
sPARcRNA_Viz would be expanded to include other interactive visualizations and API calls to other gene databases. This would provide more ways to analyze genes and integrate with other websites.
147199

148200
# FAIR-Centered Design
149-
Perhaps the **most important** aspect of sPARcRNA_Viz is its emphasis on the FAIR Data Principles. Summarized below are highlight features of sPARcRNA_Viz supporting the FAIR initiative.
201+
202+
Perhaps the **most important** aspect of sPARcRNA_Viz is its emphasis on the FAIR Data Principles. Summarized below are highlight features of sPARcRNA_Viz supporting the FAIR initiative.
203+
150204
## Importance of FAIR Data Principles
205+
151206
<p align="left">
152207
<img width="576" alt="Screenshot 2024-08-12 at 9 58 14 AM" src=https://github.com/user-attachments/assets/fc0112ba-ac4e-41fe-92ac-65e5339a6eb7>
153208
</p>
@@ -157,32 +212,47 @@ FAIR data is that which is **F**indable, **A**ccessible, **I**nteroperable, and
157212
Particularly in the case of scRNA-seq data, which is expensive from both a wet and dry lab standpoint, it is very useful to adhere to FAIR standards. For instance, one particularly common phenomemon with respect to scRNA-seq is **dropout**<sup>4</sup>, where portions of RNA are not captured by experimental techniques. scRNA-seq data can also be signficantly varied with regard to format; often, differently-labeled matrices may contain raw counts data, or data that has been normalized by a method such as CPM, TPM, or RPKM/FPKM. The FAIR article cited on the SPARC website expands upon this idea further: the licensing of data can also pose a challenge for the analysis of gene regulation and expression. Therefore, the intentional **categorization and stewardship** of data can present a major benefit to transcriptomics researchers, propelling scientific progress.
158213

159214
### Summary of FAIR Principles Application
160-
| FAIR Principle | Other Tools | sPARcRNA_Viz |
161-
| --- | --- | --- |
162-
| **F**indable | May not be connected to an existing database such as the SPARC Portal, which could hinder the findability of data | sPARcRNA_Viz is **connected to o²S²PARC**, so it can use the well-organized datasets provided on the SPARC portal, and it is archived on Zenodo with the appropriate metadata |
163-
| **A**ccessible | May have a user interface that requires a programming background | sPARcRNA_Viz's **friendly user interface and visuals** allow researchers to quickly engage with data and is open, free and universally implementable |
164-
| **I**nteroperable | May not allow for connections between datasets | Through its use of GSEA, sPARcRNA_Viz allows for the **meaningful connection of datasets**: scRNA-seq data can be used in association with gene ontology. In addition, visualizations generated for each dataset can be compared with each other |
165-
| **R**eusable | May only support the formatting of one dataset | sPARcRNA_Viz be used with multiple datasets due to the ability to **specify parameters**. Likewise, sPARcRNA_Viz offers a security benefit through its use of **input validation** |
215+
216+
| FAIR Principle | Other Tools | sPARcRNA_Viz |
217+
| ----------------- | ----------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
218+
| **F**indable | May not be connected to an existing database such as the SPARC Portal, which could hinder the findability of data | sPARcRNA_Viz is **connected to o²S²PARC**, so it can use the well-organized datasets provided on the SPARC portal, and it is archived on Zenodo with the appropriate metadata |
219+
| **A**ccessible | May have a user interface that requires a programming background | sPARcRNA_Viz's **friendly user interface and visuals** allow researchers to quickly engage with data and is open, free and universally implementable |
220+
| **I**nteroperable | May not allow for connections between datasets | Through its use of GSEA, sPARcRNA_Viz allows for the **meaningful connection of datasets**: scRNA-seq data can be used in association with gene ontology. In addition, visualizations generated for each dataset can be compared with each other |
221+
| **R**eusable | May only support the formatting of one dataset | sPARcRNA_Viz be used with multiple datasets due to the ability to **specify parameters**. Likewise, sPARcRNA_Viz offers a security benefit through its use of **input validation** |
222+
223+
To ensure that we are compliant with all the FAIR principles, we have also created a crosswalk between the FAIR principles and the sPARcRNA_Viz tool. This crosswalk can be found in the [CROSSWALK.md](CROSSWALK.md) file.
166224

167225
# Additional Information
226+
168227
## Issue Reporting
228+
169229
Please utilize the **Issues** tab of this repository should you encounter any problems with sPARcRNA_Viz.
230+
170231
## How to Contribute
232+
171233
Please Fork this repository and submit a **Pull Request** to contribute.
234+
172235
## Cite Us
236+
173237
Please see our [citation](CITATION.cff).
238+
174239
## License
240+
175241
sPARcRNA_Viz is distributed under the [MIT License](LICENSE).
242+
176243
## Team
244+
177245
- Mihir Samdarshi (Lead, Sysadmin, Developer)
178246
- Sanjay Soundarajan (Sysadmin, Developer)
179247
- Mahitha Simhambhatla (Developer, Writer)
180248
- Raina Patel (Writer)
181249
- Ayla Bratton (Writer)
250+
182251
## Materials Cited
252+
183253
<a id="1">[1]</a>
184-
Ben Aribi, H., Ding, M., & Kiran, A. (2023).
185-
Gene expression data visualization tool on the o2S2PARC platform.
254+
Ben Aribi, H., Ding, M., & Kiran, A. (2023).
255+
Gene expression data visualization tool on the o2S2PARC platform.
186256
F1000Research, 11, 1267.
187257
https://www.pnas.org/doi/abs/10.1073/pnas.0506580102<br />
188258
<a id="2">[2]</a>
@@ -194,18 +264,20 @@ GO FAIR.(2017).
194264
FAIR Principles - GO FAIR. GO FAIR.
195265
https://www.go-fair.org/fair-principles/<br />
196266
<a id="4">[4]</a>
197-
Kim, T. H., Zhou, X., & Chen, M. (2020). Demystifying “drop-outs” in single-cell UMI data. Genome Biology, 21(1).
267+
Kim, T. H., Zhou, X., & Chen, M. (2020). Demystifying “drop-outs” in single-cell UMI data. Genome Biology, 21(1).
198268
https://doi.org/10.1186/s13059-020-02096-y <br />
199269
<a id="5">[5]</a>
200270
Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S., & Mesirov, J. P. (2005).
201271
Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles.
202-
Proceedings of the National Academy of Sciences, 102(43), 15545–15550.
272+
Proceedings of the National Academy of Sciences, 102(43), 15545–15550.
203273
https://doi.org/10.1073/pnas.0506580102 <br />
204274
<a id="6">[6]</a>
205275
Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., & Gonzalez-Beltran, A. (2016).
206-
The FAIR Guiding Principles for Scientific Data Management and Stewardship. Scientific Data, 3(1).
276+
The FAIR Guiding Principles for Scientific Data Management and Stewardship. Scientific Data, 3(1).
207277
https://www.nature.com/articles/sdata201618 <br />
208278
<br></br>
209279
Logo and figures were created using Microsoft Word; images were formatted using Canva.
280+
210281
## Acknowledgements
282+
211283
We would like to thank the SPARC Codeathon 2024 team for all their guidance and support.

0 commit comments

Comments
 (0)