Pseudomonas validation #57

ajlee21 · 2021-01-22T00:37:53Z

This PR performs a similar validation, previously performed on recount2 data, to pseudomonas data. Here we are comparing the ranking of genes generated by SOPHIE vs those from a manually curated dataset, GAPE.

The following changes were made:

Added 0_prepare_reference_gene_file.ipynb notebook that processes the curated ANOVA results to get gene rankings. These gene rankings will be what we compare our SOPHIE rankings against
Update the existing 2_identify_generic_genes_pathways.ipynb notebook to compare SOPHIE rankings vs the manually curated ones.
Based on validation results seen below, we added additional notebooks to test hypothesis (see below). These new notebooks include 0_subset_training_compendium.ipynb to create new training compendium and 2_identify_generic_genes_pathways_pao1.ipynb to perform validation analysis on the new training compendium. The code in this notebook is nearly identical to 2_identify_generic_genes_pathways.ipynb and so doesn't need much review. There were some cutsom edits that needed to be made due to limitations in ponyo. Which I have created an issue for.
Update the supporting functions to use a different reference. Previously the functions were written assuming the validation was only performed on the recount2 dataset, so there were things that were hard-coded in.

The main result is here:

Overall there is good consistency between SOPHIE and the reference set of experiments.
Genes that SOPHIE says are generic, GAPE doesn't find as strongly generic
Genes that SOPHIE says are not generic, GAPE also says is not generic (genes that consistently didn’t change or genes that changed in a subset of cases).
There is some noise in the bottom right corner (i.e. genes that the reference didn't think were as generic but SOPHIE did). These might be the result of the reference being limited to 73 experiments and differences in data processing (ANOVA vs RMA).
One hypothesis is that the subset of genes that are found to be generic by SOPHIE are not generic using GAPE-curated experiments because SOPHIE is trained on a compendium containing multiple strains, whereas we suspect that the GAPE experiments are only from a single strain (PAO1). So genes that are generic in other strain contexts will not be detected by GAPE. To test this we filtered the SOPHIE training compendium to only include PAO1 strains.

Using only PAO1 samples we get:

Compared to the correlation plot comparing SOPHIE trained on the full pseudomonas compendium vs GAPE experiments here, it looks like there is more noise in the above correlation plot. Looking up the experiment ids associated with GAPE here it looks like GAPE does contain a mix of PAO1, PA14 and clinical strains. So by subsetting out training compendium to only represent PAO1 patterns, which is adding more noise to the bottom left of the plot as expected (genes that GAPE finds generic are not found by SOPHIE due to our limited representation in our training compendium)

This inconsistency in genes found to be generic by SOPHIE and not by the manually curated set of experiments appears in this analysis using P. aeruginosa data and also human data. Some other hypotheses to test in the future include:

For Pseudomonas, perhaps this is a difference in how the data was processed. This would require us to re-process the manually-curated experiments, which would take some time
In humans, perhaps this is a difference in the platforms (RNAseq vs array). We would need to find an RNAseq compendium that has DE analysis uniformly processed and available.
Perhaps the difference is due to context differences. Thought this notebook compared genes using SOPHIE trained on cancer-specific contexts vs Crow et. al. using mixed context. And there was noise in the middle but no distinct group on the bottom right.

…ebook

ajlee21 · 2021-01-22T03:23:23Z

Note: commit messages ccf38bb and earlier are from a merge conflict from #54 that I addressed but resulted in my ajlee master branch getting out of sync with greenelab master. So these commits are already merged in greenelab master but still showing in the history. These commits do not reflect the current changes made in this PR

…ples

ben-heil

Looks good! Sorry for leaving so many comments, it's late (for me) so I'm more confused than usual

generic_expression_patterns_modules/process.py

pseudomonas_analysis/0_prepare_reference_gene_file.ipynb

pseudomonas_analysis/2_identify_generic_genes_pathways_pao1.ipynb

ajlee21 added 29 commits January 8, 2021 09:33

create new directory for enrichment analysis

5422747

move files

e3473c3

update env to include gsva

88ac5d1

add GSVA method

f0cb337

update env to run ROAST

90f0fc0

update scripts and notebooks to run ROAST

76ea2d4

update scripts to use multiple enrichment methods and update test not…

03ff480

…ebook

fix error in test

0168256

fix assert statments

3f43242

fix file path for test

c6fd746

update gsa param based on updated scripts

a494ec3

add CAMERA method and start formating

6fbf1c2

add clusterProfiler to env

af2b5d5

update function and nb to run ORA

5849169

update comment about ORA output

cfd2f35

update enrichment scripts that were causing output errors

cda056f

update analysis notebook

713f0ea

format enrichment outputs

9e2d72f

plot summary ranking trend

27bd9fd

run enrichment analyses and add result files

4b00f8a

update comments

42553c6

update comments about takeaway and methods

6acf7f3

fix conflict

7b6533f

Merge branch 'ajlee21-add_enrich'

1328b51

fixed merge conflict

8f04367

Merge remote-tracking branch 'upstream/master'

ccf38bb

update scripts to allow for validation of other datasets

2be744a

add and update pa notebooks to validate against GAPE

d932409

fix one conflict message

8eecbf9

ajlee21 added 9 commits January 22, 2021 16:07

update compare gene rank output in notebook

270164e

update and verify pa analysis run

841ba56

add new notebook to create new training compendium with only pao1 sam…

8700aed

…ples

update processing script to account for different training set shape

b402f45

update processing of template experiment that was incorrectly removed

dce8f2e

train new compendium

14ad494

add new nb and data files for new analysis using new training compendium

93ed15d

update correlation plot to use different labeling depending on reference

b6ce97f

update result figure and comments

dd5bfd7

ajlee21 requested a review from ben-heil January 25, 2021 14:45

ajlee21 marked this pull request as ready for review January 25, 2021 14:46

ben-heil approved these changes Jan 27, 2021

View reviewed changes

updated based on PR

2895eb9

ajlee21 merged commit e46f301 into greenelab:master Jan 27, 2021

ajlee21 deleted the pa_validate branch January 27, 2021 19:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pseudomonas validation #57

Pseudomonas validation #57

Uh oh!

ajlee21 commented Jan 22, 2021 •

edited

Loading

Uh oh!

ajlee21 commented Jan 22, 2021

Uh oh!

ben-heil left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Pseudomonas validation #57

Pseudomonas validation #57

Uh oh!

Conversation

ajlee21 commented Jan 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajlee21 commented Jan 22, 2021

Uh oh!

ben-heil left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ajlee21 commented Jan 22, 2021 •

edited

Loading