Skip to content

Conversation

@ajlee21
Copy link
Contributor

@ajlee21 ajlee21 commented Jan 22, 2021

This PR performs a similar validation, previously performed on recount2 data, to pseudomonas data. Here we are comparing the ranking of genes generated by SOPHIE vs those from a manually curated dataset, GAPE.

The following changes were made:

  1. Added 0_prepare_reference_gene_file.ipynb notebook that processes the curated ANOVA results to get gene rankings. These gene rankings will be what we compare our SOPHIE rankings against
  2. Update the existing 2_identify_generic_genes_pathways.ipynb notebook to compare SOPHIE rankings vs the manually curated ones.
  3. Based on validation results seen below, we added additional notebooks to test hypothesis (see below). These new notebooks include 0_subset_training_compendium.ipynb to create new training compendium and 2_identify_generic_genes_pathways_pao1.ipynb to perform validation analysis on the new training compendium. The code in this notebook is nearly identical to 2_identify_generic_genes_pathways.ipynb and so doesn't need much review. There were some cutsom edits that needed to be made due to limitations in ponyo. Which I have created an issue for.
  4. Update the supporting functions to use a different reference. Previously the functions were written assuming the validation was only performed on the recount2 dataset, so there were things that were hard-coded in.

The main result is here:
image

  • Overall there is good consistency between SOPHIE and the reference set of experiments.
  • Genes that SOPHIE says are generic, GAPE doesn't find as strongly generic
  • Genes that SOPHIE says are not generic, GAPE also says is not generic (genes that consistently didn’t change or genes that changed in a subset of cases).
  • There is some noise in the bottom right corner (i.e. genes that the reference didn't think were as generic but SOPHIE did). These might be the result of the reference being limited to 73 experiments and differences in data processing (ANOVA vs RMA).
  • One hypothesis is that the subset of genes that are found to be generic by SOPHIE are not generic using GAPE-curated experiments because SOPHIE is trained on a compendium containing multiple strains, whereas we suspect that the GAPE experiments are only from a single strain (PAO1). So genes that are generic in other strain contexts will not be detected by GAPE. To test this we filtered the SOPHIE training compendium to only include PAO1 strains.

Using only PAO1 samples we get:
image

  • Compared to the correlation plot comparing SOPHIE trained on the full pseudomonas compendium vs GAPE experiments here, it looks like there is more noise in the above correlation plot. Looking up the experiment ids associated with GAPE here it looks like GAPE does contain a mix of PAO1, PA14 and clinical strains. So by subsetting out training compendium to only represent PAO1 patterns, which is adding more noise to the bottom left of the plot as expected (genes that GAPE finds generic are not found by SOPHIE due to our limited representation in our training compendium)

This inconsistency in genes found to be generic by SOPHIE and not by the manually curated set of experiments appears in this analysis using P. aeruginosa data and also human data. Some other hypotheses to test in the future include:

  1. For Pseudomonas, perhaps this is a difference in how the data was processed. This would require us to re-process the manually-curated experiments, which would take some time
  2. In humans, perhaps this is a difference in the platforms (RNAseq vs array). We would need to find an RNAseq compendium that has DE analysis uniformly processed and available.
  3. Perhaps the difference is due to context differences. Thought this notebook compared genes using SOPHIE trained on cancer-specific contexts vs Crow et. al. using mixed context. And there was noise in the middle but no distinct group on the bottom right.

@ajlee21
Copy link
Contributor Author

ajlee21 commented Jan 22, 2021

Note: commit messages ccf38bb and earlier are from a merge conflict from #54 that I addressed but resulted in my ajlee master branch getting out of sync with greenelab master. So these commits are already merged in greenelab master but still showing in the history. These commits do not reflect the current changes made in this PR

@ajlee21 ajlee21 requested a review from ben-heil January 25, 2021 14:45
@ajlee21 ajlee21 marked this pull request as ready for review January 25, 2021 14:46
Copy link

@ben-heil ben-heil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Sorry for leaving so many comments, it's late (for me) so I'm more confused than usual

@ajlee21 ajlee21 merged commit e46f301 into greenelab:master Jan 27, 2021
@ajlee21 ajlee21 deleted the pa_validate branch January 27, 2021 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants