Skip to content
This repository was archived by the owner on Jul 20, 2021. It is now read-only.

Commit 282f1ae

Browse files
author
Greg Caporaso
authored
added dynamic tree visualization (#314)
1 parent 48ba00a commit 282f1ae

File tree

2 files changed

+63
-35
lines changed

2 files changed

+63
-35
lines changed

book/applications/biological-diversity.md

Lines changed: 63 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,34 @@ We could compute this in python as follows:
144144

145145
Imagine that we have the same table, but some additional information about the OTUs in the table. Specifically, we've computed the following phylogenetic tree. And, for the sake of illustration, imagine that we've also assigned taxonomy to each of the OTUs and found that our samples contain representatives from the archaea, bacteria, and eukaryotes (their labels begin with `A`, `B`, and `E`, respectively).
146146

147-
<img src="https://raw.githubusercontent.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/master/book/applications/images/pd_calc_tree.png">
147+
First, let's define a phylogenetic tree using the Newick format (which is described [here](http://evolution.genetics.washington.edu/phylip/newicktree.html), and more formally defined [here](http://evolution.genetics.washington.edu/phylip/newick_doc.html)). We'll then load that up using [scikit-bio](http://scikit-bio.org)'s [TreeNode](http://scikit-bio.org/generated/skbio.core.tree.TreeNode.html#skbio.core.tree.TreeNode) object, and visualize it with [ete3](http://etetoolkit.org).
148+
149+
```python
150+
>>> import ete3
151+
>>> ts = ete3.TreeStyle()
152+
>>> ts.show_leaf_name = True
153+
>>> ts.scale = 250
154+
>>> ts.branch_vertical_margin = 15
155+
>>> ts.show_branch_length = True
156+
```
157+
158+
```python
159+
>>> from io import StringIO
160+
>>> newick_tree = StringIO('((B1:0.2,B2:0.3):0.3,((B3:0.5,B4:0.3):0.2,B5:0.9):0.3,'
161+
... '((A1:0.2,A2:0.3):0.3,'
162+
... ' (E1:0.3,E2:0.4):0.7):0.55);')
163+
...
164+
>>> from skbio.tree import TreeNode
165+
...
166+
>>> tree = TreeNode.read(newick_tree)
167+
>>> tree = tree.root_at_midpoint()
168+
```
169+
170+
```python
171+
>>> t = ete3.Tree.from_skbio(tree, map_attributes=["value"])
172+
>>> t.render("%%inline", tree_style=ts)
173+
<IPython.core.display.Image object>
174+
```
148175

149176
Pairing this with the table we defined above (displayed again in the cell below), given what you now know about these OTUs, which would you consider the most diverse? Are you happy with the $\alpha$ diversity conclusion that you obtained when computing the number of observed OTUs in each sample?
150177

@@ -166,18 +193,6 @@ Phylogenetic Diversity (PD) is a metric that was developed by Dan Faith in the e
166193

167194
PD is relatively simple to calculate. It is computed simply as the sum of the branch length in a phylogenetic tree that is "covered" or represented in a given sample. Let's look at an example to see how this works.
168195

169-
First, let's define a phylogenetic tree using the Newick format (which is described [here](http://evolution.genetics.washington.edu/phylip/newicktree.html), and more formally defined [here](http://evolution.genetics.washington.edu/phylip/newick_doc.html)). We'll then load that up using [scikit-bio](http://scikit-bio.org)'s [TreeNode](http://scikit-bio.org/generated/skbio.core.tree.TreeNode.html#skbio.core.tree.TreeNode) object.
170-
171-
```python
172-
>>> from io import StringIO
173-
>>> newick_tree = StringIO('(((B1:0.2,B2:0.3):0.3,((B3:0.5,B4:0.3):0.2,B5:0.9):0.3):0.35,(((A1:0.2,A2:0.3):0.3,(E1:0.3,E2:0.4):0.7):0.2):0.05)root;')
174-
...
175-
>>> from skbio.tree import TreeNode
176-
...
177-
>>> tree = TreeNode.read(newick_tree)
178-
>>> tree = tree.root_at_midpoint()
179-
```
180-
181196
I'll now define a couple of functions that we'll use to compute PD.
182197

183198
```python
@@ -213,29 +228,29 @@ And then apply those to compute the PD of our three samples. For each computatio
213228
```python
214229
>>> pd_A = phylogenetic_diversity(tree, table2, 'A', verbose=True)
215230
>>> print(pd_A)
216-
B1 0.2 0.3 0.25
217-
B2 0.3
218-
B3 0.5 0.2 0.3
219-
2.05
231+
B1 0.2 0.3 0.2250000000000001
232+
B2 0.3
233+
B3 0.5 0.2 0.3
234+
2.0250000000000004
220235
```
221236

222237
```python
223238
>>> pd_B = phylogenetic_diversity(tree, table2, 'B', verbose=True)
224239
>>> print(pd_B)
225-
B1 0.2 0.3 0.25
226-
B2 0.3
227-
B3 0.5 0.2 0.3
228-
B4 0.3
229-
2.35
240+
B1 0.2 0.3 0.2250000000000001
241+
B2 0.3
242+
B3 0.5 0.2 0.3
243+
B4 0.3
244+
2.325
230245
```
231246

232247
```python
233248
>>> pd_C = phylogenetic_diversity(tree, table2, 'C', verbose=True)
234249
>>> print(pd_C)
235-
B1 0.2 0.3 0.25
236-
A1 0.2 0.3 0.2 0.05 0.1
237-
E2 0.4 0.7
238-
2.7
250+
B1 0.2 0.3 0.2250000000000001
251+
A1 0.2 0.3 0.32499999999999996
252+
E2 0.4 0.7
253+
2.65
239254
```
240255

241256
How does this result compare to what we observed above with the Observed OTUs metric? Based on your knowledge of biology, which do you think is a better representation of the relative diversities of these samples?
@@ -596,10 +611,7 @@ First, let's look at the analysis presented in panels E and F. Instead of genera
596611
>>> ax.set_xticklabels(['same body habitat', 'different body habitat'])
597612
>>> ax.set_ylabel('Unweighted UniFrac Distance')
598613
>>> _ = ax.set_ylim(0.0, 1.0)
599-
/Users/caporaso/miniconda3/envs/iab/lib/python3.4/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
600-
warnings.warn(self.msg_depr % (key, alt_key))
601-
/Users/caporaso/miniconda3/envs/iab/lib/python3.4/site-packages/matplotlib/__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
602-
warnings.warn(self.msg_depr % (key, alt_key))
614+
<Figure size 432x288 with 1 Axes>
603615
```
604616

605617
```python
@@ -610,7 +622,7 @@ test statistic name R
610622
sample size 6
611623
number of groups 3
612624
test statistic 1
613-
p-value 0.054
625+
p-value 0.065
614626
number of permutations 999
615627
Name: ANOSIM results, dtype: object
616628
```
@@ -630,8 +642,7 @@ If we run through these same steps, but base our analysis on a different metadat
630642
>>> ax.set_xticklabels(['same person', 'different person'])
631643
>>> ax.set_ylabel('Unweighted UniFrac Distance')
632644
>>> _ = ax.set_ylim(0.0, 1.0)
633-
/Users/caporaso/miniconda3/envs/iab/lib/python3.4/site-packages/matplotlib/__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
634-
warnings.warn(self.msg_depr % (key, alt_key))
645+
<Figure size 432x288 with 1 Axes>
635646
```
636647

637648
```python
@@ -641,7 +652,7 @@ test statistic name R
641652
sample size 6
642653
number of groups 2
643654
test statistic -0.333333
644-
p-value 0.889
655+
p-value 0.869
645656
number of permutations 999
646657
Name: ANOSIM results, dtype: object
647658
```
@@ -659,6 +670,7 @@ Next, let's look at a hierarchical clustering analysis, similar to that presente
659670
>>> lm = average(human_microbiome_dm.condensed_form())
660671
>>> d = dendrogram(lm, labels=human_microbiome_dm.ids, orientation='right',
661672
... link_color_func=lambda x: 'black')
673+
<Figure size 432x288 with 1 Axes>
662674
```
663675

664676
Again, we can see how the data really only becomes interpretable in the context of metadata:
@@ -667,12 +679,14 @@ Again, we can see how the data really only becomes interpretable in the context
667679
>>> labels = [human_microbiome_sample_md['body site'][sid] for sid in sample_ids]
668680
>>> d = dendrogram(lm, labels=labels, orientation='right',
669681
... link_color_func=lambda x: 'black')
682+
<Figure size 432x288 with 1 Axes>
670683
```
671684

672685
```python
673686
>>> labels = [human_microbiome_sample_md['individual'][sid] for sid in sample_ids]
674687
>>> d = dendrogram(lm, labels=labels, orientation='right',
675688
... link_color_func=lambda x: 'black')
689+
<Figure size 432x288 with 1 Axes>
676690
```
677691

678692
### Ordination <link src='b1cdbe'/>
@@ -762,6 +776,7 @@ F 0.737954545455
762776
```python
763777
>>> from pylab import scatter
764778
>>> ord_plot = scatter(a1_values, a2_values, s=40)
779+
<Figure size 432x288 with 1 Axes>
765780
```
766781

767782
And again, let's look at how including metadata helps us to interpret our results.
@@ -772,6 +787,7 @@ First, we'll color the points by the body habitat that they're derived from:
772787
>>> colors = {'tongue': 'red', 'gut':'yellow', 'skin':'blue'}
773788
>>> c = [colors[human_microbiome_sample_md['body site'][e]] for e in human_microbiome_dm.ids]
774789
>>> ord_plot = scatter(a1_values, a2_values, s=40, c=c)
790+
<Figure size 432x288 with 1 Axes>
775791
```
776792

777793
And next we'll color the samples by the person that they're derived from. Notice that this plot and the one above are identical except for coloring. Think about how the colors (and therefore the sample metadata) help you to interpret these plots.
@@ -780,6 +796,7 @@ And next we'll color the samples by the person that they're derived from. Notice
780796
>>> person_colors = {'subject 1': 'red', 'subject 2':'yellow'}
781797
>>> person_c = [person_colors[human_microbiome_sample_md['individual'][e]] for e in human_microbiome_dm.ids]
782798
>>> ord_plot = scatter(a1_values, a2_values, s=40, c=person_c)
799+
<Figure size 432x288 with 1 Axes>
783800
```
784801

785802
#### Determining the most important axes in polar ordination <link src='fb483b'/>
@@ -822,14 +839,17 @@ So why do we care about axes being uncorrelated? And why do we care about explai
822839

823840
```python
824841
>>> ord_plot = scatter(data[0][4], data[1][4], s=40, c=c)
842+
<Figure size 432x288 with 1 Axes>
825843
```
826844

827845
```python
828846
>>> ord_plot = scatter(data[0][4], data[13][4], s=40, c=c)
847+
<Figure size 432x288 with 1 Axes>
829848
```
830849

831850
```python
832851
>>> ord_plot = scatter(data[0][4], data[14][4], s=40, c=c)
852+
<Figure size 432x288 with 1 Axes>
833853
```
834854

835855
#### Interpreting ordination plots <link src='40e0a6'/>
@@ -848,10 +868,12 @@ One thing that you may have notices as you computed the polar ordination above i
848868

849869
```python
850870
>>> ord_plot = scatter(a1_values, a2_values, s=40, c=c)
871+
<Figure size 432x288 with 1 Axes>
851872
```
852873

853874
```python
854875
>>> ord_plot = scatter(alt_a1_values, a2_values, s=40, c=c)
876+
<Figure size 432x288 with 1 Axes>
855877
```
856878

857879
Some other important features:
@@ -950,6 +972,7 @@ Next we'll load our distance matrix. This is similar to ``human_microbiome_dm_da
950972
```python
951973
>>> from iab.data import lauber_soil_unweighted_unifrac_dm
952974
>>> _ = lauber_soil_unweighted_unifrac_dm.plot(cmap='Greens')
975+
<Figure size 432x288 with 2 Axes>
953976
```
954977

955978
Does this visualization help you to interpret the results? Probably not. Generally we'll need to apply some approaches that will help us with interpretation. Let's use ordination here. We'll run Principal Coordinates Analysis on our ``DistanceMatrix`` object. This gives us a matrix of coordinate values for each sample, which we can then plot. We can use ``scikit-bio``'s implementation of PCoA as follows:
@@ -964,6 +987,7 @@ What does the following ordination plot tell you about the relationship between
964987

965988
```python
966989
>>> _ = lauber_soil_unweighted_unifrac_pc.plot(lauber_soil_sample_md, 'Latitude', cmap='Greens', title="Samples colored by Latitude", axis_labels=('PC1', 'PC2', 'PC3'))
990+
<Figure size 432x288 with 2 Axes>
967991
```
968992

969993
If the answer to the above question is that there doesn't seem to be much association, you're on the right track. We can quantify this, for example, by testing for correlation between pH and value on PC 1.
@@ -982,6 +1006,7 @@ In the next plot, we'll color the points by the pH of the soil sample they repre
9821006

9831007
```python
9841008
>>> _ = lauber_soil_unweighted_unifrac_pc.plot(lauber_soil_sample_md, 'pH', cmap='Greens', title="Samples colored by pH", axis_labels=('PC1', 'PC2', 'PC3'))
1009+
<Figure size 432x288 with 2 Axes>
9851010
```
9861011

9871012
```python
@@ -1010,6 +1035,7 @@ Imagine you ran three different beta diversity metrics on your BIOM table: unwei
10101035
>>> _ = lauber_soil_unweighted_unifrac_pc.plot(lauber_soil_sample_md, 'pH', cmap='Greens',
10111036
... title="Unweighted UniFrac, samples colored by pH",
10121037
... axis_labels=('PC1', 'PC2', 'PC3'))
1038+
<Figure size 432x288 with 2 Axes>
10131039
```
10141040

10151041
```python
@@ -1020,6 +1046,7 @@ Imagine you ran three different beta diversity metrics on your BIOM table: unwei
10201046
>>> _ = lauber_soil_bray_curtis_pcoa.plot(lauber_soil_sample_md, 'pH', cmap='Greens',
10211047
... title="Bray-Curtis, samples colored by pH",
10221048
... axis_labels=('PC1', 'PC2', 'PC3'))
1049+
<Figure size 432x288 with 2 Axes>
10231050
```
10241051

10251052
```python
@@ -1030,8 +1057,9 @@ Imagine you ran three different beta diversity metrics on your BIOM table: unwei
10301057
>>> _ = lauber_soil_weighted_unifrac_pcoa.plot(lauber_soil_sample_md, 'pH', cmap='Greens',
10311058
... title="Weighted UniFrac, samples colored by pH",
10321059
... axis_labels=('PC1', 'PC2', 'PC3'))
1033-
/Users/caporaso/miniconda3/envs/iab/lib/python3.4/site-packages/skbio/stats/ordination/_principal_coordinate_analysis.py:102: RuntimeWarning: The result contains negative eigenvalues. Please compare their magnitude with the magnitude of some of the largest positive eigenvalues. If the negative ones are smaller, it's probably safe to ignore them, but if they are large in magnitude, the results won't be useful. See the Notes section for more details. The smallest eigenvalue is -0.010291669756329344 and the largest is 3.8374200744108204.
1060+
/Users/gregcaporaso/miniconda3/envs/iab/lib/python3.5/site-packages/skbio/stats/ordination/_principal_coordinate_analysis.py:102: RuntimeWarning: The result contains negative eigenvalues. Please compare their magnitude with the magnitude of some of the largest positive eigenvalues. If the negative ones are smaller, it's probably safe to ignore them, but if they are large in magnitude, the results won't be useful. See the Notes section for more details. The smallest eigenvalue is -0.010291669756329357 and the largest is 3.8374200744108204.
10341061
RuntimeWarning
1062+
<Figure size 432x288 with 2 Axes>
10351063
```
10361064

10371065
Specifically, what we want to ask when comparing these results is **given a pair of ordination plots, is their shape (in two or three dimensions) the same?** The reason we care is that we want to know, **given a pair of ordination plots, would we derive the same biological conclusions regardless of which plot we look at?**
-17.5 KB
Binary file not shown.

0 commit comments

Comments
 (0)