ENH: adds machine learning chapter as a work in progress (#292)

Greg Caporaso · web-flow · commit 9ed93f4100f1 · 2019-03-11T11:41:30.000-07:00
* ENH: adds new machine learning chapter
diff --git a/book/back-matter/glossary.md b/book/back-matter/glossary.md
@@ -8,3 +8,14 @@ A hypothesis about which bases or amino acids in two biological sequences are de
 ACC---GTAC
 CCCATCGTAG
 ```
+
+## kmer (noun) <link src="C7hMX5"/>
+
+A kmer is simply a word (or list of adjacent characters) in a sequence of length k. For example, the overlapping kmers in the sequence ``ACCGTGACCAGTTACCAGTTTGACCAA`` are as follows:
+
+```python
+>>> import skbio
+>>> skbio.DNA('ACCGTGACCAGTTACCAGTTTGACCAA').kmer_frequencies(k=5, overlap=True)
+```
+
+It is common for bioinformaticians to substitute the value of `k` for the letter _k_ in the word _kmer_. For example, you might here someone say "we identified all seven-mers in our sequence", to mean they identified all kmers of length seven.
diff --git a/book/fundamentals/database-searching.md b/book/fundamentals/database-searching.md
@@ -413,7 +413,7 @@ Try increasing and decreasing the number of sequences we'll align by increasing
 
 #### kmer content <link src="QblTRV"/>
 
-Another metric of sequence composition is *kmer composition*. A kmer is simply a word (or list of adjacent characters) of length *k* found within a sequence. Here are the kmer frequencies in a short DNA sequence. The ``overlap=True`` parameter here means that our kmers can overlap one another.
+Another metric of sequence composition is *kmer composition*. A [kmer](alias://C7hMX5) is simply a word (or list of adjacent characters) of length *k* found within a sequence. Here are the kmer frequencies in a short DNA sequence. The ``overlap=True`` parameter here means that our kmers can overlap one another.
 
 ```python
 >>> skbio.DNA('ACCGTGACCAGTTACCAGTTTGACCAA').kmer_frequencies(k=5, overlap=True)
@@ -612,9 +612,9 @@ How does the actual score of aligning the sequence to itself compare to the scor
 >>> plot_score_distribution(actual_score, random_scores)
 ```
 
-What does this tell us about our alignment score and therefore about our alignment? Is it good or bad? 
+What does this tell us about our alignment score and therefore about our alignment? Is it good or bad?
 
-We finally have information that we can use to evaluate an alignment score, and therefore to evaluate the quality of an alignment. Let's use this information to quantify the quality of the alignment by computing a p-value. As we described above, this is simply the probability that we would obtain an alignment score at least this good if the sequences being aligned are not homologous. Since we have a lot of scores now from sequences that are similar but not homologous, if we just count how many are at least as high as our actual score and divide by the number of scores we compute, that is an empirical (data-driven) way of determining our p-value. 
+We finally have information that we can use to evaluate an alignment score, and therefore to evaluate the quality of an alignment. Let's use this information to quantify the quality of the alignment by computing a p-value. As we described above, this is simply the probability that we would obtain an alignment score at least this good if the sequences being aligned are not homologous. Since we have a lot of scores now from sequences that are similar but not homologous, if we just count how many are at least as high as our actual score and divide by the number of scores we compute, that is an empirical (data-driven) way of determining our p-value.
 
 To determine if our alignment is statistically significant, we need to define $\alpha$ before computing the p-value so the p-value does not impact our choice of $\alpha$. Let's define $\alpha$ as 0.05. This choice means if we obtain a p-value less than 0.05 we will consider the alignment statistically significant and accept the hypothesis that the sequences are homologous.
 
@@ -700,7 +700,7 @@ Notice how these sequences are almost identical, but have some differences. Let'
 ...       fraction_better_or_equivalent_alignments(sequence1, sequence1_95))
 ```
 
-You likely got a significant p-value there, telling you that the sequences are homologous. 
+You likely got a significant p-value there, telling you that the sequences are homologous.
 
 Now let's simulate much more distantly related sequences by introducing substitutions at many more sites.
 
diff --git a/book/fundamentals/index.yaml b/book/fundamentals/index.yaml
@@ -4,3 +4,4 @@ contents:
  - multiple-sequence-alignment
  - phylogeny-reconstruction
  - sequence-mapping-and-clustering
+ - machine-learning
diff --git a/book/fundamentals/machine-learning.md b/book/fundamentals/machine-learning.md