Skip to content
This repository was archived by the owner on Jul 20, 2021. It is now read-only.

Commit 9ed93f4

Browse files
author
Greg Caporaso
authored
ENH: adds machine learning chapter as a work in progress (#292)
* ENH: adds new machine learning chapter
1 parent ba386aa commit 9ed93f4

File tree

4 files changed

+323
-4
lines changed

4 files changed

+323
-4
lines changed

book/back-matter/glossary.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,14 @@ A hypothesis about which bases or amino acids in two biological sequences are de
88
ACC---GTAC
99
CCCATCGTAG
1010
```
11+
12+
## kmer (noun) <link src="C7hMX5"/>
13+
14+
A kmer is simply a word (or list of adjacent characters) in a sequence of length k. For example, the overlapping kmers in the sequence ``ACCGTGACCAGTTACCAGTTTGACCAA`` are as follows:
15+
16+
```python
17+
>>> import skbio
18+
>>> skbio.DNA('ACCGTGACCAGTTACCAGTTTGACCAA').kmer_frequencies(k=5, overlap=True)
19+
```
20+
21+
It is common for bioinformaticians to substitute the value of `k` for the letter _k_ in the word _kmer_. For example, you might here someone say "we identified all seven-mers in our sequence", to mean they identified all kmers of length seven.

book/fundamentals/database-searching.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -413,7 +413,7 @@ Try increasing and decreasing the number of sequences we'll align by increasing
413413

414414
#### kmer content <link src="QblTRV"/>
415415

416-
Another metric of sequence composition is *kmer composition*. A kmer is simply a word (or list of adjacent characters) of length *k* found within a sequence. Here are the kmer frequencies in a short DNA sequence. The ``overlap=True`` parameter here means that our kmers can overlap one another.
416+
Another metric of sequence composition is *kmer composition*. A [kmer](alias://C7hMX5) is simply a word (or list of adjacent characters) of length *k* found within a sequence. Here are the kmer frequencies in a short DNA sequence. The ``overlap=True`` parameter here means that our kmers can overlap one another.
417417

418418
```python
419419
>>> skbio.DNA('ACCGTGACCAGTTACCAGTTTGACCAA').kmer_frequencies(k=5, overlap=True)
@@ -612,9 +612,9 @@ How does the actual score of aligning the sequence to itself compare to the scor
612612
>>> plot_score_distribution(actual_score, random_scores)
613613
```
614614

615-
What does this tell us about our alignment score and therefore about our alignment? Is it good or bad?
615+
What does this tell us about our alignment score and therefore about our alignment? Is it good or bad?
616616

617-
We finally have information that we can use to evaluate an alignment score, and therefore to evaluate the quality of an alignment. Let's use this information to quantify the quality of the alignment by computing a p-value. As we described above, this is simply the probability that we would obtain an alignment score at least this good if the sequences being aligned are not homologous. Since we have a lot of scores now from sequences that are similar but not homologous, if we just count how many are at least as high as our actual score and divide by the number of scores we compute, that is an empirical (data-driven) way of determining our p-value.
617+
We finally have information that we can use to evaluate an alignment score, and therefore to evaluate the quality of an alignment. Let's use this information to quantify the quality of the alignment by computing a p-value. As we described above, this is simply the probability that we would obtain an alignment score at least this good if the sequences being aligned are not homologous. Since we have a lot of scores now from sequences that are similar but not homologous, if we just count how many are at least as high as our actual score and divide by the number of scores we compute, that is an empirical (data-driven) way of determining our p-value.
618618

619619
To determine if our alignment is statistically significant, we need to define $\alpha$ before computing the p-value so the p-value does not impact our choice of $\alpha$. Let's define $\alpha$ as 0.05. This choice means if we obtain a p-value less than 0.05 we will consider the alignment statistically significant and accept the hypothesis that the sequences are homologous.
620620

@@ -700,7 +700,7 @@ Notice how these sequences are almost identical, but have some differences. Let'
700700
... fraction_better_or_equivalent_alignments(sequence1, sequence1_95))
701701
```
702702

703-
You likely got a significant p-value there, telling you that the sequences are homologous.
703+
You likely got a significant p-value there, telling you that the sequences are homologous.
704704

705705
Now let's simulate much more distantly related sequences by introducing substitutions at many more sites.
706706

book/fundamentals/index.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,4 @@ contents:
44
- multiple-sequence-alignment
55
- phylogeny-reconstruction
66
- sequence-mapping-and-clustering
7+
- machine-learning

0 commit comments

Comments
 (0)