|
1 | 1 | { |
2 | 2 | "metadata": { |
3 | 3 | "name": "", |
4 | | - "signature": "sha256:296651182a216272a5bda0c9c27a14f21621e84338b38a032feb74d38bb9ddbd" |
| 4 | + "signature": "sha256:63b4f660fec9abeef40537f13e87660973a53d00457c3ddd6e1b61491e720359" |
5 | 5 | }, |
6 | 6 | "nbformat": 3, |
7 | 7 | "nbformat_minor": 0, |
|
793 | 793 | "In the next step we determine the best alignment given the sequences and scoring scheme in what we'll call the **dynamic programming matrix**, and then define programmatically how to transcribe the alignment in what we'll call the **traceback matrix** to yield a pair of aligned sequences. \n", |
794 | 794 | "\n", |
795 | 795 | "\n", |
796 | | - "For the convenience of coding this algorithm, it helps to define the dynamic programming matrix with one extra row and one extra column relative to the score matrix, and make these the first column and row of the matrix. These then represent the beginning of the alignment position $(0, 0)$. The score $F$ for cell $(i, j)$, where $i$ represents the row number and $j$ represents the column number, is defined for the first row and column as follows. \n", |
| 796 | + "For the convenience of coding this algorithm, it helps to define the dynamic programming matrix with one extra row and one extra column relative to the score matrix, and make these the first column and row of the matrix. These then represent the beginning of the alignment position $(0, 0)$. The score $F$ for cell $(i, j)$, where $i$ represents the row number and $j$ represents the column number, is defined for the first row and column as follows:\n", |
797 | 797 | "\n", |
798 | | - "\n", |
799 | | - "$F(0, 0) = 0$\n", |
800 | | - "\n", |
801 | | - "$F(i, 0) = F(i-1, 0) - d$\n", |
802 | | - "\n", |
803 | | - "$F(0, j) = F(0, j-1) - d$\n", |
| 798 | + "$$\n", |
| 799 | + "\\begin{align}\n", |
| 800 | + "& F(0, 0) = 0\\\\\n", |
| 801 | + "& F(i, 0) = F(i-1, 0) - d\\\\\n", |
| 802 | + "& F(0, j) = F(0, j-1) - d\\\\\n", |
| 803 | + "\\end{align}\n", |
| 804 | + "$$\n", |
804 | 805 | "\n", |
805 | 806 | "This matrix, pre-initialization, would look like the following. As an exercise, try computing the values for the cells in the first four rows in column zero, and the first four columns in row zero. As you fill in the value for a cell, for all cells with a score based on another score in the matrix (i.e., everything except for $F(0, 0)$), draw an arrow from that cell to the cell whose score it depends on. \n", |
806 | 807 | "\n", |
|
897 | 898 | "\n", |
898 | 899 | "In a Needleman-Wunsch alignment, the score $F$ for cell $(i, j)$ (where $i$ is the row number and $j$ is the column number, and $i > 0$ and $j > 0$) is computed as the maximum of three possible values.\n", |
899 | 900 | "\n", |
900 | | - "\n", |
901 | | - "$\n", |
| 901 | + "$$\n", |
902 | 902 | "F(i, j) = max \\left(\\begin{align}\n", |
903 | | - "F(i-1, j-1) + s(c_i, c_j)\\\\\n", |
904 | | - "F(i-1, j) - d\\\\\n", |
905 | | - "F(i, j-1) - d)\n", |
906 | | - "\\end{align}\\right)$\n", |
| 903 | + "& F(i-1, j-1) + s(c_i, c_j)\\\\\n", |
| 904 | + "& F(i-1, j) - d\\\\\n", |
| 905 | + "& F(i, j-1) - d\n", |
| 906 | + "\\end{align}\\right)\n", |
| 907 | + "$$\n", |
907 | 908 | "\n", |
908 | 909 | "In this notation, $s$ refers to the substitution matrix, $c_i$ and $c_j$ refers to characters in `seq1` and `seq2`, and $d$ again is the gap penalty. Describing the scoring function in English, we score a cell with the maximum of three values: either the value of the cell up and to the left plus the score for the substitution taking place in the current cell (which you find by looking up the substitution in the substitution matrix); the value of the cell above minus the gap penalty; or the value of the cell to the left minus the gap penalty. In this way, you're determining whether the best (highest) score is obtained by inserting a gap in sequence 1 (corresponding to $F(i-1, j) - d$), inserting a gap in sequence 2 (corresponding to $F(i, j-1) - d$), or aligning the characters in sequence 1 and sequence 2 (corresponding to $F(i-1, j-1) + s(c_i, c_j)$).\n", |
909 | 910 | "\n", |
|
1415 | 1416 | "The Smith-Waterman algorithm is used for performing pairwise local alignment. It is nearly identical to Needleman-Wunsch, with three small important differences. \n", |
1416 | 1417 | "\n", |
1417 | 1418 | "First, initialization is easier:\n", |
1418 | | - "\n", |
1419 | | - "$F(0, 0) = 0$\n", |
1420 | | - "\n", |
1421 | | - "$F(i, 0) = 0$\n", |
1422 | | - "\n", |
1423 | | - "$F(0, j) = 0$" |
| 1419 | + "$$\n", |
| 1420 | + "\\begin{align}\n", |
| 1421 | + "& F(0, 0) = 0\\\\\n", |
| 1422 | + "& F(i, 0) = 0\\\\\n", |
| 1423 | + "& F(0, j) = 0\n", |
| 1424 | + "\\end{align}\n", |
| 1425 | + "$$" |
1424 | 1426 | ] |
1425 | 1427 | }, |
1426 | 1428 | { |
|
1500 | 1502 | "source": [ |
1501 | 1503 | "Next, there is one additional term in the scoring function:\n", |
1502 | 1504 | "\n", |
1503 | | - "$\n", |
| 1505 | + "$$\n", |
1504 | 1506 | "F(i, j) = max \\left(\\begin{align}\n", |
1505 | | - "0\\\\\n", |
1506 | | - "F(i-1, j-1) + s(c_i, c_j)\\\\\n", |
1507 | | - "F(i-1, j) - d\\\\F(i, j-1) - d)\n", |
1508 | | - "\\end{align}\\right)$" |
| 1507 | + "& 0\\\\\n", |
| 1508 | + "& F(i-1, j-1) + s(c_i, c_j)\\\\\n", |
| 1509 | + "& F(i-1, j) - d\\\\\n", |
| 1510 | + "& F(i, j-1) - d)\n", |
| 1511 | + "\\end{align}\\right)\n", |
| 1512 | + "$$\n" |
1509 | 1513 | ] |
1510 | 1514 | }, |
1511 | 1515 | { |
|
1996 | 2000 | "source": [ |
1997 | 2001 | "The second limitation of the our simple alignment algorithm, and one that is also present in our version of Smith-Waterman as implemented above, is that all gaps are scored equally whether they represent the opening of a new insertion/deletion, or the extension of an existing insertion/deletion. This isn't ideal based on what we know about how insertion/deletion events occur (see [this discussion of replication slippage](http://www.ncbi.nlm.nih.gov/books/NBK21114/)). Instead, **we might want to incur a large penalty for opening a gap, but a smaller penalty for extending a gap**. To do this, **we need to make two small changes to our scoring scheme**. When we compute the score for a gap, we should incurr a *gap open penalty* if the previous max score was derived from inserting a gap character in the same sequence. If we represent our traceback matrix as $T$, our gap open penalty as $d^0$, and our gap extend penalty as $d^e$, our scoring scheme would look like the following:\n", |
1998 | 2002 | "\n", |
1999 | | - "$\n", |
| 2003 | + "$$\n", |
2000 | 2004 | "F(i, j) = max \\left(\\begin{align}\n", |
2001 | | - " 0\\\\\n", |
2002 | | - " F(i-1, j-1) + s(c_i, c_j)\\\\\n", |
2003 | | - " \\left\\{\\begin{array}{l l} F(i-1, j) - d^e \\quad \\text{if $T(i-1, j)$ is gap}\\\\ F(i-1, j) - d^o \\quad \\text{if $T(i-1, j)$ is not gap} \\end{array} \\right\\} \\\\\n", |
2004 | | - " \\left\\{\\begin{array}{l l} F(i, j-1) - d^e \\quad \\text{if $T(i, j-1)$ is gap}\\\\ F(i, j-1) - d^o \\quad \\text{if $T(i, j-1)$ is not gap} \\end{array} \\right\\}\n", |
| 2005 | + "& 0\\\\\n", |
| 2006 | + "& F(i-1, j-1) + s(c_i, c_j)\\\\\n", |
| 2007 | + "& \\left\\{\\begin{array}{l l} F(i-1, j) - d^e \\quad \\text{if $T(i-1, j)$ is gap}\\\\ F(i-1, j) - d^o \\quad \\text{if $T(i-1, j)$ is not gap} \\end{array} \\right\\} \\\\\n", |
| 2008 | + "& \\left\\{\\begin{array}{l l} F(i, j-1) - d^e \\quad \\text{if $T(i, j-1)$ is gap}\\\\ F(i, j-1) - d^o \\quad \\text{if $T(i, j-1)$ is not gap} \\end{array} \\right\\}\n", |
2005 | 2009 | " \\end{align}\\right)\n", |
2006 | | - "$\n", |
| 2010 | + "$$\n", |
| 2011 | + "\n", |
2007 | 2012 | "\n", |
2008 | 2013 | "Notice how we only use the gap extend penalty if the previous max score resulted from a gap in the same sequence (which we know by looking in the traceback matrix) because it represents the continuation of an existing gap in that sequence. This is why we check for a specific type of gap in $T$, rather than checking whether $T$ `!= '\\'`. \n", |
2009 | 2014 | "\n", |
|
0 commit comments