7.3: Dynamic Programming
Two reasonably sized sequences cannot be aligned by brute force. Luckily, there is another algorithm borrowed from computer science, dynamic programming, that makes use of a dynamic matrix.
What is needed is a scoring system to judge the quality of an alignment. The goal is to find the alignment that has the maximum score. We assume that the alignment of character \(a_{i}\) with character \(b_{j}\) has the score \(S\left(a_{i}, b_{j}\right) .\) For example, when aligning two DNA sequences, a match (A-A, C-C, T-T, G-G) may be scored as \(+2\) , and a mismatch \((\mathrm{A}-\mathrm{C}, \mathrm{A}-\mathrm{T}, \mathrm{A}-\mathrm{G}\) , etc. \()\) scored as \(-1\) . We also assume that an indel (a nucleotide aligned with a gap) is scored as \(g\) , with a typical value for DNA alignment being \(g=-2\) . In the next section, we develop a better and more widely used model for indel scoring that distinguishes gap openings from gap extensions.
Now, let \(T(i, j)\) denote the maximum score for aligning a sequence of length \(i\) with a sequence of length \(j .\) We can compute \(T(i, j)\) provided we know \(T(i-\) \(1, j-1), T(i-1, j)\) and \(T(i, j-1)\) . Indeed, our logic is similar to that used when counting the total number of alignments. There are again three ways to compute \(T(i, j):(1) i-1\) characters of the first sequence are aligned with \(j-1\) characters of the second sequence with maximum score \(T(i-1, j-1)\) , and the \(i\) th character of the first sequence aligns with the \(j\) th character of the second sequence with updated maximum score \(T(i-1, j-1)+S\left(a_{i}, b_{j}\right) ;(2) i-1\) characters of the first sequence are aligned with \(j\) characters of the second sequence with maximum score \(T(i-\) \(1, j)\) , and the \(i\) th character of the first sequence aligns with a gap in the second sequence with updated maximum score \(T(i-1, j)+g\) , or; (3) \(i\) characters of the first sequence are aligned with \(j-1\) characters of the second sequence with maximum score \(T(i, j-1)\) , and a gap in the first sequence aligns with the \(j\) th character of the second sequence with updated maximum score \(T(i, j-1)+g\) . We then compare these three scores and assign \(T(i, j)\) to be the maximum; that is,
\[T(i, j)=\max \left\{\begin{array}{l} T(i-1, j-1)+S\left(a_{i}, b_{j}\right) \\[4pt] T(i-1, j)+g \\[4pt] T(i, j-1)+g \end{array}\right. \nonumber \]
Boundary conditions give the score of aligning a sequence with a null sequence of gaps, so that
\[T(i, 0)=T(0, i)=i g, \quad i>0 \nonumber \]
with \(T(0,0)=0\) .
The recursion \((7.2)\) , together with the boundary conditions (7.3), can be used to construct a dynamic matrix. The score of the best alignment is then given by the last filled-in element of the matrix, which for aligning a sequence of length \(n\) with a sequence of length \(m\) is \(T(n, m)\) . Besides this score, however, we also want to determine the alignment itself. The alignment can be obtained by tracing back the path in the dynamic matrix that was followed to compute each matrix element \(T(i, j)\) . There could be more than one path, so that the best alignment may be degenerate.
Sequence alignment is always done computationally, and there are excellent software tools freely available on the web (see $7.6). Just to illustrate the dynamic programming algorithm, we compute by hand the dynamic matrix for aligning two short DNA sequences GGAT and GAATT, scoring a match as \(+2\) , a mismatch as \(-1\) and an indel as \(-2\) :
\(\begin{array}{ccccccc} & - & \mathrm{G} & \mathrm{A} & \mathrm{A} & \mathrm{T} & \mathrm{T} \\[4pt] - & 0 & -2 & -4 & -6 & -8 & -10 \\[4pt] \mathrm{G} & -2 & 2 & 0 & -2 & -4 & -6 \\[4pt] \mathrm{G} & -4 & 0 & 1 & -1 & -3 & -5 \\[4pt] \mathrm{~A} & -6 & -2 & 2 & 3 & 1 & -1 \\[4pt] \mathrm{~T} & -8 & -4 & 0 & 1 & 5 & 3\end{array}\)
In our hand calculation, the two sequences to be aligned go to the left and above the dynamic matrix, leading with a gap character ’ \(^{\prime}\) ’. Row 0 and column 0 are then filled in with the boundary conditions, starting with 0 in position \((0,0)\) and incrementing by the gap penalty \(-2\) across row 0 and down column 0 . The recursion relation (7.2) is then used to fill in the dynamic matrix one row at a time moving from leftto-right and top-to-bottom. To determine the \((i, j)\) matrix element, three numbers must be compared and the maximum taken: (1) inspect the nucleotides to the left of row \(i\) and above column \(j\) and add \(+2\) for a match or \(-1\) for a mismatch to the \((i-1, j-1)\) matrix element; \((2)\) add \(-2\) to the \((i-1, j)\) matrix element; \((3)\) add \(-2\) to the \((i, j-1)\) matrix element. For example, the first computed matrix element 2 at position \((1,1)\) was determined by taking the maximum of \((1) 0+2=2\) , since \(G-G\) is a match; \((2)-2-2=-4 ;(3)-2-2=-4\) . You can test your understanding of dynamic programming by computing the other matrix elements.
After the matrix is constructed, the traceback algorithm that finds the best alignment starts at the bottom-right element of the matrix, here the \((4,5)\) matrix element with entry 3. The matrix element used to compute 3 was either at \((4,4)\) (horizontal move) or at \((3,4)\) (diagonal move). Having two possibilities implies that the best alignment is degenerate. For now, we arbitrarily choose the diagonal move. We build the alignment from end to beginning with GGAT on top and GAATT on bottom:
T
\(\mathrm{T}\)
We illustrate our current position in the dynamic matrix by eliminating all the elements that are not on the traceback path and are no longer accessible:
\(\begin{array}{ccccccc} & - & G & A & A & T & T \\[4pt] - & 0 & -2 & -4 & -6 & -8 & \\[4pt] G & -2 & 2 & 0 & -2 & -4 & \\[4pt] G & -4 & 0 & 1 & -1 & -3 & \\[4pt] A & -6 & -2 & 2 & 3 & 1 & \\[4pt] T & & & & & & 3\end{array}\)
We start again from the 1 entry at \((3,4)\) . This value came from the 3 entry at \((3,3)\) by a horizontal move. Therefore, the alignment is extended to
-T
||
TT
where a gap is inserted in the top sequence for a horizontal move. (A gap is inserted in the bottom sequence for a vertical move.) The dynamic matrix now looks like
\(\begin{array}{lcccccc} & - & G & A & A & T & T \\[4pt] - & 0 & -2 & -4 & -6 & & \\[4pt] G & -2 & 2 & 0 & -2 & & \\[4pt] G & -4 & 0 & 1 & -1 & & \\[4pt] A & -6 & -2 & 2 & 3 & 1 & \\[4pt] T & & & & & & 3\end{array}\)
Starting again from the 3 entry at \((3,3)\) , this value came from the 1 entry at \((2,2)\) in a diagonal move, extending the alignment to
A-T
|||
ATT The dynamic matrix now looks like
Continuing in this fashion (try to do this), the final alignment is
where it is customary to represent a matching character with a colon ’ \(:\) ’. The traceback path in the dynamic matrix is
If the other degenerate path was initially taken, the final alignment would be
GAATT
and the traceback path would be
The score of both alignments is easily recalculated to be the same, with \(2-1+2-\) \(2+2=3\) and \(2-1+2+2-2=3\)
The algorithm for aligning two proteins is similar, except match and mismatch scores depend on the pair of aligning amino acids. With twenty different amino acids found in proteins, the score is represented by a \(20 \times 20\) substitution matrix. The most commonly used matrices are the PAM series and BLOSUM series of matrices, with BLOSUM62 the commonly used default matrix.