7.4: Gaps
Empirical evidence suggests that gaps cluster, in both nucleotide and protein sequences. Clustering is usually modeled by different penalties for gap opening \(\left(g_{o}\right)\) and gap extension \(\left(g_{e}\right)\) , with \(g_{o}<g_{e}<0\) . For example, the default scoring scheme for the widely used BLASTN software is \(+1\) for a nucleotide match, \(-3\) for a nucleotide mismatch, \(-5\) for a gap opening, and \(-2\) for a gap extension.
Having two types of gaps (opening and extension) complicates the dynamic programming algorithm. When an indel is added to an existing alignment, the scoring increment depends on whether the indel is a gap opening or a gap extension. For example, the extended alignment
\[\begin{array}{ll} \mathrm{AB} & \mathrm{AB}- \\[4pt] 11 & \text { to } & 111 \\[4pt] \mathrm{ab} & & \mathrm{abc} \end{array} \nonumber \]
adds a gap opening penalty \(g_{o}\) to the score, whereas
\[\begin{array}{ll} \text { A- } & \text { A- } \\[4pt] \text { II to } & \text { ||| } \\[4pt] \text { ab } & & \text { abc } \end{array} \nonumber \]
adds a gap extension penalty \(g_{e}\) to the score. The score increment depends not only on the current aligning pair, but also on the previously aligned pair.
The final aligning pair of a sequence of length \(i\) with a sequence of length \(j\) can be one of three possibilities (top:bottom): (1) \(a_{i}: b_{j} ;\) (2) \(a_{i}:-;(3)-: b_{j}\) . Only for (1) is the score increment \(S\left(a_{i}, b_{j}\right)\) unambiguous. For (2) or (3), the score increment depends on the presence or absence of indels in the previously aligned characters. For instance, for alignments ending with \(a_{i}:-\) , the previously aligned character pair could be one of (i) \(a_{i-1}: b_{j}\) , (ii) \(-: b_{j}\) , (iii) \(a_{i-1}:-\) . If the previous aligned character pair was (i) or (ii), the score increment would be the gap opening penalty \(g_{o}\) ; if it was (iii), the score increment would be the gap extension penalty \(g_{e}\) .
To remove the ambiguity that occurs with a single dynamic matrix, we need to compute three dynamic matrices simultaneously, with matrix elements denoted by \(T(i, j), T_{-}(i, j)\) and \(T^{-}(i, j)\) , corresponding to the three types of aligning pairs. The recursion relations are
\[T(i, j)=\max \left\{\begin{array}{l} T(i-1, j-1)+S\left(a_{i}, b_{j}\right) \\[4pt] T_{-}(i-1, j-1)+S\left(a_{i}, b_{j}\right) \\[4pt] T^{-}(i-1, j-1)+S\left(a_{i}, b_{j}\right) \end{array}\right. \nonumber \]
(2) \(a_{i}:-\)
\[T_{-}(i, j)=\max \left\{\begin{array}{l} T(i-1, j)+g_{o} \\[4pt] T_{-}(i-1, j)+g_{e} \\[4pt] T^{-}(i-1, j)+g_{o} \end{array}\right. \nonumber \]
(3) \(-: b_{j}\)
\[T^{-}(i, j)=\max \left\{\begin{array}{l} T(i, j-1)+g_{o} \\[4pt] T_{-}(i, j-1)+g_{o} \\[4pt] T^{-}(i, j-1)+g_{e} \end{array}\right. \nonumber \]
To align a sequence of length \(n\) with a sequence of length \(m\) , the best alignment score is the maximum of the scores obtained from the three dynamic matrices:
\[T_{\text {opt }}(n, m)=\max \left\{\begin{array}{l} T(n, m), \\[4pt] T_{-}(n, m), \\[4pt] T^{-}(n, m) . \end{array}\right. \nonumber \]
The traceback algorithm to find the best alignment proceeds as before by starting with the matrix element corresponding to the best alignment score, \(T_{\mathrm{opt}}(n, m)\) , and tracing back to the matrix element that determined this score. The optimum alignment is then built up from last-to-first as before, but now switching may occur between the three dynamic matrices.