7.2: Brute Force Alignment
One (bad) approach to sequence alignment is to align the two sequences in all possible ways, score the alignments with an assumed scoring system, and determine the highest scoring alignment. The problem with this brute-force approach is that the number of possible alignments grows exponentially with sequence length; and for sequences of reasonable length, the computation is already impossible. For example, the number of ways to align two sequences of 50 characters each-a rather small alignment problem-is about \(1.5 \times 10^{37}\) , already an astonishingly large number. It is informative to count the number of possible alignments between two sequences since a similar algorithm is used for sequence alignment.
Suppose we want to align two sequences. Gaps in either sequence are allowed but a gap can not be aligned with a gap. By way of illustration, we demonstrate the three ways that the first character of the upper-case alphabet and the lower-case alphabet may align:
and the five ways in which the first two characters of the upper-case alphabet can align with the first character of the lower-case alphabet:
A recursion relation for the total number of possible alignments of a sequence of \(i\) characters with a sequence of \(j\) characters may be derived by considering the alignment of the last character. There are three possibilities that we illustrate by assuming the \(i\) th character is ’ \(\mathrm{F}\) ’ and the \(j\) th character is ’ \(\mathrm{d}\) ’:
(1) \(i-1\) characters of the first sequence are already aligned with \(j-1\) characters of the second sequence, and the \(i\) th character of the first sequence aligns exactly with the \(j\) th character of the second sequence:
(2) \(i-1\) characters of the first sequence are aligned with \(j\) characters of the second sequence and the \(i\) th character of the first sequence aligns with a gap in the second sequence
(3) \(i\) characters of the first sequence are aligned with \(j-1\) characters of the second sequence and a gap in the first sequence aligns with the \(j\) th character of the second sequence iil
\(\cdots d\)
If \(C(i, j)\) is the number of ways to align an \(i\) character sequence with a \(j\) character sequence, then, from our counting,
\[C(i, j)=C(i-1, j-1)+C(i-1, j)+C(i, j-1) \nonumber \]
This recursion relation requires boundary conditions. Because there is only one way to align an \(i>0\) character sequence against a zero character sequence (i.e., \(i\) characters against \(i\) gaps) the boundary conditions are \(C(0, j)=C(i, 0)=1\) for all \(i, j>0\) We may also add the additional boundary condition \(C(0,0)=1\) , obtained from the known result \(C(1,1)=3\) .
Using the recursion relation (7.1), we can construct the following dynamic matrix to count the number of ways to align the two five-character sequences \(a_{1} a_{2} a_{3} a_{4} a_{5}\) and \(b_{1} b_{2} b_{3} b_{4} b_{5}\) :
\(\begin{array}{ccccccc} & - & b_{1} & b_{2} & b_{3} & b_{4} & b_{5} \\[4pt] - & 1 & 1 & 1 & 1 & 1 & 1 \\[4pt] a_{1} & 1 & 3 & 5 & 7 & 9 & 11 \\[4pt] a_{2} & 1 & 5 & 13 & 25 & 41 & 61 \\[4pt] a_{3} & 1 & 7 & 25 & 63 & 129 & 231 \\[4pt] a_{4} & 1 & 9 & 41 & 129 & 321 & 681 \\[4pt] a_{5} & 1 & 11 & 61 & 231 & 681 & 1683\end{array}\)
The size of this dynamic matrix is \(6 \times 6\) , and for convenience we label the rows and columns starting from zero (i.e., row 0 , row \(1, \ldots\) , row 5 ). This matrix was constructed by first writing \(-a_{1} a_{2} a_{3} a_{4} a_{5}\) to the left of the matrix and \(-b_{1} b_{2} b_{3} b_{4} b_{5}\) above the matrix, then filling in ones across the zeroth row and down the zeroth column to satisfy the boundary conditions, and finally applying the recursion relation directly by going across the first row from left-to-right, the second row from left-to-right, etc. To demonstrate the filling in of the matrix, we have across the first row: \(1+1+1=3,1+1+3=5,1+1+5=7\) , etc, and across the second row: \(1+3+1=5,3+5+5=13,5+7+13=25\) , etc. Finally, the last element entered gives the number of ways to align two five character sequences: 1683, already a remarkably large number.
It is possible to solve analytically the recursion relation (7.1) for \(C(i, j)\) using generating functions. Although the solution method is interesting-and in fact was shown to me by a student-the final analytical result is messy and we omit it here. In general, computation of \(C(i, j)\) is best done numerically by constructing the dynamic matrix.