7: Sequence Alignment
The software program BLAST (Basic Local Alignment Search Tool) uses sequence alignment algorithms to compare a query sequence against a database to identify other known sequences similar to the query sequence. Often, the annotations attached to the already known sequences yield important biological information about the query sequence. Almost all biologists use BLAST, making sequence alignment one of the most important algorithms of bioinformatics.
The sequence under study can be composed of nucleotides (from the nucleic acids DNA or RNA) or amino acids (from proteins). Nucleic acids chain together four different nucleotides: A,C,T,G for DNA and A, C,U,G for RNA; proteins chain together twenty different amino acids. The sequence of a DNA molecule or of a protein is the linear order of nucleotides or amino acids in a specified direction, defined by the chemistry of the molecule. There is no need for us to know the exact details of the chemistry; it is sufficient to know that a protein has distinguishable ends called the N-terminus and the C-terminus, and that the usual convention is to read the amino acid sequence from the N-terminus to the C-terminus. Specification of the direction is more complicated for a DNA molecule than for a protein molecule because of the double helix structure of DNA, and this will be explained in Section 7.1.
The basic sequence alignment algorithm aligns two or more sequences to highlight their similarity, inserting a small number of gaps into each sequence (usually denoted by dashes) to align wherever possible identical or similar characters. For instance, Fig \(7.1\) presents an alignment using the software tool ClustalW of the hemoglobin beta-chain from a human, a chimpanzee, a rat, and a zebrafish. The human and chimpanzee sequences are identical, a consequence of our very close evolutionary relationship. The rat sequence differs from human/chimpanzee at only 27 out of 146 amino acids; we are all mammals. The zebrafish sequence, though clearly related, diverges significantly. Notice the insertion of a gap in each of the mammal sequences at the zebra fish amino acid position 122 . This permits the subsequent zebrafish sequence to better align with the mammal sequences, and implies either an insertion of a new amino acid in fish, or a deletion of an amino acid in mammals. The insertion or deletion of a character in a sequence is called an indel. Mismatches in sequence, such as that occurring between zebrafish and mammals at amino acid positions 2 and 3 is called a mutation. ClustalW places a "* on the last line to denote exact amino acid matches across all sequences, and \(\mathrm{a}^{\prime}: '\) and ’ \('\) ’ to denote chemically similar amino acids across all sequences (each amino acid has characteristic chemical properties, and amino acids can be grouped according to similar properties). In this chapter, we detail the algorithms used to align sequences.
-
- 7.2: Brute Force Alignment
- One (bad) approach to sequence alignment is to align the two sequences in all possible ways, score the alignments with an assumed scoring system, and determine the highest scoring alignment. The problem with this brute-force approach is that the number of possible alignments grows exponentially with sequence length; and for sequences of reasonable length, the computation is already impossible.
-
- 7.5: Local Alignments
- We have so far discussed how to align two sequences over their entire length, called a global alignment. Often, however, it is more useful to align two sequences over only part of their lengths, called a local alignment. In bioinformatics, the algorithm for global alignment is called "Needleman-Wunsch," and that for local alignment "Smith-Waterman."