Biomolecular sequences can be thought of as a long series of letters, four for DNA and twenty for amino acids. Aligning sequences to detect similarities is useful for finding patterns. Common patterns might give insights to common functionality of the sequences, assuming that the important functions were preserved in the genes. We introduce the alignment problem in the context of two sequences and extend to multiple sequence alignment. Three standard approaches are reviewed, comprising clustering techniques, multinomial models using Gibbs sampling, and Hidden Markov Models (HMM).
Motivation
Alignment of two sequences
Alignment of multiple sequences
References
Pharmaceutical companies invest heavily into drug development that targets the actual cause of many diseases like cancer: the human gene. Within the coming years, the entire human genome will be sequenced. From there, it is still a long way to go to understand how genes work, how they interact, and what information is stored where. Statistics in conjunction with biology and computer science will form the core disciplines to help gaining an understanding of the functioning of genes and ultimately identify target genes for curing diseases.
Some genes and even some complete organisms have been identified and the data together with a description of the known attributes are publically available. In analyzing a new gene sequence, a comparison to other sequences might help in obtaining pointers towards the biological functioning of the gene sequence.
The following example of raw data has been taken from the "Human cellular DNA/ Human papilloma virus proviral DNA". The entire sequence contains 1810 A's, 1155 C's, 1279 G's, and 1798 T's, in total 6042 letters.
The sequence begins with
TATGTATGTTGCACTGTCCCGCTTCTGCAGTCCATGCATGTGTGTGTGTATGTGTGGAT ACTTGTGTTTGTGTTTATATTAGTACGTACCACACCATTGGAGGTCTTTGCTGTATATT TACTTTTTTTTTTACTGCCTATGTGGGTATTACACAGTTTTGCTCGTTATAGTATGCCT TAAGTTTTGTATTGTGCATTTGTATTGGTGTATATTTTTATAAATAAATATGGTATCAC ACCGTGCTGCCAGGCGCAAGCGTGCATCTGCAACTGAATTATATAA ...
It is easy to imagine that information extraction like the detection of patterns, repeats, or similarities to another sequence needs some sophisticated tools. This is what the remainder of this overview is about.
Aligning two sequences assumes that there are motifs preserved in both sequences that serve similar functionality. We refer to local alignment in order to identify a part of the two sequences that match best. To calculate the "degree of match", we need to introduce a similarity or distance measure. Based on evolutionary data, scoring matrices like the PAM or the BLOSUM series were derived. The scoring matrices value a pair and assign a score to it. The higher the score the more likely the pair is.
Using this approach, one can directly find the regions of highest similarity between of two sequences. Generalizing it some more, we can allow for insertions and deletions of parts of the sequences.
Aligning is a very computer-intensive task, inserting and deleting parts of sequences of length 500 and computing all scores for all possible local alignments needs substantial computer resources.
It is assumed that functionally important motifs are preserved over time, that is, during evolution. Aligning multiple sequences is a tool to search for common patterns in many (more than two) sequences to detect these patterns. Pattern search means that strands in each sequence that align to strands in other sequences are to be determined.
Aligning two sequences can be extended to aligning many sequences by using clustering techniques that lead to (re-)constructing evolutionary trees. All sequences are assumed to have a position in an evolutionary tree. The more different the sequences are, the more distant they are in the tree.
We reconstruct the tree partially by using a measure of similarity. Using the similarity (or the distance, respectively) one can derive the distance between any two sequences.
Clustal W is a such a method that uses clustering techniques. The essential idea of Clustal W consists of joining sequences subsequently to form clusters. The two sequences that have the smallest distance to each other form a new cluster, whose root is calculated from the nodes the cluster contains. Proceeding iteratively, one finally obtains the entire (sub-)tree of sequences.
A sequence to be aligned to others consists of two parts, the common motif detected, and the part not belonging to the motif. We model the motif positions as multinomial responses arising from position-dependent distributions. The non- motif positions are assigned a position-independent distribution of possible outcomes. Lawrence et. al. (1993) describe a strategy for using Gibbs sampling in multiple alignment problems.
Gibbs sampling is a stochastic approach to simulate samples from the distribution of interest. Iterative samples converge under certain assumptions to the distribution of interest. Inference is then based on the samples generated. Given the sample, quantities like variance or correlation estimates can easily be derived.
As samples are iteratively drawn, the Gibbs sampler does not get stuck in a local optimum, but covers the entire support of the distribution.
Given a set of sequences, the common pattern(s) plus the position(s)
corresponding in each sequence need to be determined, leading to the following
algorithm. To faciliate expressions, we start with a search for one motif of
fixed length.
In detail, the following needs to be calculated:
Finding the optimal alignment is equal to maximizing the log-likelihood. The algorithm can be extended to include a variable width of the pattern. Given the log-likelihood function that we want to maximize, the problem is that with increasing pattern width the log-likelihood tends to increase. On the other hand, modeling longer patterns requires additional parameters, because there are more positions in the pattern. By penalizing extra parameters, an optimum can be found that takes into account the goodness of alignment and the number of parameters (pattern length) required.
If multiple patterns are to be aligned simultaneously, only one of the elements is altered at a time, so that interference between the multiple patterns can be avoided (like overlapping).
Hidden Markov Models (HMM's) consists of two main components,
The hidden part forms a first order Markov chain that models the common underlying motif. Each realization of a sequence that can be observed might in turn deviate to some extent from the motif that cannot be observed directly.
The task in this model frame is to determine the path that most likely generated the set of observations under examination. In order to determine the global maximum likelihood and the corresponding model, the following algorithm is carried out:
To carry out the model estimate, the following calculations need to be done:
Given states y and sequences a,
Calculate
Alternative ways of performing the model identification are simulated annealing and Bayesian approaches.