Statistical Issues in Multiple Sequence Alignment

Andreas Krause

GeneData AG
P.O. Box 254
4016 Basel
Switzerland

Andreas.Krause@genedata.com
http://www.genedata.com


Overview

Biomolecular sequences can be thought of as a long series of letters, four for DNA and twenty for amino acids. Aligning sequences to detect similarities is useful for finding patterns. Common patterns might give insights to common functionality of the sequences, assuming that the important functions were preserved in the genes. We introduce the alignment problem in the context of two sequences and extend to multiple sequence alignment. Three standard approaches are reviewed, comprising clustering techniques, multinomial models using Gibbs sampling, and Hidden Markov Models (HMM).


Contents

Motivation
Alignment of two sequences
Alignment of multiple sequences
References

Motivation

Pharmaceutical companies invest heavily into drug development that targets the actual cause of many diseases like cancer: the human gene. Within the coming years, the entire human genome will be sequenced. From there, it is still a long way to go to understand how genes work, how they interact, and what information is stored where. Statistics in conjunction with biology and computer science will form the core disciplines to help gaining an understanding of the functioning of genes and ultimately identify target genes for curing diseases.

Some genes and even some complete organisms have been identified and the data together with a description of the known attributes are publically available. In analyzing a new gene sequence, a comparison to other sequences might help in obtaining pointers towards the biological functioning of the gene sequence.

The following example of raw data has been taken from the "Human cellular DNA/ Human papilloma virus proviral DNA". The entire sequence contains 1810 A's, 1155 C's, 1279 G's, and 1798 T's, in total 6042 letters.

The sequence begins with

TATGTATGTTGCACTGTCCCGCTTCTGCAGTCCATGCATGTGTGTGTGTATGTGTGGAT ACTTGTGTTTGTGTTTATATTAGTACGTACCACACCATTGGAGGTCTTTGCTGTATATT TACTTTTTTTTTTACTGCCTATGTGGGTATTACACAGTTTTGCTCGTTATAGTATGCCT TAAGTTTTGTATTGTGCATTTGTATTGGTGTATATTTTTATAAATAAATATGGTATCAC ACCGTGCTGCCAGGCGCAAGCGTGCATCTGCAACTGAATTATATAA ...

It is easy to imagine that information extraction like the detection of patterns, repeats, or similarities to another sequence needs some sophisticated tools. This is what the remainder of this overview is about.

Alignment of Two Sequences

Aligning two sequences assumes that there are motifs preserved in both sequences that serve similar functionality. We refer to local alignment in order to identify a part of the two sequences that match best. To calculate the "degree of match", we need to introduce a similarity or distance measure. Based on evolutionary data, scoring matrices like the PAM or the BLOSUM series were derived. The scoring matrices value a pair and assign a score to it. The higher the score the more likely the pair is.

Using this approach, one can directly find the regions of highest similarity between of two sequences. Generalizing it some more, we can allow for insertions and deletions of parts of the sequences.

Aligning is a very computer-intensive task, inserting and deleting parts of sequences of length 500 and computing all scores for all possible local alignments needs substantial computer resources.

Multiple Alignment

It is assumed that functionally important motifs are preserved over time, that is, during evolution. Aligning multiple sequences is a tool to search for common patterns in many (more than two) sequences to detect these patterns. Pattern search means that strands in each sequence that align to strands in other sequences are to be determined.

Cluster methods

Aligning two sequences can be extended to aligning many sequences by using clustering techniques that lead to (re-)constructing evolutionary trees. All sequences are assumed to have a position in an evolutionary tree. The more different the sequences are, the more distant they are in the tree.

We reconstruct the tree partially by using a measure of similarity. Using the similarity (or the distance, respectively) one can derive the distance between any two sequences.

Clustal W is a such a method that uses clustering techniques. The essential idea of Clustal W consists of joining sequences subsequently to form clusters. The two sequences that have the smallest distance to each other form a new cluster, whose root is calculated from the nodes the cluster contains. Proceeding iteratively, one finally obtains the entire (sub-)tree of sequences.

Multinomial Models

A sequence to be aligned to others consists of two parts, the common motif detected, and the part not belonging to the motif. We model the motif positions as multinomial responses arising from position-dependent distributions. The non- motif positions are assigned a position-independent distribution of possible outcomes. Lawrence et. al. (1993) describe a strategy for using Gibbs sampling in multiple alignment problems.

Gibbs sampling is a stochastic approach to simulate samples from the distribution of interest. Iterative samples converge under certain assumptions to the distribution of interest. Inference is then based on the samples generated. Given the sample, quantities like variance or correlation estimates can easily be derived.

As samples are iteratively drawn, the Gibbs sampler does not get stuck in a local optimum, but covers the entire support of the distribution.

Given a set of sequences, the common pattern(s) plus the position(s) corresponding in each sequence need to be determined, leading to the following algorithm. To faciliate expressions, we start with a search for one motif of fixed length.

  • Predictive update step: Choose one of the sequences (randomly or in some order)
  • Calculate pattern and background descriptions from current positions
  • Calculate probabilities for all sub-sequences of length W and assign the weight P/Q to each sub-sequence, with P=probability of generating this subsequence, and Q=background probabilities
  • Draw a random sample of the sub-sequences with the weights as above
  • Repeat ... until convergence in distribution is achieved

In detail, the following needs to be calculated:

  • Qx: Probability of generating a segment x conditioned on (qij)
  • Px: Probability of generating the background of a segment x conditioned on (pj)
  • ai: Starting position of the alignment in sequence i
  • bi: Additional information of the pattern distribution, pseudo-counts in Bayesian context
  • cij: Count of amino acid j in position I
  • qij qij = (cij + bj) / (N-1 + S bj)
  • Probability for amino acid j in position i of the motif Pij calculated analogously with no position dependent information (only over the non-pattern positions)
  • Ax = Qx/Px (normalized).

Probability that pattern belongs at position x in the sequence

Finding the optimal alignment is equal to maximizing the log-likelihood. The algorithm can be extended to include a variable width of the pattern. Given the log-likelihood function that we want to maximize, the problem is that with increasing pattern width the log-likelihood tends to increase. On the other hand, modeling longer patterns requires additional parameters, because there are more positions in the pattern. By penalizing extra parameters, an optimum can be found that takes into account the goodness of alignment and the number of parameters (pattern length) required.

If multiple patterns are to be aligned simultaneously, only one of the elements is altered at a time, so that interference between the multiple patterns can be avoided (like overlapping).

Hidden Markov Models

Hidden Markov Models (HMM's) consists of two main components,

  • A hidden part, the hidden states that cannot be observed and
  • An observable part generating the data

The hidden part forms a first order Markov chain that models the common underlying motif. Each realization of a sequence that can be observed might in turn deviate to some extent from the motif that cannot be observed directly.

The task in this model frame is to determine the path that most likely generated the set of observations under examination. In order to determine the global maximum likelihood and the corresponding model, the following algorithm is carried out:

  • Choose an initial (prior) model
  • Align each sequence to the model
  • Re-estimate the model parameters based on the alignments made
  • Repeat steps 2 and 3 until convergence is reached

To carry out the model estimate, the following calculations need to be done:

Given states y and sequences a,

Calculate

  • n(y) = # of paths through y
  • n(y'|y) = # transitions (y' -> y)
  • m (a|y) = # transitions (y -> a)
  • Estimate P(y'|y) = n(y'|y) / n(y)
  • Estimate P(a|y) = m(a|y) / n(y)
  • Until convergence is reached.

Alternative ways of performing the model identification are simulated annealing and Bayesian approaches.


References

General Texts

  • Boguski, M.S.: Hunting for Genes in Computer Data Bases, The New England Journal of Medicine, 1995, Vol. 333, No. 10, 645-47. Online copy
  • Boguski, M.S.: The Turning Point in Genome Research. Trends in Biochemical Sciences 20:295-6. Online copy
  • Casey, Denise: Primer on Molecular Genetics. Online copy
  • Casey, Denise: Glossary (of Terms in molecular biology): Online copy
  • Waterman, M., Introduction to Computational Biology, Chapman & Hall, 1995. Textbook on the mathematics of computational biology.

Newspaper Articles and General Reviews

  • The Genomics Gamble, Science, 1997, Vol.275, 767-774
  • The Real Biotech Revolution, Fortune, March 31 1997, 28-41
  • Bioinformatics - principles and potential of a new multidisciplinary tool. Tibtech, Aug 1996, Vol. 14, 261-321
  • Scientists link a gene to Colon Cancer. The Wall Street Journal Europe, August 27, 1997.

Alignments

  • Smith, T.F., Waterman, M.S. (1981): The identification of common molecular subsequences. Journal of Molecular Biology, 147, 195-197.
  • Vingron, M.; Waterman, M.: Statistical Significance of Local Alignments with Gaps.

Cluster analysis approaches (Clustal W program)

  • Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680. Software and documentation are available from http://www.infobiogen.fr/docs/ClustalW/clustalw.html

Multinomial models (Gibbs Sampling, EM Algorithm approaches)

  • Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., Wootton, J.C.: Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment. Science, 1993, Vol. 262, 208-14.
  • Lawrence, C.E., Reilly, A.A.: ?. Proteins, 1990, Vol. 7, No. 41
  • Cardon, L.R., Stormo, G.D., 1992, Journal of Molecular Biology, 223, 159

Hidden Markov Models

  • Eddy, S.R.: Hidden Markov Models, 1996. Current Opinion in Structural Biology, 6:361-365. Review of the use of HMMs for sequence and structure profiles. (online journal.)
  • Eddy, S.R. (1995): Multiple Alignment Using Hidden Markov Models. Proc. Third Int. Conf. Intelligent Systems for Molecular Biology, edited by C. Rawlings et al. AAAI Press, Menlo Park. pp. 114-120. Online copy describes a simulated annealing algorithm for HMM training and a probabilistic suboptimal alignment algorithm. Compares HMM-based multiple alignment to CLUSTALW.
  • Brown, M., Hughey, R, et.al.: Using Dirichlet mixture priors to derive hidden Markov models for protein families. Proc. First Int. Conf. on Intelligent Systems for Molecular Biology, L. Hunter et al., eds. AAAI Press, Menlo Park, 1993. 47-55.