advertisement: compare things at compare-stuff.com!
next up previous contents
Next: Results Up: Methods and Data Previous: Evaluation of alignment quality   Contents

Sequence alignment algorithm

At each position in the query and library sequences, various properties and quantities (discussed in Section 4.1.1) can be calculated from the multiple sequence alignment. These can be combined into a property vector $P$ (although initially we will use only one property at a time, i.e. $P$ will be one dimensional). For query sequence A and library sequence B, the individual vectors corresponding to residues $i$ and $j$ are designated $P^A_i$ and $P^B_j$ respectively. The standard dynamic programming alignment algorithm requires a measure of suitability, $S_{i,j}$ for each residue $i$ in the query sequence (A) at each position $j$ in the library sequence (B), as described in more detail in Section 1.2.1. No penalties are made for terminal gaps. In order to align using property vectors, the value of $S_{i,j}$ is calculated as the normalised and negated Euclidean distance between property vectors as follows:

\begin{displaymath}
S_{i,j} = \frac{\mu - \vert\vert P^A_i - P^B_j \vert\vert}{\sigma}
\end{displaymath} (5)

where $\mu$ and $\sigma$ are the mean and standard deviation of all pairwise Euclidean distances between $i$ and $j$. Thus highly similar property vectors yield high (positive) scores for $S_{i,j}$, and the most different property vectors yield the lowest (negative) scores.

In order to measure local similarity, $S_{i,j}$ is calculated over a small local window of $w$ residues each side of the central residue as follows:

\begin{displaymath}
S_{i,j} = \frac{\mu - \sum_{k=-w}^{k<=w}\vert\vert P^A_{i+k} - P^B_{j+k} \vert\vert}{\sigma}
\end{displaymath} (6)

where the mean and standard deviation now refer to the distribution of the summed distances. In the following results a window of size $w=3$, a total of 7 residues, is used. When the window overlaps with the end of the sequence, Equation 4.2 uses the data for terminal residues in place of the non-existent ones.

To speed up the alignments of longer sequences, only values of $S_{i,j}$ within a certain distance of the central diagonal are calculated. This distance (also, unfortunately, called a window) is set to $50+\vert len(A)-len(B)\vert$. Thus when $len(A)=len(B)$, a maximum shift of more than 50 residues between the two sequences is not possible. The restriction could be removed at a later stage in the development or application of the method.

Gap open and gap extension penalties of 3 and 0.5 respectively were used unless otherwise stated. These parameters are usually optimised empirically according to the requirements of a particular method. In this case, gap penalties were set early on in this investigation on the basis of a few trials, with the intention to refine them later.

Many threading methods use the dynamic programming algorithm in various forms, including local alignment[Jones et al., 1992], global alignment[Bowie et al., 1990,Matsuo & Nishikawa, 1995], and the so-called global-local alignment[Fischer & Eisenberg, 1996,Rice & Eisenberg, 1997]. These protocols basically differ in the scoring of terminal gaps and the extent of the alignment. The processing of scores in fold recognition is something of a black art. Theoretical proof exists to show that the scores from local alignments follow a Poisson-like distribution from which reasonable estimates of biological significance can be drawn[Henikoff, 1996,Bryant & Altschul, 1995, for reviews]. The widely used sequence database searching methods BLAST[Altschul et al., 1990], FASTA[Pearson, 1990] and Smith-Waterman[Smith & Waterman, 1981] all use local alignments. However, the distributions of global alignment scores are less well understood. Some methods use the global method to generate the alignment, and then calculate an energy score based on mounting the query sequence onto the library structure[Matsuo & Nishikawa, 1995], thus avoiding the direct use of the global score. Using global or local scores, Z-scores for each query-library pair can be calculated independently using scores from the alignments of randomised sequences[Rice & Eisenberg, 1997].

We found, however, that the global alignment algorithm[Needleman & Wunsch, 1970] gave the best results when combined with a simple score normalisation step. Before the calculation of Z-scores, the dynamic programming score is divided by the sum of the lengths of the two protein sequences. Without this correction, longer alignments (from longer library sequences) rank higher than they should.


next up previous contents
Next: Results Up: Methods and Data Previous: Evaluation of alignment quality   Contents
Copyright Bob MacCallum - DISCLAIMER: this was written in 1997 and may contain out-of-date information.