advertisement: compare things at compare-stuff.com!
next up previous contents
Next: Tertiary structure prediction Up: Ab initio methods Previous: Prediction of sub-cellular location   Contents

Secondary structure prediction

The ab initio methods discussed so far have predicted only global structural features. The next level in the hierarchy of protein structure is the description of the secondary structure of each residue. Currently this can be predicted with at best about 70-75% per-residue accuracy into three states: helix, strand, coil (the $Q_3$ measure) using multiple sequence alignments as input, making it a widely used tool in the analysis of novel sequences. Secondary structure predictions have been used in fold recognition (discussed above), in the manual assembly of tertiary structure models, and in folding simulations (see below). Methods to predict secondary structure have been in development for several decades and consequently the literature is extensive. Here, its history is briefly summarised and the most widely used methods and their limitations are discussed.

The earliest attempts to predict secondary structure were severely limited by the small number of available structures and minimal computing resources. Nevertheless, predictions with around 60-65% $Q_3$ accuracy were achieved using simple statistical or theoretical estimates of helix- and strand-forming propensity[Periti et al., 1967,Low et al., 1968,Ptitsyn, 1969,Nagano, 1973,Chou & Fasman, 1974,Garnier et al., 1978] and complex manual analyses of patterns of physico-chemical side-chain properties[Lim, 1974a,Lim, 1974b].

In intervening years, the increased availability of structural information has enabled the analysis of sequence/structure correlations for pairs (or more) of amino acids[Gibrat et al., 1987,Rooman & Wodak, 1991,Han & Baker, 1995,Han & Baker, 1996]. But their predictive power is not sufficient to make any overall improvement to the overall accuracy of secondary structure prediction. The secondary structure of a particular residue is partly determined by its local sequence environment, but also by its environment with respect to the rest of the folded (or folding) protein which may be separated by tens or hundreds of residues along the chain[Kabsch & Sander, 1984].

Although Robson and colleagues[Garnier et al., 1978] recognised that aligned protein sequences would provide valuable evolutionary information relevant to secondary structure prediction using their algorithm (known as the GOR method), it was not until relatively recently that the databases contained enough related sequences to put this into practice[Zvelebil et al., 1987]. The improvement was a significant one, many workers have repeated the experiment with a variety of algorithms, and in each case a 5-10% increase in $Q_3$ was reported. Multiple sequence alignments provide information about core secondary structures through the conservation of amino acids and the location of insertions and deletions. Sequence conservation also highlights patterns of key physico-chemical features which relate to secondary structures, reducing the noise problem in single sequences.

The most widely used methods are currently the statistics-based GOR method, for its algorithmic simplicity and easy implementation, and the PHD program of Rost and Sanderrost:better,rost:phd3. The latter has a $Q_3$ accuracy of around 72%, of which the top 40% of residue predictions, ranked by a reliability index, are 88% accurate. The PHD method employs feed-forward neural networks to perform the non-linear mapping from a window of amino-acid profiles to one of the three secondary structural states. The structural preferences for the various local sequence environments are learnt by the network during training on sequence alignments with known structures. A PHD prediction service is available via email and the World-Wide Web, but is not generally distributed as a stand-alone program. An interesting recent development has been the DSC program of King and Sternbergking:dsc. This multiple sequence method combines a number of previously used predictive attributes: GOR-like residue propensities, helix- and strand-like patterns of hydrophobicity and conservation, sequence-edge effects and amino acid composition. A linear discrimination function[Weiss & Kulikowski, 1991,Michie et al., 1994] is used to determine the relative contributions of each sequence-based attribute to the final prediction, which is 70% accurate ($Q_3$) using the same test set as Rost and Sander. The DSC algorithm is reported to perform marginally better than PHD on sequences of 90-170 residues. The algorithm is fast, easily implemented and requires little computer memory, but most importantly, the source code is freely available to the scientific community. This work also demonstrates that neural networks are not unique in being able to extract effectively the information relevant to secondary structure formation. Furthermore, simpler methods such as this allow a better understanding of the underlying principles. This point has also been stressed by Benner and colleagues[Benner & Gerloff, 1991,Benner et al., 1994] who prefer to perform secondary structure predictions (including location of elements with respect to the core) with a manual analysis of patterns of conservation and residue types.

Despite redundancy in the mapping between local sequence and structure[Kabsch & Sander, 1984], `nearest neighbour' methods[Yi & Lander, 1993,Salamov & Solovyev, 1995,Frishman & Argos, 1996,Salamov & Solovyev, 1997] also have around 70% $Q_3$ accuracy. For short fragments of query sequence, these methods search a database of sequences with known structure and allocate secondary structure according to that of the nearest neighbours (or top hits). The PREDATOR program of Frishman and Argosfrishman:predator,frishman:predator2 additionally uses amino acid pair statistics to predict hydrogen bonds between neighbouring $\beta $-strands and between residues $i,i+4$ in helices, and has a reported a $Q_3$ of 75% using multiple sequences (but not multiple sequence alignments). It is not yet clear whether this extra 3% is attributable to the hydrogen bonding predictions. Long-range contacts cannot be usefully predicted using statistics based methods[Thomas et al., 1996,Gobel et al., 1994,Olmea & Valencia, 1997]. The source code for PREDATOR is available to academics, and an internet-based prediction service is also available.

Using only single query sequences the SSPAL method of Salamov and Solovyevsalamov:jmb97 uses multiple overlapping local alignments with sequences of known secondary structure in a nearest neighbour-like way and achieves 71% $Q_3$ accuracy, which is uncharacteristic of a single sequence method (their multiple sequence nearest neighbour method NNSSP[Salamov & Solovyev, 1995] is 72% accurate by the same measures). Local sequence space seems to be well populated in sequences related to known structures. Thus pairwise local alignments or comparisons using multiple sequences (as in SSPAL or PREDATOR) may detect local sequence-structure correlations with slightly more sensitivity than the usual approach of looking for patterns in multiple sequence alignments.

Can secondary structure prediction ever be expected to surpass 70-75% $Q_3$ accuracy? Even assuming that long-range tertiary interactions can be incorporated into these algorithms, the best possible $Q_3$ will not, on average, be 100%. Related structures do not share identical strings of secondary structure assignments[Russell & Barton, 1993,Rost et al., 1994]. Even when a query sequence can be aligned confidently to a sequence of known structure, the alignment will produce a secondary structure `prediction' with a $Q_3$ of only 88% on average[Rost et al., 1994], predictions based on multiple sequence alignments can not be expected to perform better than this. This ceiling would obviously rise as more structures become known. Why then is $Q_3$ used so universally when 75% does not really mean three-quarters correct?[*] Other measures based on the quality of secondary structure element prediction have been proposed[Presnell et al., 1992,Rost et al., 1994,Wang, 1994,Zhu, 1995], but $Q_3$ has survived due to its simplicity and universality.

Frishman and Argosfrishman:future recently tested PREDATOR using current and previous SWISS-PROT[Bairoch & Boeckmann, 1991,Bairoch & Apweller, 1997] sequence database releases from which to draw multiple sequences for the jack-knifed test set of 125 proteins. By extrapolation they expect an increase in $Q_3$ of around 5-10% given a ten-fold increase in sequence database size. It is claimed that the accuracy of the PREDATOR method is not limited by the variation of observed secondary structures amongst different family members in a global multiple alignment[Russell & Barton, 1993] because it uses pairwise local alignments instead. Thus the projected 80-85% accuracy cannot be compared to the 88% ceiling discussed above; there would still be room for significant improvement.

Further improvements in secondary structure prediction may require more attention to specific sequential and structural motifs[Han & Baker, 1996], such as turns[Hutchinson & Thornton, 1994,Yang et al., 1996], $\beta $-sheets, helix termini[Jimenez et al., 1994,Aurora et al., 1994,Donnelly et al., 1994,Elmasry & Fersht, 1994] and super-secondary structures[Taylor & Thornton, 1984]. But maybe the only substantial advances will be made when tertiary interactions are fully taken into account. This is discussed in the following section.


next up previous contents
Next: Tertiary structure prediction Up: Ab initio methods Previous: Prediction of sub-cellular location   Contents
Copyright Bob MacCallum - DISCLAIMER: this was written in 1997 and may contain out-of-date information.