advertisement: compare things at compare-stuff.com! |
The ab initio methods discussed so far have predicted only global
structural features. The next level in the hierarchy of protein structure
is the description of the secondary structure of each residue. Currently
this can be predicted with at best about 70-75% per-residue accuracy into
three states: helix, strand, coil (the measure) using multiple
sequence alignments as input, making it a widely used tool in the analysis
of novel sequences. Secondary structure predictions have been used in fold
recognition (discussed above), in the manual assembly of tertiary structure
models, and in folding simulations (see below). Methods to predict
secondary structure have been in development for several decades and
consequently the literature is extensive. Here, its history is briefly
summarised and the most widely used methods and their limitations are
discussed.
The earliest attempts to predict secondary structure were severely limited
by the small number of available structures and minimal computing
resources. Nevertheless, predictions with around 60-65% accuracy
were achieved using simple statistical or theoretical estimates of helix-
and strand-forming
propensity[Periti et al.,
1967,Low et al., 1968,Ptitsyn, 1969,Nagano, 1973,Chou & Fasman, 1974,Garnier et al.,
1978]
and complex manual analyses of patterns of physico-chemical side-chain
properties[Lim, 1974a,Lim, 1974b].
In intervening years, the increased availability of structural information has enabled the analysis of sequence/structure correlations for pairs (or more) of amino acids[Gibrat et al., 1987,Rooman & Wodak, 1991,Han & Baker, 1995,Han & Baker, 1996]. But their predictive power is not sufficient to make any overall improvement to the overall accuracy of secondary structure prediction. The secondary structure of a particular residue is partly determined by its local sequence environment, but also by its environment with respect to the rest of the folded (or folding) protein which may be separated by tens or hundreds of residues along the chain[Kabsch & Sander, 1984].
Although Robson and colleagues[Garnier et al.,
1978] recognised that
aligned protein sequences would provide valuable evolutionary
information relevant to secondary structure prediction using their
algorithm (known as the GOR method), it was not until relatively
recently that the databases contained enough related sequences to put
this into practice[Zvelebil et al.,
1987]. The improvement was a
significant one, many workers have repeated the experiment with a
variety of algorithms, and in each case a 5-10% increase in was
reported. Multiple sequence alignments provide information about core
secondary structures through the conservation of amino acids and the
location of insertions and deletions. Sequence conservation also
highlights patterns of key physico-chemical features which relate to
secondary structures, reducing the noise problem in single sequences.
The most widely used methods are currently the statistics-based GOR method,
for its algorithmic simplicity and easy implementation, and the PHD program
of Rost and Sanderrost:better,rost:phd3. The latter has a
accuracy of around 72%, of which the top 40% of residue
predictions, ranked by a reliability index, are 88% accurate. The PHD
method employs feed-forward neural networks to perform the non-linear
mapping from a window of amino-acid profiles to one of the three secondary
structural states. The structural preferences for the various local
sequence environments are learnt by the network during training on sequence
alignments with known structures. A PHD prediction service is available
via email and the World-Wide Web, but is not generally distributed as a
stand-alone program. An interesting recent development has been the DSC
program of King and Sternbergking:dsc. This multiple sequence
method combines a number of previously used predictive attributes: GOR-like
residue propensities, helix- and strand-like patterns of hydrophobicity and
conservation, sequence-edge effects and amino acid composition. A linear
discrimination function[Weiss & Kulikowski, 1991,Michie et al.,
1994] is used to
determine the relative contributions of each sequence-based attribute to
the final prediction, which is 70% accurate (
) using the same test
set as Rost and Sander. The DSC algorithm is reported to perform
marginally better than PHD on sequences of 90-170 residues. The algorithm
is fast, easily implemented and requires little computer memory, but most
importantly, the source code is freely available to the scientific
community. This work also demonstrates that neural networks are not unique
in being able to extract effectively the information relevant to secondary
structure formation. Furthermore, simpler methods such as this allow a
better understanding of the underlying principles. This point has also
been stressed by Benner and colleagues[Benner & Gerloff, 1991,Benner et al.,
1994] who
prefer to perform secondary structure predictions (including location of
elements with respect to the core) with a manual analysis of patterns of
conservation and residue types.
Despite redundancy in the mapping between local sequence and
structure[Kabsch & Sander, 1984], `nearest neighbour'
methods[Yi & Lander, 1993,Salamov & Solovyev, 1995,Frishman & Argos, 1996,Salamov & Solovyev, 1997] also
have around 70% accuracy. For short fragments of query sequence,
these methods search a database of sequences with known structure and
allocate secondary structure according to that of the nearest neighbours
(or top hits). The PREDATOR program of Frishman and
Argosfrishman:predator,frishman:predator2 additionally uses
amino acid pair statistics to predict hydrogen bonds between neighbouring
-strands and between residues
in helices, and has a reported a
of 75% using multiple sequences (but not multiple sequence
alignments). It is not yet clear whether this extra 3% is attributable to
the hydrogen bonding predictions. Long-range contacts cannot be usefully
predicted using statistics based
methods[Thomas et al.,
1996,Gobel et al.,
1994,Olmea & Valencia, 1997]. The source code
for PREDATOR is available to academics, and an internet-based prediction
service is also available.
Using only single query sequences the SSPAL method of Salamov and
Solovyevsalamov:jmb97 uses multiple overlapping local
alignments with sequences of known secondary structure in a nearest
neighbour-like way and achieves 71% accuracy, which is
uncharacteristic of a single sequence method (their multiple sequence
nearest neighbour method NNSSP[Salamov & Solovyev, 1995] is 72% accurate by the
same measures). Local sequence space seems to be well populated in
sequences related to known structures. Thus pairwise local alignments or
comparisons using multiple sequences (as in SSPAL or PREDATOR) may detect
local sequence-structure correlations with slightly more sensitivity than
the usual approach of looking for patterns in multiple sequence alignments.
Can secondary structure prediction ever be expected to surpass 70-75%
accuracy? Even assuming that long-range tertiary interactions can be
incorporated into these algorithms, the best possible
will not, on
average, be 100%. Related structures do not share identical strings of
secondary structure assignments[Russell & Barton, 1993,Rost et al.,
1994]. Even
when a query sequence can be aligned confidently to a sequence of known
structure, the alignment will produce a secondary structure `prediction'
with a
of only 88% on average[Rost et al.,
1994], predictions
based on multiple sequence alignments can not be expected to perform better
than this. This ceiling would obviously rise as more structures become
known. Why then is
used so universally when 75% does not really
mean three-quarters correct?
Other measures based on the quality of secondary structure
element prediction have been
proposed[Presnell et al.,
1992,Rost et al.,
1994,Wang, 1994,Zhu, 1995],
but
has survived due to its simplicity and universality.
Frishman and Argosfrishman:future recently tested PREDATOR
using current and previous
SWISS-PROT[Bairoch & Boeckmann, 1991,Bairoch & Apweller, 1997] sequence database
releases from which to draw multiple sequences for the jack-knifed test set
of 125 proteins. By extrapolation they expect an increase in of
around 5-10% given a ten-fold increase in sequence database size. It is
claimed that the accuracy of the PREDATOR method is not limited by the
variation of observed secondary structures amongst different family members
in a global multiple alignment[Russell & Barton, 1993] because it uses pairwise
local alignments instead. Thus the projected 80-85% accuracy cannot be
compared to the 88% ceiling discussed above; there would still be room for
significant improvement.
Further improvements in secondary structure prediction may require more
attention to specific sequential and structural motifs[Han & Baker, 1996],
such as turns[Hutchinson & Thornton, 1994,Yang et al., 1996], -sheets, helix
termini[Jimenez et al.,
1994,Aurora et al.,
1994,Donnelly et al.,
1994,Elmasry & Fersht, 1994]
and super-secondary structures[Taylor & Thornton, 1984]. But maybe the only
substantial advances will be made when tertiary interactions are fully
taken into account. This is discussed in the following section.