advertisement: compare things at compare-stuff.com! |
Is there any justification for yet another body of work on secondary structural class prediction from amino acid composition? We have made modifications to the method of Nakashima et al. which take account of multiple sequences and local sequence patterns and result in modest improvements in prediction accuracy. The main results and conclusions from this work are summarised below:
It has been shown that certain global features, such as the presence of
helix can be predicted with usable accuracy and reliability using a simple
method. The sequence information being used is not identical to that used
by per-residue secondary structure prediction algorithms such as
PHD[Rost & Sander, 1994] and DSC[King & Sternberg, 1996]. The PHD algorithm makes use of
global amino acid composition, but the relative importance of the amino
acids has not been determined. For DSC, it was found that the relative
proportions of His, Glu, Gln, Asp and Arg were beneficial to overall
prediction accuracy. However, of these, only Glu and Arg show significant
differences in frequency across the three main classes in our study, and
both are over-abundant in the two helix-containing classes. DSC, by
contrast, finds that the global occurrence of Glu and Arg favours
-strand prediction. We have shown that the over-abundant amino
acids in the various classes cannot be explained purely in terms of helix
and strand forming propensity. Secondary structure predictions rely most
heavily on this local information. The results from PHD suggest that this
alternate information can improve per-residue accuracy by no more than 1%.
It is unlikely that the use of a discrete three-state class prediction from
a method such as ours would be significantly better. The incorporation of
reliability information, and maybe even architecture predictions into a
PHD-like algorithm may be more successful, however.
The definition of class in CATH is based on 3D geometric
criteria[Michie et al.,
1996], and is defined automatically for 90% of
domains and with manual intervention for the remaining 10%. A recent
analysis of secondary structural class prediction by Eisenhaber et
al.eisenhaber:sscp2 used class definitions based upon
thresholds of secondary structural content chosen by Nakashima et
al.nakashima:cpred. They found that the upper limit to class
prediction accuracy using amino acid composition was around 60% with four
classes (mainly-, mainly-
, mixed-
and
irregular). PHD class predictions are reported to be 75% accurate (using
the thresholds of Zhang and Chouzhang:protsci92). These
methods are restricted to the prediction of classes defined without
recourse to 3D information. The direct comparison between class
predictions generated indirectly from per-residue predictions and from our
method is not possible because we use the CATH class definitions.
The reliability measure used here may not be particularly sophisticated or original, but it has been shown for the first time here that it can be put into practice, particularly in the context of hierarchical predictions. The heuristic dissection and hierarchical prediction of fold space according to CATH classifications and amino acid composition may not be easily cross-validated, but it appears to be a valid approach. Following the automation of the hierarchical approach it would be informative to investigate the results alongside those from the fold recognition methods presented in Chapter 4.
Global sequence features have been identified which appear to discriminate
between the architectures of the mainly- class. Cysteine, not
surprisingly, is a common building block of many domains with ribbon
architectures. The ease with which the ribbon domains can be distinguished
may be, in part, due to the commonality of function in this architecture;
hormones predominate and many of these domains may be distant homologues.
In the sandwich, barrel, and distorted sandwich architectures, the
Asp-X-X-Gly pattern is found to be significantly favoured in turns, and
most frequently occurs in barrel domains. A quantitative analysis of the
frequencies of certain sequence patterns in particular secondary structural
environments between different fold types has not been performed, but would
provide valuable information. It is possible that the principles of
folding and organisation are qualitatively different from one type of
protein structure to another.
Non- architectures did not exhibit any strong compositional bias. Could
more sophisticated patterns be devised to distinguish more reliably the
helix containing architectures? This is an area worthy of investigation.
Analyses of sequence-structure
correlations[Kabsch & Sander, 1984,Han & Baker, 1995,Han & Baker, 1996,Rooman & Wodak, 1991]
and approaches to secondary structure prediction[Barton, 1995, for a
review] have focused on local structural features. The
CATH classification is a unique resource which provides easily accessible
global architecture information for all domains in the PDB. Whilst CATH
may mis-classify a small percentage through errors in domain definition,
structure comparisons, the structures themselves or human judgement, it
still allows surveys such as this which were not possible a few years ago.
Furthermore, as our understanding of domain structure improves, so will
methods of classification.