Discussion

Is there any justification for yet another body of work on secondary structural class prediction from amino acid composition? We have made modifications to the method of Nakashima et al. which take account of multiple sequences and local sequence patterns and result in modest improvements in prediction accuracy. The main results and conclusions from this work are summarised below:

Test sets must be representative of the sequences expected in bona fide predictions.
No prior knowledge of protein structure should be used when constructing data sets. However, our trials use sequences of structural domains.
Data sets must not contain pairs of sequences with significant sequence similarity.
Jack-knifed testing is essential.
Three class (mainly- $\alpha$ , mainly- $\beta$ and mixed- $\alpha \beta$ ) accuracy is 57% based on trials of 470 domains.
Sparse sequence information compromises prediction accuracy. Accuracy can be increased by about 5% using data sets screened on the basis of sequence length or the number of available multiple sequences.
The use of short sequence-word composition (duplets and triplets) improves prediction accuracy by a further 6-7% (to 69%).
83% of domains are correctly predicted into one of two classes: mainly- $\beta$ and helix-containing (mainly- $\alpha$ and mixed- $\alpha \beta$ ).
A useful reliability measure has been incorporated into the method.

It has been shown that certain global features, such as the presence of helix can be predicted with usable accuracy and reliability using a simple method. The sequence information being used is not identical to that used by per-residue secondary structure prediction algorithms such as PHD[Rost & Sander, 1994] and DSC[King & Sternberg, 1996]. The PHD algorithm makes use of global amino acid composition, but the relative importance of the amino acids has not been determined. For DSC, it was found that the relative proportions of His, Glu, Gln, Asp and Arg were beneficial to overall prediction accuracy. However, of these, only Glu and Arg show significant differences in frequency across the three main classes in our study, and both are over-abundant in the two helix-containing classes. DSC, by contrast, finds that the global occurrence of Glu and Arg favours $\beta$ -strand prediction. We have shown that the over-abundant amino acids in the various classes cannot be explained purely in terms of helix and strand forming propensity. Secondary structure predictions rely most heavily on this local information. The results from PHD suggest that this alternate information can improve per-residue accuracy by no more than 1%. It is unlikely that the use of a discrete three-state class prediction from a method such as ours would be significantly better. The incorporation of reliability information, and maybe even architecture predictions into a PHD-like algorithm may be more successful, however.

The definition of class in CATH is based on 3D geometric criteria[Michie et al., 1996], and is defined automatically for 90% of domains and with manual intervention for the remaining 10%. A recent analysis of secondary structural class prediction by Eisenhaber et al.eisenhaber:sscp2 used class definitions based upon thresholds of secondary structural content chosen by Nakashima et al.nakashima:cpred. They found that the upper limit to class prediction accuracy using amino acid composition was around 60% with four classes (mainly- $\alpha$ , mainly- $\beta$ , mixed- $\alpha \beta$ and irregular). PHD class predictions are reported to be 75% accurate (using the thresholds of Zhang and Chouzhang:protsci92). These methods are restricted to the prediction of classes defined without recourse to 3D information. The direct comparison between class predictions generated indirectly from per-residue predictions and from our method is not possible because we use the CATH class definitions.

The reliability measure used here may not be particularly sophisticated or original, but it has been shown for the first time here that it can be put into practice, particularly in the context of hierarchical predictions. The heuristic dissection and hierarchical prediction of fold space according to CATH classifications and amino acid composition may not be easily cross-validated, but it appears to be a valid approach. Following the automation of the hierarchical approach it would be informative to investigate the results alongside those from the fold recognition methods presented in Chapter 4.

Global sequence features have been identified which appear to discriminate between the architectures of the mainly- $\beta$ class. Cysteine, not surprisingly, is a common building block of many domains with ribbon architectures. The ease with which the ribbon domains can be distinguished may be, in part, due to the commonality of function in this architecture; hormones predominate and many of these domains may be distant homologues. In the sandwich, barrel, and distorted sandwich architectures, the Asp-X-X-Gly pattern is found to be significantly favoured in turns, and most frequently occurs in barrel domains. A quantitative analysis of the frequencies of certain sequence patterns in particular secondary structural environments between different fold types has not been performed, but would provide valuable information. It is possible that the principles of folding and organisation are qualitatively different from one type of protein structure to another.

Non- $\beta$ architectures did not exhibit any strong compositional bias. Could more sophisticated patterns be devised to distinguish more reliably the helix containing architectures? This is an area worthy of investigation. Analyses of sequence-structure correlations[Kabsch & Sander, 1984,Han & Baker, 1995,Han & Baker, 1996,Rooman & Wodak, 1991] and approaches to secondary structure prediction[Barton, 1995, for a review] have focused on local structural features. The CATH classification is a unique resource which provides easily accessible global architecture information for all domains in the PDB. Whilst CATH may mis-classify a small percentage through errors in domain definition, structure comparisons, the structures themselves or human judgement, it still allows surveys such as this which were not possible a few years ago. Furthermore, as our understanding of domain structure improves, so will methods of classification.

Next: Alignments of multiple sequence-derived Up: Class and architecture prediction Previous: Mainly- architectures Contents