Introduction

Early attempts at structural class prediction from amino acid composition with small datasets[Nakashima et al., 1986,Klein & Delisi, 1986b,Chou, 1989] claimed accuracies of around 70-80% (into three or four classes). In this study, we find that the size and makeup of the dataset crucially affect the prediction accuracies; large and comprehensive datasets giving accuracies as low as 57% for three classes (mainly- $\alpha$ , mainly- $\beta$ and mixed- $\alpha \beta$ ) using a jack-knifed implementation of the method of Nakashima et al.nakashima:cpred. Short sequences are shown to be the cause of inaccurate predictions because the small amount of information they contain is subject to noise. The use of multiple sequences, where available, overcomes this problem. The global structural information content of n-tuple (sequence word) composition is also investigated. We find that three-class accuracies of around 68-69% can be achieved from multiple sequence alignments using the composition of either duplets or triplets (using a reduced amino-acid alphabet). The top 50% of predictions, as ranked by a reliability measure, have an accuracy of 80%. The use of a reliability measure appears to be novel in the area of class prediction.

The quality of the predictions is also assessed using the Matthews correlation coefficient[Matthews, 1975] which penalises the over- and under-prediction which can arise from unequally partitioned datasets. High values of the Matthews coefficient for the mainly- $\beta$ class suggested that the n-tuple approach is best used to distinguish helix-containing domains from non helix-containing domains. The results agree: two-class (mainly- $\beta$ and not mainly- $\beta$ ) prediction accuracies are 83% accurate with the top 50% of predictions approximately 95% correct. Hierarchical prediction is then investigated, including the prediction of mainly- $\beta$ architectures which appears to be feasible using the simple methods described here. The underlying compositional differences which allow these predictions are then discussed.

Next: Class prediction Up: Class and architecture prediction Previous: Class and architecture prediction Contents