Hierarchical class and architecture prediction

In Section 3.2.10 we observed that helices provided the clearest signal to our compositional approach to class prediction. Two-class predictions (between helix-containing and mainly- $\beta$ ) are remarkably accurate (83% correct overall, 95% correct in the most confident 50% of predictions). In this section we investigate the multi-step hierarchical approach to class and architecture prediction. In the following section the underlying compositional differences between the classes and architectures are discussed.

From this point onwards, a more recent version (February 1997) of the CATH classification of domains is used. Therefore the main results from the previous section are repeated with this data, and are presented briefly below. All predictions are performed using amino acid duplets at (i, i+3) using domains with 160 or more multiple sequence residues in total. In this section, gapped regions of multiple sequence alignments have not been replaced by the letter `J', and there are $20\times 20$ possible duplets. The composition vector normalisation originally described by Nakashima et al.nakashima:cpred is not performed in this section, but this does not result in an appreciable drop in performance.

Three-class predictions were made for 515 CATH domain sequences (99 mainly- $\alpha$ , 129 mainly- $\beta$ , 287 mixed- $\alpha \beta$ ) with an overall accuracy of 66%. The Matthews correlation coefficients are $C_\alpha=0.41$ , $C_{mixed}=0.41$ and $C_\beta=0.58$ . The accuracies in the four quartiles based on the reliability measure are (most reliable first): 80%, 69%, 68% and 49%. The lower overall accuracy (66% compared with 69%, Result 20 in Table 3.1) could be attributed to the CATH dataset which is larger and may contain fewer weak homologous pairs, or to changes to the method (not removing gapped regions of multiple sequence alignment). However, the Matthews correlation coefficients are comparable. Randomised predictions give an accuracy of 41%. These predictions are made by assigning the class of a domain picked at random from the dataset, thus making use of the proportions of each class in the dataset. This approach can be described as being based on prior probabilities.

Two-class predictions (as in Section 3.2.10) are 83% accurate for 523 domains: 129 mainly- $\beta$ , 394 helix-containing (including 8 irregular domains). This is the same as the result obtained previously. The Matthews correlation coefficient (for both classes) is slightly higher, at 0.61. The reliability quartile accuracies are 98%, 89%, 80% and 66%. The accuracy for mainly- $\beta$ proteins prediction in these quartiles are 100%, 78% (sic), 95% and 71%, indicating that the prediction does not simply guess at helix-containing (the random prediction based on prior probabilities is about 62% accurate overall). The Matthews coefficients also provide this information.

Subsections

Next: Distinction between mainly- and Up: Class and architecture prediction Previous: Secondary structural content prediction Contents