advertisement: compare things at compare-stuff.com! |
In Section 3.2.10 we observed that helices provided the clearest
signal to our compositional approach to class prediction. Two-class
predictions (between helix-containing and mainly-) are remarkably
accurate (83% correct overall, 95% correct in the most confident 50% of
predictions). In this section we investigate the multi-step hierarchical
approach to class and architecture prediction. In the following section
the underlying compositional differences between the classes and
architectures are discussed.
From this point onwards, a more recent version (February 1997) of the CATH
classification of domains is used. Therefore the main results from the
previous section are repeated with this data, and are presented briefly
below. All predictions are performed using amino acid duplets at (i, i+3)
using domains with 160 or more multiple sequence residues in total. In
this section, gapped regions of multiple sequence alignments have
not been replaced by the letter `J', and there are
possible duplets. The composition vector normalisation originally
described by Nakashima et al.nakashima:cpred is not
performed in this section, but this does not result in an appreciable drop
in performance.
Three-class predictions were made for 515 CATH domain sequences (99
mainly-, 129 mainly-
, 287 mixed-
) with an
overall accuracy of 66%. The Matthews correlation coefficients are
,
and
. The accuracies in
the four quartiles based on the reliability measure are (most reliable
first): 80%, 69%, 68% and 49%. The lower overall accuracy (66%
compared with 69%, Result 20 in Table 3.1) could be
attributed to the CATH dataset which is larger and may contain fewer weak
homologous pairs, or to changes to the method (not removing gapped regions
of multiple sequence alignment). However, the Matthews correlation
coefficients are comparable. Randomised predictions give an accuracy of
41%. These predictions are made by assigning the class of a domain picked
at random from the dataset, thus making use of the proportions of each
class in the dataset. This approach can be described as being based on
prior probabilities.
Two-class predictions (as in Section 3.2.10) are 83% accurate for
523 domains: 129 mainly-, 394 helix-containing (including 8
irregular domains). This is the same as the result obtained previously.
The Matthews correlation coefficient (for both classes) is slightly higher,
at 0.61. The reliability quartile accuracies are 98%, 89%, 80% and
66%. The accuracy for mainly-
proteins prediction in these
quartiles are 100%, 78% (sic), 95% and 71%, indicating that the
prediction does not simply guess at helix-containing (the random prediction
based on prior probabilities is about 62% accurate overall). The Matthews
coefficients also provide this information.