advertisement: compare things at compare-stuff.com! |
The dataset used for Result 4 is biased due to the exclusion of borderline or `difficult' folds. If one assumes that the different fold types occupy fuzzy, but distinguishable regions in amino acid composition space, then filtering the dataset in this way will sharpen the edges of these regions and the results will be better, although not applicable to real-life predictions for sequences of unknown structure.
Each domain in the 1996 CATH classification is assigned to one of three
classes (mainly-, mainly-
, and mixed-
); the
two mixed classes (
and
) are combined
because the differences between them are better described in terms of
architecture, and also because many
proteins split into
mainly-
and mainly-
domains. The prediction using the
entire 3-class 1996 CATH database is 59% accurate (Result 5 in
Table 3.1) whilst the prediction using domains with
definite 3-class assignments (using the program by Alex Michie) is 62%
accurate with consistently higher Matthews coefficients (Result 6).
These results confirm the possibility of bias in incomplete datasets.
Comprehensive datasets must therefore be used when testing prediction
methods.
Our use of sequences belonging to domains whose boundaries have been
determined through structural analysis is also questionable. The reliable
prediction of domain boundaries from sequence information alone is not yet
possible; ideally whole (potentially multi-domain) protein chains should be
used to test the methods. However it makes little sense to test an
algorithm on long chains possibly containing a mixture of mainly-
and mainly-
domains. In any case, long sequences should generally
be split into shorter fragments during blind predictions.
The class of irregular domains has so far been ignored. Excluding these domains from datasets is another source of bias. The number of irregular domains in CATH is small, at around 1.5% of the total number of homologous superfamily representatives. All these domains have less than 80 residues however, so have been excluded from the majority of predictions in this work in order to make results directly comparable. When irregular domains are not excluded, four-class predictions are about 1% less accurate than the equivalent three-class predictions (data not shown). This may be the result of a disproportionately large number of domains being predicted as irregular (nearly twice as many as expected). Centroids carry equal `weight' in this method, and if the centroid for irregular domains happened to be close to the centroid for another class it is easy to see how this degradation of performance might occur.