advertisement: compare things at compare-stuff.com!
next up previous contents
Next: Pairwise similarity Up: Class prediction Previous: Dataset size   Contents


Dataset completeness and class definitions

The dataset used for Result 4 is biased due to the exclusion of borderline or `difficult' folds. If one assumes that the different fold types occupy fuzzy, but distinguishable regions in amino acid composition space, then filtering the dataset in this way will sharpen the edges of these regions and the results will be better, although not applicable to real-life predictions for sequences of unknown structure.

Each domain in the 1996 CATH classification is assigned to one of three classes (mainly-$\alpha $, mainly-$\beta $, and mixed-$\alpha \beta $); the two mixed classes ($\alpha/\beta$ and $\alpha+\beta$) are combined because the differences between them are better described in terms of architecture, and also because many $\alpha+\beta$ proteins split into mainly-$\alpha $ and mainly-$\beta $ domains. The prediction using the entire 3-class 1996 CATH database is 59% accurate (Result 5 in Table 3.1) whilst the prediction using domains with definite 3-class assignments (using the program by Alex Michie) is 62% accurate with consistently higher Matthews coefficients (Result 6). These results confirm the possibility of bias in incomplete datasets. Comprehensive datasets must therefore be used when testing prediction methods.

Our use of sequences belonging to domains whose boundaries have been determined through structural analysis is also questionable. The reliable prediction of domain boundaries from sequence information alone is not yet possible; ideally whole (potentially multi-domain) protein chains should be used to test the methods. However it makes little sense to test an algorithm on long chains possibly containing a mixture of mainly-$\alpha $ and mainly-$\beta $ domains. In any case, long sequences should generally be split into shorter fragments during blind predictions.

The class of irregular domains has so far been ignored. Excluding these domains from datasets is another source of bias. The number of irregular domains in CATH is small, at around 1.5% of the total number of homologous superfamily representatives. All these domains have less than 80 residues however, so have been excluded from the majority of predictions in this work in order to make results directly comparable. When irregular domains are not excluded, four-class predictions are about 1% less accurate than the equivalent three-class predictions (data not shown). This may be the result of a disproportionately large number of domains being predicted as irregular (nearly twice as many as expected). Centroids carry equal `weight' in this method, and if the centroid for irregular domains happened to be close to the centroid for another class it is easy to see how this degradation of performance might occur.


next up previous contents
Next: Pairwise similarity Up: Class prediction Previous: Dataset size   Contents
Copyright Bob MacCallum - DISCLAIMER: this was written in 1997 and may contain out-of-date information.