advertisement: compare things at compare-stuff.com!
next up previous contents
Next: Dataset completeness and class Up: Class prediction Previous: Overview of algorithm   Contents

Dataset size

The published results of Nakashima et al.nakashima:cpred state an overall accuracy of 70% for a five-class prediction ( $\alpha, \beta, \alpha/\beta, \alpha+\beta$ and irregular) using a dataset of 135 proteins. Chou and Zhangchou:critreview repeated this method on a dataset of 120 proteins to obtain a four class accuracy ( $\alpha, \beta, \alpha/\beta, \alpha+\beta$) of 63%. We apply our implementation of the method to 113 of the 120 proteins (7 were obsolete) and obtain 69% accuracy (see Result 3 in Table 3.1). Clearly the outcome of the prediction is sensitive to the choice of dataset. With a larger dataset these effects should be diluted and the results will reflect more accurately the amount of global structural information in the amino acid composition of protein sequences.

Using a program by Alex Michie[Michie et al., 1996], 403 of the 470 homologous superfamily representatives (86%) of the 1996 CATH domain classification can be classified automatically using 3D structural information into one of four classes ( $\alpha, \beta, \alpha/\beta, \alpha+\beta$). The remaining 14% are borderline cases requiring visual inspection. Fully automated sequence-based prediction methods are not therefore expected to predict more than 86% of domains correctly. With the dataset of 403 unambiguously classified domains, the overall accuracy is 52% (Result 4 in Table 3.1); clearly much worse than the original published figures of 70-80%[Nakashima et al., 1986].


Table 3.1: Secondary structural class prediction accuracies
No. Prediction summary n1 c2 $Q_c$ (%)3 $C_\alpha$4 $C_\beta$ $C_{\alpha/\beta}$ $C_{\alpha+\beta}$
1 Nakashima et al.nakashima:cpred published results 135 5 70 - - - -
2 Method as 1. Results of Chou et al.chou:critreview 120 4 63 - - - -
3 This work. 7 proteins removed from dataset 2. 113 4 69 - - - -
4 CATH 96, 4 classes (not comprehensive) 403 4 52 0.42 0.52 0.34 0.21
5 CATH 96, 3 classes (comprehensive) 470 3 59 0.34 0.49 0.30
6 CATH 96, 3 classes (not comprehensive) 421 3 62 0.38 0.50 0.36

7

As 5, with jack-knifing 470 3 57 0.28 0.47 0.26
8 As 7, minimum sequence length 80 res. 394 3 56 0.26 0.45 0.25
9 As 7, minimum sequence length 160 res. 224 3 62 0.26 0.60 0.30
10 As 7, minimum sequence length 240 res. 101 3 54 0.25 0.51 0.22
11 As 7, minimum sequence length 320 res. 52 3 62 0.19 0.57 0.26
12 As 7, with core multiple sequences 470 3 56 0.30 0.48 0.22
13 As 12, total number of residues $>$ 80 450 3 58 0.30 0.47 0.25
14 As 12, total number of residues $>$ 160 383 3 62 0.36 0.51 0.32
15 As 12, total number of residues $>$ 240 312 3 64 0.47 0.44 0.33
16 As 12, total number of residues $>$ 320 258 3 66 0.34 0.47 0.31

17

As 12, using 21 letter duplets (i,i+1) 470 3 63 0.32 0.47 0.28
18 As 17, total number of residues $>$ 160 383 3 68 0.41 0.52 0.38
19 As 18, using 21 letter duplets (i,i+2) 383 3 68 0.42 0.50 0.37
20 As 18, using 21 letter duplets (i,i+3) 383 3 69 0.43 0.55 0.42
21 As 18, using 21 letter duplets (i,i+4) 383 3 67 0.42 0.53 0.39
22 As 12, using 8 letter5 triplets (i,i+1,i+2) 470 3 63 0.35 0.44 0.29
23 As 22, total number of residues $>$ 160 383 3 69 0.43 0.52 0.39
24 As 23, using 8 letter triplets (i,i+2,i+4) 383 3 68 0.40 0.52 0.38
25 As 23, using 8 letter triplets (i,i+3,i+6) 383 3 68 0.43 0.49 0.38
26 As 23, using 8 letter triplets (i,i+4,i+8) 383 3 68 0.43 0.49 0.40
1 n = size of dataset
2 c = number of secondary structural classes predicted
3 $Q_c$ = mean prediction accuracy
4 $C_x$ = Matthews correlation coefficient for class $x$
5 see text


next up previous contents
Next: Dataset completeness and class Up: Class prediction Previous: Overview of algorithm   Contents
Copyright Bob MacCallum - DISCLAIMER: this was written in 1997 and may contain out-of-date information.