Dataset size

The published results of Nakashima et al.nakashima:cpred state an overall accuracy of 70% for a five-class prediction ( $\alpha, \beta, \alpha/\beta, \alpha+\beta$ and irregular) using a dataset of 135 proteins. Chou and Zhangchou:critreview repeated this method on a dataset of 120 proteins to obtain a four class accuracy ( $\alpha, \beta, \alpha/\beta, \alpha+\beta$ ) of 63%. We apply our implementation of the method to 113 of the 120 proteins (7 were obsolete) and obtain 69% accuracy (see Result 3 in Table 3.1). Clearly the outcome of the prediction is sensitive to the choice of dataset. With a larger dataset these effects should be diluted and the results will reflect more accurately the amount of global structural information in the amino acid composition of protein sequences.

Using a program by Alex Michie[Michie et al., 1996], 403 of the 470 homologous superfamily representatives (86%) of the 1996 CATH domain classification can be classified automatically using 3D structural information into one of four classes ( $\alpha, \beta, \alpha/\beta, \alpha+\beta$ ). The remaining 14% are borderline cases requiring visual inspection. Fully automated sequence-based prediction methods are not therefore expected to predict more than 86% of domains correctly. With the dataset of 403 unambiguously classified domains, the overall accuracy is 52% (Result 4 in Table 3.1); clearly much worse than the original published figures of 70-80%[Nakashima et al., 1986].

**Table 3.1:** Secondary structural class prediction accuracies
No.	Prediction summary	n¹	c²	(%)³	$C_\alpha$ ⁴	$C_\beta$	$C_{\alpha/\beta}$	$C_{\alpha+\beta}$
1	Nakashima et al.nakashima:cpred published results	135	5	70	-	-	-	-
2	Method as 1. Results of Chou et al.chou:critreview	120	4	63	-	-	-	-
3	This work. 7 proteins removed from dataset 2.	113	4	69	-	-	-	-
4	CATH 96, 4 classes (not comprehensive)	403	4	52	0.42	0.52	0.34	0.21
5	CATH 96, 3 classes (comprehensive)	470	3	59	0.34	0.49	0.30
6	CATH 96, 3 classes (not comprehensive)	421	3	62	0.38	0.50	0.36
7	As 5, with jack-knifing	470	3	57	0.28	0.47	0.26
8	As 7, minimum sequence length 80 res.	394	3	56	0.26	0.45	0.25
9	As 7, minimum sequence length 160 res.	224	3	62	0.26	0.60	0.30
10	As 7, minimum sequence length 240 res.	101	3	54	0.25	0.51	0.22
11	As 7, minimum sequence length 320 res.	52	3	62	0.19	0.57	0.26
12	As 7, with core multiple sequences	470	3	56	0.30	0.48	0.22
13	As 12, total number of residues 80	450	3	58	0.30	0.47	0.25
14	As 12, total number of residues 160	383	3	62	0.36	0.51	0.32
15	As 12, total number of residues 240	312	3	64	0.47	0.44	0.33
16	As 12, total number of residues 320	258	3	66	0.34	0.47	0.31
17	As 12, using 21 letter duplets (i,i+1)	470	3	63	0.32	0.47	0.28
18	As 17, total number of residues 160	383	3	68	0.41	0.52	0.38
19	As 18, using 21 letter duplets (i,i+2)	383	3	68	0.42	0.50	0.37
20	As 18, using 21 letter duplets (i,i+3)	383	3	69	0.43	0.55	0.42
21	As 18, using 21 letter duplets (i,i+4)	383	3	67	0.42	0.53	0.39
22	As 12, using 8 letter⁵ triplets (i,i+1,i+2)	470	3	63	0.35	0.44	0.29
23	As 22, total number of residues 160	383	3	69	0.43	0.52	0.39
24	As 23, using 8 letter triplets (i,i+2,i+4)	383	3	68	0.40	0.52	0.38
25	As 23, using 8 letter triplets (i,i+3,i+6)	383	3	68	0.43	0.49	0.38
26	As 23, using 8 letter triplets (i,i+4,i+8)	383	3	68	0.43	0.49	0.40

¹ n = size of dataset

² c = number of secondary structural classes predicted

= mean prediction accuracy

⁴

= Matthews correlation coefficient for class

⁵ see text