Prediction of mainly- architectures

The compositional approach is suited to the prediction of any pre-defined sub-grouping of sequences, not only secondary structural class. For example, in Chapter 1 the prediction of sub-cellular location from amino-acid composition was reviewed. The CATH architectures for the mainly- $\beta$ domains (see Figure 3.3) have been predicted using exactly the methods used above. The dataset contains 124 domains (the domains of five architectures with single representatives were removed). Overall, 50% of the 124 domains were predicted correctly, against a background of 25% expected accuracy using random prediction based on prior probabilities. The reliability quartile accuracies are 61%, 58%, 48% and 32%. The detailed results for each architecture are given in Table 3.3. Most notably, the best quality predictions are for the ribbon architecture, with a Matthews correlation coefficient $C_{ribbon}=0.634$ , and 71% correct. The barrel and sandwich architectures are also predicted better than random ( $C_{barrel}=0.393$ and $C_{sandwich}=0.345$ ). None of the 10 distorted sandwiches is predicted correctly; 4 of them are predicted as (normal) sandwiches. This result is interesting because it suggests that similarities in architecture are reflected by similarities in amino acid pattern composition. The remaining architectures have few representatives and are likely to compromise the overall prediction accuracy (see Section 3.2.3).

Figure 3.3: Common mainly- $\beta$ architectures. Figures generated using Molscript[Kraulis, 1991].

(a) Ribbon (2tgi00)
$\epsfig{file=chap3/figs/2tgi00.eps,width=\twotoapage}$

(b) Barrel (1mjc00)
$\epsfig{file=chap3/figs/1mjc00.eps,width=\twotoapage}$

(d) Distorted Sandwich (1bcx00)
$\epsfig{file=chap3/figs/1bcx00.eps,width=\twotoapage}$

**Table 3.3:** Architecture prediction within the mainly- $\beta$ class.
					reliability quartiles¹
Architecture	CATH	n²	Q%³	$C_{arch}$ ⁴	1	2	3	4
Ribbon	2.10	17	71	0.634	3/3	1/2	3/5	5/7
Barrel	2.40	25	60	0.393	4/7	7/8	4/8	0/2
Sandwich	2.60	52	65	0.345	11/13	10/15	8/14	5/10
2 Solenoid	2.150	3	33	0.158	1/2	0/0	0/0	0/1
Single Sheet	2.20	3	0	-0.014	0/0	0/0	0/1	0/2
Trefoil	2.80	4	0	-0.016	0/1	0/1	0/0	0/2
Complex	2.170	2	0	-0.020	0/1	0/1	0/0	0/0
7 Propellor	2.130	2	0	-0.020	0/0	0/1	0/1	0/0
Distorted Sandwich	2.70	10	0	-0.027	0/4	0/3	0/0	0/3
Roll	2.30	2	0	undef	0/0	0/0	0/1	0/1
Aligned Prism	2.100	2	0	undef	0/0	0/0	0/1	0/1
4 Propellor	2.110	2	0	undef	0/0	0/0	0/0	0/2

¹ Predictions are sorted by the reliability measure and split into quartiles. 1=most reliable, 4=least reliable. Numbers show number correct out of number predicted.

² number of domains in dataset

³ overall accuracy

⁴ Matthews correlation coefficient for each architecture