advertisement: compare things at compare-stuff.com!
next up previous contents
Next: Local secondary structure Up: Methods Analysis Previous: Methods Analysis   Contents

Class and architecture

For each type of domain $t$ at a given level in the CATH classification (e.g. class, architecture), and amino acid or duplet type $a$ (from now on referred to as a pattern) the number of observations $O_{a,t}$ in the dataset are counted. The counts from multiple sequences are averaged and rounded down. The number of expected occurrences $E_{a,t}$ of a particular pattern for a domain type is calculated as follows:

\begin{displaymath}
E_{a,t} = \frac{\sum{O_a}\sum{O_t}}{\sum{O_{a,t}}}
\end{displaymath} (2)

where $\sum{O_a}$ is the total number of occurrences of pattern $a$ in all fold types, $\sum{O_t}$ is the total number of patterns in fold type $t$, and $\sum{O_{a,t}}$ is the total number of patterns in the entire dataset. The commonly used chi-squared ($\chi^2$) measure of heterogeneity is calculated as the sum of $\frac{(O-E)^2}{E}$ over the different classes of observation. Given $\chi^2$ and the number of degrees of freedom, $f$, (see below), the probability that the difference between observed and expected values has occurred by chance can be estimated from suitable tables[Bailey, 1981, for example]. The number of degrees of freedom for $n$ classes of observation is $n-1$, whilst for a two-way classification of $n$ by $m$ classes (in our case: $n$ patterns and $m$ fold types) it is $(m-1)(n-1)$.

When different patterns are compared with each other according to their $\chi^2$ deviations across multiple folding types a further adjustment is required. To illustrate this, consider the 20 amino acids distributed between three folding types A, B and C. We rank the amino acids by their individual $\chi^2$ sums across the three types. Each $\chi^2$ score has 2 degrees of freedom, and we may find that proline has the highest $\chi^2$ of 6.0. This is significant at the 95% level, i.e. 1 in 20 $\chi^2$ values of this magnitude will have occurred by chance (also written as $P<0.05$). Remember that there are 20 amino acids, so in effect we have performed 20 experiments. Therefore the result for proline could easily have arisen by chance. $P$ must be adjusted in accordance with the number of `replicates', $n$ as follows[Martin et al., 1995]:

\begin{displaymath}
P' = 1 - (1-P)^{n-1}
\end{displaymath} (3)

Note that if the study had been concerned with just proline from the very outset, this adjustment would not be necessary.

Estimates of statistical significance are not the primary concern of this work. The $\chi^2$ analysis is used mainly to rank the likely contributions of each pattern to the prediction accuracy. $\chi^2$ values resulting from the analysis of sequence pattern frequency between different classes or architectures (as described above) will be referred to as $\chi _t^2$ (`t' for type). In addition, we calculate $(O_{a,t}-E_{a,t})$ for each pattern and fold type combination to determine which patterns are over and under-represented in each folding type. It does not make sense to divide this difference by $E_{a,t}$ because the prediction algorithm does not currently use normalised pattern compositions.


next up previous contents
Next: Local secondary structure Up: Methods Analysis Previous: Methods Analysis   Contents
Copyright Bob MacCallum - DISCLAIMER: this was written in 1997 and may contain out-of-date information.