advertisement: compare things at compare-stuff.com!
next up previous contents
Next: Multiple sequences Up: Class prediction Previous: Jack-knifing   Contents

Sequence length and noise

The composition vectors of short sequences are susceptible to large changes as the result of just a few amino acid substitutions. The structural class information content of these vectors is therefore unstable and is likely to render predictions less reliable. To explore this, Results 8 to 11 in Table 3.1 have been obtained using datasets with minimum sequence length cutoffs of between 80 and 320 residues. Since the structural class composition of the datasets is not constant, the most suitable comparison between predictions is made using the Matthews correlation coefficients for each class, which are plotted in Figure 3.1(a). Improvements in $C_\beta$ are consistent, however the predictions for the mainly-$\alpha $ and mixed-$\alpha \beta $ protein domains are slightly worse using longer sequences. The use of sequence length-filtered datasets greatly improves the accuracy of predicting the presence or absence of $\alpha $-helix (hence the high value for $C_\beta$) but is not so good at distinguishing between the two $\alpha $-containing classes. The length-filtered datasets are generally too small to be of practical use however (see $n$ in Table 3.1).

Figure 3.1: Reducing noise in the prediction datasets. (a) Matthews coefficients, C(class), vs. the lower sequence length cutoff for datasets of single sequences. (b) Matthews coefficients vs. the lower cutoff for the number of residues in multiple sequences. Consistent improvements for all classes are seen only in (b) where the size of the dataset is also less drastically affected (101 single sequences have over 240 residues, 312 domains have core multiple sequences totalling over 240 residues).
\begin{figure}(a)\\
\epsfig{file=chap3/figs/new_length_single.ps, width=3in}\\
(b)\\
\epsfig{file=chap3/figs/new_total.ps, width=3in}\\
\end{figure}

At this point we should note that the filtering of the datasets applied here does not unfairly bias the prediction. The prediction accuracies quoted here are valid for any query sequence longer than a prescribed number of residues. The same is not true for datasets of protein domains with idealised class assignments (discussed above); in this case it is impossible to filter out borderline query sequences without first knowing their structures.


next up previous contents
Next: Multiple sequences Up: Class prediction Previous: Jack-knifing   Contents
Copyright Bob MacCallum - DISCLAIMER: this was written in 1997 and may contain out-of-date information.