advertisement: compare things at compare-stuff.com! |
Early attempts at structural class prediction from amino acid composition
with small datasets[Nakashima et al.,
1986,Klein & Delisi, 1986b,Chou, 1989] claimed
accuracies of around 70-80% (into three or four classes). In this study,
we find that the size and makeup of the dataset crucially affect the
prediction accuracies; large and comprehensive datasets giving accuracies
as low as 57% for three classes (mainly-, mainly-
and
mixed-
) using a jack-knifed implementation of the method of
Nakashima et al.nakashima:cpred. Short sequences are
shown to be the cause of inaccurate predictions because the small amount of
information they contain is subject to noise. The use of multiple
sequences, where available, overcomes this problem. The global structural
information content of n-tuple (sequence word) composition is also
investigated. We find that three-class accuracies of around 68-69% can
be achieved from multiple sequence alignments using the composition of
either duplets or triplets (using a reduced amino-acid alphabet). The top
50% of predictions, as ranked by a reliability measure, have an accuracy
of 80%. The use of a reliability measure appears to be novel in the area
of class prediction.
The quality of the predictions is also assessed using the Matthews
correlation coefficient[Matthews, 1975] which penalises the
over- and under-prediction which can arise from unequally partitioned
datasets. High values of the Matthews coefficient for the mainly-
class suggested that the n-tuple approach is best used to distinguish
helix-containing domains from non helix-containing domains. The results
agree: two-class (mainly-
and not mainly-
) prediction accuracies
are 83% accurate with the top 50% of predictions approximately 95%
correct. Hierarchical prediction is then investigated, including the
prediction of mainly-
architectures which appears to be feasible using
the simple methods described here. The underlying compositional
differences which allow these predictions are then discussed.