advertisement: compare things at compare-stuff.com! |
Nakashima et al.nakashima:cpred suggested that the use of amino-acid duplets might improve the separation of proteins in composition space; but their unpublished attempts showed no such trend. However more recent work[Wu et al., 1992,Hobohm & Sander, 1995,Dubchak et al., 1995] suggests that local ordered information is relevant to the classification of protein sequences. We were interested to investigate the use of n-tuples in global structure prediction.
Result 17 in Table 3.1 is obtained using the composition of
amino acid/indel specifier duplets (21 letter alphabet, 441-dimensional
vector). is 7% higher than the equivalent result using singlet
composition (Result 12). The Matthews coefficients also indicate a
meaningful increase in prediction accuracy. Encouragingly, this increase
is seen again when the dataset is restricted to multiple sequences with
more than 160 residues in total (Result 18);
and again the
Matthews coefficients indicate a real improvement. The introduction of
gaps into the n-tuples (Results 19-21) does not produce any dramatic
improvements, but i,i+3 duplets give the best overall prediction, with
(Result 20). The quality of the i,i+3 results may be due
to the proximity of these residues in helices. Specific pairs of residues
might be preferred at these positions because of the side-chain
interactions they make when in helix. In strands, residues at i and i+2
can also make contact, but the results suggest that helical patterns are
more important. Further physical explanations for these results are
discussed in Section 3.5.
In order to use n-tuples where n2, the amino acid alphabet must be
reduced so that the composition vectors do not have too many dimensions for
the data they represent. In order to calculate triplet composition, we
have divided the amino acids into 7 groups based on the Venn diagram of
amino acid properties by Taylortaylor:venndiag. This gives a
total of 8 `amino acid' types: FILM, AV, TC, YWHK, REQ, SDN, PG, and J
(indel specifier). The triplet composition vector has 512 dimensions. The
results using triplet composition (Results 22-26) are very similar to those
obtained using duplets.
Clearly there are a bewildering number of alternative ways to construct
n-tuple compositions; using different lengths, gaps, amino acid groupings,
and simultaneous combinations of these. An attempt was made to search for
a better solution using a genetic algorithm. Whilst the algorithm was
capable of improving upon random starting material, no improvement beyond
could be shown in either test or training datasets (data not
shown).