N-tuple composition

advertisement: compare things at compare-stuff.com!

Next: Prediction confidence Up: Class prediction Previous: Multiple sequences Contents

N-tuple composition

Nakashima et al.nakashima:cpred suggested that the use of amino-acid duplets might improve the separation of proteins in composition space; but their unpublished attempts showed no such trend. However more recent work[Wu et al., 1992,Hobohm & Sander, 1995,Dubchak et al., 1995] suggests that local ordered information is relevant to the classification of protein sequences. We were interested to investigate the use of n-tuples in global structure prediction.

Result 17 in Table 3.1 is obtained using the composition of amino acid/indel specifier duplets (21 letter alphabet, 441-dimensional vector). is 7% higher than the equivalent result using singlet composition (Result 12). The Matthews coefficients also indicate a meaningful increase in prediction accuracy. Encouragingly, this increase is seen again when the dataset is restricted to multiple sequences with more than 160 residues in total (Result 18); $Q_c = 68\%$ and again the Matthews coefficients indicate a real improvement. The introduction of gaps into the n-tuples (Results 19-21) does not produce any dramatic improvements, but i,i+3 duplets give the best overall prediction, with $C_\beta = 0.55$ (Result 20). The quality of the i,i+3 results may be due to the proximity of these residues in helices. Specific pairs of residues might be preferred at these positions because of the side-chain interactions they make when in helix. In strands, residues at i and i+2 can also make contact, but the results suggest that helical patterns are more important. Further physical explanations for these results are discussed in Section 3.5.

In order to use n-tuples where n2, the amino acid alphabet must be reduced so that the composition vectors do not have too many dimensions for the data they represent. In order to calculate triplet composition, we have divided the amino acids into 7 groups based on the Venn diagram of amino acid properties by Taylortaylor:venndiag. This gives a total of 8 `amino acid' types: FILM, AV, TC, YWHK, REQ, SDN, PG, and J (indel specifier). The triplet composition vector has 512 dimensions. The results using triplet composition (Results 22-26) are very similar to those obtained using duplets.

Clearly there are a bewildering number of alternative ways to construct n-tuple compositions; using different lengths, gaps, amino acid groupings, and simultaneous combinations of these. An attempt was made to search for a better solution using a genetic algorithm. Whilst the algorithm was capable of improving upon random starting material, no improvement beyond $Q_c = 70\%$ could be shown in either test or training datasets (data not shown).

Next: Prediction confidence Up: Class prediction Previous: Multiple sequences Contents