advertisement: compare things at compare-stuff.com!
next up previous contents
Next: N-tuple composition Up: Class prediction Previous: Sequence length and noise   Contents

Multiple sequences

It is well known that protein domains with similar sequences share the same global structure. Many structure prediction techniques have been developed or enhanced to incorporate the evolutionary information available in this `mapping' of multiple sequences to a single structure (see Chapter 1). Surprisingly, there seems to be no documented attempt to use multiple sequences in the direct prediction of secondary structural class. Our extension of the method of Nakashima et al.nakashima:cpred to incorporate multiple sequences is, in essence, very simple; the amino acid composition vector for each single sequence is replaced by the mean of the composition vectors from the homologues. The methods of multiple sequence retrieval and processing are described in Appendix A. Multiple sequences were made non-redundant to 50% sequence identity. In order to eliminate potential noise coming from less conserved surface loops of variable length and amino acid composition, all regions of inserted sequence were removed. All adjacent columns in the multiple sequence alignment containing one or more gaps were collapsed into a single column of `J's. This process results in gapless `blocks'. The prediction method now utilises a 21 letter alphabet (the 20 amino acids and the indel or join specifier `J').

Whilst the prediction using multiple sequences (Result 12 in Table 3.1) is of similar quality to the equivalent prediction with single sequences (Result 7), there is a clear improvement in the prediction of all three classes when the dataset is filtered by the total number of residues in the multiple sequence alignments (Results 13-16 and Figure 3.1(b)). Furthermore, the sizes of the filtered datasets do not decrease as rapidly as the length-filtered datasets as the cutoff is increased, making the multiple sequence method applicable to a larger number of real-life prediction targets. Using only protein domains with more than 160 multiple sequence residues, for example, the prediction accuracy, $Q_c$, is 62%, and this would be applicable to an expected 81% ($383/470$) of real-life sequences.


next up previous contents
Next: N-tuple composition Up: Class prediction Previous: Sequence length and noise   Contents
Copyright Bob MacCallum - DISCLAIMER: this was written in 1997 and may contain out-of-date information.