A protocol for fold recognition

advertisement: compare things at compare-stuff.com!

Next: Can we recognise all Up: Significance estimates and null Previous: Null predictions Contents

A protocol for fold recognition

A summary of the fold recognition results using alignment of mean hydrophobicity and DSC secondary structure predictions follows. The results are estimated from a set of domains from CATH with 100-300 residues and at least 5 non-redundant multiple sequences; the results from `easy' queries have been discarded.

The probability of making a confident () top ranking prediction when a fold can be recognised is 34% (the coverage), and of these predictions, roughly 59% are likely to be correct.
Z-scores higher than 1.6 indicate `easy' targets, which may also be detected using simpler methods (there is no harm in using this method, however).
13% of query sequences from novel folds will be confidently (and wrongly) predicted as having a similar fold in the library.

Using these estimates, we estimate the results for 100 hypothetical queries with no `easily' detectable pairwise identity to our library folds. Each query sequence represents a whole domain of between 100 and 300 residues, and has at least 5 non-redundant sequence homologues. Thirty of the queries represent novel folds. To be more precise, they are not in our library of 197 folds. The method described above will make 28 confident predictions (). Of these, 14 will be correct, 10 will be mis-predicted, and 4 will be over-predicted novel folds. Of the 72 unpredicted queries, only 26 actually have no known fold. Thus in this example the probability of making a confident prediction (coverage) is 28%, and of these, about 50% will be correct.

How does this relate to real life predictions? The splitting of query sequences into structural domains is difficult when the structure is not known, hence our trials using domain sequences are far from ideal. Some idea of domain boundaries may be inferred from multiple sequence alignments, from the literature (if the protein or its homologues have been studied), or from sequence searches against the PRODOM database[Sonnhammer & Kahn, 1994]. In the latter, a local sequence alignment tool is used to identify internal repeats and sub-sequences which occur in other proteins. Hence multi-domain sequences will not always be split completely by PRODOM if their sequences have not been adequately shuffled by evolution. Lengthy query sequences may not need splitting, they may be a single domain. In this case, the sequence should be queried against a library of larger domains (for example 250-500 residues) for which the reliability and accuracy should be estimated as above. Shorter sequences should be treated likewise. Effective pre- and post-processing of queries and predictions is therefore important, and much of this is best done by hand. When making multiple predictions for different sequence fragments the compounding of errors should be taken into account. It is hoped that the SIVA approach will be applied publicly to blind targets at the CASP3 meeting (December 1998), using both automated and manually interpreted protocols.

Next: Can we recognise all Up: Significance estimates and null Previous: Null predictions Contents