advertisement: compare things at compare-stuff.com!
next up previous contents
Next: Can we recognise all Up: Significance estimates and null Previous: Null predictions   Contents

A protocol for fold recognition

A summary of the fold recognition results using alignment of mean hydrophobicity and DSC secondary structure predictions follows. The results are estimated from a set of domains from CATH with 100-300 residues and at least 5 non-redundant multiple sequences; the results from `easy' queries have been discarded.

Using these estimates, we estimate the results for 100 hypothetical queries with no `easily' detectable pairwise identity to our library folds. Each query sequence represents a whole domain of between 100 and 300 residues, and has at least 5 non-redundant sequence homologues. Thirty of the queries represent novel folds. To be more precise, they are not in our library of 197 folds. The method described above will make 28 confident predictions ($1.4<z<1.6$). Of these, 14 will be correct, 10 will be mis-predicted, and 4 will be over-predicted novel folds. Of the 72 unpredicted queries, only 26 actually have no known fold. Thus in this example the probability of making a confident prediction (coverage) is 28%, and of these, about 50% will be correct.

How does this relate to real life predictions? The splitting of query sequences into structural domains is difficult when the structure is not known, hence our trials using domain sequences are far from ideal. Some idea of domain boundaries may be inferred from multiple sequence alignments, from the literature (if the protein or its homologues have been studied), or from sequence searches against the PRODOM database[Sonnhammer & Kahn, 1994]. In the latter, a local sequence alignment tool is used to identify internal repeats and sub-sequences which occur in other proteins. Hence multi-domain sequences will not always be split completely by PRODOM if their sequences have not been adequately shuffled by evolution. Lengthy query sequences may not need splitting, they may be a single domain. In this case, the sequence should be queried against a library of larger domains (for example 250-500 residues) for which the reliability and accuracy should be estimated as above. Shorter sequences should be treated likewise. Effective pre- and post-processing of queries and predictions is therefore important, and much of this is best done by hand. When making multiple predictions for different sequence fragments the compounding of errors should be taken into account. It is hoped that the SIVA approach will be applied publicly to blind targets at the CASP3 meeting (December 1998), using both automated and manually interpreted protocols.


next up previous contents
Next: Can we recognise all Up: Significance estimates and null Previous: Null predictions   Contents
Copyright Bob MacCallum - DISCLAIMER: this was written in 1997 and may contain out-of-date information.