advertisement: compare things at compare-stuff.com! |
The importance of coordinated testing for fold recognition algorithms has
already been stressed above. Fischer et
al.fischer:bench have developed a number of benchmark sets of
query and library folds. During the final stages of this work, SIVA was
applied to the first benchmark in the
series.
This benchmark consists of 68 query folds and 301 library folds. These are
whole chain structures taken directly from the PDB (unlike the CATH domains
used in this work). Using secondary structure predictions, hydrophobicity
and multiple sequences exactly as described above, SIVA gave 32 correct top
hits according to an assessment performed by Daniel Fischer. The authors
of the benchmark have reported 3D-1D methods which obtain 50 or more
correct top hits (see their web pages for more details). Furthermore,
their application of sequence-only alignments using established
substitution matrices (PAM250, BLOSUM62, GONNET) on this benchmark results
in around 40 correct top hits. How can this be possible, when it has been
shown in this work that SIVA is far superior to standard sequence methods?
One possible explanation is to be found in the alignment algorithms used. Our comparison between Smith Waterman and SIVA may not be very fair because of the length-dependencies of our method (see Section 4.4.2). The Smith Waterman local alignment algorithm does not make effective use of sequence length information. The approach of Fischer and colleagues is to use the so-called global-local alignment and rank by raw scores[Fischer et al., 1996,Fischer & Eisenberg, 1996]. The global-local alignment ensures that the entire library sequence is aligned and penalised for any gaps (including at its termini), whilst the termini of the query sequence need not be aligned (or penalised) as in a standard local alignment. This approach is suited to scanning multi-domain queries against a library of single domains, and it should not be as length-dependent as SIVA. To some extent, however, the global-local algorithm may be particularly suited to the benchmark folds. In particular, the optimisation of gap penalties by Fischer et al.fischer:bench appears to be more severe than in this work.
The most compatible library fold identified by SIVA in the benchmark for
query sequence 1tahA0[Noble et al., 1993, triacylglycerol hydrolase from Pseudomonas
glumae, 318 residues] was 2liv00[Sack et al., 1989, leucine/isoleucine/valine
periplasmic binding protein, 344 residues]. The `correct' answer
1tca00 (a yeast triacylglycerol hydrolase, 317 residues) ranked second.
However, 1tahA0 and 2liv00 share 66 carbon- atoms which can be
superposed with 2.5Å RMSd. The equivalenced residues encompass 5
strands and two helices. None of the other `incorrect' top hits in the
benchmark trials had such clear structural similarity. The exclusion of
1mbc00 (sperm whale myoglobin) as a correct hit for 1cpcL0 (phycocyanin, a
light harvesting protein from cyanobacteria) in the benchmark evaluation is
debatable; the two proteins have completely different functions and share
minimal sequence identity (10%). In our benchmark results for 1cpcL0,
1mbc00 ranks top, whilst the `correct' answer 1colA0 (colicin) ranks 23rd.
Of the 340 benchmark folds, 125 (37%) have less than 5 multiple sequences
(gathered using our methods, see Appendix A), unlike the
testing domains used in this chapter which have at least 5 or 10 multiple
sequences. The detrimental effect of this lack of multiple sequence
information has not yet been measured directly on the benchmark folds.
When the trial of 78 vs. 197 folds (see Section 4.4.1) is
repeated using single sequences (using hydrophobicity and DSC prediction
information) only correct top hits are obtained. This is not a big
improvement over the Smith Waterman results (
) and is substantially
worse than the multiple sequence results (
). The existence of large
number of folds in the benchmark with poor evolutionary information is the
most likely explanation for the poor performance of SIVA in this instance.