advertisement: compare things at compare-stuff.com! |
Using a rather limited dataset, consistent improvements beyond a baseline
set by the Smith Waterman method have been obtained in the number of
correctly identified top ranking folds, . Comparisons with other
methods are difficult to make, except in a fully blind trial such as CASP.
Rost et al.rost:pbt recently published work which
compared all combinations of alignments of sequence, predicted secondary
structure and observed secondary structure, using log-odds matrices. Their
baseline, using the Smith Waterman algorithm with a matrix from
McLachlan[McLachlan et al.,
1984] produced 16% correct first hits from 89
queries and a library of 723. Using a combination of sequence and PHD
secondary structure predictions for both query and library sequences, they
obtained 27% correct first hits. Introducing known secondary structure to
the library sequence information improved the performance by only a few
percent. Our baseline was higher, at
(26%), and was increased to
(48%) using a combination of hydrophobicity and secondary structure
prediction probabilities. Rost et al. showed, however, that the
percentage of correct first hits was around 50% using only query folds
which structurally aligned with library sequences with more than 70%
overlap. The detection of partial matches is a much harder problem, as was
also found at the CASP2 meeting[Marchler-Bauer & Bryant,
1997]. In this
respect, our query set using domains (not whole chains) of between 100 and
300 residues is `easy' (but see below), and probably explains the agreement
of our results with those of Rost et al.. Our results are obtained
without structural information however, and with a different set of
proteins.
How do the results of sequence-only profile methods (including hidden Markov models and position specific scoring methods) compare to our results and fold recognition methods as a whole? As discussed in Section 4.1, the benchmarks for sequence methods are generally quite different to those for fold recognition methods. Blind (or at least coordinated) testing on the same data is the only fair comparison. At CASP2 there were unfortunately too few targets to properly compare accuracies between methods. Most methods did better on easier targets: those with extensive overlap with known folds, and/or slightly more related sequences (judged by sequence identity) or common sequence motifs related to function. Sequence-only hidden Markov methods also did well on these, but not so well on the harder targets (S. Bryant, personal communication; full evaluation to be published in Proteins: Structure, Function and Genetics). Many sequence-only `profilers' did not take part in the experiment.
The alignment of sequence property vectors described in this work
is quite similar to profile methods in two ways. Firstly, evolutionary
information is incorporated from multiple sequence alignments and secondary
structure predictions. Secondly, amino acid substitution matrices (used in
profile and non-profile methods) indirectly encode much of the information
that we have used, hydrophobicity in particular. The method does not
employ position specific gap penalties, however (although they could be
added quite easily). Considering the simplicity of the method, why does it
seem to perform so well on this small dataset? The small number of queries
(27) clearly has some bearing on the results. It is widely accepted that
small datasets, regardless how unbiased they are, tend to over-perform
relative to the real expected accuracy in blind trials. Furthermore, our
dataset is biased in favour of success. Domains were selected
for having 10 or more non-redundant (to 70% pairwise identity) multiple
sequences. The optimised evolutionary information content of the dataset
will inevitably improve the sensitivity of our sequence searches. Of
course, we can state that our method has % accuracy if there are at
least
suitable multiple sequences for the query sequence, and if a
similar fold exists in the library (null predictions will be discussed
below).
It was simple enough to apply SIVA (using a 1:1 combination of
hydrophobicity and two-state DSC predictions) to a larger set of query and
library folds. Now allowing domains of 100-300 residues with at least 5
multiple sequences, 78 queries and 197 library domains were available.
This set now contains many more domains with lower quality evolutionary
information. Furthermore it is expected that using a larger library, more
false hits will occur by chance. With our method there are
(45%) correct top hits compared to
(18%) for the Smith Waterman
control, thus roughly the same increase in performance is observed with the
larger trial. It should be stressed that the figure of 45% is not an
estimate of the accuracy of distant homologue detection, since the dataset
contained a number of easily detectable homologues (14, using Smith
Waterman).
Log-odds and position specific matrix methods require the discretisation of sequence data. Amino acid sequences are inherently discretised into 20 classes, but predictions of secondary structure or accessibility require discretisation[Fischer & Eisenberg, 1996,Rice & Eisenberg, 1997,Rost et al., 1997] leading to the loss of information. Even position specific scoring matrices can be rather `lumpy' with sparse data (few multiple alignments). From our results, the direct comparison (using the Euclidean distance) of mean hydrophobicities and secondary structure prediction probabilities appears to be very effective. It is not necessary to decide where to set the boundaries for different classes of hydrophobicity or prediction probability hence none of this information is lost (although the hydrophobicity scales have by definition 20 discrete values). We suggest that the magnitude of sequence-derived information may be as important as the patterns of discrete states, which have been the basis for most methods in sequence comparison and structure prediction.