advertisement: compare things at compare-stuff.com!
next up previous contents
Next: Helix/non-helix prediction Up: Class prediction Previous: N-tuple composition   Contents

Prediction confidence

Since the prediction method is based on simple Euclidean geometry and relies upon protein domains of different structural classes occupying different, but overlapping, regions in composition space, the prediction accuracy for a particular query should correlate inversely with the proximity of the query composition vector to an intersection of these fuzzy regions. A simple reliability measure, $R$, based on this assumption has been calculated as follows:


\begin{displaymath}R = 1 - \frac{D_{min}}{\sum{D_{!min}}/(c-1)} \end{displaymath}

where $D_{min}$ is the Euclidean distance between the query sequence composition and the closest (i.e. predicted) class centroid, $\sum{D_{!min}}$ is the sum of the non-minimal distances between query sequence composition and class centroids, and $c$ is the total number of class centroids (3 in this case). For query sequences approximately equidistant to each centroid, $R \rightarrow 0$. Larger values for $R$ occur when the query sequence is closer to one centroid than the others, on average.

Figure 3.2: Reliability measures for class predictions. Prediction accuracy improves using lower cutoffs for $R$.
\begin{figure}\epsfig{file=chap3/figs/confidence.ps, width=3in}\\
\end{figure}

In Figure 3.2 the prediction accuracy, $Q_c$, and the prediction coverage (i.e. the fraction of sequences for which a prediction is made) are shown for a range of thresholds of $R$ (prediction details as for Result 20 in Table 3.1). Clearly as predictions are filtered using higher thresholds for $R$ the fraction of correct predictions increases. The coverage decreases quite rapidly, however. $Q_c$ is 93% for predictions which satisfy $R>0.06$, but this high level of reliability is only expected in 7% of predictions (see coverage in Figure 3.2). The top 44% of predictions ($R>0.03$) are 83% accurate. In Section 3.4 prediction accuracies are given for four quartiles based on ranking the predictions by $R$.


next up previous contents
Next: Helix/non-helix prediction Up: Class prediction Previous: N-tuple composition   Contents
Copyright Bob MacCallum - DISCLAIMER: this was written in 1997 and may contain out-of-date information.