advertisement: compare things at compare-stuff.com! |
The current, widely used `standard sequence search techniques' referred to in this thesis: BLAST[Altschul et al., 1990], FASTA[Pearson, 1990] and Smith Waterman[Smith & Waterman, 1981] searches of sequence databases have very good accuracy when used with care. However, when applied to the complete proteomes of Haemophilus influenzae, Mycoplasma genitalium and Methanococcus jannaschii[Fleischmann et al., 1995,Fraser et al., 1995,Bult et al., 1996], these methods (confidently) find similar sequences of known function in current sequence databases for only 58%, 79% and 78% of the sequences respectively; in other words the chances of standard sequence methods failing to find a significantly similar sequence with known function for a randomly selected gene from organisms like these are around 20-40%. Does this mean that 20-40% of proteins have hitherto unknown functions? Structural comparisons and manual sequence alignments of proteins with similar functions have identified many pairs of proteins which exhibit much lower sequence identity than the limit of around 30% above which standard sequence methods can reliably detect relationships[Levitt & Chothia, 1976,Orengo et al., 1994]. It is estimated that only 15% of these evolutionary relationships can be detected with the Smith Waterman single sequence method[Brenner, 1996,Hubbard, 1997]. Increasing the coverage of sequence search methods to allow inferences of function from currently undetectable similarities is the goal of sequence and structure based methods (as discussed below).