Baseline comparison -- Smith Waterman searches of the fold library

We hope to show that the new method developed here performs better than current methods, both threading and sequence based. Starting with sequence based methods, the best control experiment is to apply the Smith Waterman[Smith & Waterman, 1981] local sequence alignment search program, SSEARCH, which is part of the FASTA package[Pearson, 1990]. Unlike FASTA, which initially screens sequences using a fast but approximate method for scoring diagonals in the alignment matrix, SSEARCH performs complete local alignments for every comparison. Whilst SSEARCH is only a single sequence method, we improved its chances of success by creating a sequence library from the multiple sequence homologues of the fold library domains (1421 sequences in total). We also used multiple query sequences: for each query domain, each sequence homologue was scanned against all 1421 sequences (except those derived from the query), and the resulting local alignment scores were ranked and the top alignment selected. Gap penalties of 12 (opening) and 2 (extension) were used with the BLOSUM50 matrix, according to the recommendations in the literature[Henikoff, 1996,Pearson, 1995].

Using this protocol, the number of correct non-self top-ranking folds,

, was 7. Details of these hits are presented in Table 4.2. Mean adjusted rank and alignment shifts were not calculated. It is clear from these results that the dataset is not strictly non-homologous. Whilst none of the pairs of domains are more than 20% identical by global alignment methods, some are clearly detectable using standard local alignment methods with a small library of similarly sized domain sequences. Six of the seven hits are pairs which recognise each other. Each of these pairs share similar functions either by E.C. number or SCOP classification. 2ohx and 1gdh are both oxidoreductases acting on the CH-OH group of donors with NAD

or NADP

as the acceptor. 1tpf and 1pii are both isomerases which interconvert aldoses and ketoses. 5p21 and 1hur are both in the G-protein family of SCOP. These two pairs of domains are placed in the same homologous superfamilies in the latest release of CATH. The recognition of 1exg00 by 1cgt04 is not so easily explained since in SCOP they have different fold classifications, yet 1exg is in the ``carbohydrate-binding domain'' superfamily, and the equivalent SCOP domain for 1cgt04 is in the ``Starch-binding domain'' superfamily. In CATH, both domains belong to the immunoglobulin-like topology. These domains may be more related than SCOP suggests.

**Table 4.2:** Details of correct top-ranking fold recognition results using the Smith Waterman local alignment sequence search of the fold library of 82 domains.
query		library		CATH	global	local
domain	length	domain	length	topology	$\%_{id}$	$\%_{id}$	overlap
5p2100	166	1hurA0	180	3.40.330	19.8	30.4	125
1hurA0	180	5p2100	166	3.40.330	19.8	29.6	125
2ohxA2	139	1gdhA2	184	3.40.330	12.0	32.8	64
1gdhA2	184	2ohxA2	139	3.40.330	12.0	32.8	64
1tpfA0	250	1pii02	191	3.20.40	14.8	17.2	215
1pii02	191	1tpfA0	250	3.20.40	14.8	17.2	215
1cgt04	104	1exg00	110	2.60.40	13.4	25.8	62