advertisement: compare things at compare-stuff.com!
next up previous contents
Next: Multiple sequence alignments Up: Gathering multiple sequences for Previous: Domain sequences   Contents

Database search

FASTA3[Pearson, 1990] was used to search SWISS-PROT[Bairoch & Boeckmann, 1991,Bairoch & Apweller, 1997] release 34 (release 33 was used in the first half of Chapter 3) for sequences similar to each fragment extracted from the PDB file. Default settings were used and the output was filtered as follows. Locally aligned fragments were extracted and retained (as `hits') if the percentage identity was above a threshold, $\tau=100x^{-0.25}$, where $x$ is the length of the alignment (overlap). This length-dependent function was derived empirically and produces a conservative threshold for percent sequence identity: for lengths 50, 100 and 200, $\tau$ is 38%, 32% and 27% respectively.

When database searches have been completed for each sequence fragment, the hits are reassembled in order according to the SWISS-PROT identifier of the sequence from which they originated. Fragments are joined with a single `-' character, which is interpreted as a gap in the following steps. The original probe sequence is also joined in this way.

Finally the reassembled hits are filtered such that only those whose lengths are within +/- 20% of the probe sequence are retained, in order to improve the quality of the multiple sequence alignments described below.


next up previous contents
Next: Multiple sequence alignments Up: Gathering multiple sequences for Previous: Domain sequences   Contents
Copyright Bob MacCallum - DISCLAIMER: this was written in 1997 and may contain out-of-date information.