Domain sequences

The aim here is to create a sequence for each domain in the CATH database. When a domain is not equivalent to the entire chain of its original PDB structure, the residue ranges and chain identifiers which specify it are available in CATH. In this laboratory, PDB files containing just the ATOM records of these domains are also available (and are known as chopped domains). From the PDB format file of every domain (whole chain or chopped) we extract the sequence from the ATOM records of the carbon-alpha atoms. A number of chopped domains have large amounts of `missing' sequence which will compromise the sensitivity of sequence database searches (most alignment algorithms are not able to cope with large insertions and deletions). It is therefore necessary to extract contiguous sequence fragments as separate entities, and perform separate searches with each fragment as described below. Fragment boundaries are defined where neighbouring carbon-alphas (in the PDB file) are more than 4Å apart. This approach also deals with missing residues which are common in PDB files, due to the lack of data during structure determination (in surface loops, for example).

Next: Database search Up: Gathering multiple sequences for Previous: Overview Contents