9. MATRIZ Y ANÁLISIS DOFA DE LA LIGA RISARALDENSE DE RUGBY
9.2 MATRIZ DOFA
Each query sequence is searched against nrdb90 (Holm and Sander, 1998) using PSI-BLAST for detection of related sequences. PSI-BLAST parameters were set as described by Altschul and Koonin (1998) with an E-value of 0.001 for inclusion into the next round of searching (-h option). Each search was carried out for no more than 4 iterations, or until convergence. Identical matches made to the query sequence were removed. Each pair-wise sequence alignment generated by PSI- BLAST with an E-value less than O.OI (see 5.3.1.1) was then analysed.
Next, the positions to which the aligned termini of the database sequence have matched to the query sequence were recorded. This was repeated for all significant PSI-BLAST matches, such that a count is made at a residue position in the query sequence whenever a database sequence termini residue has been matched to it. For example, if the first residue (i.e. N-terminal aligned residue) of an aligned database sequence alignment was matched to residue 100 of the query chain, the count for residue 100 was incremented by 1. In cases where alignment termini residues were within 15 residues of the real N- or C-terminus of the database sequence, the match was counted twice, as such aligned residues represent genuine database sequence domain. Two separate counts were made; one for database sequence N-termini matched residues and one for matched C-termini residues.
Once all the significantly aligned sequences have been counted, a smoothing window of 15 residues (section 5.3.1.1) was moved over the query sequence, one for the N-terminal distribution and another for the C-terminal distribution. This procedure involved averaging the values of all the residues that lay within the window, and giving the average value to the central residue. This smoothing effect was used in order to take into account variability in the length of homologous
were combined, taking into account those regions in which both N-terminal and C- terminal residues had been found to align. Such regions were more likely to indicate the end and beginning of adjacent domains and therefore a domain boundary in the query sequence. In such situations it might be expected that the C- termini region of a homologous domain sequence may be closely followed by the N-termini aligned region of another homologous sequence. Regions that contained counts for both N- and C-terminal aligned residues were given a weighting, such that half the maximal count at the N- or C-terminal residue position was added to the sum of the N- and C- terminal counts to form a combined value. In addition, for positions in which only N- or C-terminal (or neither) alignments had been observed, the combined profile was the sum of the two distributions. The combined smoothed distribution now represented a profile of the aligned database sequences. The elevated regions of the profile should represent putative domain boundaries, i.e. regions in which a number of database domain termini have been found. To assign domain boundaries from such elevated regions, or profile peaks, their significance was compared to the distribution over the remaining query chain. In order to do this a mean and standard deviation was calculated over the distribution and a Z-score was calculated for each residue in the query sequence; where the Z-score is given by:
Z-score = ( x - X )/(J ( X )
where X is the profile value for each residue position.
The alignment of nrdb90 sequences can result in termini hits that are equivalent to the N- and C- termini of the query chain. The aligned positions of these termini hits may not exactly match those of the query sequence, a result of the natural variation in lengths of homologous sequences. This can lead to profile ’edge effects’ in both the first and last 40 residues of the query sequence. It is important to avoid incorrect boundary predictions due to such edge effects which may appear as significant peaks in the profile. In order to address this issue, the first and last 40 residues of the boundary profile are flattened, i.e. they are given a value of 0. In cases where edge effects are found to extend beyond the first or last 40 residues of the sequence, the corresponding residues are also flattened. The mean and standard
deviation of the smoothed N- and C-termini alignment counts were therefore calculated for all residues, excluding those considered to be part of an edge effect.
All peaks in the profile that were found to have a Z-score greater or equal to 1.5 (5.3.1.1) were assigned as putative domain boundaries. Domain boundaries were assigned to the highest peak first. In cases where a domain boundary was assigned, residues within ± 30 amino acids of the central residue of the profile boundary peak were assigned a Z-score of 0. This was carried out in order to avoid domain boundaries being assigned within 30 residues of one another, since 30 residues is the smallest domain size permitted in the CATH database (Orengo et al., 1997).