Preparing the sequence alignment
For the co-evolution analysis, the Pfam families and structures used for the solvent exposure analysis discussed in Chapter 4 were used; Pfam families that have examples known to exist as homo-oligomers were identified and removed using the database discussed in Chapter 3. This step is taken as it would be impossible to determine if a co-substitution detectable by this analysis was seen as a result of long range intra-domain interactions, or short-range inter- domain interactions between neighbouring sites in the biological unit between the two domains, which is explained more clearly in Figure 5.1. The analysis was performed independently on 46 Pfam families with a sequence and structural reference taken from Eukaryota and 51 Pfam families with a sequence and structural reference taken from Prokaryota.
Figure 5.1: Co-substitution events in a homo-oligomer: Residues x could influence the residue type at sites y in A and A0, i.e. residue-type x in Monomer A might influence y in A and A0
Each Pfam family has a sequence alignment consisting of known examples of the domain; these are segments of the protein sequences stored in the UniProt data-bank. The alignment for each selected family was processed as follows. The sequences in the alignment were compared with the sequences from the structural example of the domain. The structural example with the most residues and the highest sequence identity to a sequence in the alignment was selected. Since crystal structure data is sometimes from a mutated variant of the wild type, this means that it is not always possible to have a perfect match between the structure sequence and the protein sequence derived from genomic data. The two mentioned sequences formed a pair, the structure sequence and the alignment-reference-sequence. As in the solvent exposure analysis, the separation into taxonomic groups (Eukaryota/Prokaryota) was maintained. This selection step was applied to the Pfam alignment, such that all sequences not from the taxonomic group were removed and a new alignment was compiled, consisting of only those sequences belonging to the taxonomic group of the selected reference structure. A threshold sequence identity of at least a 35% with the reference sequence was also applied. It has been shown that structural similarity is maintained between homologous sequences of from c.35% and higher [20].
Two further filtration steps were applied to the alignment. Firstly the percentage of gapped entries in each column was determined. Columns consisting of 45% gaps or more were ex- cluded from the analysis, as it was deemed that these columns probably represented poor quality regions in the alignment and could most likely reduce the reliability of our results.
done, because of the weakness of the Henikoff weighting described in Section 2.5. The discus- sion of the weighting showed that the Henikoff weighting method does not reduce the weight of two equal sequences by one half, three equal sequences by one third and so on. Due to this limitation we decided to remove sequences which were close to identical, with the threshold being set at 95% sequence identity; the sequence identity for every possible pair of sequences in the alignment was calculated. If a sequence pair was found to have identity at or above the threshold the first of the pair was added to a list of excluded sequences. This is not the ideal solution, which suggest that the optimal possible data for analysis was not necessarily achieved. However, as there were time constraints on developing an optimal solution, this was the simplest solution permissible. Importantly the 95% threshold will actually remove little co-substitution data; but the reduction will speed up the algorithm to locate co-substitutions. The above com- pleted the preparation of the sequence alignment data for the search of given co-substitution events in the data.
Mapping the structure to the alignment
The structure sequence and the alignment-reference-sequence described above, were used to create a map between positions in the reference structure and columns in the sequence align- ment. This was done by recording the position of all the gaps from the alignment sequence. The gaps were then removed and ClustalW [82] was used to align the structure sequence with the alignment sequence. Positions from the structure sequence which had to be gapped in order to align properly with the alignment sequence were recorded. The positions in the alignment se- quence matching positions in the structure sequence were recorded as these are the only column positions in the alignment for which any distance information will be available in the distance matrix, and thus the only ones relevant to our analysis.
Generating the Distance Matrix
The inter-residue distance between pairs of residues in the domain structure was calculated, using the PDB module for BioPython [75]. All inter-residue distances were measured between Cβatoms. Glycine, having no Cβ atom, posed a problem in this regard. To address this, firstly,
all the glycine positions in the structure sequence were relabelled to alanine using a Bash-shell script. Secondly the CORALL command of the program WHATIF [74] was used to correct the atomic positions of the structure to incorporate a Cβ atom for those glycine changed to alanine.
This inserted the missing Cβ atom for the glycines which had been changed to alanine; but the
structure was not optimised to accommodate the additional atoms; thus the inter-residue dis- tances remained unchanged for all residue pairs, although CORALL did add any atoms missing from residues in the structure. The inter-residue distances were then determined. The glycine positions were recorded to ensure no confusion between the actual alanine residues in the struc- ture and those representing the replaced glycines. As such the correct residue is recorded for each position in the distance matrix. However the inter-residue distances recorded between any residue and glycine is measured with respect to the pseudo Cβposition.
Permissible co-substitution events in a given environment may also be constrained by the requirements for packing chains in the tertiary structure, and they may be constrained by the requirements for the formation of secondary structures. Given that for secondary structure pre- dictions there are reports of above 90% accuracy [83]; it makes sense to ignore interactions that may be intra-secondary structure and focus on tertiary structure associated packing inter- actions. To avoid intra-secondary structure interactions, substitutions between residue pairs 15 residues or fewer apart from each other in the sequence, were not considered. The choice of this separation is supported by Brunak et al. [84], who studied inter-atomic distances between Cαatoms. They reported that the distribution of inter-atomic distances in the tertiary structure
reflects the secondary structure units within the protein structure. They saw the strongest in- dication of secondary structure effects between residues separated by fewer than 10 residues. However they claim to have seen some effect up to a separation of 20 residues. This is further supported by the work of Mr. Welland, who calculated the P(d|s) – the probability of a distances given sequence separation, between Cβ atoms. Figures 5.2 and 5.3 1, show P(d|s) against the
distance of separation in the structure. The first figure illustrates the distribution in the short ranges of sequence separation up to 16 residues apart and shows that secondary structure effects
disappear at about a separation of about 15 residues. The second figure is shown to illustrate that the secondary structure effects don’t appear again at any separation greater than 15 and that at a separation of 15 residues and above, the distribution of P(d|s) very strongly resembles a Poisson distribution. For these reasons, co-substitutions were considered for analysis only for pairs of residues separated by 15 residues or more, in the sequence alignment.
Figure 5.3: P(d|s) vs. inter-residue separation.
Building the distance matrix was done while taking into account solvation states of residue pairs. The inter-residue distance data, was measured between residues in the reference structure and sorted according to the solvent exposure of each residue in the residue-pair. Three sets of residue pairs were generated: surface pairs, where both residues were on the surface; buried- pairs, where both pairs were in the protein interior; mixed pairs, where one residue was buried and the other was on the surface. The solvent exposure used to delimit the boundary between the surface and the protein interior was a measure of 20 HSEu13, as determined by the solvent
exposure analysis discussed in Chapter 4. The reference structure had been selected from the structures which had been used for the analysis of solvent-exposure, with the solvent-exposure, values stored in the B-factor column of the PDB structure file. The inter-residue positions in the reference structures (which represents inter-column distances in the alignment) were assigned to distance range-bins of width 3 Å . Thus, the final distance matrix was represented, firstly into
three sets for each solvation state of the residue pairs, and secondly with the range-bin for each residue-pair assigned.
Bootstrap generation
The distance data was used to generate randomised distance-matrices for the purpose of per- forming a bootstrap analysis. The residue-pair and measured inter-residue distance data for a given solvation state was taken. From this data a list of all residue pairs and a list of all inter- residue distances was compiled. An inter-residue distance was then randomly selected from the list, with replacement, and assigned to a residue-pair. This was repeated for all residue pairs. One hundred randomised distance matrices were generated for each set of position pairs in each Pfam family selected for the analysis. Due to time considerations, the residue-position-pairs which were in the “mixed” set (i.e. one residue on the surface and one buried), were excluded, this will need to be analysed at a later date, to determine the co-substitution behaviour between the surface and the protein interior.