The investigations found in the literature on co-evolution and correlated mutations in protein structures have focused on predicting residues in direct contact, because it was assumed that the driver for co-evolution was structural pressure on a local environment. As discussed in Section 1.2, correlated mutation analysis has shown that two columns with changing amino-acid in a sequence alignment, are varying in correlation with each other. However, the relationship between inter-residue distance and correlated mutations has not been explicitly studied because long-range interactions were not considered important factors. We argue that a limitation of correlated mutation analysis is that it does not consider the importance of the residues or residue types involved in the mutations and the lack of physical distance considerations is an oversight. By analysing the propensity for different pairs of residues to be jointly involved in co-
substitutions, it is possible to determine characteristics of the underlying physics governing protein structures. Consider two positions that appear to be mutating in a correlated fashion, if it is observed that the columns contain exclusively charged residues, and that the mutations appear to conserve the attractive or repulsive nature of the electrostatic potential. Then, if we consider a specific attractive interaction, e.g. a hydrogen bond, to be of primary importance, it could perhaps be replaced by a salt bridge. However, if the attraction is important but the specificity is not, a hydrophobic interaction may be also suitable. In other words, residues may need to be conserved to maintain the folding pathway but don’t necessarily need to be close together or interact in the folded structure.
Figure 1.4: Co-substitutions due to long range interactions: The structural exemplar for the sequences in the alignment on the right, is shown on the left. Interactions across some physical distance, e.g. point 3 and n, can be determined by the co-substitution behaviour shown in columns 3 and n of the sequence alignment.
Here we present an analysis of the propensity for residues to “co-substitute” at differ- ent physical distances, to explicitly investigate the role of distance on the propensity to co- substitute. Unlike in a correlated mutation analysis where the emphasis is on determining the correlation of mutation between columns, we investigate individual co-substitutions. Consider Figure 1.4, which is similar to Figure 1.1 and illustrates long-range interactions between posi- tions in the protein structure. In the sequence alignment on the right hand side, in columns 3 and n, a conservation of electrostatic repulsion across some distance is shown, while in columns 3 and l a conservation electrostatic attraction is shown. These are both examples of electrostatic
interaction being maintained. The objective of the work presented in this thesis has been to develop a method which can elucidate the distance preferences of different types of amino-acid co-substitutions.
Steps have been taken in the analysis, to separate co-substitutions between residues on the surface from those co-substitutions in the protein interior. These steps involved having firstly to define the boundary between the surface and the protein interior. This has been done to investi- gate if differences in the co-substitution behaviour exists, between the two solvation states. An investigation like this has not been reported in the literature previously.
To allow a rigorous classification of residues as being either surface or buried, a statistical analysis was conducted of the propensity for each of the proteinogenic amino acids to be solvent exposed. This has led to a number of interesting observations regarding the relationship between the solvent exposure measures ASA and HSEu, and some evidence of a correlation between solvent exposure preferences and substitution propensities.
The thesis has 5 chapters beyond the introduction. Chapter 2, introduces the main concep- tual ideas of the analytical method. Providing definitions to terms and a derivation of the ana- lytical methods developed and used for this work. Chapter 3 details a bioinformatics project to build a data base which merged the data from Pfam-A, SwissProt and the PDB/PiQSi databases. The database was used to ensure the sequence and structure data used were from the same cellu- lar environment. Chapter 4 presents the analysis of solvent exposure preference for amino-acid types, seeking to determine a crossover point that can be used to define a set of surface residues and a set of buried residues. Chapter 5 presents the co-substitution analysis, indicating evidence for different co-substitution patterns on the surface compared to the buried residues. Chapter 6 closes the thesis with a summary of the main conclusions of this work.
Chapter 2
Development of analytical methods
2.1
Introduction of the statistical functions
TheOE ratio is used in the methods of this thesis to determine the statistical preference for residue co-substitution with respect to distance, and the statistical preference for each of the twenty standard amino acids to be solvent exposed. In this chapter firstly an overview of theOE statistical method is given. Secondly a discussion of statistical phenomenon known as Simpson’s Paradox is given to explain and justify the approach used to conduct the statistical analysis. Thirdly an explanation of how the OE statistical method can be applied to co-substitution and solvent exposure analyses respectively. Finally the weighting of protein sequences is explained with a discussion of Henikoff weighting and a new method of weighting pairs of sequences developed for this work, is presented.