6.1.1
Nested Sampling of Proteins and Peptides
Nested sampling is a Bayesian algorithm designed to be particularly efficient at sampling systems which undergo a first order phase transition. In Chapter 2 we parallelized the algorithm and, for the first time, used it to sample a biophysical system, a coarse-grained protein model, CRANKITE.
The potential energy landscapes of three small proteins are explored and energy landscape charts are generated, giving a large-scale visualization of the potential energy surface showing the protein folding funnel. We considered how the simulations behave when changing the NS algorithm parameters. We also compared the nested sampling algorithm to parallel tempering, using both methods to calculate the heat capacity of polyalanine.
For more complicated protein models, which have more degrees of freedom per residue, the MC move set must allow for, for example, angle bending and side chain rotations. These moves, especially at low temperatures, are often inefficient when compared to the more widely used molecular dynamics approach. This will be particularly noticeable in cases where explicit solvent molecules are included, which is often the case for biophysical systems.
When using explicit solvent molecules with MD, collisions between separate solvent molecules ex- change energy and can enable energy barriers to be crossed, whereas with MC, a large number of moves cause molecules to overlap, which can cause a low acceptance rate and long decorrelation times. There- fore, in order for NS to gain popularity within the computational structural biology community, it is necessary to adapt the algorithm to work within an MD framework.
To that end, in Chapter 5, we adapted the nested sampling algorithm to be used within an MD framework by implementing Galilean exploration. We demonstrated the application of the algorithm by calculating heat capacity curves for an all-atom model of alanine dipeptide and compared the results to the standard replica exchange approach. We calculated the dihedral angle free energy surface of alanine dipeptide both in vacuo and implicit solvent and used the surface to compare the latest Amber force field to previous computational and experimental work.
Finally, we discussed the theoretical behaviour of Galilean nested sampling, REMD and an alternative nested sampling algorithm, which uses canonical trajectories, for systems which undergo a first order phase transition. After incorporating an appropriate semimetric, Galilean exploration should allow NS to be used with more realistic force fields where there is often no efficient MC move set.
6.1.2
Contrastive Divergence and Protein Force Field Parameter
Optimization
In this work we have substantially improved CRANKITE, a coarse-grained protein model. In Chapter 2 we added side-chainγ-atoms to the model, together with an MC side-chain dihedral angle rotation move. We also improved the energy function by adding a hydrophobic energy term and tuning the functional forms of existing energy terms.
In Chapter 3 we focussed on optimizing the parameters of the CRANKITE force field. We used a maximum likelihood approach, optimizing the force field parameters such that the likelihood of a training set, consisting of experimentally-derived protein crystal structures, is maximized.
In order to avoid the expensive calculation of ensemble averages, we used a statistical machine- learning technique, contrastive divergence. In comparison to other maximum likelihood approaches, the efficiency of our algorithm allows a larger training set to be used and we have shown the optimized force field is transferable to a protein not included in the training set.
In Chapter 3 we placed particular emphasis on the van der Waals energy term. We optimized parameter values for both a cheap, hard cutoff function and a more expensive 12–6 LJ functional form, and we compared them to parameters taken from ‘standard’ molecular dynamics force fields: we compared the observed distributions of bond angles, atomic distances, backbone dihedral angles and hydrogen bonding patterns. We also calculated the heat capacities of polyalanine and observed the different turn types found when folding a β-hairpin. We demonstrated the importance of optimizing the parameters of the force field rather than taking values found in the literature.
In Chapter 3 we also discussed the contrastive divergence procedure as applied to force field parameter inference, its behaviour, the assumptions it relies upon and the effect of changing the quality of the training set.
6.1.3
β-Contact Prediction and Correlated Mutation Analysis
In Chapter 4 we developed a protein β-contact prediction algorithm whose predictions can be used as inputs to CRANKITE when the native protein structure is unknown. We developed an empirical Bayes
β-sheet model which encodes the strong constraints and prior knowledge associated with β-contacts. We coupled the model to the direct information (DI), a powerful maximum entropy-based correlated mutation statistic.
Unlike the majority of correlated mutation analysis research, proteins with large high-quality multiple sequence alignments are not specifically chosen for analysis, but instead a standard dataset of 916 proteins used to benchmarkβ-contact prediction algorithms is used. We show that the DI statistic contains useful information even when smaller autogenerated MSAs are used, and that, according to our benchmarked results, the DI is as informative as inputting the entire MSA into a neural network or Markov random field when predictingβ-contacts.
Finally, tying this work in with the rest of the thesis, we show that theβ-contact predictions can be used within a tertiary structure prediction pipeline by using them as inputs to CRANKITE, enabling it to successfully determine the folds of two previous CASP targets.