Typically, a protein is useless, sometimes harmful, unless it folds into its generally unique shape. Such a process takes place over a timescale of microseconds in nature although the number of all possible conformations is tremendous; such dilemma has been under question till now (Dill & Chan, 1997; Dill & MacCallum, 2012; Levinthal, 1968; Zwanzig, Szabo, & Bagchi, 1992). Although the final structure is the most “crucial part”, the folding pathway has been under thorough study since it may reveal important clues, mainly to help computational biologists mimic the real trajectory towards the native structure (Dill, 1985; Voelz, Bowman, Beauchamp, & Pande, 2010). Probably the first finding in this regard for globular proteins notes that hydrophobic amino acids tend to be in the centre of the structure to avoid the surrounding water molecules, whilst the hydrophilic ones prefer to stay in contact with the external aqueous environment, see Figure 1.1.
3
Studies and proposed theories on protein folding have been published from different scientific perspectives: chemical, physical and biological (Dill, Ozkan, Shell, & Weikl, 2008; Luo, 2014; Scheraga, 2015). Computational techniques have played a key role by running simulations of that process to mimic nanosecond by nanosecond how atoms interact according to the “standard” Newton’s second law. Although successful attempts have been recorded, only very powerful supercomputers and grid computing systems were able to achieve success to such experiments (Tyka et al., 2011; Voelz et al., 2010).
Christian Anfinsen – one of the pioneers in the field of protein structures – has formulated two notable theories: the first one states that the native structure is the one that has the lowest free energy value (Anfinsen, Haber, Sela, & White, 1961), the second describes protein folding as a pure physical process, i.e. the tertiary structure can be solely determined by the sequence of amino acids (Anfinsen, 1973), see Figure 1.2. The above two principles represent the basis for the most challenging computational technique known as ab initio protein structure prediction. From the first theory’s perspective, Protein Structure Prediction (PSP) is an optimisation problem where the energy function plays the role of heuristic as an attempt to reach the global minimum energy in the tremendous search space. Anfinsen’s second theory has paved the way to computationally represent an approximate value of the interactions that take place Figure 1.1: Pictorial description of globular protein folding. The left part represents the primary sequence, i.e. the linear chain of amino acid whereas the right part shows the folded structure. The black-filled, white-filled, dark grey- filled, and light grey-filled spheres represent the hydrophobic, hydrophilic amino acids, C terminal and N terminal respectively. For the sake of simplicity, the figure is shown in 2D.
4
amongst the atoms and amino acids without taking into consideration any external effects.
Computational approaches used to determine a protein’s structure can be categorised into two main groups: template-based and template free modelling; whereas the first one relies mainly on the known proteins structures deposited in the world’s largest repository – the Protein Data Bank (PDB) - by trying to find either some level of sequence-sequence similarity or sequence-structure compatibility between a template conformation and the target in question, the second group, i.e. template free – also known as ab initio – relies solely on both of Anfinsen’s theories. Ab initio approaches are closer to the natural case than template-based methods where most advances have been focusing on detecting more remote homologues and better modelling sequence- structure compatibility. In their turn, template-based techniques are divided further into to sub-categories: homology modelling and threading.
Homology modelling or comparative modelling is considered the simplest way to build the target in question and it is based on a quite old hypothesis: similar sequences infer similar structure (Browne et al., 1969). Whenever sequence similarity exceeds 30%, models with good accuracy are typically expected (Lam, Das, Sillitoe, & Figure 1.2: A depiction of Anfinsen’s experiment; the native structure (top left) was denatured to form two inactive shapes (bottom left and top right). Both were again biologically activated (renatured) and the protein returned to its native shape. Taken from (Amani & Naeem, 2013).
5
Orengo, 2017). Figure 1.3 shows the process of comparative modelling; the sequence alignment was performed using ClustalW (Larkin et al., 2007) and the model was built using MODELLER (B Webb & Sali, 2014).
Fold recognition or threading is a more complicated and computationally expensive template-based modelling and is typically used whenever comparative modelling fails; even if no remarkable sequence similarity has been detected, a target may still fit into one of the known structures (Chothia, 1992). Threading techniques rely on fitness scores as target’s amino acids are placed on known structures to evaluate how convenient and compatible those structures are. Most threading techniques do not model the whole target, rather the core regions only; see Figure 1.4.
Figure 1.3: A simplified pictorial illustration of the homology modelling process. The top left part shows the target sequence as well as a template structure that was chosen due to the high sequence alignment similarity shown on the top right. The native structure of the target T0295 is shown between square brackets whereas the built one using the template is displayed next to it. Taken from (di Luccio & Koehl, 2011).
6
Figure 1.4: Simplified threading process. The best core template is chosen based on the score of the energy function. The size of the query sequence is n, whereas the size of the core template used for threading is m. Since n is larger than m, the remaining regions are built using different techniques such as ab initio. Taken from (Ngom, 2006).
Ab initio methods are by far the hardest approach for PSP. As their name
implies, in these approaches, proteins are built from scratch. “Standard” ab initio algorithms mimic the natural folding process by using Force Fields (FF) – an approximation of the quantum mechanical representation of the interactions amongst atoms - however, due to the high computational cost their usage has been limited to small proteins (Khoury, Smadbeck, Kieslich, & Floudas, 2014).
An in-between approach that combines the strength of both template-based and
ab initio modelling is fragment-based protein structure prediction. It is able to predict
template-free targets but is not as computationally expensive as “pure” ab initio methods. Instead of having a single amino acid as the unit of construction, a short sequence of amino acids – treated as a rigid part – is taken into consideration. Such approaches have been the target of very active research for the sake of their enhancement and improvement as they were ranked the best in the latest blind competition: Critical Assessment of the Structure Prediction of proteins – round 12 – (CASP12) in 2016.