• No se han encontrado resultados

Anexo 7 Hojas de Vida

The main aim of this study is to improve fragment-based protein structure prediction, taking advantage of the state-of-the-art Rosetta tool (Leaver-Fay et al., 2011), by incorporating new parameters and criteria whenever template structures and

7

fragments are chosen to build complete conformation models. The fundamental objectives are concisely presented as follows.

• Creating smaller but more relevant template structure sets, i.e. allowing focusing on the most promising parts of the search space:

For the sake of sampling as many decoys as possible and therefore covering as much of the search space as possible, fragment-based approaches rely on a relatively large set of PDB’s structures. For instance, Rosetta uses a group of 16,800 template structures of average size 257 amino acids; such a number has been able to let Monte Carlo simulations cover a large number of possible conformations (Gront, Kulp, Vernon, Strauss, & Baker, 2011). However, out of the decoys produced, it turned out that many of them are quite far from the native structures although they represent local energy minima (Kim, Blum, Bradley, & Baker, 2009). In this thesis, we will show that in many cases using only 20% of Rosetta’s standard set of template structures allows not only producing decoys of better quality, but also reducing the number of irrelevant regions where search trajectories end.

• Incorporating proteins’ structural class prediction as an additional and valuable criterion for selection of fragments:

In all state-of-the-art fragment-based tools such as I-TASSER (Y. Zhang, 2008), FragFold (Jones, 2001) and Rosetta (Das & Baker, 2008), the procedure for the selection of fragments relies mainly on different techniques of sequence alignment and additional criteria, such as secondary structure prediction and Ramachandran map probabilities. In this thesis, we present a new factor that restricts the usage of fragments based on the structural class of their sources, that should match with the target’s structural class prediction. Such a factor was able to play the role of a “preliminary filter” of all fragments before Rosetta’s standard filters are applied; tangible improvements were recorded over state-of the-art prediction methods.

• Taking advantage of a proteins’ sequence-structure correlation associated with the various secondary structures to create customised fragment files, where the number of candidate fragments varies based on the predicted secondary structure

8

so that the effort in modelling each target’s region is customised according to its needs:

Rosetta uses 25 9-mers and 200 3-mers for each position in the target to be chosen randomly whilst the conformations are being built. Such a strategy – to adopt the same number of fragments at each position - is common amongst all popular fragment assembly methods. However, independent and thorough studies have shown that the sequence-structure relationship is not static for all secondary structures, specifically not for short sequences (Bystroff, Simons, Han, & Baker, 1996; de Oliveira, Shi, & Deane, 2015; Fiser et al., 2000; Sibanda & Thornton, 1985; Vanhee et al., 2011). Owing to that known fact, we have developed a novel approach to build fragment files so that, for instance, the number of candidates falls sharply whenever an alpha helix – an easier protein substructure to predict – is predicted to occur along the length of the fragment. • Applying an appropriate “amount” of corrections and tuning to an initial model

to prevent excessive changes which may have a damaging effect:

Rosetta uses 200 3-mers for each position in the target of interest as an attempt to explore neighbouring regions. However, we will demonstrate in this thesis that for a category of targets such a large number of fragments of size 3 cause “damage” to some parts of the conformation that had already reached a decent accuracy during the coarser structure prediction phase.

• Tackling energy function inaccuracies by narrowing the size of the explored area:

Exploration-exploitation trade-off is a common issue in all optimisation problems, especially in protein structure prediction using fragment assembly techniques (Simoncini, Schiex, & Zhang, 2017). However, reaching a fair compromise between these does not only raise the probability of reaching “good” regions in the search space but also narrows the gap between the decoys with low energy scores and decoys with high accuracy. In this research, we have decreased the level of exploration in our three contributions. However, each time this was done in a different way, which makes the selection of the first models – models that are associated with the lowest energy score – a more accurate process.

9

1.3 Scientific Contribution

In this thesis, novel ideas are presented leading to improvements over the standard Rosetta protein structure predictions. Our novel ideas result in the following scientific contributions:

• A novel fragment selection process where usage of template structures is restricted to those who share the same structural class prediction as the target (chapter 4). This novel idea was the basis of our contribution to CASP12 – under the name of “Rosetta_at_Kingston” group – as we had the opportunity to compete against the formal Rosetta research group and were able to show better results for 40% of the targets despite the huge gap between their computational and human resources and ours.

• A structure refinement process depending on a target’s structural class prediction (chapter 5). We have shown that the standard number of 3-mers, i.e. 200, which is used in the structure refinement phase, is not only unnecessary but also destructive for alpha and alpha-beta proteins. Indeed, for those classes, the main 9-mers insertion phase is sufficient to “deliver” conformations close to the native-like structure. As a consequence, refinement only requires being light touch.

• A protein structure prediction process which takes advantage of the sequence- structure correlation which is present amongst the three different secondary structures to select the number and diversity of possible fragment alternatives (chapter 6). The sequence-structure correlation amongst the three different secondary structures has been established for a long time (Sibanda & Thornton, 1985); alpha helices are very “conservative” whilst loops tend to have a large range of structural variety and beta strands are somewhere in between. As a consequence, it is proposed that fragments that are predicted to be either pure helices, strands or loops should have an increasing number of available candidates. Adopting this novel approach to create fragment files allows the structure prediction process to focus on complex regions by conducting extensive fragment insertions, while limiting the number of the insertions in the simpler regions.

10

1.4 Thesis Outline

This thesis is divided into 7 chapters as follows:

In this chapter, we have presented a concise description of the novelty of our findings after a short definition of the problem.

Chapter 2 is dedicated to a thorough literature review of protein folding and protein structure prediction. An earlier version of this chapter was published as a book chapter by John Wiley and Sons (Abbass, Nebel, & Mansour, 2013).

Chapter 3 describes a popular and challenging fragment-based protein structure prediction method called Rosetta (Lyskov et al., 2013) that has been ranked the best amongst its competitors.

Chapter 4 proposes the adoption of a small but customised template structures set for each group of targets that share some “global” properties. We have demonstrated that based on the target’s structural class prediction, an ad hoc fragments library should be built to produce better decoys by exploring less but exploiting deeper regions that are likely to be near the native-like structure. This work has been published as an article in BMC Bioinformatics (Abbass & Nebel, 2015).

Chapter 5 relies on the same principle as chapter 4, i.e. the structural class of the target is predicted first. However, the standard fragment library is kept but the number of 3-mers has been adjusted accordingly. This contribution has been published in a journal paper in Protein Peptides and Letters (Abbass & Nebel, 2017).

Chapter 6 illustrates how secondary structure prediction can play an additional role besides its original one as a factor whenever it comes to choosing a fragment; the number of available fragments varies between the target’s positions based on the secondary structure that it is likely to adopt starting at those positions. This work will be the core of a future publication in a bioinformatics journal.

11

Documento similar