Sequence and structure-based bioinformatic tools to the characterization, clustering and

(1)

ADVERTIMENT. Lʼaccés als continguts dʼaquesta tesi queda condicionat a lʼacceptació de les condicions dʼús establertes per la següent llicència Creative Commons: http://cat.creativecommons.org/?page_id=184 ADVERTENCIA. El acceso a los contenidos de esta tesis queda condicionado a la aceptación de las condiciones de uso establecidas por la siguiente licencia Creative Commons: http://es.creativecommons.org/blog/licencias/

WARNING. The access to the contents of this doctoral thesis it is limited to the acceptance of the use conditions set by the following Creative Commons license: https://creativecommons.org/licenses/?lang=en

Santiago Rios Azuara

(2)

Doctoral Thesis

Sequence and structure-based bioinformatic tools to the characterization, clustering and

modeling of G-protein-coupled receptors (GPCRs)

Santiago Rios Azuara Laboratory of Computational Medicine, Universitat Autònoma de Barcelona

(3)

(4)

Doctoral thesis

PhD program in Biochemistry, Molecular Biology and Biomedicine.

Sequence and structure-based bioinformatic tools to the characterization, clustering and

modelling of G-protein-coupled receptors (GPCRs)

Report submitted by Santiago Rios Azuara to obtain the degree of doctor in Biochemistry, Molecular Biology and Biomedicine.

Directors: Gianluigi Caltabiano, Angel Gonzalez Wong Tutor: Mireia Duñach Masjuan

Laboratory of Computational Medicine (LMC), Medicine Faculty,

Universitat Autònoma de Barcelona (UAB) Bellaterra, 2017

(5)

(6)

Abstract

In this work, we developed new bioinformatic tools for the study of G- protein-coupled receptors (GPCRs). The pharmacological importance of these receptors motivates the development of alternative methods to assist their classification, pharmacological identification and comparative modeling. Based on the recent advances in GPCRs crystalization, a new multiple sequence alignment strategy that incorporates irregularities observed on the receptor structures was proposed. The developed structure-based sequence alignment was used to update the GPCRs classification with significant advantages compared to previous studies.

The recent structural data was also used to improve the analysis of the orthosteric binding site through the classification of the receptors in function of the ligand binding similarity. In specific, as part of this thesis we have developed a novel substitution matrix specifically derived from GPCRs (GPCRtm) and the web application (GPCR-Browser) that permits easier comparison of receptor sequence within subfamilies.

(7)

(8)

Acknowledgements

During the last years, I had the opportunity to participate at the research group under the leadership of Prof. Leonardo Pardo and Prof. Mercedes Campillo. I want as well to thank the rest of LMC group mates. With them, I enjoyed many good moments and I spent many other significant experiences, especially Arnau for “others moments”. On my future, I will feel awkward when I will do the lunch out of the lab. In particular, I want to thank my thesis directors, Dr. Gianluigi Caltabiano and Dr. Angel Gonzalez for their collaboration. I learned many thinks with them; this project would not be possible without you.

It is important to mention Prof. Mousseau and the rest of the colleagues in Montréal, Louis, Mickael and Daniel. Thanks for your company during my Canadian experience. I also want to thank my chemist teacher at the high school, Encarna, who introduced me in research. Finally, I want to thank my family for their support.

(9)

(10)

1 INTRODUCTION 1

B7B #)&)!#&(*",')&)!#&#'*),.() 8*,).#(8)/*&,*.),- C

B7C "-#!(&.,(-/.#)(#(-#."&& D

B7D -&--# #.#)( E

1.3.1 Class A GPCRs E

1.3.2 Other GPCR classes F

B7E .,/./,&#(-#!".-)(- H

1.4.1 Class A GPCRs H

1.4.2 Other GPCR classes AA

B7F )&/&,-1#."-#(- BD

B7G &/-.,#(!'.")- ),- BF

1.6.1 Pharmacological property approaches AE

1.6.2 Phylogenetic analysis approaches AF

1.6.3 Profile and pattern approaches AH

1.6.4 Chemogenomic of the TM binding cavity approaches AI

1.6.5 Protein-ligand fingerprint approaches BA

B7H )'*/..#)(&.))&- ),."-./3)  CB 2 OBJECTIVES 23

3 ^{METHODS 25}

D7B '#()#-+/(,.,#0&(&#!('(.) ."&-- '#&3 CF

3.1.1 Sequence datasets employed BE

C;A;A;A '... BE

C;A;A;B 0()'..)*)<*'!/*-4- +/*-. BE

3.1.2 Post-processing of multiple sequence alignments sets according to structural

information BF

D7C )(-.,/.#)() '#()#-/-.#./.#)('.,#2 CG

3.2.1 Construction of GPCRtm BF

3.2.2 Evaluation of the GPCRtm in database searching and pairwise alignments BH

D7D 0&)*'(.) ."&/-.,#(!'.")- ),- CJ

C;C;A Clustering GPCRs according to TM regions BI

C;C;B Clustering GPCRs according to ligand-binding site residues C@

D7E -#!((#'*&'(..#)() .",)1-,1**&#.#)( DB

(11)

4 RESULTS AND DISCUSSION 33

E7B -) -.,/./,&%()1&!#(."#'*,)0'(.) -+/(&#!('(.-)

- DD

E7C )(-.,/.#)() ('#()#-/-.#./.#)('.,#2 ),."&--- DF

4.2.1 Amino acid compositional bias in the class A GPCRs CF

4.2.2 Development of the GPCRtm matrix CH

D;B;B;A 0)/$*)'.$($'-$/$ .*!($)*$.$)/(;*(+-$.*)2$/#*/# -

.0./$/0/$*)(/-$ . CI

D;B;B;B 1'0/$*)*!/# /((/-$3 DD

4.2.3 Conclusions and perspectives DH

E7D &/-.,#(!) &---/-#(!-.,/./,&,#0#( ),'.#)( EJ

4.3.1 Clustering based on phylogenetic reconstruction of TM domains DI

D;C;B Clustering based on ligand binding pocket residues EB

D;C;B;A !$)$/$*)*!" ) -$'$")$)$)"!*-'... EC

D;C;B;B (+-*1 ( )/.*!/# " ) -$'$")$)$)".$/ !$)$/$*) ED

D;C;B;C )'4.$.*!/# '$")<$)$)".$/ - .$0 . EE

D;C;B;D '0./ -$)"4.$($'-$/4(/-$3*!/# $)$)".$/ - .$0 . EF

4.3.3 Conclusion and Perspectives FA

E7E 0&)*'(.) )'*/..#)(&.))&- ),."-./3) &--- GC

4.4.1 GPCR Browser FB

D;D;A;A *)/ )/)0/$'$/4 FB

D;D;A;B . ,0 ) '..$!$/$*) FD

D;D;A;C $")<$)$)".$/ - +- . )//$*) FE

D;D;A;D $")<$)$)".$/ '..$!$/$*) FF

D;D;A;E #4'*" ) /$<. / (+'/ . ' /$*)/**'!*-#*(*'*"4(* ''$)" FG

4.4.2 Applications in pharmacological studies FI

4.4.3 Applications in other web-developed bioinformatics resources GA

4.4.4 Conclusion and perspectives GA

5 SUMMARY OF THE NOVELTIES DERIVED FROM THIS WORK 73

REFERENCES 75

APPENDIX 87

#-.) /&#.#)(- IH

#!/,-(.&- II

LIST OF ABBREVIATIONS 97

(12)

1 Introduction

The great advances in the genome sequencing, as well as in X-Ray structure determination techniques produced during the last decade, have brought to the scientific community a large amount of data about the sequences and structures of thousands of proteins. This information can effectively be used for medical and biological research with the development of adequate tools for their analysis and interpretation. In this regard, computational techniques may help us reach this goal.

Bioinformatics methods are among the most powerful technologies available in life sciences. The use of statistical analysis on protein sequences and structures gives us a better understanding of their biological function, physiological roles and evolutionary relationships.

There is a vast literature on successful applications of bioinformatic methodologies in the solution of biological problems, and what is more, there is an increasing need to integrate all the new knowledge in more accurate computational tools.

The main aim of this work is the development of bioinformatics techniques to characterize and study proteins with therapeutic importance. We focused our attention on the G protein-coupled receptors (GPCRs), one of the most relevant proteins families in drug discovery, with a well- recognized importance in clinical medicine. These receptors are essential in cell physiology, and their malfunction is commonly translated into pathological outcomes.

(13)

1.1 Biological and pharmacological importance of G-protein- coupled receptors

Cells are able to detect chemical signals present in their external environments by means of different classes of plasma membrane proteins, being the superfamily of G-protein-coupled receptors (GPCRs) one of the largest and most studied [1]. The origin of the GPCRs is presumed ancestral on account of their presence in most eukaryotic organisms including insects and plants. These heptahelical membrane-spanning receptors are highly diversified in mammalian genomes with current estimates of about one thousand genes, depending on the species [2].

GPCRs are involved in the translations of several endogenous and exogenous signals in cellular responses, modulating physiological processes as diverse as neurotransmission, cellular metabolism, inflammatory and immune responses, secretion, differentiation and vision among others [3]. These receptors are activated by a vast chemical diversity of natural and synthetic ligands including biogenic amines, neuropeptides, phospholipids, glycoproteins, nucleosides, nucleotides, amino acids, polypeptide hormones, odorants, ions and photons [4].

Considering the vast amount of cellular processes regulated by the GPCRs system, its wide tissue distribution and accessibility from the extracellular environment [5], it constitutes an attractive pharmaceutical target and accounts for around 30% of current drugs in market [6].

(14)

1.2 The GPCR signal transduction inside the cell

GPCRs control the activity of enzymes, ion channels and transport of vesicles via intracellular signalling cascade (Figure 1). Actions of GPCRs are driven inside the cell through heterotrimeric guanine nucleotide-binding proteins (G-proteins), ß-arrestins and G-protein-coupled receptor kinases (GRKs) among others. The activation of the G-protein (the most common secondary messenger) triggers the GTP-GDP exchange associated with the Gα-subunit and leads the βγ-dimer dissociation from Gα.

Figure 1. Scheme of the principal components of the GPCRs signal transduction machinery (GPCR coloured in purple). The G-protein, which primary effectors are adenylate cyclase (AC), phospholipase C (PLC) and ion channels, is coloured in cyan (α-subunit), magenta (ß-subunit) and green (γ-subunit). ß-arrestin, which primary effectors are tyrosine-protein kinase (TK), MAP kinase (MAPK) and E3 ubiquitin ligase (E3 ligase) is coloured in yellow. G-protein-coupled receptor kinase (GRK) is coloured in orange.

GPCR

Arrestin α

β γ

extracellular

intracellular

G-protein

GRK

AC PLC Ion

channel

TK

TK MAPK E3 ligase

(15)

Subsequently, G-protein sub-units modulate the activity of diverse effectors like ion channels or enzymes [5] and the uncoupled GPCRs became substrates for G-protein-coupled receptor kinases (GRKs), which phosphorylate the intracellular part of the GPCRs. Phosphorylated GPCRs increase their binding affinity to ß-arrestin molecules. The GPCR - ß- arrestin complex drives the membrane protein desensitization and sequestration via clathrin-coated pits endocytosis [7]. ß-arrestins also function as G-protein independent signal transducers for TK, MAPK and E3 ubiquitin ligases effectors [8].

1.3 GPCRs classification

Early attempts to classify GPCRs had been based on phylogenetic relationships, the chemical nature of their ligands, pharmacological properties or by the design of fingerprints that encodes motifs on the seven transmembrane domains (7 TM) [9-11]. According to sequence- based approaches, the GPCR superfamily is organized on five to seven classes. Class A Rhodopsin-like, which account for over 90% of all GPCRs, class B Secretin-like, class C Metabotropic glutamate receptors, class D Pheromone receptors, class E cAMP receptors and the class F Frizzled/Smoothened receptors. They constitute the A-F system, comprising known GPCRs from both vertebrate and invertebrate species [9]. In humans, there are more than 800 GPCR genes [10]. They are divided into five large families: Glutamate, Rhodopsin, Adhesion, Frizzled/Taste2 and Secretin receptors (GRAFS) (Figure 2).

(16)

Figure 2. A phylogenetic relationship of the GPCR classes A, B, C and F (GRAFS classification) adapted from [10]. Rhodopsin family classification (lower inset) is described in detail in the Figure 3.

1.3.1 Class A GPCRs

The Rhodopsin or class A GPCR family has undergone a large evolutionary success in the bilateria species. This family is the biggest and the most studied of all. According to the GRAFS classification system [10], class A GPCRs are subdivided into four main branches α, β, δ and γ, and 13 sub-branches: olfactory, aminergic, peptide, chemokine-like, purine- like, somatostatin/opioid/galanin, opsin-like, glycoprotein binding,

(17)

prostaglandin, MECA (melanocortin, endoglin, adenosine, cannabinoid), MRG receptors, melatonin and melanocyte-concentrating hormone receptors. 460 out of the 710 class A GPCRs are olfactory receptors (Figure 3).

Figure 3. Phylogenetic relationships of the class A receptors according to the GRAFS classification. Adapted from [10].

1.3.2 Other GPCR classes

The class B is divided into Secretin and Adhesion receptors (15 and 24 members respectively). Class C, with 15 receptors, comprises the metabotropic Glutamate (mGlus), γ-aminobutyric acid B-type (GABAB), calcium-sensing (CaS), taste 1 (TAS1) receptors, and several orphan receptors. Class F contains the Frizzled/Taste2 receptors (24 in total). The

(18)

other two classes (class D corresponds to Fungal mating pheromone and class E to cyclic AMP receptors), which are not present on humans.

1.4 Structural insights on GPCRs

Crystal structures of several GPCRs reveal an overall transmembrane fold preserved across the whole superfamily (Figure 4). These receptors display a highly conserved molecular architecture composed by seven α- helical transmembrane segments (7TM), which span the cell membrane, connected to each other by three extracellular loops (ECL1-3) and three intracellular loops (ICL1-3). In addition, an eighth α-helix (H8) is usually found at the beginning of the intracellular C-terminal, lying parallel to the membrane plane.

Figure 4. Comparison of the TM segments in the crystal structures from GPCRs of class A/Rhodopsin family (protein name: ADRB2, PDBid: 2RH1 [12] in cyan), class B/Secretin (GLR, 4L6R [13] in red), class C/Glutamate (GRM1, 4OR2 [14] in orange) and class F/Frizzled (SMO, 4JKV [15] in green). A lipid bilayer (in pink) is included on the representation.

(19)

In general, the interaction with ligands is produced on a binding pocket lying in the extracellular side of the receptors. Specific domains on N- terminus of several GPCR families would mediate these interactions, whereas in the majority of Class A receptors their effector molecules bind a cavity formed by the N-terminus, ECL2 and TM helices.

1.4.1 Class A GPCRs

One of the most important limitations in the study of GPCRs is the low similarity between their sequences, which in many cases is below the twilight region significant for homology detection [16]. Major sequence and structural divergences occur in the non-transmembrane regions, mostly in N- and C- terminus, ECL2 and ICL3. Interestingly, despite very low sequence conservation among class A GPCRs, they exist highly conserved residues, at least one per helix: N in TM1 (present in 98% of the sequences), D in TM2 (93%), R in TM3 (95%), W in TM4 (96%), P in TM5 (76%), P in TM6 (98%), and P in TM7 (93%) (Table A1 in Appendix).

These residues have been used by Ballesteros and Weinstein to define a general numbering scheme consisting of two digits: the first (1 through 7) corresponds to the TM segment in which the amino acid of interest is located; the second pinpoints the position relative to the most conserved residue in the helix, arbitrarily assigned to 50 [17]. Significantly, the position of these highly conserved amino acids in each helix is the same in the superimposition of the currently available crystal structures (Figure 5).

Thus, this finding validates their use as reference points in TM sequence alignments and for the construction of homology models of GPCRs with unknown structure [18].

(20)

Figure 5. Comparison of the TM segments in the crystal structures of class A GPCRs: OPSD (PDBid: 1GZM), ADRB2 (2RH1), DRD3 (3PBL), S1PR1 (3V2Y), PAR1 (3VW7), NTR1 (4BUO), P2Y12 (4PXZ), OX2R (4S0V), EDNRB (5GLH), AA2AR (5IU4) and CCR9 (5LWE). The color code of the TM helices is 1 in white, 2 in yellow, 3 in red, 4 in black, 5 in green, 6 in dark blue, 7 in light red, and C-terminal in grey. The highly conserved N1.50, D2.50, R3.50, W4.50, P5.50, P6.50, P7.50 are shown in spheres.

Class A GPCRs 3D structures reveal different spatial conformations of the N-terminus and ECL2 that maintain the binding site rather accessible from the extracellular environment. Thus, each receptor subfamily has probably developed, during evolution, a specific N-terminus/ECL2 complex to adjust the structural characteristics of its cognate ligands, and to modulate the ligand binding/unbinding events [19-21]. Analysis of crystal structures shows that ligand binding mostly occurs in a major crevice located on the upper site of TMs 3, 5, 6 and 7. In addition, a minor cavity also exists on

(21)

the TMs 1, 2, 3 and 7 (Figure 6). This minor binding pocket is usually associated with ligand selectivity [22].

Figure 6.The ligand-binding cavities in class A GPCRs. Superimposition of ligands co-crystallized with the receptors ADBR2 (PDBid: 2RH1, 3PDS, 3D4S), ADRB1 (2VT4, 2Y00, 2Y01, 2Y02, 2Y03), AA2AR (2YD0, 2YDV, 3EML, 3PWH), CXCR4 (3ODU), DRD3 (3PBL), HRH1 (3RZE), ACM2 (3UON), ACM3 (4DAJ), S1PR1 (3V2Y) and OPRK (4DJH), all showed over ADRB2 crystal structure (PDBid: 2RH1) in cylinders helices. Ligands populating the major and the minor crevices are colored in grey and black VdW spheres respectively.The color code of the helices is the same than Figure 5.

Structural alignment of class A GPCRs has revealed changes in the α- helical scaffold in the form of tight and wide turns at some TM helices.

These distortions are related to residue insertion or deletion events (known as indels) accumulated during the evolution and challenge the established

(22)

paradigm of avoiding gaps in alignments of the TM regions [23]. Gonzalez et al [24] show how sequence gaps of one or two amino acids size has to be introduced in TM2 and in TM5 sequences for some class A GPCRs in order to reflect their spatial superposition in crystal structures. These structural anomalies in TMs 2 and 5 could have played an important role in the diversification and evolutionary success of GPCRs given that a part from amino acid substitutions, indels are among the most common events in protein evolution [25]. These findings have direct implications in sequence alignments and in homology modeling as well as in phylogenetic reconstruction. Nonetheless to-date most numbering schemes of Class A GPCRs do not accurately reflect amino acid sequence position with their tertiary structure location.

1.4.2 Other GPCR classes

To date, 19 structures of the transmembrane domain of receptors from classes B, C and F have been determined in complex with ligands of varied pharmacology. The Class B Secretin receptors interact with endogenous ligands with the extracellular domain (with 100-150 residues) and part of the extracellular regions (ECL1-3) [26-28]. Whereas Adhesion receptors are characterized by the presence of an extracellular GPCR- Autoproteolysis-INducing (GAIN) domain located immediately N-terminal to the 7TM [29]. Class C GPCRs generally form dimers or higher order oligomers, with extracellular domains composed by Venus Flytrap modules (VFTM), cysteine-rich domains (CRD) and heptahelical domains (HD) [14, 30, 31]. Finally, the Class F activated by the lipoprotein WNT, contain a extracellular cysteine-rich domain (CRD), which binds

(23)

endogenous ligands, and a linker domain [15, 32] (Figure 7, Table A2 in Appendix).

Figure 7. Representation of the structures and ligand binding regions of Class A, B, C and F GPCRs. Endogenous ligands bind transmembrane (TM) cavity in class A. In the secretin class B receptors, endogenous peptides bind on the N-terminal and extracellular loops (ECLs) regions.

Class C makes dimers and ligands binds on the long N-terminal named Venus flytrap domain (VFD). In class F, ligand binds the cysteine-rich domain (CRD) in the N-terminal. Figure taken from [14].

Alternatively, small molecules could bind different allosteric sites in these receptors, modulating, in many cases the pharmacological properties of orthosteric ligands [33]. (Figure 8).

(24)

Figure 8. Schematic overview of known binding positions of class A, B and C GPCR ligands.

Figure taken from [34].

1.5 Molecular switches in GPCRs

Recent advances in experimental crystallization techniques [35, 36] had allowed the crystallization of GPCRs in different conformational states.

Analysis of the atomic-level information retrieved for these receptors in complex with agonists, antagonists, inverse agonists, and accessory proteins unveiled differences among states of the activation process [37].

However, despite the enormous chemical diversity of their ligands, all GPCRs shared a common activation mechanism. In class A GPCRs, this activation mechanism make use of small molecular micro-switches [38]

(Figure 9), which are highly conserved across the family.

(25)

Figure 9. Micro-switches of class A GPCRs. (a) Conserved residues are indicated with orange and red (highly conserved) symbols in a serpentine model of a 7TM receptor and are listed below using both the structural generic numbering system and the mathematical generic system (in brackets below). (b) The 7TM global toggle switch with micro-switches and extra- and intracellular ligand- binding pockets, showed over ADRB2 structure. Figures taken from [38].

(26)

1.6 Clustering methods for GPCRs

Conventional strategies for identifying and classifying proteins involve similarity searches looking for common biological functions or using sequence database search tools (as example: BLAST). Unfortunately, biological data are often unknown and methods based on pairwise alignment only appreciate generic similarities between sequences. Then, alternative techniques, as sequence profiles alignment or phylogenetic analyses among others, have been developed in order to overcome such problems. The most widely accepted GPCRs classifications are listed below.

1.6.1 Pharmacological property approaches

Traditional GPCR’s classification is based on receptor pharmacology, which is used as reference by all computer-generated classification. As previously mentioned, GPCRDb [11] and NC-IUPHAR [39] are the most common databases that label and classify receptors based on their endogenous ligand and pharmacological properties. However, similar ligands do not always bind to similar receptors, and similar receptors do not always recognize similar compounds. As example, all LPARs binds the same endogenous ligand [40] despite a mere 25% of identity, while other receptors (APJ and AGTRs) with higher identity, do not always share the same binding partners [41]. These observations reflect a complex evolutionary background with GPCR sequences converging or diverging before they come up with their current functional profile.

(27)

1.6.2 Phylogenetic analysis approaches

The phylogenetic analysis was originally based on phenotypes distinction.

Currently, the new technologies permit more reliable classifications using the sequence alignments of DNA, RNA, proteins or non-biological data.

The methodology to build the sequence classification is chosen in order to optimize the repartition of character states and the result is displayed as a phylogenetic tree or dendrogram (the graphic representation of the computed results). Most of the classification methods can be classified as i) distance- or ii) character-based methods. i) Distance (or algorithm) methods convert sequence data into a distance matrix. Distance values are stated from an evolution model algorithm and they reflect the number of differences between each pair of sequences. The tree is then constructed from these distance values by progressive clustering. ii) Character-based (or tree-searching) methods search the branch and bound tree topology that better fit the set of taxa [42]. In practice, both distance based and tree-searching methods are combined. For example, an initial tree may be estimated by distance-method Neighbour Joining (NJ) and subsequently, the maximum likelihood (ML) method maximizes the likelihood of the tree topology parameters from a given data [43].

Possibly the first and surely the benchmark of all GPCRs classifications was presented in 2003 by Fredriksson et al. (GRAFS classification, Figure 3), who classified GPCRs using fingerprints motifs and phylogenetic algorithms. Of the 800 human sequences classified as GPCRs, the 241 non-olfactory class A receptors were clustered in four main groups (α, ß, γ, δ) with thirteen sub-branches. α-branch clusters prostaglandins (or

(28)

prostanoid), amine (5-Hydroxytryptamine, acetylcholine, dopamine, histamine, adrenergic and trace amine receptors), opsin, melatonin and MECA receptors (Melanocortin, Lysophospholipid LPAR1-3 and S1PR, Cannabinoid, Adenosine and the orphan GPR3, GPR6 and GPR12 receptors). ß-branch includes peptide receptors: vasopresin, endothelin, bombesin, cholecystokinin, ghrelin, gonadotrophin-releasing hormone, orexin, neurotensin, neuromedin U, oxytocin, neuropeptide FF, neuropeptide Y, tachykinin, prolactin-releasing peptide and the orphan GPR83 receptors. In γ-branch, receptors are sub-grouped on the three clusters: SOG receptor cluster (somatostatin, opioid and galanin receptors), melanin-concentrating hormone “MCH” receptors and chemokine receptors: ACKRs, CCRs, CXCRs, XCR1, angiotensin II, bradykinin, chemerin, complement C5a peptide, formylpeptide, leukotriene B4, relaxin-3 and the orphan GPR1, GPR15, GPR25, GPR32 receptors.

The last branch (δ) clusters MAS-related, glycoprotein (hormone receptors) and purin receptors (P2Y, Lysophospholipid LPAR4-6, platelet- activating factor, N-arachidonyl glycine, complement anaphylatoxin chemotactic, proteinase-activated, cysteinyl leukotriene, oxoglutarate, succinate, hydroxycarboxylic acid, thyrotropin-releasing hormone, ovarian cancer, QRFP, RPE-retinal and the orphan P2Y10, GPR17, GP132, GP174, GPR35, GPR55, GPR4, GP171, GPR82, GPR161 and GPR101 receptors). Olfactory and other 7TM receptors are separately grouped.

More recently, the GRAFS classification was updated [2] and new classifications have been proposed. Chabbert and colleagues trace a molecular evolution path driven by specific residues at TM2 [44], TM5 and the WXFG motif at the ECL1 [45]. From these evolution markers, class A

(29)

receptors are proposed to cluster in four groups (G0, G1, G2 and G3). G0 includes peptide, opsin and melatonin receptors. The G1 includes somatostatin/opioid, chemotactic and purinergic receptors. The G2 is composed by amine and adenosine receptors and the G3 include leucine- rich repeat, melanocortin, S1P, cannabinoid, prostaglandin and MAS- related receptors. Finally, Kakarala et al. developed a sequence-structure based phylogeny to identify potential ligand association for class A orphans [46].

1.6.3 Profile and pattern approaches

Many methods classify protein sequences using learning statistical models obtained from the various protein classes. Profile hidden Markov models (profile HMMs) are one of the most common, building statistics from the sequence alignment consensus. However, such profiles fail when query proteins lack significant similarity on the database sequences. Then, more accurate HMM profiles are made for classification performance. T-HMM method [47] clusters GPCRs in function of a phylogenetic tree-based profile hidden Markov model. PRED-GPCR [48] application pretends to progress the method with an exhaustive discrimination of the low selective and sensitive HMM profiles.

In 2002, Lapinsh et al. [49] developed an alignment-independent method for GPCRs classification according to the chemical properties of the amino acids. These types of Support Vector Machines (SVM) methods extract physicochemical properties of the proteins. In GDS classification [50], primary amino acid sequence are described by 26 physicochemical

(30)

properties as hydrophobicity, bulky and polarity among others. In SVMtree [51], Karchin et al. combine HMM and SVM methods to produce a hierarchical multi-class SVM classification. In GPCRpred [52] and GPCRsclass [53], another SVM method determines fixed-length vectors from the dipeptide composition of the proteins.

1.6.4 Chemogenomic of the TM binding cavity approaches

The particular architecture of class A GPCRs, where ligands directly contact the transmembrane bundle residues, makes feasible the classification of these receptors only using the physicochemical properties of the binding cavity residues. In 2001, Jacoby created one of the first chemogenomic strategies where monoamine receptors were clustered in base of different ligands and the related binding site [54]. The lack of the rest of the sequences in these type of analysis permits a more ligand addressed classification and helps the ligand discovery research.

In 2006, Surgand and colleagues analyzed the residues of the ligand- binding site for all GPCRs by phylogenetic classification scheme. The residues subset was obtained considering amino acids of 30 discontinuous positions supposedly involved in ligand-binding [55]. These 30 critical positions were selected by the X-ray structure of the bovin rhodopsin receptor (PDB code: 1F88 [56]). The clustering of the ligand-binding database permits an easy detection of similarity and selectivity between different ligand-receptor interactions. According to this classification, class A receptors were grouped in 19 clusters: Prostanoids, glycoproteins, MAS- related, SREB, opsins, lipids, peptides, melatonin, vasopeptides,

(31)

adenosine, amines, melanocortins, brain-gut peptides, acids, chemokines, opiates, chemoattractants and purine receptors, and three more clusters for non-class A receptors (Figure 10).

Figure 10. TM cavity-derived phylogenetic tree for human GPCRs. Taken from [55].

Thanks to the increasing number of GPCR 3D structures, more details about their ligand-binding interactions become available, allowing a more accurate classification. In 2009, a new reference set for class A GPCR binding pocket (with seven crystal structures available at that moment) defined 44 positions important in ligand binding [57]. These methods has been compared against sequence phylogenetic analysis of TM segments [58] and against classifications shemes obtained by similarity ensemble approach (SEA) [59].

In 2016, Ngo et al created the “pocketome” database in order to classify class A GPCRs by a combination of all receptor-ligand interaction pattern

(32)

extracted from crystal structures [60]. This new methodology helps the identification of related properties between receptors on different subfamilies.

1.6.5 Protein-ligand fingerprint approaches

Protein-ligand fingerprint methods introduce novel low-dimensionality fingerprint encoding both ligand and receptor physicochemical properties which is suitable to mine protein−ligand chemogenomic space. Whereas ligand properties have been represented by standard chemical descriptors, protein cavities are encoded bit strings describing pharmacophoric properties of a definite number of binding site residues.

This concept has been applied to G protein-coupled receptors with a homogeneous cavity description [61-64].

1.7 Computational tools for the study of GPCR

The importance of the GPCRs in cellular physiology has inspired the development of numerous computational tools and databases for their study over the years. GPCRdb [65] and Pocketome [60] are web servers that manage high quality curated sequence and structural GPCRs information. PRED-GPCR [48], GPCRpred [52], GPCRsclass [53] and SEVENS [66] predict protein classification from a query sequence. On the other hand, FoldGPCR [67], GOMoDo [68], GPCR-ModSim [69] servers allow online homology modeling, docking and molecular dynamic simulations.

(33)

Despite all these available tools, a lot of work is still needed. Currently, there are an important number of unclassified receptors (nearly one hundred [40], mostly in class A sub-family) which encourages the development of new tools to assist the identification and deorphanization of GPCRs. In addition, an improvement of the current classifications for the GPCR superfamily could be feasible taking into account the new knowledge obtained from structural data.

In this work, we propose an improved methodology for the clustering of class A GPCRs, where the new structural information derived from X-Ray crystallography studies are considered in the construction of multiple sequence alignments (MSAs) for this protein family. As a result of this approach, we propose modifications to the existing GPCRs classifications systems. In addition, this information has been used in the development of computational tools to compare GPCRs sequences, conduct similarity searches and propose structural templates for homology modeling studies.

(34)

2 Objectives

The aim of this project is to make use of the state-of-the-art information about the sequence-structure relationships in GPCRs to develop bioinformatic tools that assist in classification, pharmacological identification and comparative modeling within this receptor family.

To achieve these goals, we proposed to develop:

• A substitution matrix specific for GPCR sequences. This tool would be useful for sequence alignment, BLAST searches and phylogenetic analysis.

• A web application server to integrate structural information derived from GPCRs in the comparison of the sequences from TM regions, ligand-binding sites and for template selection in comparative modelling studies.

These new generated bioinformatic tools could be of interest in projects related to phylogenetic studies, comparative modeling, molecular docking and other pharmacological applications in GPCRs.

(35)

(36)

3 Methods

3.1 Amino acid sequence retrieval and alignment of the GPCR class A family

3.1.1 Sequence datasets employed

*&(&(&(

Class A GPCR protein sequences from different biological sources were obtained from Uniprot [70]. The initial dataset was extended with the inclusion of 314 sequences functional human olfactory GPCR repertoire [71]. The final dataset is composed by 1019 sequences.

*&(&(&) !' %

Human class A GPCR sequences were obtained from the UniProt database using the following syntax: as field name organism (OS) “Human [9606]”, as family and domains: “G protein coupled receptor 1 family”. In this set, odorant receptors were excluded with the field “NOT name:

olfactory. The acquired 297 reviewed entries, were manually revised and sequences without seven TM domains were removed. The final dataset is composed by 292 sequences.

(37)

3.1.2 Post-processing of multiple sequence alignments sets according to structural information

UniProt and GPCRdb annotations were used to identify TM segments and boundaries of the TM helices were defined according to the available crystal structures of class A GPCRs (Table A1 in Appendix) [21, 72].

Sequences corresponding to TMs 1– 7 were aligned using the Win32 version of ClustalW 2.1 [73]. ClustalW was used with a gap open/extention penalty value of 40/0.1. The resulting alignment was manually curated according to the consensus signatures of class A GPCR: GN1.50, LAxxD2.50, DR3.50Y, W4.50, P5.50, Y5.58, CWxP6.50, NP7.50xxY [37], including the ECL1 WxFG motif [74] and the highly conserved cysteines in TM3 and ECL2 involved in a disulfide bridge in most receptors [75]. The disulphide bond between TM3 and ECL2 was considered as formed when both cysteine (in position 3.25 and another cysteine in ECL2) were detected and then, the cysteine at ECL2 was aligned at position 45.50 (GPCRDB numbering scheme is used at the ECL2 region [76]). Finally, gaps were inserted in the TM2 and 5 according to previous studies [24].

This resulted in two multiple sequence alignments (MSA) of 292 and 1019 non-redundant TM GPCR sequences (see Figure A1 and A2 in Appendix).

3.2 Construction of a GPCR amino acid substitution matrix

3.2.1 Construction of GPCRtm

The alignment of the TM regions of the 1019 GPCR class A sequences database was used to generate a substitution matrix representing changes

(38)

on this protein family using an implementation of the methodology described by Henikoff et al. [77]. In this regard, extracellular and intracellular regions are removed and the corresponding TM segments (1- 7), which consist of multiple alignments of short regions (< 40 amino acids), were treated as sequence blocks. As initial step, a transition count (frequency) table was computed to determine the total number of amino acid transitions pairs from each column of the alignment. After the transition count table was completed, observed and expected probability of transition were computed for each pair. The observed probability (O) for the amino acid pair (i,j) is the total number of transitions observed (from the frequency table) divided by the total number of transitions for the entire alignment.

The expected probability (e) of occurrence for each (i,j) pair was calculated from the observed probabilities for the pair.

For a single residue:

for an (i,j) pair:

for ij when i = j,

(39)

Using the expected (e) and observed (O) probabilities of transitions, the substitution values were calculated from the odds ratio matrix, as the logarithm of odds, where each entry is obtained according to:

The scaling factor of 2 is taken from Henikoff et al. [77] in order to facilitate comparisons. In the final 20 × 20 amino acid matrix, substitutions values where rounded to the nearest integer value. In addition, we calculate the average mutual information per amino acid pair or relative entropy (H) according to:

3.2.2 Evaluation of the GPCRtm in database searching and pairwise alignments

One hundred random sequences from different GPCR subfamilies, including the four main groups α, β, δ and γ [10], were used as queries in BLASTP searches executed with the AB-BLAST software (http://blast.advbiocomp.com/) against the pdbaa database (ftp://ftp.ncbi.nlm.nih.gov/blast/db/). Parameters to the customized gapped alignment score system for the GPCRtm were computed with the ALP program [78] (see Table S3). All BLASTP results were conducted with a gap existence = 15 and a gap extension = 2 scoring parameters, except for the BLOSUM62 matrix (gap existence = 11 and a gap extension = 1, default parameters). Matched comparisons of GPCRtm against JTTtm,

(40)

PHAT, BLOSUM62 and BLOSUM45 matrices were calculated with the IBM SPSS Statistics for Macintosh, Version 22.0 using the exact McNemar 2-tailed tests (p-values). Pairwise sequence alignments were generated with the MAFFT (L-INS-i) software using default parameters [79, 80].

3.3 Development of the clustering methods for GPCRs

 Clustering GPCRs according to TM regions

The human class A non-olfactory GPCR sequence dataset, was used to build an unrooted tree using PhyML software 20120412 version [81] (see results). To claim confident alignment regions, only TM blocks were used.

As mentioned earlier, two methodologies were combined in order to obtain a better phylogenetic tree. Five starting trees were built using the maximum parsimony method (pilot tests had shown better results than distance-based methods), these were optimized with Subtree Pruning and Regrafting (SPR) [82] and Nearest Neighbor Interchange (NNI) algorithms [83]. Maximum likelihood method compares the different trees, parameters and models and computes the most probable hypothesis. The amino acid equilibrium frequencies were obtained from the GPCRtm substitution matrix [24]. Bootstrap method “aLRT” based on SH and Chi-square criteria [84, 85] were used to support branch nodes (Figure A3 in Appendix). The obtained phylogenetic tree was rendered by FigTree 1.4.2 (http://tree.bio.ed.ac.uk/software/figtree/) for an easier visualization.

(41)

 Clustering GPCRs according to ligand-binding site residues

40 amino acids involved in ligand-receptor interactions were selected based on the analysis of 111 crystal structures of class A GPCRs ligand- receptor complexes deposited in Protein Data Bank (Table A1 in Appendix). Residues within a distance ≤ 5 Å of the ligand were selected in every crystal structure and annotated according to Weinstein-Ballesteros scheme. In order to decrease the bias of the available crystallographic information only positions observed in at least two crystal structures from different receptors were included in the consensus binding pocket sequence database (Table A4 in Appendix). Finally, residues of the selected positions were extracted from the human class A dataset of 292 sequences to construct ligand-binding site residue alignment (Figure A7 in Appendix). All 40-amino-acid-long sequences were compared using a similarity matrix and were visually expound using heatmap representation.

Similarity scores for every receptor pairwise were weighted by GPCRtm substitution matrix [86] and normalized using equation 1.

Equation 1

Where Sij is the initial similarity value for every pairwise receptor.

Similarity values were converted to distance by maximum metric function and clustered by average algorithm. Heatmap plots were made using the gplots library of the R software (http://CRAN.R- project.org/package=gplots). Color intervals were manually adjusted according to: strong dissimilarities (below the value -0.25) colored in dark

(42)

blue, weak dissimilarities (between -0.25 and 0) colored in white, weak similarities (between 0 and 0.15) in light blue, medium similarities (between 0.15 and 0.25) in yellow, strong similarities (between 0.25 and 0.5) in orange and very strong similarities (between 0.5 and 1) in red.

3.4 Design and implementation of the GPCR Browser web application

The GPCR-Browser (http://lmc.uab.cat/gpcr-browser/) is an easy-to-use web server built to show and take advantage of all the information retrieved in this study. Web-accessible tools were implemented in python programming language and the web page interface was developed with php code. Users can either 1) input a Uniprot protein codes or 2) the fasta sequences of the class A GPCR they are interested in. The validity of the input is always checked. In the former by validating the input code with our pre-compiled class A receptor-name list. In case a fasta sequence is pasted as input, a blast research against the “full human” GPCR database is run in order to confirms that the input sequence belongs to class A receptor (blast search has an e-value lower than 10^-30). Once input is checked and considered valid, GPCR-Browser run three independent analysis by mean of four python scripts and it displays the results on a user-friendly interface. The following paragraph explains the detail of each analysis.

1) The phylogenetic tree obtained by the TM database is parsed using Phylo library of the Biopython open source tools [87]. Then, the

(43)

sub-branch tree with the selected input (from protein code or blast result) is shown as a jpg picture.

2) A ligand-binding cavity representation is created showing the residues of the input sequence considered as part of the binding site. Residues are extracted from the fasta sequence by a previous alignment and a posteriori binding position identification. The protein sequence is aligned against the GPCR database with MAFFT program version 7.215 [80] using “Align full length sequences to an MSA”, a gap penalty value of 5, and the GPCRtm substitution matrix are implemented by default [86]. Otherwise, residues are extracted from our database on the protein code option. Similarity scores are calculated as mentioned in 3.7 and then the twenty best similarity values are selected and sorted.

3) Finally, GPCR-Browser build a phylogenetic tree between the selected target and all receptors with known crystal structures.

Sequence of all available crystal structures class A GCPR and user query are used to run a phylogenetic reconstruction using PhyML software 20120412 version with default parameters. For protein code as input, the protein sequence is extracted from our database otherwise, if fasta sequence is introduced as input, it is aligned to crystal sequences using MAFFT (as previously explained).

The GPCR-Browser tool is updated every six month. The sequence database is checked from Uniprot and the crystal structure list is renovated from pocketome website. The Uniprot and pocketome web servers are parsed by python scripts.

(44)

4 Results and Discussion

4.1 Use of structural knowledge in the improvement of sequence alignments of GPCRs

Multiple sequence alignments (MSAs) constitute an essential tool for the study of protein families. In GPCRs, the quality of an MSA could be enhanced using the available structural information (currently, 210 PDB entries for 43 unique GPCRs are available).

Figure 11. Multiple sequence alignment of TM2 and TM5 regions of crystalized class A GPCRs.

The conserved residues are highlighted in black. Sequence gaps were included to amend spatial correspondences observed in crystal structures. Sequence logos are built with weblogo (http://weblogo.berkeley.edu/logo.cgi).

5HT1B(4IAR) 5HT2B(4IB4) AA2AR(4EIY) ACM2(3UON) ACM3(4U15) ADRB1(4BVN) ADRB2(2RH1) CCR5(4MBS) CXCR4(3ODU) DRD3(3PBL) FFAR1(4PHU) HRH1(3RZE) LPAR1(4Z34) NTR1(4BUO) OPRD(4N6H) OPRK(4DJH) OPRM(4DKL) OPRX(4EA3) OPSD(1GZM) OX2R(4S0V) P2RY1(4XNW) P2Y12(4PXZ) PAR1(3VW7) S1PR1(3V2Y)

L A V TDL L V S I L V M PI L A V ADL L V G L F V M PI L A A AD IA V G V L A IPF L A C ADL I IG V F S MNL L A C ADL I IG V I S MNL L A S ADL V M G L L V VPF L A C ADL V M G L A V VPF L A I SDL F F L L - T VPF L S V ADL L F V I - T LPF L A V ADL L V A T L V M PW L G C SDL L L T V - S LPL L S V ADL I V G A V V M PM L A A ADF F A G L A Y FYL L A L SDL L T L L L A M PV L A L ADA L A T S - T LPF L A L ADA L V T T - T MPF L A L ADA L A T S - T LPF L A L ADT L V L L - T LPF L A V ADL F M V L G G FTS L S L ADV L V T I T C LPA L A L ADF L Y V L - T LPA T V I SDL L M I L - T FPF L A T ADV L F V S - V LPF L A L SDL L A G V A Y TAN

LYT V Y - S T V G A F Y FPT L L L IA LYG DFM L F - G S L A A F F TPL A IM I V TYF NYM V Y F N F F A C V L VPL L L M L G VYL AVT F G - T A IA A F Y LPV I IM T V LYW TIT F G - T A IA A F Y MPV T IM T I LYW AYA IA - S S V V S F Y VPL C IM A F VYL AYA IA - S S I V S F Y VPL V IM V F VYS FQT L K - I V I L G L V LPL L V M V ICYS VFQ F Q - H IM V G L I LPG I V I L S CYC DFV I Y - S S V V S F Y LPF G V T V L VYA PAR F S - L S L L L F F LPL A I T A F CYV W FK V M - T A I IN F Y LPT L L M L W FYA SYL V F - W A I F - N L VTF V V M V V LYA VVIQ V - N T F M S F I FPM V V I S V LNT VTK IC - V F L F A F V VP I L I I T V CYG FMK IC - V F I F A F V IPV L I I I V CYT LLK IC - V F I F A F IMPV L I I T V CYG VFA IC - I F L F S F I VPV L V I S V CYS SFV I Y - M F V V H F T IPM I I I F F CYG M YH IC - F F L V T Y M APL C L M V L AYL IYS M C - T T V A M F C VPL V L I L G CYG IVN Y I - C Q V I - F W INF L I V I V CYT YYF S A - F S A V F F F VPL I I S T V CYV HYI L F - C T T V - F T LLL L S I V I LYC

TM2 TM5

2.50 2.60 5.37 5.50 5.58

(45)

Structural alignment of GPCRs X-ray crystal structures unveiled the presence of insertion and deletion events leading to the inclusion of sequence gaps on TM helices MSAs. Figure 11 shows modifications in the TM2 and TM5 MSA based on structural differences observed in AA2AR, CCR5, CXCR4, FFAR1, LPAR1, OPRs, P2RY1, P2Y12, PAR1 and S1PR1 receptors with regard to the rest of Class A receptors. Such changes were implemented in order to fit amino acids on the same spatial location on the 3D structures.

Figure 12. Representation of the ligand-binding site of the co-crystalized ADBR2 with carazolol (PDBid: 2RH1 in cyan) and S1PR1 with a phosphonic acid ligand (PDBid: 3V2Y in orange).

Receptors are superposed and only the TM5 helix is showed. Residues on positions 5.45, 5.46,5.57 and 5.50 are highlighted. Ligands are shown as sticks.

(46)

As example, structural superposition shows a non-correspondence of a serine residue at position 5.46 in ADRB2 with regard the S1PR1 receptor structure (Figure 12). It is important to mention that these changes in the TMs alignment has an impact on the Ballesteros-Weinstein numbering as well as on the MSAs statistics, and highlight specific compositional bias observed in some members of the family.

4.2 Construction of an aminoacid substitution matrix for the Class A GPCRs

Protein sequence alignments and database search methods use standard scoring matrices calculated from amino acid substitution frequencies in general sets of proteins [88]. These general-purpose matrices are not optimal to align accurately sequences with marked compositional biases, such as hydrophobic transmembrane regions found in GPCRs [89]. Amino acid substitution matrices are obtained by the application of statistical methods on sequence alignments of evolutionarily related proteins. In this regard, it is known that the evolutionary selective pressure that governs the conservation and relative mutability of amino acids varies among protein families [90]. Therefore, the application of a standard matrix for the alignment of a determinate protein family could give inaccurate results, particularly if the amino acid composition differs from those used for the matrix construction.

Specific substitutions matrices for certain families of proteins are continuously developing [91-93]. These, in many cases have proven to be more effective than the standard matrices in recognizing evolutionary

(47)

relationships between the proteins of interest [94, 95]. To take account of the compositional bias and the modifications in the GPCR MSAs as a consequence of the structural analysis. A curated alignment of more than one thousand membrane spanning sequences of the rhodopsin class from different organisms were used for the construction of an amino acid substitution matrix dedicated to the study of GPCRs. This matrix was built using an approach similar to the one employed for the construction of the BLOSUM series of matrices [77].

4.2.1 Amino acid compositional bias in the class A GPCRs

The average amino acid composition of the TM regions of the class A family was compared with amino acid frequencies derived from other studies (Table 1). As expected, the fraction of hydrophobic residues in the membrane spanning regions of GPCRs is similar to other TM protein matrices (JTTtm and PHDhtm) and is higher than in general proteins (BLOSUM62, and Swiss-Prot). Leucine is the most common occurring residue followed by valine and isoleucine. Nonetheless, there are differences in the amino acid composition of GPCRs. This is the case for charged and polar residues, with the exception of serine and threonine that behave similar in all datasets. The accumulated percentage for the R, K, H, D, E, N, and Q amino acids in the GPCRtm dataset (19.6 %) is in between JTTtm (9.5 %) and PHDhtm (9.9 %) datasets and BLOSUM62 (32.3 %) and Swiss-Prot (33.8 %) datasets. In addition, TM regions of the rhodopsin family are also characterized for a lower frequency of glycine (4.6 %) and a higher frequency of cysteine (3.6 %) residues relative to the other datasets. Given such differences in amino acid composition, we

(48)

presume that general protein matrices such as the BLOSUM series and TM-derived protein matrices may not perform accurately in the alignment of the TM regions of GPCRs.

Table 1. Amino acid composition of substitution matrices and the Swiss-Prot database (%).

Aminoacid GPCRtm JTTtm [96] PHDhtm [97] BLOSUM62 [77] Swiss-Prot [98]

Ala (A) 8.0 10.5 8.8 7.4 8.3

Cys (C) 3.6 2.2 2.6 2.5 1.4

Asp (D) 2.1 0.9 1.4 5.4 5.5

Glu (E) 1.9 1.0 1.0 5.4 6.7

Phe (F) 7.3 7.7 9.3 4.7 3.9

Gly (G) 4.6 7.6 5.7 7.4 7.0

His (H) 2.1 1.7 1.1 2.6 2.3

Ile (I) 8.1 11.9 11.0 6.8 5.9

Lys (K) 3.4 1.1 0.9 5.8 5.8

Leu (L) 14.1 16.3 16.0 9.9 9.7

Met (M) 3.1 3.3 4.1 2.8 2.4

Asn (N) 3.4 1.8 2.2 4.5 4.1

Pro (P) 3.8 2.6 3.2 3.9 4.7

Gln (Q) 2.2 1.4 1.2 3.4 3.9

Arg (R) 4.5 1.6 2.1 5.2 5.5

Ser (S) 6.8 5.7 6.5 5.7 6.6

Thr (T) 5.6 5.2 5.3 5.1 5.3

Val (V) 9.2 11.9 11.0 7.3 6.9

Trp (W) 1.9 2.2 1.9 1.3 1.1

Tr (Y) 4.3 3.2 4.7 3.2 2.9

(49)

4.2.2 Development of the GPCRtm matrix

In Figure 13, the substitution matrix developed for the TM regions of class A GPCRs is shown. Unlike BLOSUM matrices, built from sequence blocks of a variety of biological sources, we employ sequences of only GPCRs that accounts for the compositional bias in this family of receptors.

Inspecting the diagonal elements of the matrix, we can estimate the mutability potential of each residue. Hydrophobic residues (V, L, I, A, F) display the highest level of relative mutability (corresponding to low values on the matrix, ≤ 2), whereas charged and polar residues are in general less mutable. Polar serine and threonine residues are special cases, displaying similar values than hydrophobic residues. These two amino acids, unlike other polar or charged residues, do not destabilize TM helices, as their hydrogen bonding potential can be satisfied by interacting with the carbonyl oxygen in the preceding turn of the same helix [99]. In contrast, N, D, R, W and P amino acids display the lowest level of relative mutability (corresponding to high values on the matrix, ≥ 7). All these residues display a high conservation pattern in at least one of TM helices of class A GPCRs [17, 75]: N in TM 1 (present in 98 % of the sequences), D in TM 2 (93 %), R in TM 3 (95 %), W in TM 4 (96 %) and P in TMs 5 (76

%), 6 (98 %) and 7 (93 %). Significantly, the position of these highly conserved amino acids in each helix is the same in the superimposition of the currently available crystal structures [100]. Positively (K, R, and H) and negatively (D, E) charged residues are easily interchangeable with each other. This could be due to a selection pressure to adapt the binding cavity of the TM bundle to the different chemical features of the ligands that, in many cases, display strong electrostatic properties (discussed below).

(50)

Figure 13. The G protein-coupled receptor transmembrane substitution matrix (GPCRtm).

+&)&)&( ! &# 

 ! ! 

The GPCRtm (relative entropy, H= 0.6540) displays intermediate properties between matrices derived from general TM data sets (JTTtm, H= 0.5599 and PHAT, H= 0.5550) and for water-soluble globular proteins (BLOSUM62, H= 0.6979). A comparison of GPCRtm with other matrices is shown in Figure 14. In GPCRtm, charged and polar amino acids (K, R, H, D, E, N and Q) interchange with higher frequencies than in BLOSUM62 and lower than in JTTtm. In general, there is an intermediate performance of GPCRtm between general TM-derived and globular protein matrices

(51)

with regard to the majority of charged and polar residues, which suggest a distinctive role of these amino acids in GPCRs.

Figure 14. Bubble chart of the difference matrix obtained by subtracting from GPCRtm the JTTtm (lower) and BLOSUM62 (upper) substitution matrices. Positive and negatives values are showed in grey and white circles respectively. Bubbles are scaled according to the absolute value of the difference (numerical values are available in Figure A3 in Appendix).

One of the most important aspects of substitution matrices is amino acid grouping based on their chemical properties. These similarities could be easily visualized through the construction of dendrograms and multi-