Computational tool for in silico screening of biosurfactants in metagenomic libraries

(1)

1

Computational tool for

in silico

screening of biosurfactants in

metagenomic libraries

Iván Ricardo López Sandoval1, María Mercedes Zambrano Eder2, Alejandro Reyes Muñoz3, and Andrés

Fernando González Barrios1 1

Grupo de Diseño de Productos y Procesos (GDPP), Department of Chemical Engineering

Universidad de los Andes. Carrera 1E No. 19A-40, Edificio Mario Laserna, Bogotá D.C. – Colombia

{ir.lopez59, andgonza} @uniandes.edu.co 2

Corporación Corpogen

Carrera 5 No. 66A-34, 110231, Bogotá D.C. - Colombia

[email protected] 3

Department of Biological Sciences

Universidad de los Andes, Carrera 1 # 18A-12, Bloque A, Bogotá, D.C. – Colombia

(2)

2 OBJETIVES

General

To construct a computational tool in order to carry out an in silico sequence-based screening process to find sequences that, according to their structural features may have an amphipathic structure and a

potential biosurfactant activity.

Specific

 To develop a computational platform for in silico screening of beta barrel proteins in metagenomes using non-homology approaches; employing bioinformatic programs, scripts and a

cluster analysis approach.



To evaluate the computational platform using three metagenomic sequences from soil samples of Los Nevados National Natural Park.



To evaluate the results of the platform for the metagenomes by means of tertiary structure prediction and Molecular Dynamics simulations.

(3)

3 ABSTRACT

Biosurfactants have emerged as an alternative to chemical surfactants, due to their advantages such as

biodegradability and low toxicity. It has been discovered that some proteins of beta-barrel type such as

OmpA have biosurfactant activity; this has generated an increasing interest in their discovery and

characterization. Accordingly, it was undertaken the construction of a computational tool which

identifies genes and protein sequences, with potential use as biosurfactants in metagenomic sequences.

This tool is a platform of five modules which comprised the use of bioinformatic programs Glimmer

MG and hmmscan for prediction and selection of sequences with certain domains, respectively; scripts

developed in Perl programming language to calculate hydropathy profiles and secondary structure, and

to evaluate numerically the presence of structural parameters based on reference information of OmpA

from Escherichia coli; a k-means cluster analysis approach to analyzes the numerical results approach using reference data from OmpA-like proteins; and a Gibbs free energy of formation calculator based on

group contribution methods for biocompounds. The application of this platform in three soil

metagenomic sequences from Los Nevados National Natural Park allowed the selection of 9 amino acid

sequences from the three metagenomes, which were analyzed using DELTA-Blast alignments, tertiary

structure prediction using I-TASSER and molecular dynamics simulations using GROMACS.

Simulations of a dodecane/water biphasic system were carried out for three protein sequences, finding

that one of them, an amino acid sequence from an untreated soil metagenome (BIPV1) which belong to

the maltoporins protein family, remained stable at a dodecane/water interface.

Keywords: Biosurfactants, hmmscan, outer membrane beta-barrel superfamily, hydrophobicity, k-means cluster analysis, metagenome, group contribution method.

(4)

4

1. INTRODUCTION

Surface active agents or surfactants are molecules which have the property of lowering the interfacial

tension at the interfaces between phases due to their amphipathic structure [1,2]. Surfactants are applied

as detergents, wetting, agents emulsifiers dispersing agents, and foaming agents in several fields such as

medicine, manufacturing of household products and petroleum industry, in which their use is important

in processes like oil extraction and processing, cleaning of residual oil vessels, enhancement of oil

recovery, among others [3]. One of the main features of surfactants is the formation of aggregate

structures in aqueous environments called micelles, in which the hydrophobic tails of the molecules are

protected from contact with the water. These aggregates are known to minimize the free energy of the

solution and are dynamic and dependent on several physical conditions such as temperature [4]. Most of

the surfactants in industry are derived from petroleum or chemically synthesized. In 2008, the

worldwide use of surfactants was estimated to be 13 million tons per annum [5] which is focused mainly

in the use of chemical surfactants, principally linear alkylbenzenesulfonates (LASs) and alkyl phenol

ethoxylates (APEs), which are produced synthetically. Thus, production might require several costs

relating synthesis and purification. Furthermore, surfactants are known as highly toxic compounds for

the environment [5].

Biosurfactants have been considered as a solution to decrease the use of conventional surfactants; due to

their capacity to reduce surface and interfacial tensions in both aqueous solutions and hydrocarbon

mixtures, which makes them potential molecules for processes like microbially enhanced oil recovery

(MEOR) [6]. Biosurfactants have several advantages over the chemically synthetized surfactants, such

as lower toxicity, higher biodegradability, better environmental compatibility, higher foaming, high

(5)

5

synthesized from renewable sources [4]. The term biosurfactant comprises a group of structurally

diverse molecules mainly produced by different microorganisms and are classified by their chemical

structure and microbial origin [5]; structure of biosurfactants includes hydrophilic moiety consisting of

amino acids or peptides ions; mono-, di., or polysaccharides; and a hydrophobic moiety consisting of

hydrocarbon chains of unsaturated or saturated fatty acids. The most known biosurfactants are

glycolipids, lipopeptides and lipoproteins, phospholipids and fatty acids, polymeric surfactants and

particulate surfactants.

The efforts of different disciplines such as biology, chemistry and chemical engineering are focused on

the discovery, study, production and application of these compounds; mainly of low molecular weight

biosurfactants like rhamnolipids, a type of glycolipid [4,7–10]. Several studies are focused on finding

biopolymers or high molecular molecules with emulsifying properties, and comprise experimental and

computational procedures following a trial and error methodology. Some of these studies involve porins,

which are proteins located in the outer membrane of gram-negative bacteria and whose main feature is

the presence of several β-strands, forming a notorious barrel structure. This barrel encompasses a

transmembrane pore that allows the passive diffusion across the outer membrane. Porins give the outer

membrane a semi-permeability to small solutes below a weight of 400 Da [11]. There are many types of

porins with specific types of channels. One of these types is the Outer Membrane Protein A (OmpA).

OmpA is one of the most abundant structural proteins in the outer membrane of Escherichia coli [12]. It maintains the integrity of the E. coli outer membrane, functions as a mediator in F-dependent conjugation, interacts with solid surfaces and plays an important role as a phage receptor [13]. Like

(6)

6

The N-terminal domain of OmpA consists of 170 residues located within the outer membrane, forming

an antiparallel β-barrel whose 8 transmembrane β-strands are connected by three short turns and four

large-surface-exposed hydrophilic loops that exist in the aqueous environment of periplasm [14,15]. One

of the most important feature of OmpA and other porins sequences is the presence of hydrophilic

residues between hydrophobic members, which make the protein to have an amphipathic behavior. This

plays an important role in biofilm formation in E. coli K12 [13], and in stabilization of dodecane/water mixtures [16]. The structures of OmpA and LamB can be seen on figure 1. It is important to note the

presence of the characteristic β-barrel in both proteins.

It was discovered that protein AlnA, one of the main components of bioemulsifier Alasan, produced by

Acinetobacter radioresistents KA53, is a OmpA-like protein, and practically all of the emulsifying activity is present in that protein. [17]. In addition to this, it was developed a study of the biosurfactant

activity of OmpA from Escherichia coli by means of molecular dynamics simulations (MD) and experimental validation, and it was found that these protein can stabilize dodecane/water mixtures [16].

These results encourage the search for proteins with potential use for biosurfactants.

The work is focused on the search of proteins whose structures have amphipathic features, which can

permit their use as potential surfactants. It is therefore important to find different type of proteins from

different microorganism in order to have a wide selection range of proteins that have similar

characteristics to OmpA-like proteins such as E. coli OmpA and AlnA from Alasan. One of the most recognized methodologies for discovering potential products in environmental samples is

metagenomics, which allows the study of microbial communities based on the total DNA present in an

(7)

7

present in an ecosystem, since it does not depend on culture techniques. The main challenge of

metagenomics is to determine which type microorganisms and how many of them live in a determined

environment (taxonomic metagenomics), what is the main function they have and what kind of enzymes

and metabolic pathways use in order to live in that environment (functional metagenomics) [19]. In

functional metagenomics, the screening of novel enzymes can be performed by experimental

function-based screening procedures, or by sequenced-function-based screening approaches. The latter can be

implemented experimentally and computationally. One of the most important tools for the development

of in silico sequence-based screening was the improvement of New Generation Sequencing technologies (NGS), which have the capacity to sequence DNA at unprecedented speed compared to the traditional

Sanger sequencing technique [20]. These technologies include pyrosequencing (454), Illumina/Solexa,

SOLiD, Ion Torrent or Pacific Biosciences, etc. The improvement of sequencing techniques, brings the

development of programs and packages that process the raw data produced by the sequencer, and

generates assemblies of metagenomic sequences with their respective predicted (and translated) genes

[21]. All the information that metagenomic sequences can give will be used as the main material for the

search of proteins of interest.

The aim of this work is to develop a computational platform that searches genes in metagenomic

sequences that can express proteins whose sequence and structural properties allow their use as potential

biosurfactants. Due to the great variety of proteins in microorganism, the search in this work is focused

in transmembrane proteins which shared common structural properties with OmpA, like the β-barrel.

The main fundament for the screening process is the presence of amino acid sequences which may form

(8)

8

residues and the capacity of those residues to form determined secondary structure. Therefore, the search

does not take into account a possible homology to OmpA.

The platform comprises the use of traditional bioinformatic tolls for prediction and translation of genes

(Glimmer), and for search of domains and families of interest (HMMER); and scripts for the calculation

of properties (such as hydrophobicity index, and scale values per residue to formation of secondary

structures) and for analysis in order to find regions of interest within a given sequence. Due to the

shortage of information of proved biosurfactant proteins, the platform includes an unsupervised k-means cluster analysis approach in order to select those sequences which share structural similarities to OmpA.

It also has a module which estimates the stability and spontaneity of the formation of a protein

calculating standard Gibbs free energy values. This estimation is carried out using a group contribution method. In this approach, values of free standard Gibbs free energy of formation of proteins are determined by the sum of contributions of the chemical groups which make up the molecule. The

method assigns a unique energy value to each group which will be applied to any molecule [22]. This is

an important approach to calculate energies for proteins, due to the size and complexity of these

molecules. The group contribution method used in this module was developed by Mavrovouniotis

[23,24] and allows calculation of values for biocompounds, including peptides and proteins.

The development of this platform is encompassed in a series of works focused on the search of

biocompounds with potential biosurfactant activity. These works involve the use of several

methodologies including computational and experimental characterization of transmembrane proteins

OmpA and OmpN [16, 25], and function-based experimental screening of biosurfactants in

(9)

9

2. METHODOLOGY

The methodology consisted basically in three main parts: 1) construction of the platform, 2) evaluation

of the platform using metagenomic sequences, and 3) analysis of selected amino acid sequences by the

platform. Perl scripts developed for the platform can be seen on supplementary material.

2.1. Platform construction

The platform proposed was designed in such a way to find amino acid sequences that have some degree

of similarity to OmpA and OmpN sequences, at a level of structural properties related to amphipathy.

The platform consists of five modules as is shown inn figure 2.

2.1.1. Module 1. Glimmer MG

The first module is the application of bioinformatic software Glimmer MG (Gene Locator and

Interpolated Markov ModelER – MetaGenomics) [26] to unpredicted assembled metagenomes.

Glimmer is a collection of programs for the identification of genes in microbial DNA sequences, which

uses Interpolated Markov Models (IMMs) from a training set of genes, in order to identify coding

regions in new metagenomic sequences, and distinguish them from noncoding DNA. The module

consists of two scripts in Python programming language:

1) glimmer-mg.py: Is the main script of the software. The scripts makes a classification of

(10)

10

program uses IMMs to characterize variable-length oligonucleotides typical of a phylogenetic

grouping, and uses these models to classify sequences present in reads. Then, the resulting

sequences are clustered using an unsupervised clustering method called Scimm [28], in order to

make the final predictions within each cluster. The output is a .predict file which contains the

fasta-header line of the contig followed by information about the predicted genes.

2) extract_aa.py: This script is used to extract a multi-fasta file of amino acid sequences from the

gene predictions made by Glimmer MG. It uses the information given in the .predict file in order

to extract the predicted genes from the contig, and make the translation.

The main output of this module is a multifasta file of amino acid sequences present in the metagenome

according with the Glimmer MG prediction.

2.1.2. Module 2. HMMSCAN

The second module is the first step of screening and focuses the search of proteins in the metagenome

which belong a particular family or clan. This selection is made using the bioinformatic suite HMMER

[29] and its tool hmmscan, which selects those proteins that belong to the selected families. Hidden

Markov Model files of the selected families were previously downloaded from the online database Pfam

(http://pfam.xfam.org/). The families selected belong to the outer-membrane beta-barrel superfamily

(Clan: MBB (CL0193)) [30]. According to Pfam, this superfamily has 54 families and 118601 domains.

There were downloaded 27 families HMMs which belong to this superfamily including the OmpA

(11)

11

which do not belong to the MBB. With the HMM files it was created a database using the HMMER tool

hmmpress and then, hmmscan was implemented. The output file consists of a table with the name of the

screened sequence, its corresponding family, and data of each calculation (See table 1). In order to have

only the selected sequences, it was coded a script in Perl language which contrasts the initial amino acid

multi-fasta file with the output table; Those sequences that were evaluated, and belong to any of the

families of interest, were printed on a new multi-fasta file, which was then evaluated in the following

module of the platform.

2.1.3. Module 3. Hydropathy analysis

The hydropathy analysis module involves the calculation of structural properties for each sequence and

the evaluation of the presence of amphipathic regions based on these properties. Properties which are

calculated are: hydrophobicity index according to the Kyte-Doolittle scale [31]; amino acid scale values

for prediction of alpha helix, beta sheets, beta turns and coils conformations according to Deleage and

Roux [32]; and molar fraction values (%) of amino acids which are accessible to solvent [33] (see table

2). All of these properties and their values per residue can be seen on ExPASY tool Protscale

(http://web.expasy.org/protscale/). Conjugating the presence of determined motifs with hydrophobicity

indexes and accessibility of solvents, and comparing this with results of OmpA E. coli, a punctuation matrix is generated in order to perform the second screening step. The third module is divided in two

parts:

a) Calculation of the properties: This part involves the use of a script in Perl. This script, generates

(12)

12

calculating an average value of the property to be measured, and assigning the resulting value to

the central residue on the window; each property to be calculated, is processed in a different

subroutine, but using the same window principle. Finally, each table is assign to its

corresponding sequence name and is printed. An example of the output file is shown in figure 3.

The evaluation of the script was carried out using the sequence of OmpA from Escherichia coli

and comparing the data with the one provided by Protscale.

b) Criteria selection: This part involves the use of two Perl scripts. The first script searches within

each sequence the presence of five types of regions (parameters) using the output table of module

3(a). The purpose is to quantify the number of occurrences of these regions and establish a series

of five numbers which will characterize each sequence. The parameters to evaluate are based in a

previous analysis to OmpA sequence of Escherichia coli using the table of properties and correlating this with its tertiary structure and are shown in table 3. The script recognizes the

residues that accomplish each parameter, and then only counts the cases where is valid for three

or more consecutive residues. Therefore, each sequence has 5 different values which may differ

between sequences. The evaluation of the script was done using a multi-fasta file with 94 amino

acid sequences (see section 2.2.2) which belong to the 28 families selected Pfam. Histograms

showing the number of occurrences of each parameter were created, and according to the

distribution, bin sizes were calculated applying the Freedman-Draconis rule [34] for

determination of the bin size of an histogram:

(13)

13

Where n is the number of data, x is the sample and IQR is the interquartile range of the data.

Punctuation system was created assigning a numeric value to each bin, from 0 which means the

absence of occurrence of the parameter to n occurrences. Histograms and calculation of the bin

sizes were done using the statistical programming environment R 3.1.2. The final output of the

script is a text file with the name of the sequence and its five corresponding punctuations. An

example of the output file can be seen on figure 4. This output is then processed by the second

script, which organizes the information in a matrix in which the columns correspond to the name

of the sequences and the lines to the parameter punctuations.

2.1.4. Module 4. K-means cluster analysis

Results matrix was processed by means of a k-means cluster analysis using the programs Cluster 3.0.

Clustering analyses were made using uncentered correlation as a similarity metric with 1000 runs; and

varying the number of centers according to the amount of input data of the metagenomes used in this

work. Cluster which data of eight reference sequences of OmpA of Escherichia coli was used to make the third screening step, using a script which creates a new multifasta file using the names of the

sequences of the cluster.

2.1.5. Module 5. Energy Calculator

The fifth module involves the third and last screening step of the platform. In this module, values of free

energy of formation using a group contribution method for amino acids and other biocompounds are

(14)

14

sequences into its residues (groups). A value of energy contribution is assigned to each group of the

sequence and then the sum of all contributions is carried out using the following expression:

(2)

Where is the Gibbs free energy standard of the protein, is the energy contribution of each amino

acid without terminal carboxylate and ammonium groups, and are the contributions of the

carboxylate and ammonium groups, respectively, is the contribution of the peptide bond, , and L the

number of amino acids which constitute the protein. The output is a text file with the name of the

sequence and its free Gibbs Energy of formation (in ). This calculation was done for all the

sequences of the cluster (including references). Those sequences which presented a similar value to

OmpA were chosen with the same script used in the previous module. Table 4 shows the contribution

values in order to calculate the free Gibbs energies.

2.2. Evaluation of the Platform

2.2.1. Metagenomes

The test of the platform was carried out using three assembled metagenomes from soil samples collected

in Los Nevados National Natural Park, in Colombia. The first metagenome was obtained from paramo

soil (Code: BIPV1), the second from potato-growing and livestock zone (Code: BICV2), and the third

(15)

15

seen in table 5. Soil samples of these metagenomes were collected in the same property. MetBAA

metagenome was sequenced using Illumina and the process of assembly, prediction and translation was

made using the suite Geneious giving an amino acid multi-fasta file of 479015 sequences. It was

processed by the platform from module 2. Metagenomic screening was carried out in a Sun Grid Engine

cluster computing environment, in particular employing a machine with 24 cores and 128 GB of RAM

memory, with a total of 75 h of processing time.

2.2.2. Reference sequences.

In order to make preliminary evaluation of the modules, a transmembrane data set including 94 amino

acid sequences was created. Sequences were collected from Genpept (63) and Uniprot (31) and belong

to the 28 families selected from the MBB clan of Pfam, with 16 reference sequences of OmpA (Family:

OmpA_membrane) and 5 from OmpN (Family: Porin_1). These sequences were added to the filtered

multifasta file from module 2 to serve as reference in the next modules. 8 of 16 OmpA sequences belong

to different strains of Escherichia coli and were used in module 4 as reference for k-means clustering and in module 5 as a reference for the value of . Evaluation using the transmembrane data set was

carried out in a personal computer DELL Inspiron 1464 with RAM memory of 4GB and a processor

(16)

16 2.3.Analyses of selected amino acid sequences.

2.3.1. Prediction of tertiary structure

Selected sequences by the platform were submitted to I-TASSER [35] server, an online platform for

protein structure and function predictions. In this program, tridimensional models are built based on

multiple-threading alignments and iterative template fragment assembly simulations. Visualization and

modification of the coordinates of the models were carried out using the molecular modeling program

Chimera 1.10 [36] .

2.3.2. Molecular Dynamics simulations

The purpose of this simulation is to determine the stability of the candidate proteins in a dodecane/water

bilayer system. This stability is one the main features to consider in a surfactant activity. Simulations

were carried out using the GROMACS [37] package (Version 5.0.1. for simulation boxes construction,

and version 4.6.1. for energy minimization, system equilibration, MD and subsequent analyses) [38],

using united atom force field GROMOS96 53a6 [37,38] whose parameterization is based on free

enthalpies of hydration and non-polar solvation, which play an important role in protein folding [25].

Simulations were carried out in rectangular boxes with box edges at least 1 nm apart from the protein

surface; with molecules of dodecane, whose coordinates and topology were obtained from the

Automatic Topology Builder tool of the Molecular Dynamics group of the University of Queensland

[25], and single point charge (SPC216) water molecules. According to the modified position made in

(17)

17

regions of the protein, respectively. Specific data for the construction of the box can be seen on results

and discussion.

Once the boxes were constructed, energy minimization was performed using steepest descent algorithm

until the maximum force were less than . System equilibration was performed in order to

stabilize the temperature and pressure of the system. This equilibration were performed first under a

NVT ensemble for 200 ps with a time-step of 2 fs at 300, and then under a NPT ensemble for other 2 ps

with the same time step. MD simulations were performed then for 5 ns with a time step of 2 fs. Output

trajectories of the MD simulation were used to analyze the stability of the system.

For this, calculations were made for the root mean square displacement (RMSD) of the system and its

components in order to measure the average distance between the atoms of a protein, the radius of

gyration (RGYR) which gives information of its folding stability, and the solvent accessible surface area

(SASA) of the protein during the simulation time. Visualization of the simulation box construction and

(18)

18

3. RESULTS AND DISCUSSION

3.1. Prediction and translation with Glimmer MG

The implementation of Glimmer MG for BIPV1 metagenome allowed the prediction and translation of

2 227 616 amino acid sequences from 2 056 250 contigs and in IN BICV2 metagenome, 1 660 101 amino acid sequences from 1 649 203 contigs. The overall process gave a total rate of predicted amino acid sequences per contig of 1.083 and 1.007 for BIPV1 and BICV2 metagenomes, respectively.

Glimmer MG is characterized by its high accurate predictions in whole genomes. This accuracy is

founded on its two main processes of classification and clustering; MG classifies the sequences using a

phylogenetic classifier and trains models using the results, subsequently, by means of an unsupervised

clustering approach, allows retraining of prediction models on the sequences themselves. This is a novel

approach makes Glimmer MG an alternative to prediction programs based on GC content.

3.2. Selection of families with HMMSCAN

The implementation of hmmscan module included the MetBAA metagenome, previously predicted and

translated (See section 2.2.1). The first screening process led to the selection of 7127 (0,324% of total sequences) sequences which belong to the MBB clan from the BIPV1 metagenome, 5055 (0,304%) for the BICV2 metagenome, and 1219 (0,254%) for MetBAA metagenome. Figure 5 shows the distribution of proteins for each Pfam family selected. In all the three metagenomes, it was found that families with

(19)

19

of phenol degradation family. Moreover, overall distribution of selected families was similar. This

distribution of transmembrane proteins in the three ecosystems can be associated to the taxonomy of the

metagenomes. Taxonomic analysis developed for metagenomes BIPV1 and BICV2 showed similar

taxonomic profiles for Bacteria domain, with an important presence of Proteobacteria (64.59 and

68.57%, respectively), Acidobacteria (17.39 and 8.92%), Actinobacteria (5.71 and 6.41%), and

Bacteroidetes phylum (6.17 and 10.95%) [42]. Despite the reduction of metabolic processes due to

effect of human activities on BICV2 soil sample, these activities seem not to affect the relative

abundance of transmembrane proteins obtained for this metagenome.

According to Smith et al. [43], OmpA is an abundant protein and a predominant antigen in

enterobacterial outer membrane with a copy number of approximately 100000 proteins per cell. The

most important feature of this domain is the presence of a beta/alpha/beta/alpha/beta structure found in

the C-terminal region of outer membrane proteins (like OmpA from Escherichia coli) and MotB proton channels. The N-terminal is variable and in some cases it corresponds to the OmpA-like transmembrane

family (β-barrel structure). The presence of OmpA domain was 19.6% in BIPV1 and BICV2 , whereas

the presence of OmpA-like transmembrane domain (OmpA_membrane) was 4.1% for BIPV1 and 3.7%

for BICV2. MetBAA presented a 20% of OmpA domain and 4% of OmpA-membrane, showing that

there are no significant effect of farming and altitude in the abundance of proteins with these domains. It

was observed in the three metagenomes that just one of five proteins that had OmpA domain had the

OmpA-like transmembrane domain. This matches with the fact that proteins with OmpA domain can

have in some cases a beta barrel structure in their N-terminal domain. Likewise, evaluation of 14 OmpA

sequences from transmembrane data set with hmmscan showed that sequence of OmpA from Riemerella anatipestifer (Phylum: Bacteroidetes) only hit OmpA domain; and did not have any alignment with the OmpA-like transmembrane domain. This may give indications that several amino acid sequences in the

(20)

20

metagenome could have the same hits of OmpA of R. anatipestifer. On the other hand, there were cases in which proteins that matched OprF family matched with OmpA domain. OprF-like proteins from

Pseudomonas aeruginosa are considered as an ortholog of OmpA with significant amino acid similarity in their C-terminal domains [44].

The OmpA/OmpA_membrane/OprF case was one were proteins hit more than one HMM. This situation

was observed in proteins that belong to the general bacterial porin family (Porin_1 and Porin_4) too.

These two families describe generally proteins with porin features such as OmpN. The main difference

between Porin_1 and Porin_4 is the hidden Markov model length; 337 for Porin_1 and 311 for Porin_4;

which may explain the difference of hits between them. In cases where a protein hit more than one

family, the screening script in Perl allows to de-replicate these cases in order to have only one printed

copy of each sequence in the new multi-fasta file. There is important to take into account that final

resulting proteins of the whole screening will not necessarily belong to the most abundant families

shower in this module

3.3. Calculation and analysis of structural properties.

In order to validate the results of the calculation script of module 3(a), the method applied to calculate

values per residue was compared to the method used in ExPASy Tool Protscale [45]. Figure 6 shows the

comparison between the tendencies given by the two programs, using as input, the protein sequence of

OmpA from Escherichia coli APEC 01 (Accession:ABJ00366). The profile shown by the Perl script is exactly the same to the one obtained using Protscale. The established configuration of the Protscale

algorithm consists on a sliding window which calculates the value of the central amino acid using a

(21)

21

100% and the window edge relative weight, which is 100% [45]. Perl script of module 3(a) applies this

configuration in order to calculate values for the properties for multiple sequences. Protscale server

generates a profile and a value table of a property of interest using just one sequence as input, whilst Perl

script generates tables including several properties using a multifasta file, making this an important

advantage in order to calculate all the properties of interest for the amino acid sequences in the

metagenome. Different window sizes were tested. It was found that sliding window of 7 members is a

suitable size for large proteins because it can characterize regions from 5 to 15 residues, without the

presence of noise, characteristic of lower window sizes, or underestimation which can appear using

larger window sizes.

The purpose of module 3(b) was to observe the correlation between the presence of secondary structures

present in the sequences with the hydrophobicity of the structure, making an emphasis in the presence of

hydrophobic β-strands on the β-barrel structure, and hydrophilic regions (such as β-turns) accessible to

solvents, which are structural characteristics of OmpA-like proteins. Additionally, it was evaluated the

presence of other regions that may appear in the sequences such as hydrophobic alpha helices and coils,

which are features in some transmembrane proteins [46]. The ranges of selection of regions (see table 3)

were chosen based in a previous analysis of the tertiary structure of OmpA with data using a sliding

window of 7. Initial data analyzer accounted the number of occurrences of each parameter. Figure 7

shows the histograms of the five parameters, using the transmembrane data set as a test input. Due to

the variability in the number of occurrences in parameters such as hydrophilic residues accessible to

solvent and hydrophilic beta turns and the distribution shape of data, it was planned a punctuation

system per intervals based on calculated bin sizes for each histogram. Applying the Freedman-Draconic

(22)

22

calculations and the punctuation system are shown in table 6. Three parameters showed a bin size of

approximately one; therefore, the punctuation was almost the same as the presence of determined

regions. The purpose of this punctuation system was to homogenize the parameter values, which play an

important role in k-means cluster analysis.

3.4. K-means cluster analysis

Calculations and punctuations were carried out for the screened metagenomic sequences of hmmscan

module with the transmembrane data set, in order to carry out the grouping. The purpose was to screen

those sequences that were clustered with OmpA sequences. Taking into account the homogenized

continuous values obtained from module 3(b) uncentered correlation was chosen as the similarity metric

to apply. It is a distance measure based on the Pearson correlation, but its difference is that uncentered

correlation assumes a mean value of the series as 0, even if it is not. Therefore, two vectors with

identical shape, but offset to each other by a fixed value, will have a Pearson coefficient value of 1 but

an uncentered correlation coefficient different from 1 [47].

The difference between the sizes of metagenomes determined the number of amino acid sequences in

modules 1 and 2. Due to this, the number of k centers could not be equal in the three cases. Therefore, the number of k centers was chosen in order to have clusters with maximum 100 data, including data from reference sequences. BIPV1 k-means cluster analysis was carried out with 100 centers, while

BIPV1 and MetBAA clustering was made with 75 and 20 centers, respectively. Table 7 shows the

results of the k-means cluster analysis for the three metagenomes. All data from OmpA reference sequences were grouped in the same cluster.

(23)

23

In BIPV1 case, the module 4 screened 92 sequences from 7127 selected in the hmmscan module, which

represents 1.3% of multifasta file obtained from module 2. Sequences were clustered with OmpA

references and one reference from the OprD family which corresponds to a chitoporin, a sugar-specific

channel of Escherichia coli; to form a 101 data cluster. Cluster analysis screened 73 sequences from 5055 (1.44%) in BICV2 metagenome, and were clustered with the same reference sequences of BIPV1

case, to give a 82 data cluster. MetBAA metagenome results were different; 56 sequences were screened

from 1219, to a total of 4.59%, a higher percentage compared to BIPV1 and BICV2. Likewise, these

sequences were clustered with the same nine references of BIPV1 and BICV2, and 10 more reference

sequences including two sequences of OmpA from Cronobacter sakazakii and Pantoea ananatis

(Family: Enterobacteriaceae).

3.5. Gibbs free energies calculation

The fifth module and third screening step calculated the values of Gibbs free energy of formation of the

screened proteins in module 4. The selection criterion for screening of the sequences was a similar or

more negative value compared to the one of OmpA from Escherichia coli K12, which is -89161.88

kJ/mol. It was found an important relation between these and the size of the protein. Thus, the

module screened sequences that have a similar size as OmpA-like proteins. OmpA sequences of the

transmembrane data set were included in the multifasta file from module 4 in order to make comparisons directly. In the case of BIPV1 metagenome, 5 proteins were selected according to the free

Energy value, 3 proteins for BICV2 metagenome, and 1 for the MetBAA metagenome. Screened

(24)

24

corresponding metagenomes. values are shown in table 9 with results of DELTA-Blast [48]. Gibbs

free energy of formation shows the energetic change that accompanies the process of formation of these

proteins. The importance of this module in the platform lies on the possibility of choosing those proteins

whose energetic change can be similar or more negative than OmpA.

Knowledge of the thermodynamic behavior of amino acids in aqueous and organic phases is important

to understand the behavior of a protein in biphasic systems. Although Gibbs free energies of formation

can give important information about its stability and formation at standard conditions, they cannot give

information about the stability of the proteins when they are in a liquid/liquid biphasic system such as

dodecane/water at determined conditions. Due to limited data reports for solvation free energy of amino

acids [49], it is important to take into account methodologies such as molecular dynamics, in order to

determine the energies of amino acids and peptides in an interface, in order to estimate the energetic

contributions of the contained chemical groups.

3.6. Post-screening analyses

3.6.1. DELTA-blast results

DELTA-blast alignment of the nine selected sequences with non-redundant databases, shows relative

high percentages of coverage but low for identity, which can be associated presumably with the quality

of the predicted sequences (cut or interrupted genes in the assembly process) or with the presence of

(25)

25

were related with the Pfam family associated to the query sequence. In the case of OmpA domain, all the

best hits were related to MotB-like proteins.

3.6.2. Tertiary structure prediction

It is important to evaluate the selected sequences in module 5 in order to confirm or discard the results

given by the platform. Figure 6 shows the nine tridimensional structures of the amino acid sequences

obtained with I-TASSER and their corresponding hydrophobic surface visualized with Chimera 1.10.

According to the predicted structure, the proteins were classified in three types 1) Protein with solely a

β-barrel structure, 2) proteins with a β-barrel structure and other domains and 3) Proteins without β-

barrel structure. Type 1 consisted of proteins BIPV12, BIPV13 and BIPV14; and belonged to the lipid A

3-O-deacylase, maltoporins and putative MetA-pathway phenol degradation proteins families,

respectively. These proteins are characterized for having a high number of β-strands in their barrel

structure and long loops which contribute in substances transport. Type 2 was formed by proteins

BICV21, BICV22, BICV23 and MetBAA1, and includes the surface antigen, the FadL outer membrane

transport and the putative MetA-pathway of phenol degradation protein families, and the autotransporter

domain, respectively. Type 3 corresponded to proteins which have the Pfam OmpA domain, but do not

belong to the OmpA-like transmembrane protein family, which were proteins BIPV11 and BIPV15.

According to DELTA-blast results, amino acid sequences of these proteins have a certain homology

with a flagellar motor protein (MotB) of Geobacter metallireducens, a bacteria species from phylum Proteobacteria which has the ability to oxidize organic compounds, metals and radioactive elements

(26)

26

channels like MotB. Motor flagellar protein MotB contains 308 residues and consists of a short

N-terminal cytoplasmic domain, a single membrane-spanning helix, and a large periplasmic domain, and

forms by means of a complex with MotA protein transmembrane channels for proton transport

throughout the membrane, contributing to the rotation of flagella in bacteria. [51]. Finally, one protein of

each type was selected for MD simulations, according to the hydrophobicity surface shows with the

visualization using Chimera 1.10. They were BIPV13 (Maltoporin), BICV23 (MetA-pathway phenol

degradation protein) and BIPV15 (Motor flagellar protein).

3.6.3. Molecular Dynamics simulations

Previously to the MD simulations, PDB files of the three proteins were modified; in order to perform a

layer solvation, coordinates of the protein structures were changed, allowing differentiation between the

hydrophobic and hydrophilic zones. This differentiation allowed in turn, establishing the dimensions of

the boxes. All simulation boxes presented the hydrophobic layer with dodecane molecules solvating the

hydrophobic zone at the top, and the hydrophilic zone being solvated by water molecules in the bottom.

Table 10 shows the dimensions of each box and its layers. Figure 9 shows the simulation boxes

visualized using VMD.

 BIPV13 simulation

Simulation of BIPV13 protein in dodecane/water bilayer system was carried out with 468 dodecane

molecules and 6374 molecules of water. During the 5-ns simulation, BIPV13 maintained its position in

(27)

27

barrel structure, without making any impact on the position of the protein (see figure 10). Figure 11

shows the variation of the RMDS of the system and its components during the simulation time.

Stability of the system was reached after 1 ns with and RMSD of 4.09 ± 0.01 nm for the last 4 ns.

Dodecane and water made important contributions to the RMSD of the system, with values of 4.19 ±

0.01 and 4.64 ± 0.01 nm respectively. On the other hand, BIPV13 protein RMSD value was low, with

0.428 ± 0.003 nm, and had no effect in the RMSD of the system. The radius of gyration (RGYR) is a

measure of compactness of the protein in the system [25]. BIPV13 protein had a value of 2.35 ± 0.04

nm. Figure 12 shows the variation of the radius of gyration during the 5-ns simulation. The tendency (in

blue) presents a steady (but not significant) increase during the first 4 ns of the simulation. After the

fourth nanosecond there is an increase that can be associated to the opening of the β-barrel. Figure 13

shows that variation of the hydrophobic and hydrophilic areas which remain almost constant throughout

the simulation, with values of 145.06 ± 0.09 and 85.09 ± 0.06 nm2 respectively. The results show that

BIPV13 protein presents great stability in the interface and can be taken into account as a candidate for

biosurfactant, although is necessary to see its capacity to form agglomerates.

 BICV23 simulation

Simulation of BICV23 protein in dodecane/water bilayer system was carried out with 319 dodecane

molecules and 5819 molecules of water. Figure 14 shows the behavior of the protein during the 5-ns

simulation. The most important fact of this visualization is the gradual change of position that BICV23

(28)

28

dodecane molecules, making the structure to turn sideways and descend the structure to the hydrophilic

layer. At 5 ns, only loops of the superior region are solvated by dodecane molecules.

Figure 15 shows the RMSD values for the system and its components. System achieves stability after

1,5 ns and obtain a value of 3.752 ± 0.009 nm in the following 3.5 ns, with higher contributions of

dodecane and water with values of 3.65 ± 0.01 and 4.30 ± 0.01 nm respectively. BICV23 protein obtains

an RMSD value of 0.364 ± 0.003 nm which is low despite the behavior of the protein the bilayer system.

Radius of gyration of the protein didn’t have many variations during the simulation, with an average

value of 2.0924 ± 0.0004 nm (see figure 16). This value infers that BICV23 remains compact despite its

turn. Figure 17 shows the results of the SASA calculation; there is an light increase of the hydrophilic

area after 1.6 ns and can be related with the movement of BICV23 towards the aqueous layer. Mean

values of hydrophobic and hydrophilic areas were 110.15 ± 0.06 and 72.12 ± 0.07 nm respectively.

These results show that BICV23 does not remain stable at the interface and tends to migrate to the

aqueous phase. Thus, the presence of amphipathic structure according to the platform, does not infer a

possible biosurfactant activity.

 BIPV15 simulation

Simulation of BIPV15 protein in dodecane/water bilayer system was carried out with 476 dodecane

molecules and 7791 molecules of water. Figure 18 shows the 5-ns of the protein in the bilayer system. In

this case, the protein BIPV15 tends to moderately turn sideways. Superior region remains solvated with

dodecane during all the simulation time. The bottom of the protein which is in the aqueous phase,

(29)

29

Figure 19 shows the RMSD values of the system and its components. The system was stabilized after

1.5 ns, and had a value of 4.14 ± 0.01 nm for the last 3.5 ns. Like the previous cases, the contributions of

solvents were important, with values of 4.03 ± 0.01 nm for dodecane, and 4.70 ± 0.01 nm for water.

BIPV15 obtained a RMSD value of 0.492 ± 0.005 nm. Radius of gyration of BIPV15 had moderate

decreases during the first 3 ns, but then had an increase until the fifth nanosecond as is shown in figure

20; this can be associated to the mentioned movements that have both regions of the protein. The

average radius of gyration for BIPV15 was 2.351 ± 0.001 nm. Figure 21 shows the variations of

hydrophobic and hydrophilic areas in BIPV15; there was a decrease in the accessible areas during the

first half of the simulation associated to the movements of the protein in both phases. The calculated

areas for BIPV15 were 129.95 ± 0.09 nm for the hydrophobic area, and 89.84 ± 0.06 nm for the

hydrophilic area. The results show that BIPV15, despite the movement it presents, achieve to maintain

itself at the interface. However, it is necessary to observe the behavior of the protein in longer times of

simulation in order to corroborate this stability.

3.7. Context of the platform in search of biosurfactants

As it was mentioned previously this platform is framed in a series of works focused in the search of

biosurfactant using different approaches. It is important to put the results given by the platform in a

experimental basis. A high-throughput function-based screening process done for a fosmid library of the

MetBAA was carried out in the research group. Using a method based on optical distortions in 96-well

microtitre plate [52], there were found 18 positive clones from 18432 clones of the MetBAA library.

(30)

30

characterization of them. The goal is to find proteins or peptides with biosurfactant activity in these positive clones, which can be related with results, obtained using the platform.

4. CONCLUSIONS AND PERSPECTIVES

A computational tool was developed to find amino acid sequences with structural features that may

imply an amphipathic behavior and a potential surfactant activity in metagenomic sequences. The

platform included bioinformatic programs for prediction and translation of genes, and protein family

search; with scripts in Perl language which were used to calculate structural and hydropathy properties

in order to establish a punctuation system based on OmpA from Escherichia coli structure; a k-means cluster analysis that selected those sequences that share a similar punctuation pattern to OmpA and a

calculator of free Gibbs energy of formation. Application of this platform in order to find

transmembrane-like proteins in three soil metagenomes from Los Nevados National Natural Park gave

as a result the selection of 9 potential sequences: 5 from paramo soil metagenome (BIPV1), 3 from

agriculturally-treated soil metagenome (BICV2) and 1 sequence from high Andean forest soil

metagenome (MetBAA). Molecular dynamics simulations were carried out to three of these sequences

BIPV13, BICV23 and BIPV15, named according to their corresponding metagenome. Results showed

that protein BIPV13, a maltoporin could stabilize itself in a dodecane/water interface despite some

displacements within their structure during the 5 ns simulation.

The relative abundance of transmembrane proteins in the three metagenomes showed almost the same

distribution, inferring a non-significant effect of ecosystem in the relative abundance of the proteins,

(31)

31

Punctuation matrix and k-means cluster analysis were established as the main approaches to find

transmembrane proteins, with the complement of the group contribution method which gave information

about the formation of these proteins. The next step is focused in the development of a group

contribution method which can estimate the thermodynamic stability of a protein in a dodecane/water

system. The development of the group contribution method that gives a strong criterion to the screening

process in general, and will be a key tool to expand the application of this platform to all type of

proteins. New biosurfactant proteins with different domains and motifs can be potentially used as

references in the platform to find even more proteins with suitable structural characteristics.

The idea of this work also comprises the experimental validation, which consists in function-based

screening methods and search of genes of interest using molecular biology procedures such as PCR,

cloning, transformation, heterologous expression and characterization of the proteins of interest.

Likewise, sequence-based screening approaches like primers and probes can be used in order to find

obtained the in silico selected proteins directly from DNA.

REFERENCES

[1] K. K. Sekhon, S. Khanna, and S. S. Cameotra, “Enhanced biosurfactant production through cloning of three genes and role of esterase in biosurfactant release.,” Microb. Cell Fact., vol. 10, no. 1, pp. 1–49, Jan. 2011.

[2] R. Marchant and I. M. Banat, “Microbial biosurfactants: challenges and opportunities for future exploitation.,” Trends Biotechnol., vol. 30, no. 11, pp. 558–65, Nov. 2012.

[3] D. K. F. Santos, R. D. Rufino, J. M. Luna, V. a. Santos, A. a. Salgueiro, and L. a. Sarubbo, “Synthesis and evaluation of biosurfactant produced by Candida lipolytica using animal fat and corn steep liquor,” J. Pet. Sci. Eng., vol. 105, pp. 43–50, May 2013.

(32)

32

[4] J. D. Desai and I. M. Banat, “Microbial production of surfactants and their commercial potential.,”

Microbiol. Mol. Biol. Rev., vol. 61, no. 1, pp. 47–64, Mar. 1997.

[5] I. M. Banat, A. Franzetti, I. Gandolfi, G. Bestetti, M. G. Martinotti, L. Fracchia, T. J. Smyth, and R. Marchant, “Microbial biosurfactants production, applications and future potential.,” Appl. Microbiol. Biotechnol., vol. 87, no. 2, pp. 427–44, Jun. 2010.

[6] G. G. Sarkar A.K., Goursaud J.C., Sharma MM, “A Critical Evaluation of MEOR Processes,” Situ, vol. 13, no. 4, pp. 207–238, 1989.

[7] C. N. Mulligan, “Environmental applications for biosurfactants.,” Environ. Pollut., vol. 133, no. 2, pp. 183–98, Jan. 2005.

[8] T. T. Nguyen, N. H. Youssef, M. J. McInerney, and D. a Sabatini, “Rhamnolipid biosurfactant mixtures for environmental remediation.,” Water Res., vol. 42, no. 6–7, pp. 1735–43, Mar. 2008.

[9] L.-M. Whang, P.-W. G. Liu, C.-C. Ma, and S.-S. Cheng, “Application of biosurfactants, rhamnolipid, and surfactin, for enhanced biodegradation of diesel-contaminated water and soil.,” J. Hazard. Mater., vol. 151, no. 1, pp. 155–63, Feb. 2008.

[10] M. Aguirre-Ramírez, G. Medina, A. González-Valdez, V. Grosso-Becerra, and G. Soberón-Chávez, “The Pseudomonas aeruginosa rmlBDAC operon, encoding dTDP-L-rhamnose biosynthetic enzymes, is regulated by the quorum-sensing transcriptional regulator RhlR and the alternative sigma factor σS.,”

Microbiology, vol. 158, no. Pt 4, pp. 908–16, Apr. 2012.

[11] M. Winterhalter, C. Hilty, S. M. Bezrukov, C. Nardin, and W. Meier, “Controlling membrane permeability with bacterial porins : application to encapsulated enzymes,” vol. 55, pp. 965–971, 2001.

[12] H. Wang, K. K. Andersen, B. S. Vad, and D. E. Otzen, “OmpA can form folded and unfolded oligomers.,”

Biochim. Biophys. Acta, vol. 1834, no. 1, pp. 127–36, Jan. 2013.

[13] A. González, R. Zuo, D. Ren, and T. K. Wood, “Hha , YbaJ , and OmpA Regulate Escherichia coli K12 Biofilm Formation and Conjugation Plasmids Abolish Motility,” Biotechnol. Bioeng., vol. 1, pp. 188–200, 2005.

[14] E. Sugawara and H. Nikaido, “Pore-forming activity of OmpA protein of Escherichia coli.,” J. Biol. Chem., vol. 300, 1992.

[15] R. Koebnik, “Structural and functional roles of the surface-exposed loops of the beta-barrel membrane protein OmpA from Escherichia coli.,” J. Bacteriol., vol. 181, no. 12, pp. 3688–94, Jun. 1999.

[16] S. Aguilera, A. P. Macías, D. C. Pinto, L. Vargas, M. J. Vives-florez, H. Enrique, C. Barrera, O. A. Álvarez, and A. González, “Escherichia coli ´ s OmpA as Biosurfactant for Cosmetic Industry : Stability Analysis and Experimental Validation Based on Molecular Simulations,” Adv. Comput. Biol., vol. 232, pp. 265–271, 2014.

[17] A. Toren, E. Orr, Y. Paitan, E. Z. Ron, and E. Rosenberg, “The Active Component of the Bioemulsifier Alasan from Acinetobacter radioresistens KA53 Is an OmpA-Like Protein,” vol. 184, no. 1, pp. 165–170, 2002.

(33)

33

[18] J. Handelsman, “Metagenomics : Application of Genomics to Uncultured Microorganisms Metagenomics : Application of Genomics to Uncultured Microorganisms,” vol. 68, no. 4, 2004.

[19] T. Uchiyama and K. Miyazaki, “Functional metagenomics for enzyme discovery: challenges to efficient screening.,” Curr. Opin. Biotechnol., vol. 20, no. 6, pp. 616–22, Dec. 2009.

[20] J. Zhang, R. Chiodini, A. Badr, and G. Zhang, “The impact of next-generation sequencing on genomics.,”

J. Genet. Genomics, vol. 38, no. 3, pp. 95–109, Mar. 2011.

[21] R. Logares, T. H. a Haverkamp, S. Kumar, A. Lanzén, A. J. Nederbragt, C. Quince, and H. Kauserud, “Environmental microbiology through the lens of high-throughput DNA sequencing: synopsis of current platforms and bioinformatics approaches.,” J. Microbiol. Methods, vol. 91, no. 1, pp. 106–13, Oct. 2012.

[22] A. C. Oliveira, L. F. Moura, and D. Cardoso, “Method of contribution of groups to estimate

thermodynamic properties of components of biodiesel formation in liquid phase,” Fluid Phase Equilib., vol. 317, pp. 59–64, Mar. 2012.

[23] M. L. Mavrovouniotis, “Estimation of standard Gibbs energy changes of biotransformations.,” J. Biol. Chem., vol. 266, no. 22, pp. 14440–5, Aug. 1991.

[24] M. D. Jankowski, C. S. Henry, L. J. Broadbelt, and V. Hatzimanikatis, “Group contribution method for thermodynamic analysis of complex metabolic networks.,” Biophys. J., vol. 95, no. 3, pp. 1487–99, Aug. 2008.

[25] S. Aguilera, L. Achenie, and A. F. González, “Platform Implementation for Evaluation of Proteins as Biosurfactants via Molecular Dynamics Free Energy Calculations,” Universidad de los Andes, 2013.

[26] D. R. Kelley, B. Liu, A. L. Delcher, M. Pop, and S. L. Salzberg, “Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering.,” Nucleic Acids Res., vol. 40, no. 1, p. e9, Jan. 2012.

[27] A. Brady and S. Salzberg, “Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models,” Nat. Methods, vol. 6, no. 9, pp. 673–676, 2009.

[28] D. R. Kelley and S. L. Salzberg, “Clustering metagenomic sequences with interpolated Markov models.,”

BMC Bioinformatics, vol. 11, no. 1, p. 544, Jan. 2010.

[29] R. D. Finn, J. Clements, and S. R. Eddy, “HMMER web server: interactive sequence similarity searching.,” Nucleic Acids Res., vol. 39, no. Web Server issue, pp. W29–37, Jul. 2011.

[30] M. Punta, P. C. Coggill, R. Y. Eberhardt, J. Mistry, J. Tate, C. Boursnell, N. Pang, K. Forslund, G. Ceric, J. Clements, A. Heger, L. Holm, E. L. L. Sonnhammer, S. R. Eddy, A. Bateman, and R. D. Finn, “The Pfam protein families database.,” Nucleic Acids Res., vol. 40, no. Database issue, pp. D290–301, Jan. 2012.

[31] J. Kyte and R. F. Doolittle, “A simple method for displaying the hydropathic character of a protein.,” J. Mol. Biol., vol. 157, no. 1, pp. 105–132, May 1982.

(34)

34

[32] G. Deléage and B. Roux, “An algorithm for protein secondary structure prediction based on class prediction,” Protein Eng. Des. Sel., vol. 1, no. 4, pp. 289–294, 1987.

[33] J. Janin, “Surface and inside volumes in globular proteins,” Nature, no. 277, pp. 491–492, 1979.

[34] D. Freedman and P. Diaconis, “On the histogram as a density estimator: L 2 theory,” Probab. theory Relat. fields, vol. 476, pp. 453–476, 1981.

[35] Y. Zhang, “I-TASSER server for protein 3D structure prediction.,” BMC Bioinformatics, vol. 9, p. 40, Jan. 2008.

[36] E. F. Pettersen, T. D. Goddard, C. C. Huang, G. S. Couch, D. M. Greenblatt, E. C. Meng, and T. E. Ferrin, “UCSF Chimera--a visualization system for exploratory research and analysis.,” J. Comput. Chem., vol. 25, no. 13, pp. 1605–12, Oct. 2004.

[37] S. Pronk, S. Páll, R. Schulz, P. Larsson, P. Bjelkmar, R. Apostolov, M. R. Shirts, J. C. Smith, P. M. Kasson, D. van der Spoel, B. Hess, and E. Lindahl, “GROMACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit.,” Bioinformatics, vol. 29, no. 7, pp. 845–54, Apr. 2013.

[38] J. Kerrigan, GROMACS Introductory Tutorial: Gromacs Version 4.6. New Brunswick, NJ, 2012, pp. 1–20.

[39] C. Oostenbrink, A. Villa, A. E. Mark, and W. F. van Gunsteren, “A biomolecular force field based on the free enthalpy of hydration and solvation: the GROMOS force-field parameter sets 53A5 and 53A6.,” J. Comput. Chem., vol. 25, no. 13, pp. 1656–76, Oct. 2004.

[40] C. Oostenbrink, T. A. Soares, N. F. A. van der Vegt, and W. F. van Gunsteren, “Validation of the 53A6 GROMOS force field.,” Eur. Biophys. J., vol. 34, no. 4, pp. 273–84, Jun. 2005.

[41] W. Humphrey, A. Dalke, and K. Schulten, “VMD: visual molecular dynamics,” J. Mol. Graph., vol. 14, no. 1, pp. 33–38, 1996.

[42] M. C. Álvarez, M. M. Zambrano, S. Restrepo, J. Husserl, J. M. Gómez, and A. González, “Estudio del efecto de la compartimentalización de redes metabólicas en la predicción del comportamiento de comunidades microbianas usando FBA,” Universidad de los Andes, 2014.

[43] S. G. J. Smith, V. Mahon, M. a Lambert, and R. P. Fagan, “A molecular Swiss army knife: OmpA structure, function and expression.,” FEMS Microbiol. Lett., vol. 273, no. 1, pp. 1–11, Aug. 2007.

[44] S. Krishnan and N. Prasadarao, “Outer membrane protein A and OprF: versatile roles in Gram‐negative bacterial infections,” FEBS J., vol. 279, no. 6, pp. 919–931, 2012.

[45] M. R. Wilkins, E. Gasteiger, a Bairoch, J. C. Sanchez, K. L. Williams, R. D. Appel, and D. F.

Hochstrasser, “Protein identification and analysis tools in the ExPASy server.,” Methods Mol. Biol., vol. 112, pp. 531–52, Jan. 1999.

[46] J. M. Cuthbertson, D. a Doyle, and M. S. P. Sansom, “Transmembrane helix prediction: a comparative evaluation and analysis.,” Protein Eng. Des. Sel., vol. 18, no. 6, pp. 295–308, Jun. 2005.

(35)

35

[47] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proc. Natl. Acad. Sci., vol. 95, no. 25, pp. 14863–14868, Dec. 1998.

[48] G. M. Boratyn, A. A. Schäffer, R. Agarwala, S. F. Altschul, D. J. Lipman, and T. L. Madden, “Domain enhanced lookup time accelerated BLAST.,” Biol. Direct, vol. 7, p. 12, Jan. 2012.

[49] J. Chang, A. Lenhoff, and S. Sandler, “Solvation free energy of amino acids and side-chain analogues,” J. Phys. …, pp. 2098–2106, 2007.

[50] R. T. Anderson, H. A. Vrionis, I. Ortiz-Bernad, C. T. Resch, P. E. Long, R. Dayvault, K. Karp, S. Marutzky, D. R. Metzler, A. Peacock, D. C. White, M. Lowe, and D. R. Lovley, “Stimulating the in situ activity of Geobacter species to remove uranium from the groundwater of a uranium-contaminated aquifer.,” Appl. Environ. Microbiol., vol. 69, no. 10, pp. 5884–91, Oct. 2003.

[51] E. R. Hosking, C. Vogt, E. P. Bakker, and M. D. Manson, “The Escherichia coli MotAB proton channel unplugged.,” J. Mol. Biol., vol. 364, no. 5, pp. 921–37, Dec. 2006.

[52] C.-Y. Chen, S. C. Baker, and R. C. Darton, “The application of a high throughput analysis method for the screening of potential biosurfactants from natural sources.,” J. Microbiol. Methods, vol. 70, no. 3, pp. 503– 10, Sep. 2007.

(36)

36 Table 1. Pfam families selected from outer-membrane beta barrel superfamily. Clan MBB (CL0193)

Family Code Description

1 Ail_lom PF06316 Virulence-related outer membrane protein family

2 Autotransporter PF03797 Autotransporter beta-domain

3 Bac_Surface_Ag PF01103 Surface antigen family

4 Channel_Tsx PF03502 Nucleoside-specific Channel forming protein family

5 CopB PF05275 Copper resistance protein B protein family

6 KdgM PF06178 Oligogalacturonate-specific porin protein family

7 LamB PF02264 Maltoporins family

8 MipA PF06629 MltA-interacting Protein family

9 OmpA PF00691 OmpA domain

10 OmpA_membrane PF01389 OmpA-like transmembrane domain

11 Omptin PF01278 Outer membrane protease A family

12 OmpW PF03922 OmpW-like protein W family

13 OpcA PF07239 Outer membrane adhesin family

14 OprB PF04966 Carbohydrate-selective porin family

15 OprF PF05736 OprF membrane domain

16 OrpD PF03573 Outer membrane serine type peptidase family

17 OstA_C PF04453 Organic solvent tolerance protein family

18 PagL PF09411 Lipid A 3-O-deacylase family

19 Phenol_MetA_deg PF13557 Putative MetA-Pathway of phenol degradation family

20 Porin O_P PF07396 Phosphate-selective porin O and P family

21 Porin_1 PF00267 General bacterial porin family

22 Porin_2 PF02530 Alpha subdivision of Proteobacteriaporin family

23 Porin_4 PF13609 General bacterial porin family

24 Porin_OmpG PF09381 Outer membrane porin G family

25 ShlB PF03865 Haemolysin secretion/activation protein family

26 Toluene_X PF03349 FadL outer membrane protein transport family

27 TraF_2 PF13729 F plasmid transfer Operon protein family

28 Usher PF00577 Fimbrial Usher protein family Source: Pfam 27.0 (March 2013; http://pfam.xfam.org/)

(37)

37 Table 2. Values used for the construction of hydropathy calculator script.

Note: HI: Hydrophobicity index. BS: Beta strand. AH: Alpha helix

SA: Solvent accessibility. BT: Beta turn. C: Coil.

Residue HI SA AH BS BT C

Ala: 1.800 6.600 1.489 0.709 0.788 0.824

Arg: -4.500 4.500 1.224 0.920 0.912 0.893

Asn: -3.500 6.700 0.772 0.604 1.572 1.167

Asp: -3.500 7.700 0.924 0.541 1.197 1.197

Cys: 2.500 0.900 0.966 1.191 0.965 0.953

Gln: -3.500 5.200 1.164 0.840 0.997 0.947

Glu: -3.500 5.700 1.504 0.567 1.149 0.761

Gly: -0.400 6.700 0.510 0.657 1.860 1.251

His: -3.200 2.500 1.003 0.863 0.970 1.068

Ile: 4.500 2.800 1.003 1.799 0.240 0.886

Leu: 3.800 4.800 1.236 1.261 0.670 0.810

Lys: -3.900 10.300 1.172 0.721 1.302 0.897

Met: 1.900 1.000 1.363 1.210 0,436 0,810

Phe: 2.800 2.400 1.195 1.393 0.624 0,797

Pro: -1.600 4.800 0.492 0.354 1.415 1.540

Ser: -0.800 9.400 0.739 0.928 1.316 1.130

Thr: -0.700 7.000 0.785 1.221 0.739 1.148

Trp: -0.900 1.400 1.090 1.306 0.546 0.941

Tyr: -1.300 5.100 0.787 1.266 0.795 1.109