1
Computational tool for
in silico
screening of biosurfactants in
metagenomic libraries
Iván Ricardo López Sandoval1, María Mercedes Zambrano Eder2, Alejandro Reyes Muñoz3, and Andrés
Fernando González Barrios1 1
Grupo de Diseño de Productos y Procesos (GDPP), Department of Chemical Engineering
Universidad de los Andes. Carrera 1E No. 19A-40, Edificio Mario Laserna, Bogotá D.C. – Colombia
{ir.lopez59, andgonza} @uniandes.edu.co 2
Corporación Corpogen
Carrera 5 No. 66A-34, 110231, Bogotá D.C. - Colombia
Department of Biological Sciences
Universidad de los Andes, Carrera 1 # 18A-12, Bloque A, Bogotá, D.C. – Colombia
2 OBJETIVES
General
To construct a computational tool in order to carry out an in silico sequence-based screening process to find sequences that, according to their structural features may have an amphipathic structure and a
potential biosurfactant activity.
Specific
To develop a computational platform for in silico screening of beta barrel proteins in metagenomes using non-homology approaches; employing bioinformatic programs, scripts and a
cluster analysis approach.
To evaluate the computational platform using three metagenomic sequences from soil samples of Los Nevados National Natural Park.
To evaluate the results of the platform for the metagenomes by means of tertiary structure prediction and Molecular Dynamics simulations.3 ABSTRACT
Biosurfactants have emerged as an alternative to chemical surfactants, due to their advantages such as
biodegradability and low toxicity. It has been discovered that some proteins of beta-barrel type such as
OmpA have biosurfactant activity; this has generated an increasing interest in their discovery and
characterization. Accordingly, it was undertaken the construction of a computational tool which
identifies genes and protein sequences, with potential use as biosurfactants in metagenomic sequences.
This tool is a platform of five modules which comprised the use of bioinformatic programs Glimmer
MG and hmmscan for prediction and selection of sequences with certain domains, respectively; scripts
developed in Perl programming language to calculate hydropathy profiles and secondary structure, and
to evaluate numerically the presence of structural parameters based on reference information of OmpA
from Escherichia coli; a k-means cluster analysis approach to analyzes the numerical results approach using reference data from OmpA-like proteins; and a Gibbs free energy of formation calculator based on
group contribution methods for biocompounds. The application of this platform in three soil
metagenomic sequences from Los Nevados National Natural Park allowed the selection of 9 amino acid
sequences from the three metagenomes, which were analyzed using DELTA-Blast alignments, tertiary
structure prediction using I-TASSER and molecular dynamics simulations using GROMACS.
Simulations of a dodecane/water biphasic system were carried out for three protein sequences, finding
that one of them, an amino acid sequence from an untreated soil metagenome (BIPV1) which belong to
the maltoporins protein family, remained stable at a dodecane/water interface.
Keywords: Biosurfactants, hmmscan, outer membrane beta-barrel superfamily, hydrophobicity, k-means cluster analysis, metagenome, group contribution method.
4
1. INTRODUCTION
Surface active agents or surfactants are molecules which have the property of lowering the interfacial
tension at the interfaces between phases due to their amphipathic structure [1,2]. Surfactants are applied
as detergents, wetting, agents emulsifiers dispersing agents, and foaming agents in several fields such as
medicine, manufacturing of household products and petroleum industry, in which their use is important
in processes like oil extraction and processing, cleaning of residual oil vessels, enhancement of oil
recovery, among others [3]. One of the main features of surfactants is the formation of aggregate
structures in aqueous environments called micelles, in which the hydrophobic tails of the molecules are
protected from contact with the water. These aggregates are known to minimize the free energy of the
solution and are dynamic and dependent on several physical conditions such as temperature [4]. Most of
the surfactants in industry are derived from petroleum or chemically synthesized. In 2008, the
worldwide use of surfactants was estimated to be 13 million tons per annum [5] which is focused mainly
in the use of chemical surfactants, principally linear alkylbenzenesulfonates (LASs) and alkyl phenol
ethoxylates (APEs), which are produced synthetically. Thus, production might require several costs
relating synthesis and purification. Furthermore, surfactants are known as highly toxic compounds for
the environment [5].
Biosurfactants have been considered as a solution to decrease the use of conventional surfactants; due to
their capacity to reduce surface and interfacial tensions in both aqueous solutions and hydrocarbon
mixtures, which makes them potential molecules for processes like microbially enhanced oil recovery
(MEOR) [6]. Biosurfactants have several advantages over the chemically synthetized surfactants, such
as lower toxicity, higher biodegradability, better environmental compatibility, higher foaming, high
5
synthesized from renewable sources [4]. The term biosurfactant comprises a group of structurally
diverse molecules mainly produced by different microorganisms and are classified by their chemical
structure and microbial origin [5]; structure of biosurfactants includes hydrophilic moiety consisting of
amino acids or peptides ions; mono-, di., or polysaccharides; and a hydrophobic moiety consisting of
hydrocarbon chains of unsaturated or saturated fatty acids. The most known biosurfactants are
glycolipids, lipopeptides and lipoproteins, phospholipids and fatty acids, polymeric surfactants and
particulate surfactants.
The efforts of different disciplines such as biology, chemistry and chemical engineering are focused on
the discovery, study, production and application of these compounds; mainly of low molecular weight
biosurfactants like rhamnolipids, a type of glycolipid [4,7–10]. Several studies are focused on finding
biopolymers or high molecular molecules with emulsifying properties, and comprise experimental and
computational procedures following a trial and error methodology. Some of these studies involve porins,
which are proteins located in the outer membrane of gram-negative bacteria and whose main feature is
the presence of several β-strands, forming a notorious barrel structure. This barrel encompasses a
transmembrane pore that allows the passive diffusion across the outer membrane. Porins give the outer
membrane a semi-permeability to small solutes below a weight of 400 Da [11]. There are many types of
porins with specific types of channels. One of these types is the Outer Membrane Protein A (OmpA).
OmpA is one of the most abundant structural proteins in the outer membrane of Escherichia coli [12]. It maintains the integrity of the E. coli outer membrane, functions as a mediator in F-dependent conjugation, interacts with solid surfaces and plays an important role as a phage receptor [13]. Like
6
The N-terminal domain of OmpA consists of 170 residues located within the outer membrane, forming
an antiparallel β-barrel whose 8 transmembrane β-strands are connected by three short turns and four
large-surface-exposed hydrophilic loops that exist in the aqueous environment of periplasm [14,15]. One
of the most important feature of OmpA and other porins sequences is the presence of hydrophilic
residues between hydrophobic members, which make the protein to have an amphipathic behavior. This
plays an important role in biofilm formation in E. coli K12 [13], and in stabilization of dodecane/water mixtures [16]. The structures of OmpA and LamB can be seen on figure 1. It is important to note the
presence of the characteristic β-barrel in both proteins.
It was discovered that protein AlnA, one of the main components of bioemulsifier Alasan, produced by
Acinetobacter radioresistents KA53, is a OmpA-like protein, and practically all of the emulsifying activity is present in that protein. [17]. In addition to this, it was developed a study of the biosurfactant
activity of OmpA from Escherichia coli by means of molecular dynamics simulations (MD) and experimental validation, and it was found that these protein can stabilize dodecane/water mixtures [16].
These results encourage the search for proteins with potential use for biosurfactants.
The work is focused on the search of proteins whose structures have amphipathic features, which can
permit their use as potential surfactants. It is therefore important to find different type of proteins from
different microorganism in order to have a wide selection range of proteins that have similar
characteristics to OmpA-like proteins such as E. coli OmpA and AlnA from Alasan. One of the most recognized methodologies for discovering potential products in environmental samples is
metagenomics, which allows the study of microbial communities based on the total DNA present in an
7
present in an ecosystem, since it does not depend on culture techniques. The main challenge of
metagenomics is to determine which type microorganisms and how many of them live in a determined
environment (taxonomic metagenomics), what is the main function they have and what kind of enzymes
and metabolic pathways use in order to live in that environment (functional metagenomics) [19]. In
functional metagenomics, the screening of novel enzymes can be performed by experimental
function-based screening procedures, or by sequenced-function-based screening approaches. The latter can be
implemented experimentally and computationally. One of the most important tools for the development
of in silico sequence-based screening was the improvement of New Generation Sequencing technologies (NGS), which have the capacity to sequence DNA at unprecedented speed compared to the traditional
Sanger sequencing technique [20]. These technologies include pyrosequencing (454), Illumina/Solexa,
SOLiD, Ion Torrent or Pacific Biosciences, etc. The improvement of sequencing techniques, brings the
development of programs and packages that process the raw data produced by the sequencer, and
generates assemblies of metagenomic sequences with their respective predicted (and translated) genes
[21]. All the information that metagenomic sequences can give will be used as the main material for the
search of proteins of interest.
The aim of this work is to develop a computational platform that searches genes in metagenomic
sequences that can express proteins whose sequence and structural properties allow their use as potential
biosurfactants. Due to the great variety of proteins in microorganism, the search in this work is focused
in transmembrane proteins which shared common structural properties with OmpA, like the β-barrel.
The main fundament for the screening process is the presence of amino acid sequences which may form
8
residues and the capacity of those residues to form determined secondary structure. Therefore, the search
does not take into account a possible homology to OmpA.
The platform comprises the use of traditional bioinformatic tolls for prediction and translation of genes
(Glimmer), and for search of domains and families of interest (HMMER); and scripts for the calculation
of properties (such as hydrophobicity index, and scale values per residue to formation of secondary
structures) and for analysis in order to find regions of interest within a given sequence. Due to the
shortage of information of proved biosurfactant proteins, the platform includes an unsupervised k-means cluster analysis approach in order to select those sequences which share structural similarities to OmpA.
It also has a module which estimates the stability and spontaneity of the formation of a protein
calculating standard Gibbs free energy values. This estimation is carried out using a group contribution method. In this approach, values of free standard Gibbs free energy of formation of proteins are determined by the sum of contributions of the chemical groups which make up the molecule. The
method assigns a unique energy value to each group which will be applied to any molecule [22]. This is
an important approach to calculate energies for proteins, due to the size and complexity of these
molecules. The group contribution method used in this module was developed by Mavrovouniotis
[23,24] and allows calculation of values for biocompounds, including peptides and proteins.
The development of this platform is encompassed in a series of works focused on the search of
biocompounds with potential biosurfactant activity. These works involve the use of several
methodologies including computational and experimental characterization of transmembrane proteins
OmpA and OmpN [16, 25], and function-based experimental screening of biosurfactants in
9
2. METHODOLOGY
The methodology consisted basically in three main parts: 1) construction of the platform, 2) evaluation
of the platform using metagenomic sequences, and 3) analysis of selected amino acid sequences by the
platform. Perl scripts developed for the platform can be seen on supplementary material.
2.1. Platform construction
The platform proposed was designed in such a way to find amino acid sequences that have some degree
of similarity to OmpA and OmpN sequences, at a level of structural properties related to amphipathy.
The platform consists of five modules as is shown inn figure 2.
2.1.1. Module 1. Glimmer MG
The first module is the application of bioinformatic software Glimmer MG (Gene Locator and
Interpolated Markov ModelER – MetaGenomics) [26] to unpredicted assembled metagenomes.
Glimmer is a collection of programs for the identification of genes in microbial DNA sequences, which
uses Interpolated Markov Models (IMMs) from a training set of genes, in order to identify coding
regions in new metagenomic sequences, and distinguish them from noncoding DNA. The module
consists of two scripts in Python programming language:
1) glimmer-mg.py: Is the main script of the software. The scripts makes a classification of
10
program uses IMMs to characterize variable-length oligonucleotides typical of a phylogenetic
grouping, and uses these models to classify sequences present in reads. Then, the resulting
sequences are clustered using an unsupervised clustering method called Scimm [28], in order to
make the final predictions within each cluster. The output is a .predict file which contains the
fasta-header line of the contig followed by information about the predicted genes.
2) extract_aa.py: This script is used to extract a multi-fasta file of amino acid sequences from the
gene predictions made by Glimmer MG. It uses the information given in the .predict file in order
to extract the predicted genes from the contig, and make the translation.
The main output of this module is a multifasta file of amino acid sequences present in the metagenome
according with the Glimmer MG prediction.
2.1.2. Module 2. HMMSCAN
The second module is the first step of screening and focuses the search of proteins in the metagenome
which belong a particular family or clan. This selection is made using the bioinformatic suite HMMER
[29] and its tool hmmscan, which selects those proteins that belong to the selected families. Hidden
Markov Model files of the selected families were previously downloaded from the online database Pfam
(http://pfam.xfam.org/). The families selected belong to the outer-membrane beta-barrel superfamily
(Clan: MBB (CL0193)) [30]. According to Pfam, this superfamily has 54 families and 118601 domains.
There were downloaded 27 families HMMs which belong to this superfamily including the OmpA
11
which do not belong to the MBB. With the HMM files it was created a database using the HMMER tool
hmmpress and then, hmmscan was implemented. The output file consists of a table with the name of the
screened sequence, its corresponding family, and data of each calculation (See table 1). In order to have
only the selected sequences, it was coded a script in Perl language which contrasts the initial amino acid
multi-fasta file with the output table; Those sequences that were evaluated, and belong to any of the
families of interest, were printed on a new multi-fasta file, which was then evaluated in the following
module of the platform.
2.1.3. Module 3. Hydropathy analysis
The hydropathy analysis module involves the calculation of structural properties for each sequence and
the evaluation of the presence of amphipathic regions based on these properties. Properties which are
calculated are: hydrophobicity index according to the Kyte-Doolittle scale [31]; amino acid scale values
for prediction of alpha helix, beta sheets, beta turns and coils conformations according to Deleage and
Roux [32]; and molar fraction values (%) of amino acids which are accessible to solvent [33] (see table
2). All of these properties and their values per residue can be seen on ExPASY tool Protscale
(http://web.expasy.org/protscale/). Conjugating the presence of determined motifs with hydrophobicity
indexes and accessibility of solvents, and comparing this with results of OmpA E. coli, a punctuation matrix is generated in order to perform the second screening step. The third module is divided in two
parts:
a) Calculation of the properties: This part involves the use of a script in Perl. This script, generates
12
calculating an average value of the property to be measured, and assigning the resulting value to
the central residue on the window; each property to be calculated, is processed in a different
subroutine, but using the same window principle. Finally, each table is assign to its
corresponding sequence name and is printed. An example of the output file is shown in figure 3.
The evaluation of the script was carried out using the sequence of OmpA from Escherichia coli
and comparing the data with the one provided by Protscale.
b) Criteria selection: This part involves the use of two Perl scripts. The first script searches within
each sequence the presence of five types of regions (parameters) using the output table of module
3(a). The purpose is to quantify the number of occurrences of these regions and establish a series
of five numbers which will characterize each sequence. The parameters to evaluate are based in a
previous analysis to OmpA sequence of Escherichia coli using the table of properties and correlating this with its tertiary structure and are shown in table 3. The script recognizes the
residues that accomplish each parameter, and then only counts the cases where is valid for three
or more consecutive residues. Therefore, each sequence has 5 different values which may differ
between sequences. The evaluation of the script was done using a multi-fasta file with 94 amino
acid sequences (see section 2.2.2) which belong to the 28 families selected Pfam. Histograms
showing the number of occurrences of each parameter were created, and according to the
distribution, bin sizes were calculated applying the Freedman-Draconis rule [34] for
determination of the bin size of an histogram:
13
Where n is the number of data, x is the sample and IQR is the interquartile range of the data.
Punctuation system was created assigning a numeric value to each bin, from 0 which means the
absence of occurrence of the parameter to n occurrences. Histograms and calculation of the bin
sizes were done using the statistical programming environment R 3.1.2. The final output of the
script is a text file with the name of the sequence and its five corresponding punctuations. An
example of the output file can be seen on figure 4. This output is then processed by the second
script, which organizes the information in a matrix in which the columns correspond to the name
of the sequences and the lines to the parameter punctuations.
2.1.4. Module 4. K-means cluster analysis
Results matrix was processed by means of a k-means cluster analysis using the programs Cluster 3.0.
Clustering analyses were made using uncentered correlation as a similarity metric with 1000 runs; and
varying the number of centers according to the amount of input data of the metagenomes used in this
work. Cluster which data of eight reference sequences of OmpA of Escherichia coli was used to make the third screening step, using a script which creates a new multifasta file using the names of the
sequences of the cluster.
2.1.5. Module 5. Energy Calculator
The fifth module involves the third and last screening step of the platform. In this module, values of free
energy of formation using a group contribution method for amino acids and other biocompounds are
14
sequences into its residues (groups). A value of energy contribution is assigned to each group of the
sequence and then the sum of all contributions is carried out using the following expression:
(2)
Where is the Gibbs free energy standard of the protein, is the energy contribution of each amino
acid without terminal carboxylate and ammonium groups, and are the contributions of the
carboxylate and ammonium groups, respectively, is the contribution of the peptide bond, , and L the
number of amino acids which constitute the protein. The output is a text file with the name of the
sequence and its free Gibbs Energy of formation (in ). This calculation was done for all the
sequences of the cluster (including references). Those sequences which presented a similar value to
OmpA were chosen with the same script used in the previous module. Table 4 shows the contribution
values in order to calculate the free Gibbs energies.
2.2. Evaluation of the Platform
2.2.1. Metagenomes
The test of the platform was carried out using three assembled metagenomes from soil samples collected
in Los Nevados National Natural Park, in Colombia. The first metagenome was obtained from paramo
soil (Code: BIPV1), the second from potato-growing and livestock zone (Code: BICV2), and the third
15
seen in table 5. Soil samples of these metagenomes were collected in the same property. MetBAA
metagenome was sequenced using Illumina and the process of assembly, prediction and translation was
made using the suite Geneious giving an amino acid multi-fasta file of 479015 sequences. It was
processed by the platform from module 2. Metagenomic screening was carried out in a Sun Grid Engine
cluster computing environment, in particular employing a machine with 24 cores and 128 GB of RAM
memory, with a total of 75 h of processing time.
2.2.2. Reference sequences.
In order to make preliminary evaluation of the modules, a transmembrane data set including 94 amino
acid sequences was created. Sequences were collected from Genpept (63) and Uniprot (31) and belong
to the 28 families selected from the MBB clan of Pfam, with 16 reference sequences of OmpA (Family:
OmpA_membrane) and 5 from OmpN (Family: Porin_1). These sequences were added to the filtered
multifasta file from module 2 to serve as reference in the next modules. 8 of 16 OmpA sequences belong
to different strains of Escherichia coli and were used in module 4 as reference for k-means clustering and in module 5 as a reference for the value of . Evaluation using the transmembrane data set was
carried out in a personal computer DELL Inspiron 1464 with RAM memory of 4GB and a processor
16 2.3.Analyses of selected amino acid sequences.
2.3.1. Prediction of tertiary structure
Selected sequences by the platform were submitted to I-TASSER [35] server, an online platform for
protein structure and function predictions. In this program, tridimensional models are built based on
multiple-threading alignments and iterative template fragment assembly simulations. Visualization and
modification of the coordinates of the models were carried out using the molecular modeling program
Chimera 1.10 [36] .
2.3.2. Molecular Dynamics simulations
The purpose of this simulation is to determine the stability of the candidate proteins in a dodecane/water
bilayer system. This stability is one the main features to consider in a surfactant activity. Simulations
were carried out using the GROMACS [37] package (Version 5.0.1. for simulation boxes construction,
and version 4.6.1. for energy minimization, system equilibration, MD and subsequent analyses) [38],
using united atom force field GROMOS96 53a6 [37,38] whose parameterization is based on free
enthalpies of hydration and non-polar solvation, which play an important role in protein folding [25].
Simulations were carried out in rectangular boxes with box edges at least 1 nm apart from the protein
surface; with molecules of dodecane, whose coordinates and topology were obtained from the
Automatic Topology Builder tool of the Molecular Dynamics group of the University of Queensland
[25], and single point charge (SPC216) water molecules. According to the modified position made in
17
regions of the protein, respectively. Specific data for the construction of the box can be seen on results
and discussion.
Once the boxes were constructed, energy minimization was performed using steepest descent algorithm
until the maximum force were less than . System equilibration was performed in order to
stabilize the temperature and pressure of the system. This equilibration were performed first under a
NVT ensemble for 200 ps with a time-step of 2 fs at 300, and then under a NPT ensemble for other 2 ps
with the same time step. MD simulations were performed then for 5 ns with a time step of 2 fs. Output
trajectories of the MD simulation were used to analyze the stability of the system.
For this, calculations were made for the root mean square displacement (RMSD) of the system and its
components in order to measure the average distance between the atoms of a protein, the radius of
gyration (RGYR) which gives information of its folding stability, and the solvent accessible surface area
(SASA) of the protein during the simulation time. Visualization of the simulation box construction and
18
3. RESULTS AND DISCUSSION
3.1. Prediction and translation with Glimmer MG
The implementation of Glimmer MG for BIPV1 metagenome allowed the prediction and translation of
2 227 616 amino acid sequences from 2 056 250 contigs and in IN BICV2 metagenome, 1 660 101 amino acid sequences from 1 649 203 contigs. The overall process gave a total rate of predicted amino acid sequences per contig of 1.083 and 1.007 for BIPV1 and BICV2 metagenomes, respectively.
Glimmer MG is characterized by its high accurate predictions in whole genomes. This accuracy is
founded on its two main processes of classification and clustering; MG classifies the sequences using a
phylogenetic classifier and trains models using the results, subsequently, by means of an unsupervised
clustering approach, allows retraining of prediction models on the sequences themselves. This is a novel
approach makes Glimmer MG an alternative to prediction programs based on GC content.
3.2. Selection of families with HMMSCAN
The implementation of hmmscan module included the MetBAA metagenome, previously predicted and
translated (See section 2.2.1). The first screening process led to the selection of 7127 (0,324% of total sequences) sequences which belong to the MBB clan from the BIPV1 metagenome, 5055 (0,304%) for the BICV2 metagenome, and 1219 (0,254%) for MetBAA metagenome. Figure 5 shows the distribution of proteins for each Pfam family selected. In all the three metagenomes, it was found that families with
19
of phenol degradation family. Moreover, overall distribution of selected families was similar. This
distribution of transmembrane proteins in the three ecosystems can be associated to the taxonomy of the
metagenomes. Taxonomic analysis developed for metagenomes BIPV1 and BICV2 showed similar
taxonomic profiles for Bacteria domain, with an important presence of Proteobacteria (64.59 and
68.57%, respectively), Acidobacteria (17.39 and 8.92%), Actinobacteria (5.71 and 6.41%), and
Bacteroidetes phylum (6.17 and 10.95%) [42]. Despite the reduction of metabolic processes due to
effect of human activities on BICV2 soil sample, these activities seem not to affect the relative
abundance of transmembrane proteins obtained for this metagenome.
According to Smith et al. [43], OmpA is an abundant protein and a predominant antigen in
enterobacterial outer membrane with a copy number of approximately 100000 proteins per cell. The
most important feature of this domain is the presence of a beta/alpha/beta/alpha/beta structure found in
the C-terminal region of outer membrane proteins (like OmpA from Escherichia coli) and MotB proton channels. The N-terminal is variable and in some cases it corresponds to the OmpA-like transmembrane
family (β-barrel structure). The presence of OmpA domain was 19.6% in BIPV1 and BICV2 , whereas
the presence of OmpA-like transmembrane domain (OmpA_membrane) was 4.1% for BIPV1 and 3.7%
for BICV2. MetBAA presented a 20% of OmpA domain and 4% of OmpA-membrane, showing that
there are no significant effect of farming and altitude in the abundance of proteins with these domains. It
was observed in the three metagenomes that just one of five proteins that had OmpA domain had the
OmpA-like transmembrane domain. This matches with the fact that proteins with OmpA domain can
have in some cases a beta barrel structure in their N-terminal domain. Likewise, evaluation of 14 OmpA
sequences from transmembrane data set with hmmscan showed that sequence of OmpA from Riemerella anatipestifer (Phylum: Bacteroidetes) only hit OmpA domain; and did not have any alignment with the OmpA-like transmembrane domain. This may give indications that several amino acid sequences in the
20
metagenome could have the same hits of OmpA of R. anatipestifer. On the other hand, there were cases in which proteins that matched OprF family matched with OmpA domain. OprF-like proteins from
Pseudomonas aeruginosa are considered as an ortholog of OmpA with significant amino acid similarity in their C-terminal domains [44].
The OmpA/OmpA_membrane/OprF case was one were proteins hit more than one HMM. This situation
was observed in proteins that belong to the general bacterial porin family (Porin_1 and Porin_4) too.
These two families describe generally proteins with porin features such as OmpN. The main difference
between Porin_1 and Porin_4 is the hidden Markov model length; 337 for Porin_1 and 311 for Porin_4;
which may explain the difference of hits between them. In cases where a protein hit more than one
family, the screening script in Perl allows to de-replicate these cases in order to have only one printed
copy of each sequence in the new multi-fasta file. There is important to take into account that final
resulting proteins of the whole screening will not necessarily belong to the most abundant families
shower in this module
3.3. Calculation and analysis of structural properties.
In order to validate the results of the calculation script of module 3(a), the method applied to calculate
values per residue was compared to the method used in ExPASy Tool Protscale [45]. Figure 6 shows the
comparison between the tendencies given by the two programs, using as input, the protein sequence of
OmpA from Escherichia coli APEC 01 (Accession:ABJ00366). The profile shown by the Perl script is exactly the same to the one obtained using Protscale. The established configuration of the Protscale
algorithm consists on a sliding window which calculates the value of the central amino acid using a
21
100% and the window edge relative weight, which is 100% [45]. Perl script of module 3(a) applies this
configuration in order to calculate values for the properties for multiple sequences. Protscale server
generates a profile and a value table of a property of interest using just one sequence as input, whilst Perl
script generates tables including several properties using a multifasta file, making this an important
advantage in order to calculate all the properties of interest for the amino acid sequences in the
metagenome. Different window sizes were tested. It was found that sliding window of 7 members is a
suitable size for large proteins because it can characterize regions from 5 to 15 residues, without the
presence of noise, characteristic of lower window sizes, or underestimation which can appear using
larger window sizes.
The purpose of module 3(b) was to observe the correlation between the presence of secondary structures
present in the sequences with the hydrophobicity of the structure, making an emphasis in the presence of
hydrophobic β-strands on the β-barrel structure, and hydrophilic regions (such as β-turns) accessible to
solvents, which are structural characteristics of OmpA-like proteins. Additionally, it was evaluated the
presence of other regions that may appear in the sequences such as hydrophobic alpha helices and coils,
which are features in some transmembrane proteins [46]. The ranges of selection of regions (see table 3)
were chosen based in a previous analysis of the tertiary structure of OmpA with data using a sliding
window of 7. Initial data analyzer accounted the number of occurrences of each parameter. Figure 7
shows the histograms of the five parameters, using the transmembrane data set as a test input. Due to
the variability in the number of occurrences in parameters such as hydrophilic residues accessible to
solvent and hydrophilic beta turns and the distribution shape of data, it was planned a punctuation
system per intervals based on calculated bin sizes for each histogram. Applying the Freedman-Draconic
22
calculations and the punctuation system are shown in table 6. Three parameters showed a bin size of
approximately one; therefore, the punctuation was almost the same as the presence of determined
regions. The purpose of this punctuation system was to homogenize the parameter values, which play an
important role in k-means cluster analysis.
3.4. K-means cluster analysis
Calculations and punctuations were carried out for the screened metagenomic sequences of hmmscan
module with the transmembrane data set, in order to carry out the grouping. The purpose was to screen
those sequences that were clustered with OmpA sequences. Taking into account the homogenized
continuous values obtained from module 3(b) uncentered correlation was chosen as the similarity metric
to apply. It is a distance measure based on the Pearson correlation, but its difference is that uncentered
correlation assumes a mean value of the series as 0, even if it is not. Therefore, two vectors with
identical shape, but offset to each other by a fixed value, will have a Pearson coefficient value of 1 but
an uncentered correlation coefficient different from 1 [47].
The difference between the sizes of metagenomes determined the number of amino acid sequences in
modules 1 and 2. Due to this, the number of k centers could not be equal in the three cases. Therefore, the number of k centers was chosen in order to have clusters with maximum 100 data, including data from reference sequences. BIPV1 k-means cluster analysis was carried out with 100 centers, while
BIPV1 and MetBAA clustering was made with 75 and 20 centers, respectively. Table 7 shows the
results of the k-means cluster analysis for the three metagenomes. All data from OmpA reference sequences were grouped in the same cluster.
23
In BIPV1 case, the module 4 screened 92 sequences from 7127 selected in the hmmscan module, which
represents 1.3% of multifasta file obtained from module 2. Sequences were clustered with OmpA
references and one reference from the OprD family which corresponds to a chitoporin, a sugar-specific
channel of Escherichia coli; to form a 101 data cluster. Cluster analysis screened 73 sequences from 5055 (1.44%) in BICV2 metagenome, and were clustered with the same reference sequences of BIPV1
case, to give a 82 data cluster. MetBAA metagenome results were different; 56 sequences were screened
from 1219, to a total of 4.59%, a higher percentage compared to BIPV1 and BICV2. Likewise, these
sequences were clustered with the same nine references of BIPV1 and BICV2, and 10 more reference
sequences including two sequences of OmpA from Cronobacter sakazakii and Pantoea ananatis
(Family: Enterobacteriaceae).
3.5. Gibbs free energies calculation
The fifth module and third screening step calculated the values of Gibbs free energy of formation of the
screened proteins in module 4. The selection criterion for screening of the sequences was a similar or
more negative value compared to the one of OmpA from Escherichia coli K12, which is -89161.88
kJ/mol. It was found an important relation between these and the size of the protein. Thus, the
module screened sequences that have a similar size as OmpA-like proteins. OmpA sequences of the
transmembrane data set were included in the multifasta file from module 4 in order to make comparisons directly. In the case of BIPV1 metagenome, 5 proteins were selected according to the free
Energy value, 3 proteins for BICV2 metagenome, and 1 for the MetBAA metagenome. Screened
24
corresponding metagenomes. values are shown in table 9 with results of DELTA-Blast [48]. Gibbs
free energy of formation shows the energetic change that accompanies the process of formation of these
proteins. The importance of this module in the platform lies on the possibility of choosing those proteins
whose energetic change can be similar or more negative than OmpA.
Knowledge of the thermodynamic behavior of amino acids in aqueous and organic phases is important
to understand the behavior of a protein in biphasic systems. Although Gibbs free energies of formation
can give important information about its stability and formation at standard conditions, they cannot give
information about the stability of the proteins when they are in a liquid/liquid biphasic system such as
dodecane/water at determined conditions. Due to limited data reports for solvation free energy of amino
acids [49], it is important to take into account methodologies such as molecular dynamics, in order to
determine the energies of amino acids and peptides in an interface, in order to estimate the energetic
contributions of the contained chemical groups.
3.6. Post-screening analyses
3.6.1. DELTA-blast results
DELTA-blast alignment of the nine selected sequences with non-redundant databases, shows relative
high percentages of coverage but low for identity, which can be associated presumably with the quality
of the predicted sequences (cut or interrupted genes in the assembly process) or with the presence of
25
were related with the Pfam family associated to the query sequence. In the case of OmpA domain, all the
best hits were related to MotB-like proteins.
3.6.2. Tertiary structure prediction
It is important to evaluate the selected sequences in module 5 in order to confirm or discard the results
given by the platform. Figure 6 shows the nine tridimensional structures of the amino acid sequences
obtained with I-TASSER and their corresponding hydrophobic surface visualized with Chimera 1.10.
According to the predicted structure, the proteins were classified in three types 1) Protein with solely a
β-barrel structure, 2) proteins with a β-barrel structure and other domains and 3) Proteins without β-
barrel structure. Type 1 consisted of proteins BIPV12, BIPV13 and BIPV14; and belonged to the lipid A
3-O-deacylase, maltoporins and putative MetA-pathway phenol degradation proteins families,
respectively. These proteins are characterized for having a high number of β-strands in their barrel
structure and long loops which contribute in substances transport. Type 2 was formed by proteins
BICV21, BICV22, BICV23 and MetBAA1, and includes the surface antigen, the FadL outer membrane
transport and the putative MetA-pathway of phenol degradation protein families, and the autotransporter
domain, respectively. Type 3 corresponded to proteins which have the Pfam OmpA domain, but do not
belong to the OmpA-like transmembrane protein family, which were proteins BIPV11 and BIPV15.
According to DELTA-blast results, amino acid sequences of these proteins have a certain homology
with a flagellar motor protein (MotB) of Geobacter metallireducens, a bacteria species from phylum Proteobacteria which has the ability to oxidize organic compounds, metals and radioactive elements
26
channels like MotB. Motor flagellar protein MotB contains 308 residues and consists of a short
N-terminal cytoplasmic domain, a single membrane-spanning helix, and a large periplasmic domain, and
forms by means of a complex with MotA protein transmembrane channels for proton transport
throughout the membrane, contributing to the rotation of flagella in bacteria. [51]. Finally, one protein of
each type was selected for MD simulations, according to the hydrophobicity surface shows with the
visualization using Chimera 1.10. They were BIPV13 (Maltoporin), BICV23 (MetA-pathway phenol
degradation protein) and BIPV15 (Motor flagellar protein).
3.6.3. Molecular Dynamics simulations
Previously to the MD simulations, PDB files of the three proteins were modified; in order to perform a
layer solvation, coordinates of the protein structures were changed, allowing differentiation between the
hydrophobic and hydrophilic zones. This differentiation allowed in turn, establishing the dimensions of
the boxes. All simulation boxes presented the hydrophobic layer with dodecane molecules solvating the
hydrophobic zone at the top, and the hydrophilic zone being solvated by water molecules in the bottom.
Table 10 shows the dimensions of each box and its layers. Figure 9 shows the simulation boxes
visualized using VMD.
BIPV13 simulation
Simulation of BIPV13 protein in dodecane/water bilayer system was carried out with 468 dodecane
molecules and 6374 molecules of water. During the 5-ns simulation, BIPV13 maintained its position in
27
barrel structure, without making any impact on the position of the protein (see figure 10). Figure 11
shows the variation of the RMDS of the system and its components during the simulation time.
Stability of the system was reached after 1 ns with and RMSD of 4.09 ± 0.01 nm for the last 4 ns.
Dodecane and water made important contributions to the RMSD of the system, with values of 4.19 ±
0.01 and 4.64 ± 0.01 nm respectively. On the other hand, BIPV13 protein RMSD value was low, with
0.428 ± 0.003 nm, and had no effect in the RMSD of the system. The radius of gyration (RGYR) is a
measure of compactness of the protein in the system [25]. BIPV13 protein had a value of 2.35 ± 0.04
nm. Figure 12 shows the variation of the radius of gyration during the 5-ns simulation. The tendency (in
blue) presents a steady (but not significant) increase during the first 4 ns of the simulation. After the
fourth nanosecond there is an increase that can be associated to the opening of the β-barrel. Figure 13
shows that variation of the hydrophobic and hydrophilic areas which remain almost constant throughout
the simulation, with values of 145.06 ± 0.09 and 85.09 ± 0.06 nm2 respectively. The results show that
BIPV13 protein presents great stability in the interface and can be taken into account as a candidate for
biosurfactant, although is necessary to see its capacity to form agglomerates.
BICV23 simulation
Simulation of BICV23 protein in dodecane/water bilayer system was carried out with 319 dodecane
molecules and 5819 molecules of water. Figure 14 shows the behavior of the protein during the 5-ns
simulation. The most important fact of this visualization is the gradual change of position that BICV23
28
dodecane molecules, making the structure to turn sideways and descend the structure to the hydrophilic
layer. At 5 ns, only loops of the superior region are solvated by dodecane molecules.
Figure 15 shows the RMSD values for the system and its components. System achieves stability after
1,5 ns and obtain a value of 3.752 ± 0.009 nm in the following 3.5 ns, with higher contributions of
dodecane and water with values of 3.65 ± 0.01 and 4.30 ± 0.01 nm respectively. BICV23 protein obtains
an RMSD value of 0.364 ± 0.003 nm which is low despite the behavior of the protein the bilayer system.
Radius of gyration of the protein didn’t have many variations during the simulation, with an average
value of 2.0924 ± 0.0004 nm (see figure 16). This value infers that BICV23 remains compact despite its
turn. Figure 17 shows the results of the SASA calculation; there is an light increase of the hydrophilic
area after 1.6 ns and can be related with the movement of BICV23 towards the aqueous layer. Mean
values of hydrophobic and hydrophilic areas were 110.15 ± 0.06 and 72.12 ± 0.07 nm respectively.
These results show that BICV23 does not remain stable at the interface and tends to migrate to the
aqueous phase. Thus, the presence of amphipathic structure according to the platform, does not infer a
possible biosurfactant activity.
BIPV15 simulation
Simulation of BIPV15 protein in dodecane/water bilayer system was carried out with 476 dodecane
molecules and 7791 molecules of water. Figure 18 shows the 5-ns of the protein in the bilayer system. In
this case, the protein BIPV15 tends to moderately turn sideways. Superior region remains solvated with
dodecane during all the simulation time. The bottom of the protein which is in the aqueous phase,
29
Figure 19 shows the RMSD values of the system and its components. The system was stabilized after
1.5 ns, and had a value of 4.14 ± 0.01 nm for the last 3.5 ns. Like the previous cases, the contributions of
solvents were important, with values of 4.03 ± 0.01 nm for dodecane, and 4.70 ± 0.01 nm for water.
BIPV15 obtained a RMSD value of 0.492 ± 0.005 nm. Radius of gyration of BIPV15 had moderate
decreases during the first 3 ns, but then had an increase until the fifth nanosecond as is shown in figure
20; this can be associated to the mentioned movements that have both regions of the protein. The
average radius of gyration for BIPV15 was 2.351 ± 0.001 nm. Figure 21 shows the variations of
hydrophobic and hydrophilic areas in BIPV15; there was a decrease in the accessible areas during the
first half of the simulation associated to the movements of the protein in both phases. The calculated
areas for BIPV15 were 129.95 ± 0.09 nm for the hydrophobic area, and 89.84 ± 0.06 nm for the
hydrophilic area. The results show that BIPV15, despite the movement it presents, achieve to maintain
itself at the interface. However, it is necessary to observe the behavior of the protein in longer times of
simulation in order to corroborate this stability.
3.7. Context of the platform in search of biosurfactants
As it was mentioned previously this platform is framed in a series of works focused in the search of
biosurfactant using different approaches. It is important to put the results given by the platform in a
experimental basis. A high-throughput function-based screening process done for a fosmid library of the
MetBAA was carried out in the research group. Using a method based on optical distortions in 96-well
microtitre plate [52], there were found 18 positive clones from 18432 clones of the MetBAA library.
30
characterization of them. The goal is to find proteins or peptides with biosurfactant activity in these positive clones, which can be related with results, obtained using the platform.
4. CONCLUSIONS AND PERSPECTIVES
A computational tool was developed to find amino acid sequences with structural features that may
imply an amphipathic behavior and a potential surfactant activity in metagenomic sequences. The
platform included bioinformatic programs for prediction and translation of genes, and protein family
search; with scripts in Perl language which were used to calculate structural and hydropathy properties
in order to establish a punctuation system based on OmpA from Escherichia coli structure; a k-means cluster analysis that selected those sequences that share a similar punctuation pattern to OmpA and a
calculator of free Gibbs energy of formation. Application of this platform in order to find
transmembrane-like proteins in three soil metagenomes from Los Nevados National Natural Park gave
as a result the selection of 9 potential sequences: 5 from paramo soil metagenome (BIPV1), 3 from
agriculturally-treated soil metagenome (BICV2) and 1 sequence from high Andean forest soil
metagenome (MetBAA). Molecular dynamics simulations were carried out to three of these sequences
BIPV13, BICV23 and BIPV15, named according to their corresponding metagenome. Results showed
that protein BIPV13, a maltoporin could stabilize itself in a dodecane/water interface despite some
displacements within their structure during the 5 ns simulation.
The relative abundance of transmembrane proteins in the three metagenomes showed almost the same
distribution, inferring a non-significant effect of ecosystem in the relative abundance of the proteins,
31
Punctuation matrix and k-means cluster analysis were established as the main approaches to find
transmembrane proteins, with the complement of the group contribution method which gave information
about the formation of these proteins. The next step is focused in the development of a group
contribution method which can estimate the thermodynamic stability of a protein in a dodecane/water
system. The development of the group contribution method that gives a strong criterion to the screening
process in general, and will be a key tool to expand the application of this platform to all type of
proteins. New biosurfactant proteins with different domains and motifs can be potentially used as
references in the platform to find even more proteins with suitable structural characteristics.
The idea of this work also comprises the experimental validation, which consists in function-based
screening methods and search of genes of interest using molecular biology procedures such as PCR,
cloning, transformation, heterologous expression and characterization of the proteins of interest.
Likewise, sequence-based screening approaches like primers and probes can be used in order to find
obtained the in silico selected proteins directly from DNA.
REFERENCES
[1] K. K. Sekhon, S. Khanna, and S. S. Cameotra, “Enhanced biosurfactant production through cloning of three genes and role of esterase in biosurfactant release.,” Microb. Cell Fact., vol. 10, no. 1, pp. 1–49, Jan. 2011.
[2] R. Marchant and I. M. Banat, “Microbial biosurfactants: challenges and opportunities for future exploitation.,” Trends Biotechnol., vol. 30, no. 11, pp. 558–65, Nov. 2012.
[3] D. K. F. Santos, R. D. Rufino, J. M. Luna, V. a. Santos, A. a. Salgueiro, and L. a. Sarubbo, “Synthesis and evaluation of biosurfactant produced by Candida lipolytica using animal fat and corn steep liquor,” J. Pet. Sci. Eng., vol. 105, pp. 43–50, May 2013.
32
[4] J. D. Desai and I. M. Banat, “Microbial production of surfactants and their commercial potential.,”
Microbiol. Mol. Biol. Rev., vol. 61, no. 1, pp. 47–64, Mar. 1997.
[5] I. M. Banat, A. Franzetti, I. Gandolfi, G. Bestetti, M. G. Martinotti, L. Fracchia, T. J. Smyth, and R. Marchant, “Microbial biosurfactants production, applications and future potential.,” Appl. Microbiol. Biotechnol., vol. 87, no. 2, pp. 427–44, Jun. 2010.
[6] G. G. Sarkar A.K., Goursaud J.C., Sharma MM, “A Critical Evaluation of MEOR Processes,” Situ, vol. 13, no. 4, pp. 207–238, 1989.
[7] C. N. Mulligan, “Environmental applications for biosurfactants.,” Environ. Pollut., vol. 133, no. 2, pp. 183–98, Jan. 2005.
[8] T. T. Nguyen, N. H. Youssef, M. J. McInerney, and D. a Sabatini, “Rhamnolipid biosurfactant mixtures for environmental remediation.,” Water Res., vol. 42, no. 6–7, pp. 1735–43, Mar. 2008.
[9] L.-M. Whang, P.-W. G. Liu, C.-C. Ma, and S.-S. Cheng, “Application of biosurfactants, rhamnolipid, and surfactin, for enhanced biodegradation of diesel-contaminated water and soil.,” J. Hazard. Mater., vol. 151, no. 1, pp. 155–63, Feb. 2008.
[10] M. Aguirre-Ramírez, G. Medina, A. González-Valdez, V. Grosso-Becerra, and G. Soberón-Chávez, “The Pseudomonas aeruginosa rmlBDAC operon, encoding dTDP-L-rhamnose biosynthetic enzymes, is regulated by the quorum-sensing transcriptional regulator RhlR and the alternative sigma factor σS.,”
Microbiology, vol. 158, no. Pt 4, pp. 908–16, Apr. 2012.
[11] M. Winterhalter, C. Hilty, S. M. Bezrukov, C. Nardin, and W. Meier, “Controlling membrane permeability with bacterial porins : application to encapsulated enzymes,” vol. 55, pp. 965–971, 2001.
[12] H. Wang, K. K. Andersen, B. S. Vad, and D. E. Otzen, “OmpA can form folded and unfolded oligomers.,”
Biochim. Biophys. Acta, vol. 1834, no. 1, pp. 127–36, Jan. 2013.
[13] A. González, R. Zuo, D. Ren, and T. K. Wood, “Hha , YbaJ , and OmpA Regulate Escherichia coli K12 Biofilm Formation and Conjugation Plasmids Abolish Motility,” Biotechnol. Bioeng., vol. 1, pp. 188–200, 2005.
[14] E. Sugawara and H. Nikaido, “Pore-forming activity of OmpA protein of Escherichia coli.,” J. Biol. Chem., vol. 300, 1992.
[15] R. Koebnik, “Structural and functional roles of the surface-exposed loops of the beta-barrel membrane protein OmpA from Escherichia coli.,” J. Bacteriol., vol. 181, no. 12, pp. 3688–94, Jun. 1999.
[16] S. Aguilera, A. P. Macías, D. C. Pinto, L. Vargas, M. J. Vives-florez, H. Enrique, C. Barrera, O. A. Álvarez, and A. González, “Escherichia coli ´ s OmpA as Biosurfactant for Cosmetic Industry : Stability Analysis and Experimental Validation Based on Molecular Simulations,” Adv. Comput. Biol., vol. 232, pp. 265–271, 2014.
[17] A. Toren, E. Orr, Y. Paitan, E. Z. Ron, and E. Rosenberg, “The Active Component of the Bioemulsifier Alasan from Acinetobacter radioresistens KA53 Is an OmpA-Like Protein,” vol. 184, no. 1, pp. 165–170, 2002.
33
[18] J. Handelsman, “Metagenomics : Application of Genomics to Uncultured Microorganisms Metagenomics : Application of Genomics to Uncultured Microorganisms,” vol. 68, no. 4, 2004.
[19] T. Uchiyama and K. Miyazaki, “Functional metagenomics for enzyme discovery: challenges to efficient screening.,” Curr. Opin. Biotechnol., vol. 20, no. 6, pp. 616–22, Dec. 2009.
[20] J. Zhang, R. Chiodini, A. Badr, and G. Zhang, “The impact of next-generation sequencing on genomics.,”
J. Genet. Genomics, vol. 38, no. 3, pp. 95–109, Mar. 2011.
[21] R. Logares, T. H. a Haverkamp, S. Kumar, A. Lanzén, A. J. Nederbragt, C. Quince, and H. Kauserud, “Environmental microbiology through the lens of high-throughput DNA sequencing: synopsis of current platforms and bioinformatics approaches.,” J. Microbiol. Methods, vol. 91, no. 1, pp. 106–13, Oct. 2012.
[22] A. C. Oliveira, L. F. Moura, and D. Cardoso, “Method of contribution of groups to estimate
thermodynamic properties of components of biodiesel formation in liquid phase,” Fluid Phase Equilib., vol. 317, pp. 59–64, Mar. 2012.
[23] M. L. Mavrovouniotis, “Estimation of standard Gibbs energy changes of biotransformations.,” J. Biol. Chem., vol. 266, no. 22, pp. 14440–5, Aug. 1991.
[24] M. D. Jankowski, C. S. Henry, L. J. Broadbelt, and V. Hatzimanikatis, “Group contribution method for thermodynamic analysis of complex metabolic networks.,” Biophys. J., vol. 95, no. 3, pp. 1487–99, Aug. 2008.
[25] S. Aguilera, L. Achenie, and A. F. González, “Platform Implementation for Evaluation of Proteins as Biosurfactants via Molecular Dynamics Free Energy Calculations,” Universidad de los Andes, 2013.
[26] D. R. Kelley, B. Liu, A. L. Delcher, M. Pop, and S. L. Salzberg, “Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering.,” Nucleic Acids Res., vol. 40, no. 1, p. e9, Jan. 2012.
[27] A. Brady and S. Salzberg, “Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models,” Nat. Methods, vol. 6, no. 9, pp. 673–676, 2009.
[28] D. R. Kelley and S. L. Salzberg, “Clustering metagenomic sequences with interpolated Markov models.,”
BMC Bioinformatics, vol. 11, no. 1, p. 544, Jan. 2010.
[29] R. D. Finn, J. Clements, and S. R. Eddy, “HMMER web server: interactive sequence similarity searching.,” Nucleic Acids Res., vol. 39, no. Web Server issue, pp. W29–37, Jul. 2011.
[30] M. Punta, P. C. Coggill, R. Y. Eberhardt, J. Mistry, J. Tate, C. Boursnell, N. Pang, K. Forslund, G. Ceric, J. Clements, A. Heger, L. Holm, E. L. L. Sonnhammer, S. R. Eddy, A. Bateman, and R. D. Finn, “The Pfam protein families database.,” Nucleic Acids Res., vol. 40, no. Database issue, pp. D290–301, Jan. 2012.
[31] J. Kyte and R. F. Doolittle, “A simple method for displaying the hydropathic character of a protein.,” J. Mol. Biol., vol. 157, no. 1, pp. 105–132, May 1982.
34
[32] G. Deléage and B. Roux, “An algorithm for protein secondary structure prediction based on class prediction,” Protein Eng. Des. Sel., vol. 1, no. 4, pp. 289–294, 1987.
[33] J. Janin, “Surface and inside volumes in globular proteins,” Nature, no. 277, pp. 491–492, 1979.
[34] D. Freedman and P. Diaconis, “On the histogram as a density estimator: L 2 theory,” Probab. theory Relat. fields, vol. 476, pp. 453–476, 1981.
[35] Y. Zhang, “I-TASSER server for protein 3D structure prediction.,” BMC Bioinformatics, vol. 9, p. 40, Jan. 2008.
[36] E. F. Pettersen, T. D. Goddard, C. C. Huang, G. S. Couch, D. M. Greenblatt, E. C. Meng, and T. E. Ferrin, “UCSF Chimera--a visualization system for exploratory research and analysis.,” J. Comput. Chem., vol. 25, no. 13, pp. 1605–12, Oct. 2004.
[37] S. Pronk, S. Páll, R. Schulz, P. Larsson, P. Bjelkmar, R. Apostolov, M. R. Shirts, J. C. Smith, P. M. Kasson, D. van der Spoel, B. Hess, and E. Lindahl, “GROMACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit.,” Bioinformatics, vol. 29, no. 7, pp. 845–54, Apr. 2013.
[38] J. Kerrigan, GROMACS Introductory Tutorial: Gromacs Version 4.6. New Brunswick, NJ, 2012, pp. 1–20.
[39] C. Oostenbrink, A. Villa, A. E. Mark, and W. F. van Gunsteren, “A biomolecular force field based on the free enthalpy of hydration and solvation: the GROMOS force-field parameter sets 53A5 and 53A6.,” J. Comput. Chem., vol. 25, no. 13, pp. 1656–76, Oct. 2004.
[40] C. Oostenbrink, T. A. Soares, N. F. A. van der Vegt, and W. F. van Gunsteren, “Validation of the 53A6 GROMOS force field.,” Eur. Biophys. J., vol. 34, no. 4, pp. 273–84, Jun. 2005.
[41] W. Humphrey, A. Dalke, and K. Schulten, “VMD: visual molecular dynamics,” J. Mol. Graph., vol. 14, no. 1, pp. 33–38, 1996.
[42] M. C. Álvarez, M. M. Zambrano, S. Restrepo, J. Husserl, J. M. Gómez, and A. González, “Estudio del efecto de la compartimentalización de redes metabólicas en la predicción del comportamiento de comunidades microbianas usando FBA,” Universidad de los Andes, 2014.
[43] S. G. J. Smith, V. Mahon, M. a Lambert, and R. P. Fagan, “A molecular Swiss army knife: OmpA structure, function and expression.,” FEMS Microbiol. Lett., vol. 273, no. 1, pp. 1–11, Aug. 2007.
[44] S. Krishnan and N. Prasadarao, “Outer membrane protein A and OprF: versatile roles in Gram‐negative bacterial infections,” FEBS J., vol. 279, no. 6, pp. 919–931, 2012.
[45] M. R. Wilkins, E. Gasteiger, a Bairoch, J. C. Sanchez, K. L. Williams, R. D. Appel, and D. F.
Hochstrasser, “Protein identification and analysis tools in the ExPASy server.,” Methods Mol. Biol., vol. 112, pp. 531–52, Jan. 1999.
[46] J. M. Cuthbertson, D. a Doyle, and M. S. P. Sansom, “Transmembrane helix prediction: a comparative evaluation and analysis.,” Protein Eng. Des. Sel., vol. 18, no. 6, pp. 295–308, Jun. 2005.
35
[47] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proc. Natl. Acad. Sci., vol. 95, no. 25, pp. 14863–14868, Dec. 1998.
[48] G. M. Boratyn, A. A. Schäffer, R. Agarwala, S. F. Altschul, D. J. Lipman, and T. L. Madden, “Domain enhanced lookup time accelerated BLAST.,” Biol. Direct, vol. 7, p. 12, Jan. 2012.
[49] J. Chang, A. Lenhoff, and S. Sandler, “Solvation free energy of amino acids and side-chain analogues,” J. Phys. …, pp. 2098–2106, 2007.
[50] R. T. Anderson, H. A. Vrionis, I. Ortiz-Bernad, C. T. Resch, P. E. Long, R. Dayvault, K. Karp, S. Marutzky, D. R. Metzler, A. Peacock, D. C. White, M. Lowe, and D. R. Lovley, “Stimulating the in situ activity of Geobacter species to remove uranium from the groundwater of a uranium-contaminated aquifer.,” Appl. Environ. Microbiol., vol. 69, no. 10, pp. 5884–91, Oct. 2003.
[51] E. R. Hosking, C. Vogt, E. P. Bakker, and M. D. Manson, “The Escherichia coli MotAB proton channel unplugged.,” J. Mol. Biol., vol. 364, no. 5, pp. 921–37, Dec. 2006.
[52] C.-Y. Chen, S. C. Baker, and R. C. Darton, “The application of a high throughput analysis method for the screening of potential biosurfactants from natural sources.,” J. Microbiol. Methods, vol. 70, no. 3, pp. 503– 10, Sep. 2007.
36 Table 1. Pfam families selected from outer-membrane beta barrel superfamily. Clan MBB (CL0193)
Family Code Description
1 Ail_lom PF06316 Virulence-related outer membrane protein family
2 Autotransporter PF03797 Autotransporter beta-domain
3 Bac_Surface_Ag PF01103 Surface antigen family
4 Channel_Tsx PF03502 Nucleoside-specific Channel forming protein family
5 CopB PF05275 Copper resistance protein B protein family
6 KdgM PF06178 Oligogalacturonate-specific porin protein family
7 LamB PF02264 Maltoporins family
8 MipA PF06629 MltA-interacting Protein family
9 OmpA PF00691 OmpA domain
10 OmpA_membrane PF01389 OmpA-like transmembrane domain
11 Omptin PF01278 Outer membrane protease A family
12 OmpW PF03922 OmpW-like protein W family
13 OpcA PF07239 Outer membrane adhesin family
14 OprB PF04966 Carbohydrate-selective porin family
15 OprF PF05736 OprF membrane domain
16 OrpD PF03573 Outer membrane serine type peptidase family
17 OstA_C PF04453 Organic solvent tolerance protein family
18 PagL PF09411 Lipid A 3-O-deacylase family
19 Phenol_MetA_deg PF13557 Putative MetA-Pathway of phenol degradation family
20 Porin O_P PF07396 Phosphate-selective porin O and P family
21 Porin_1 PF00267 General bacterial porin family
22 Porin_2 PF02530 Alpha subdivision of Proteobacteriaporin family
23 Porin_4 PF13609 General bacterial porin family
24 Porin_OmpG PF09381 Outer membrane porin G family
25 ShlB PF03865 Haemolysin secretion/activation protein family
26 Toluene_X PF03349 FadL outer membrane protein transport family
27 TraF_2 PF13729 F plasmid transfer Operon protein family
28 Usher PF00577 Fimbrial Usher protein family Source: Pfam 27.0 (March 2013; http://pfam.xfam.org/)
37 Table 2. Values used for the construction of hydropathy calculator script.
Note: HI: Hydrophobicity index. BS: Beta strand. AH: Alpha helix
SA: Solvent accessibility. BT: Beta turn. C: Coil.
Residue HI SA AH BS BT C
Ala: 1.800 6.600 1.489 0.709 0.788 0.824
Arg: -4.500 4.500 1.224 0.920 0.912 0.893
Asn: -3.500 6.700 0.772 0.604 1.572 1.167
Asp: -3.500 7.700 0.924 0.541 1.197 1.197
Cys: 2.500 0.900 0.966 1.191 0.965 0.953
Gln: -3.500 5.200 1.164 0.840 0.997 0.947
Glu: -3.500 5.700 1.504 0.567 1.149 0.761
Gly: -0.400 6.700 0.510 0.657 1.860 1.251
His: -3.200 2.500 1.003 0.863 0.970 1.068
Ile: 4.500 2.800 1.003 1.799 0.240 0.886
Leu: 3.800 4.800 1.236 1.261 0.670 0.810
Lys: -3.900 10.300 1.172 0.721 1.302 0.897
Met: 1.900 1.000 1.363 1.210 0,436 0,810
Phe: 2.800 2.400 1.195 1.393 0.624 0,797
Pro: -1.600 4.800 0.492 0.354 1.415 1.540
Ser: -0.800 9.400 0.739 0.928 1.316 1.130
Thr: -0.700 7.000 0.785 1.221 0.739 1.148
Trp: -0.900 1.400 1.090 1.306 0.546 0.941
Tyr: -1.300 5.100 0.787 1.266 0.795 1.109