29Norma de Información - ESTADOS FINANCIEROS 2015

Biology and technology

Different from the classical isolation and cultivation-dependent microbial ecology stud-

ies, in metagenomics the microbial organisms are directly studied in their environment,

i.e., without the need for laboratory cultivation of individual species.

In very general terms, metagenomics is based on the analysis of all genomic material

present in a sample taken from a sampling location (either environmental, human, . . . ).

The exact definition of metagenomics is still under debate in the scientific community.

We will use the term in a very strict form: identification of all known microorganisms in

a sample, and we particularly focus on the sequencing of the 16S ribosomal RNA (16S

prokaryotes, which are single cell organisms to which bacteria belong. Ribosomes are

complex molecular structures in cells that play an important role in the translation of

mRNA into a polypeptide chain. The structure is composed of several subunits which

consists of proteins and of a few RNA molecules which are referred to as ribosomal

RNA (rRNA). In contrast to mRNA, the rRNA molecules themselves are not translated

into proteins; they play a role in the functionality of the ribosomes. However, just like

mRNA the rRNA is a transcript product of genomic DNA and it can therefore be se-

quenced using sequencing technologies (see further). These 16S rRNA genes are

very well suited for the identification of bacterial species, because it contains regions

that vary between species, but are highly conserved within species. Several refer-

ence databases, which connect the 16S rRNA gene sequences to bacterial species,

are available. Another advantage in the use of 16S rRNA is that these interesting

regions can be easily amplified (necessary preprocessing step in most sequencing

technologies) because many universal PRC primers are available for the highly con-

served regions. A final advantage is that the method allows for the identification of

species based on only a very specific genomic DNA region so that many samples can

be sequenced simultaneously, and the cost is strongly reduced to, e.g., whole genome

shotgun sequencing methods for species identification. The latter basically consists

in the fragmentation of the whole DNA genome into small fragments that are subse-

quently amplified and sequenced. The method allows to sequence the whole genome,

and its has the advantage of giving less species identification errors, but it is more

expensive and the genome assembly step is time consuming and also error prone. In

this thesis we use two datasets obtained through 16S rRNA sequencing.

It starts with sample collection. The collected samples are first purified and DNA is

extracted. However, the different DNA fragments are mixed together and they need

to be separated for sequencing. The step of separating the sequences is part of the

process called library construction. The target region of 16S rRNA is amplified throng

Polymerase Chain Reactions (PCR). The amplified 16S rRNA fragments are then se-

quenced using a massive parallel sequencing technique. There exist several sequenc-

ing platforms; we refer to ? for an comparison of some of the major platforms. The

output of the sequencing device is a large set of reads. Each read is a sequence of

nucleotides originating from a DNA fragment. Because the sequencing starts from

PCR amplified DNA fragments, each original DNA fragment may be sequenced multi-

ple times. The average number of reads that cover a nucleotide is referred to as the

sequencing depth or coverage. It is expected that the larger the coverage, the less

error prone the subsequent statistical analysis. The total number of reads produced by

the sequencing experiment of one sample, is known as the library size of the sample.

Once the reads are available, the wet-lab handling is over and the species identification

process continuous with data-processing steps (bioinformatics).

Next, the resulting sequenced reads are clustered into groups of closely related se-

quences; this step is called binning. A binning method can be based on the similarities

among the sequences or the similarity of a sequence to known references [?]. The

reads can be binned according to different levels of the similarity. A cut-off of 97%

similarity is often applied to obtain Operational Taxonomic Unit (OTU) level, which is

a pragmatic proxy for the microbial “species” taxonomic levels. Reference databases

are available for OTU (or species) identification. The sequencing technology does not

reads mapped to an OTU is also considered as a proxy for the abundance of the OTU

in the sample. Hence, the data can be represented as an abundance matrix as shown

in Figure 1.1. Starting from the OTU classification, data can also be represented at

higher taxonomic ranks, using, e.g., the bacterial phyla classification.

Two important characteristics of the abundance data:

• The rare species are prone to be undetected in the community due to sequencing

error or insufficient sequencing depth or coverage [?].

• The total number of sequences or reads (library size) varies form sample to sam-

ple due to different wet-lab handling and differences in PCR efficiency during the

amplification step. This variation can not be controlled nor is it associated with

true abundance [?]. As a consequence, more rare OTUs will be observed in the

samples with larger library size.

As a consequence, the probability of observing zero abundance also depends on the

library size and insufficient library size may lead to zero inflation [?].

Examples of metagenomics studies

With the advancement of genomic sequencing methods and the drop of the cost, the

number of metagenomic projects is drastically increasing. More and more significant

discoveries are made through metagenomics studies. For example, the Earth Micro-

biome Project is set to attempt to characterise the functional diversity of global taxo-

nomic for the benefit of the planet and the human being. Another example is the Human

microbial community found at multiple human body sites and human health [?].

The downstream statistical analysis includes exploratory data analysis, e.g., Principle

Component Analysis for revealing the relationships between taxa, and CCA for study-

ing taxa-environment relationships.

Many human microbiome studies aim at the detection of bacterial species that show

different (relative) abundances between groups of people, e.g. between healthy and

obesity subjects. In environmental metagenomics studies, the question may, e.g., re-

late to differential abundance of species between two regions with different climate

conditions. These questions are typically answered by means of large scale statistical

hypothesis testing. This usually involves testing for differential abundance at individ-

ual species level, and subsequently correcting for multiple testing so as to control the

false discovery rate at a desired level. Instead of performing a statistical analysis for

each species separately, the research may also focus on the community structure as

summarised into a biodiversity index. See ?, ? and ? for examples.

In document ESTADOS FINANCIEROS 2015 (página 30-36)