Biology and technology
Different from the classical isolation and cultivation-dependent microbial ecology stud-
ies, in metagenomics the microbial organisms are directly studied in their environment,
i.e., without the need for laboratory cultivation of individual species.
In very general terms, metagenomics is based on the analysis of all genomic material
present in a sample taken from a sampling location (either environmental, human, . . . ).
The exact definition of metagenomics is still under debate in the scientific community.
We will use the term in a very strict form: identification of all known microorganisms in
a sample, and we particularly focus on the sequencing of the 16S ribosomal RNA (16S
prokaryotes, which are single cell organisms to which bacteria belong. Ribosomes are
complex molecular structures in cells that play an important role in the translation of
mRNA into a polypeptide chain. The structure is composed of several subunits which
consists of proteins and of a few RNA molecules which are referred to as ribosomal
RNA (rRNA). In contrast to mRNA, the rRNA molecules themselves are not translated
into proteins; they play a role in the functionality of the ribosomes. However, just like
mRNA the rRNA is a transcript product of genomic DNA and it can therefore be se-
quenced using sequencing technologies (see further). These 16S rRNA genes are
very well suited for the identification of bacterial species, because it contains regions
that vary between species, but are highly conserved within species. Several refer-
ence databases, which connect the 16S rRNA gene sequences to bacterial species,
are available. Another advantage in the use of 16S rRNA is that these interesting
regions can be easily amplified (necessary preprocessing step in most sequencing
technologies) because many universal PRC primers are available for the highly con-
served regions. A final advantage is that the method allows for the identification of
species based on only a very specific genomic DNA region so that many samples can
be sequenced simultaneously, and the cost is strongly reduced to, e.g., whole genome
shotgun sequencing methods for species identification. The latter basically consists
in the fragmentation of the whole DNA genome into small fragments that are subse-
quently amplified and sequenced. The method allows to sequence the whole genome,
and its has the advantage of giving less species identification errors, but it is more
expensive and the genome assembly step is time consuming and also error prone. In
this thesis we use two datasets obtained through 16S rRNA sequencing.
It starts with sample collection. The collected samples are first purified and DNA is
extracted. However, the different DNA fragments are mixed together and they need
to be separated for sequencing. The step of separating the sequences is part of the
process called library construction. The target region of 16S rRNA is amplified throng
Polymerase Chain Reactions (PCR). The amplified 16S rRNA fragments are then se-
quenced using a massive parallel sequencing technique. There exist several sequenc-
ing platforms; we refer to ? for an comparison of some of the major platforms. The
output of the sequencing device is a large set of reads. Each read is a sequence of
nucleotides originating from a DNA fragment. Because the sequencing starts from
PCR amplified DNA fragments, each original DNA fragment may be sequenced multi-
ple times. The average number of reads that cover a nucleotide is referred to as the
sequencing depth or coverage. It is expected that the larger the coverage, the less
error prone the subsequent statistical analysis. The total number of reads produced by
the sequencing experiment of one sample, is known as the library size of the sample.
Once the reads are available, the wet-lab handling is over and the species identification
process continuous with data-processing steps (bioinformatics).
Next, the resulting sequenced reads are clustered into groups of closely related se-
quences; this step is called binning. A binning method can be based on the similarities
among the sequences or the similarity of a sequence to known references [?]. The
reads can be binned according to different levels of the similarity. A cut-off of 97%
similarity is often applied to obtain Operational Taxonomic Unit (OTU) level, which is
a pragmatic proxy for the microbial “species” taxonomic levels. Reference databases
are available for OTU (or species) identification. The sequencing technology does not
reads mapped to an OTU is also considered as a proxy for the abundance of the OTU
in the sample. Hence, the data can be represented as an abundance matrix as shown
in Figure 1.1. Starting from the OTU classification, data can also be represented at
higher taxonomic ranks, using, e.g., the bacterial phyla classification.
Two important characteristics of the abundance data:
• The rare species are prone to be undetected in the community due to sequencing
error or insufficient sequencing depth or coverage [?].
• The total number of sequences or reads (library size) varies form sample to sam-
ple due to different wet-lab handling and differences in PCR efficiency during the
amplification step. This variation can not be controlled nor is it associated with
true abundance [?]. As a consequence, more rare OTUs will be observed in the
samples with larger library size.
As a consequence, the probability of observing zero abundance also depends on the
library size and insufficient library size may lead to zero inflation [?].
Examples of metagenomics studies
With the advancement of genomic sequencing methods and the drop of the cost, the
number of metagenomic projects is drastically increasing. More and more significant
discoveries are made through metagenomics studies. For example, the Earth Micro-
biome Project is set to attempt to characterise the functional diversity of global taxo-
nomic for the benefit of the planet and the human being. Another example is the Human
microbial community found at multiple human body sites and human health [?].
The downstream statistical analysis includes exploratory data analysis, e.g., Principle
Component Analysis for revealing the relationships between taxa, and CCA for study-
ing taxa-environment relationships.
Many human microbiome studies aim at the detection of bacterial species that show
different (relative) abundances between groups of people, e.g. between healthy and
obesity subjects. In environmental metagenomics studies, the question may, e.g., re-
late to differential abundance of species between two regions with different climate
conditions. These questions are typically answered by means of large scale statistical
hypothesis testing. This usually involves testing for differential abundance at individ-
ual species level, and subsequently correcting for multiple testing so as to control the
false discovery rate at a desired level. Instead of performing a statistical analysis for
each species separately, the research may also focus on the community structure as
summarised into a biodiversity index. See ?, ? and ? for examples.