2. Propuesta de sesiones
2.5. Sesión para el desarrollo de las estrategias afectivas
An overview of the RNA-seq in silico analysis, from analysis of the raw sequence
79
Figure 2.4. Overview of the workflow for RNA-seq in silico analysis. Software and
80
Illumina sequencing
High-quality total RNA was sent for sequencing on the Illumina HiSeq 2000 platform by BGI. Messenger RNA (mRNA) was enriched from total RNA using the Ribo-
Zero™ rRNA Removal Kits for bacteria (Epicentre, USA) and prokaryotic strand-
specific library transcriptome preparation was described above. Libraries with 200 bp inserts were constructed, paired-end sequencing was performed on each library and 90 bp reads were generated. BGI provided data that had been filtered to remove reads containing ≥ 10% unreadable bases, ≥ 20% low quality (≤ Q20) bases or adaptor contamination. For each sample, two FASTQ files were provided that represent the forward and reverse reads. To assess the quality of the FASTQ files, FastQC (Andrews, 2010) was used. DynamicTrim is a wrapper script and was used for quality checks, adapter trimming and
trimming of paired-end and single-end data (Cox et al., 2010). By default, the first 10 bp
of all reads and any reads with a quality score ≤ Q28 were trimmed using DynamicTrim.
For this analysis an error probability p or a quality value Q default setting of p = 0.05
(equivalent to quality score Q ≈ 13) was applied.
Sequence read alignment and transcript assembly
Rockhopper is a system specifically designed for computational analysis of
bacterial RNA-seq data (McClure et al., 2013; Tjaden, 2015), and supports reference-
based assembly of bacterial transcriptomes (http://cs.wellesley.edu/~btjaden/Rock- hopper). The reference-based assembly from short read sequences involves aligning sequencing reads to a sequenced reference genome (Flicek and Birney, 2009; Martin and Wang, 2011). Reference-based assembly was preferred because it is a fast and relatively precise approach, but mainly because high-quality genome sequences were available for
both B. hungatei MB2003 and B. proteoclasticus B316. Rockhopper was run on each set
of FASTQ reads representing mono- and co-culture growth of B. hungatei MB2003 and
B. proteoclasticus B316 on xylan and pectin conditions separately (Figure 2.4). The
system utilises Bowtie 2 (Langmead and Salzberg, 2012), with default parameters, for the alignment of sequence reads in order to make an estimate of gene expression directly from alignment results. Illumina sequence datasets were used, and tables containing the coordinates of the tRNA and rRNA genes as well as the protein coding genes within both
B. hungatei MB2003 and B. proteoclasticus B316 genomes were created. The tRNA
genes and non-coding RNAs (ncRNA) were manually removed from the datasets and reads that did not align to rRNA, tRNA and ncRNA sequences were saved to a FASTQ
81
file for further analysis. For the reference-genome based assembly of the transcriptomes, Rockhopper was run on each set of FASTQ reads, using default parameters with the allowed mismatches and minimum seed length parameters for the transcripts were set to 0.02 and 0.33, respectively. In order to make the read counts comparable among the different samples analysed (or replicates of samples), normalization was performed for each individual gene by upper quartile normalization.
Differential gene expression
The Rockhopper output containing the total number of aligned reads, also calculates the reads per kilobase of gene per million reads mapped (RPKM), a common measure for quantifying gene expression. RPKM is a measure that sums the number of reads for a gene and divides by the gene's length and the total number of reads. Rockhopper reports the expression level of each transcript using RPKM, except that instead of dividing by the total number of reads, Rockhopper divides by the upper quartile of gene expression. Rockhopper then tests for differential expression by first obtaining a smoothed estimate
of the variance for each gene that was expressed via local regression. A null hypothesis
test was then performed based on a negative binomial distribution model (Robinson and
Smyth, 2008; Anders and Huber, 2010). This is a statistical test that utilises a p-value to
model the probability of observing a transcript's expression levels in different conditions (in this case mono- and co-culture) by chance. Because multiple tests are used to
determine differential gene expression, Q-values (or p-adjusted values) were also
reported that control for the false discovery rate (FDR) using the Benjamini-Hochberg (BH) procedure (Benjamini and Hochberg, 1995). Because Rockhopper was run for each set of FASTQ separately, the output data was grouped and compared based on the treatments (xylan or pectin) relative to the growth conditions (mono- versus co-culture
gene expression) for B. hungatei MB2003 and B. proteoclasticus B316 separately. The
cut-off used for a gene to be considered statistically differentially expressed (DEG) in
each bacterium between the mono- and co-culture environments, was a Q-value ≤ 0.05
82
Statistical analysis of RNA-seq data
Two forms of multidimensional scaling (MDS) (Clarke, 1993) were used to analyse
the RNA-seq data; a non-metric MDS (NMDS) (Faith et al., 1987; Minchin, 1987), and
a metric scaling in the form of Principal Coordinate Analysis (PCoA) (Gower, 1966). Stress (standardised residual sum of squares) was measured for the relationship between the original distance matrix associated with the sample replicates and the distances in dimensional space when carrying out NMDS analysis (Gower and Legendre, 1986). Bray-Curtis distances (Anderson, 2001) of the normalised counts were used to measure
pairwise dissimilarity using vegdist function of the package vegan of R software (R Core
Team, 2015). The relative power of NMDS depends on the resemblance method (dissimilarity measure) used as the basis of the analysis. Because the dissimilarity matrix was constructed using the Bray-Curtis distances, the analysis is more focused on compositional changes in sample identities, as opposed to the differences in the relative
abundance of transcripts per sample (Petersen et al., 2011). The pcoa function in the ape
R package and the metaMDS function of the vegan package were used to create the
configuration of all points in the two dimensional space.
Permutation multivariate analysis of variance uses distance matrices to define sources of variation among samples (Zapala and Schork, 2006). This resembles the
‘homogeneity of variances’ assumption required when doing univariate (one gene at a
time) or nonparametric multivariate ANOVA (or “one-way MANOVA”). So if the p-
value is large, that means that the ‘homogeneity of variances’ assumption is satisfied
(Warton et al., 2012). Correspondence analysis (CA) (Greenacre, 1993) was also utilised
to explore and visualise the association between groupings of samples based on the culture condition (mono- (C1) and co-culture (C2)), treatment (xylan (X) and pectin (P))
and associated genes for B316 and MB2003. The R package FactoMineR was used to
explore CA, while clustering, data manipulations and additional analysis of high-
dimensional data sets were considered using packages such as doBy and mixOmics. Genes
with zero or near-zero variances were first removed from the analyses using the
nearZeroVar function of the mixOmics R package.
A Chi-square test of independence was carried out on the raw transcript data to assess whether paired observations on two variables, were independent of each other. In other words, test whether genes deemed differentially expressed were independent of culture condition (mono- and co-culture) and treatment (xylan and pectin). The goal was to obtain a global view of the gene expression data that is useful for interpretation. The
83
CA procedure utilised here can be simplified as follows. The frequencies for any row to column combination of categories are related to all other combinations based on the
marginal frequencies, intimately connected with the Chi-square (F2) test (statistic) of
independence. This yields a conditional expectation, very similar to an expected Chi- square value. Once obtained, these values are normalised, and a process much like PCA (Principal component analysis) defines the lower-dimensional solutions. The maximum number of new dimensions is equal to the minimum (number of rows, number of
columns) - 1. Similar to the ‘total variation’ among the original variables in a PCA, in
CA, the “total inertia” is decomposed. The “total inertia” is a measure of the total
association between the rows and columns of the given contingency table, and equals to
F2-statistic divided by the grand total. The low dimensions then simultaneously relate the
rows and columns in a single graph, and the axes are called “principal axes”. It should be
noted here that the graph should be thought of as two different overlaid plots, one for rows and the other for columns.
To interpret the CA plot, each point should be observed relative to the origin or centroid, where each gene is plotted respective to its association within each culture condition/treatment category (PC1, PC2, XC1 and XC2). Therefore points that are in similar directions are positively associated and genes on opposite sides of the origin are negatively associated, and genes that are furthest from the origin exhibit the strongest associations. Meaning that for a positive association like up-regulation of a gene in a particular condition, the two points will lie on the same side of the centroid, and the larger the distance from the centroid, the stronger the association. A negative association like down-regulation will cause the column-point and the row-point to lie on opposite sides of the centroid. The CA analysis also identified the relative associations between genes in addition to which genes were most or least up- or down-regulated overall.
When interest is focussed on relationships or associations between a row point/category (PC1, PC2, XC1 and XC2) and all column categories (genes), a rather
more complicated method that first draws a line (‘direction vector’) on the plot (of all row
and column points in 2 or higher dimensions) through the origin and the point corresponding to the row point in question may be considered. This is followed by perpendiculars to this direction vector being dropped from each column position/point on the plot. The relative relationships associated with the column categories towards the row category of interest can be read off by traversing the direction vector through the row
84
points intersect it. A column point with an intersection on the direction vector on the same side (say, +ve side) of the origin as the row point of interest occurs more often in the sample than the average in the row categories overall, whereas one on the other side (say, -ve side) of the origin occurs less frequently than the average. In addition, the further from the origin on the +ve side such an intersection occurs, the greater the frequency of the column point in the sample. Conversely, the further out on the -ve side an intersection falls, the less frequent the column point in the sample. A matrix of similarity between each of the row categories and column categories can be constructed as the distances (+ve or -ve) of positions of column points on the above direction vectors in the multidimensional CA configuration. These similarities or measures of association can
then be used to create ‘Clustered Image Maps’ (CIM) and ‘Network graphs’ (González
et al., 2012) to highlight the relationship between the rows and columns.
The Kruskal-Wallis (KW) analysis of variance by ranks of the p-values (Hollander
et al., 2013), was then performed on the normalised counts independently using the kruskal.test function in the R package. This analysis incorporated the ANOVA t-test and
the Kruskal-Wallis (KW) pair-wise analysis of variance by ranks, which is a non- parametric test investigating whether samples originated from the same distribution (Conover, 1980). Assumptions were made when using this statistical test, however the KW test was selected because it applies to situations where two or more independent samples are compared with varying sample sizes.
85