Rosita 42 F Soltera Tecnóloga Desempleada Quirúrgico Subsidiado
5. CONCLUSIONES Y RECOMENDACIONES
5.1 Experiencia de vivir con el VIH/SIDA
This first application case concerns a genetic study of transcription initiation, pro- filed using Cap Analysis of Gene Expression (CAGE) in Drosophila melanogaster at three different stages of development. This analysis is part of a project in collabora- tion with Eileen Furlong’s group in EMBL Heidelberg (Germany) and was published in Schor et al. (2017). CAGE measurements were available across three distinct stages of Drosophila development. The aim of this part of the analysis was to map QTLs for transcription initiation in a joint analysis across developmental stages.
The analysis of these data is challenging, as transcription initiation is a high- dimensional molecular trait consisting of hundreds of univariate measurements. More- over, development adds another phenotypic dimension to the data. To approach these challenges, we first performed a conventional dimensionality reduction using PCA and then considered PC-based phenotypes for joint QTL mapping across developmental stages. The flexibility of the fixed-effect testing module implemented in LIMIX (see Section 5.2.2) was essential to define statistical tests for the specific study design (see below). The statistical pipeline for QTL mapping presented herein, was designed and validated by me in collaboration with Jacob Degner.
Background. A transcription start site (TSS) is a genomic location where transcrip- tion is initiated (Zvelebil and Baum, 2007). CAGE is a molecular profiling assay that enables the characterisation of transcription initiation on a genome-wide scale (Shiraki et al., 2003). CAGE isolates the sequence fragment at the 5’ end of RNA molecules, which is the first part of the gene being transcribed. Mapping these short sequences to a reference genome enables the characterisation of the TSS distribution. Recent studies have shown that while many genes have a unique and well-defined TSS, for others, the distribution of TSS can span regions of up to thousands of bases (Lenhard et al., 2012; Carninci et al., 2006; Ni et al., 2010). While genetic effects on RNA expression levels have been largely studied, the extent to which genetic variation affects transcription initiation remains unknown. To investigate this, Eileen Furlong’s group profiled tran- scription initiation in 81 genotyped lines of Drosophila melanogaster at 2-4h, 6-8h and 10-12h after egg laying using CAGE.
Phenotype definition. Transcription initiation regions (TIR) were defined as the
1kb regions centred around the highest CAGE peaks. In total, 13,508 TIR were identi-
fied. Denoting withN (= 81) the number of individuals and with D (= 3) the number
of development stages, each TIR corresponds toN× D × 1, 000 count measures (where
1,000 is the number of base pairs in a TIR). Even considering a single-stage analysis, joint modelling of the count data in a TRI would entail a joint QTL mapping of 1,000 univariate traits. To reduce the dimensionality of the problem, we projected the TSS distribution onto the three leading principal components, i.e. we performed dimension- ality reduction in the base pair space. Specifically, we performed PCA on square-rooted counts across all lines and developmental stages. For each TIR, we also defined a mean- based phenotype as the sum of the read counts in the TIR. To adjust for batch effects and other hidden covariates, we applied PEER (Stegle et al., 2012) independently for
each TIR and developmental stage to each of theKPC= 3 PC-based phenotypes and
the mean-based phenotype considering 10 unknown factors. The residuals from PEER were quantile-normalised to a normal distribution to ensure that model assumptions were fulfilled.
Molecular QTL mapping. For each TIR, we considered all bi-allelic variants with
MAF > 5% that are within 100 kb from the centres of TIR regions and considered
three different analyses:
1. Single-stage analysis of mean expression. For each TIR and developmental stage, we considered the univariate linear mixed model described in Section 2.29.
We used the RRM to model genetic relatedness between lines.
2. Multi-stage analysis of mean expression. For each TIR, we considered the multi-trait linear mixed model in Eq (5.23) jointly modelling mean expression
levels across the three developmental stages. We considered both a common
effect test across all developmental stages and a specific effect test for each stage (see Section 5.2.2).
3. Multi-stage analysis of PC-based phenotypes. For each TIR, we considered
the multi-trait linear mixed model in Eq (5.23) jointly modelling theKPC= 3 PC-
based phenotypes across the D = 3 developmental stages, resulting in a total of
9 phenotypes. Denoting with Yi ∈ RN ×D the matrix of PCi phenotypes, where
rows are samples and columns are developmental stages, the total phenotype
matrix isY = [Y1, Y2, Y3]∈ RN ×KD. Tests for common and specific effects across
stages require the increased flexibility made available by the LIMIX framework.
In this setting, a common effect is an effect that is heterogeneous across theKPC
PCs but constant (for each PC) across the D developmental stages. In the same
vein, a specific effect at stage d is an effect that is different at stage d (for each
PC) with respect to the other developmental stages. These two tests correspond to
common effect test : H1: A = IK⊗ 1D×1 vs H0 : A = 0
specific effect test : H1: A = IK⊗
" I> d I> ∼d # vs H0: A = IK⊗ 1D,1
where Id and I∼d areD-dimensional indicator vectors such that (Id)i = δid and
I∼d = 1D− Id.
To account for multiple testing we used a two-step procedure (see discussion in Sec- tion 2.2.2). First, we calculated a TIR-level P value for each TIR considering 10,000 permutations of the genotype data across lines. Second, the TIR-level P values were corrected for multiple testing across TIRs using the Benjamini-Hochberg (BH) proce- dure. For the single-stage analysis, the BH correction was performed both across TIRs and developmental stages.
Results. Joint modelling of mean-based phenotypes across multiple stages increased
power compared to the single-stage analysis (Fig. 5.2A-B). Additionally, the multi- stage PC-based analysis almost doubled the number of significant TIRs with respect to
the mean-based analysis, identifying 4,526 TIRs with a significant QTL (FDR< 1%, Fig. 5.2A-B). Note that this set of QTLs includes variants that only affect the expres- sion level (see Fig. 5.2C), variants that only affects the shape of the TSS distribution (see Fig. 5.2D) and variants that affect both. Interestingly, the model did not retrieve any TIR with significant stage-specific effects.
5.3.2 Dissecting the genetic and the epigenetic component of gene