CATEGORÍA CENTRAL
7. ANÁLISIS DE LA INFORMACIÓN
=
=
Amxn Bpxq (A Kronecker Product = B)mpxnq(a) Kronecker Product
vec(Y)NT
YNxT
(b) vec operator
Figure A.2: Graphical description of the Kronecker product and the vec operator. Left: The Kronecker product is a matrix product that multiplies each element of A with the complete matrix B. Right: The vec operator reshapes a matrix into a vector by concatenating the columns of the matrix.
A.3
Kronecker product
Let A be a m⇥ n matrix, and B be a p ⇥ q matrix. The Kronecker product A ⌦ B is a mp⇥ nq matrix and defined as follows:
A⌦ B = 0 B @ A11B . . . A1nB ... . .. ... Am1B . . . AmnB 1 C A (A.47)
The following equalities hold (Bernstein, 2009):
(A⌦ B)(C ⌦ D) = AC ⌦ BD (A.48)
(A⌦ B)> = A>⌦ B> (A.49)
(A⌦ B) 1 = A 1⌦ B 1 (A.50)
|A ⌦ B| = |A|p· |B|n (A.51)
(A⌦ B) vec(Y ) = vec BY A> , (A.52) where vec(Y ) is obtained by vertical concatenation of the columns of Y . A graphical description of the Kronecker product and the vec operator is given in A.2.
Let UASAUA> be the eigenvalue decomposition of A and UBSBUB>the eigenvalue
decomposition of B, then
(UA⌦ UB) SA⌦ SB+ 2I UA>⌦ UB> (A.53)
List of figures
2.1 A geometric interpretation of least squares. Let the num- ber of data points be N = 3 and the number of features be M = 2. The two feature vectors (x11, x21, x31), (x12, x22, x32) span a two-
dimensional subspace in R3. The orthogonal projection of y 2 R3
onto the subspace is the least squares estimator of y. Based on Hastie et al. (2009). . . 12 2.2 Bias-Variance decomposition. The training error (blue line) and
test error (green line) are shown as a function of the regularization parameter . Standard errors are computed over 30 repetitions. The test error (green line) decomposes into the squared bias (yellow), the variance (red) and noise.
For each repetition, we draw N = 200 random points as training set. The target is determined by the function y = Xw + ✏ with signal-to- noise ratio of 0.8 and M = 300. The weight vector and the test set (200 samples) are fixed over all repetitions. . . 14 2.3 Contours of the error and regularization function. We show the
contour plots of the error function for Lasso (left) and Ridge regression (right) in pink. Points along one contour line have the same function value. We restrict our attention to the weight vectors for which the regularization function is smaller than a certain threshold ↵ (yellow- shaded area). The optimal solution is found when the contours first hit the constraint region. For the Lasso, the solution is sparse (it lies on the axis), while for the Ridge it is not. Adopted from Tibshirani (1994). . . 19
2.4 Di↵erence between parametric and nonparametric models. Graphical presentation of a parametric model (left) and of a nonpara- metric model (right). Given the parameters, predictions are indepen- dent of the training data in parametric methods. In nonparametric methods, the dependencies cannot be resolved. Adopted from Barber (2012). . . 22 2.5 Samples drawn from a Gaussian process with di↵erent co-
variance functions. From left to right: We used a linear ( 2 = 1), polynomial ( 2 = 1, c = 0, d = 2) and squared exponential covariance
( 2 = 1, l2 = 1) function. To demonstrate the wiggling e↵ect of l2, we
also show the squared exponential covariance with l2 = 0.5 (dashed
lines). The mean function was set to zero in all three experiments. . . 23 2.6 Drawing functions from the posterior. We used a Gaussian pro-
cess with mean function m(x) = x and squared exponential covariance function ( 2 = 1, and l2 = 1). In the first plot from the left, we draw
samples from the prior. In the other two plots, we draw samples from the posterior after having made one observation (middle) and three observations (right plot). Observations are marked as red dots. The black line depicts the mean predictions. The yellow area contains all predictions within two standard derivations from the mean. Adopted from Rasmussen and Williams (2005). . . 26 3.1 Whitening the data. The covariance matrix K + I is used to
decorrelate the markers from the phenotype by projecting them along the principal components and rescaling them to unit variance. . . 35 3.2 Realized relationship matrix from the 1196 plants of Ara-
bidopsis thaliana available from (Horton et al., 2012). The relatedness between the individuals is complex and strong as the ma- trix is deeply structured. . . 37 3.3 Evaluation of alternative methods on semi-empirical GWAS
datasets, mimicking population structure as found in Ara- bidopsis thaliana. (a) Precision-Recall Curve for recovering sim- ulated causal SNPs using alternative methods. Shown is precision (TP/(TP+FP)) as a function of the recall (TP/(TP+FN)). (b) Al- ternative evaluation of each method on the identical dataset using Receiver operating characteristics (ROC). Shown is the True Positive Rate (TPR) as a function of the False Positive Rate (FPR). . . 41
List of figures 111 3.4 Characteristics of alternative methods on semi-empirical GWAS
dataset. (a) Area under the precision recall curve as a function of the total e↵ect size of all causal SNPs. (b) Average negative log- likelihood of each selected SNPs under the multivariate normal distri- bution N (0, K) as a function of the number of SNPs that are active in the model. The smaller the log likelihood is, the more the SNPs are correlated with the population structure. For the LMM-Lasso and the Lasso active SNPs have been selected by following the regularization path. For linear mixed model (LMM) and linear model (LM), the set of active SNPs have been obtained in ascending order of the p-value obtained. In the beginning, Lasso and the linear model choose SNPs that heavily reflect the population structure, while the mixed model approaches do not. In both figures, the number of causal SNPs was 100. . . 42
3.5 Evaluation of alternative methods on the semi-empirical GWAS dataset for di↵erent simulation settings. Area under precision recall curve for finding the true simulated associations. Alternative simulation parameters have been varied in a chosen range. (a) Evalu- ation for di↵erent relative strength of population structure 2
pop. (b)
Evaluation for true simulated genetic models with increasing complex- ity (more causal SNPs). (c) Evaluation for variable signal to noise ratio 2
sig. . . 49
3.6 Di↵erentiation between multiple causal loci from spurious correlation due to linkage on simulated data. The upper two plots show a single SNP with a strong e↵ect in an LD block. The lower two plots show the same LD block, but with an additional SNP e↵ect with weaker e↵ect size in the opposite direction. While both methods detect the SNP with large e↵ect size, the second one is only uniquely recovered by the LMM-Lasso. The red lines indicate the causal SNPs, the blue dots the assigned score. . . 50
3.7 Predictive power and sparsity of the fitted genetic models for Lasso and LMM-Lasso applied to quantitative traits in model systems. Considered were flowering phenotypes in Ara- bidopsis thaliana and bio-chemical and physiological phenotypes with relevance for human health profiled in mouse. Comparative evalua- tions include the fraction of the phenotypic variance predicted and the complexity of the fitted genetic model (number of active SNPs). (a) Explained variance in Arabidopsis. (b) Explained variance in mouse. (c) Complexity of fitted models in Arabidopsis. (d) Complexity of fitted models in mouse. . . 51 3.8 Variance dissection into individual SNP e↵ects and global
genetic background driven by population structure. Shown is the explained variance on an independent test set as a function of the number of active SNPs for the flowering phenotype (10 C) in Arabidopsis thaliana. In blue, the predictive test set variance of the Lasso as a function of the number of SNPs in the model. In green, the total predictive variance of LMM-Lasso for di↵erent sparsity levels. The shaded area indicates the fraction of variance LMM-Lasso explains by means of individual SNP e↵ects (yellow) and population structure (green). LMM-Lasso without additional SNPs in the model corresponds to a genetic random e↵ect model (black star). . . 52 3.9 Evaluation of the Lasso methods for FLC gene expression
in Arabidopsis thaliana. Precision-Recall Curve for recovering SNPs in proximity to known candidate genes using alternative meth- ods. Shown is precision (TP/(TP+FP)) as a a function of the recall (TP/(TP+FN)). Each point in the plot corresponds to a specific se- lection threshold. . . 53 4.1 Demonstration of the coupling of input and output. Red solid
edges represent correlations between the phenotypes, blue solid lines represent relational dependencies between the SNPs. A dashed line represents an association between an SNP and a phenotype. . . 60 4.2 Runtime comparison for varying number of traits. The naive
method, based on a Cholesky factorization, is shown in green (dashed), the proposed method, exploiting the Kronecker structure, is shown in blue (solid). Shown is the averaged runtime (in seconds) and its standard error as a function of the number of traits. . . 63
List of figures 113 4.3 Performance of all algorithms on simulated data. Left: Com-
paring the methods in terms of predictive power. Shown is the Ex- plained Variance (EV) as a function of ↵. The dashed gray line repre- sents the upper bound, i.e. EV obtained when using true co-efficient matrix. Right: Area under Precision recall curve (AUPRC) for recov- ering the simulated associations as a function of the overlap parameter ↵. The error bars represent the standard error. . . 66 4.4 Power comparison for varying input noise. We fixed the overlap
parameter ↵ = 0.3 and varied the number of conflicting edges between 0 and 50. For each setting, we generated 50 datasets. The performance of all methods that are exploiting input structure is dropping as the number of conflicting edges is increasing, while the performance of the other methods stays constant. . . 67 4.5 Hierarchical clustering of the gene expression levels under
glucose treatment (left) and of the di↵erence between gene expression levels under ethanol and glucose treatments (right). Under the glucose condition, genes can be divided into two major clus- ters of co-expressed genes, whereas on the right, the co-expression pat- tern is more complex containing two major co-expression clusters, one of which contains three smaller clusters that are weakly co-expressed with the other small clusters. For the EG experiment, we focused on cluster 2, for the DCS experiment on cluster 2 and 4. . . 68
5.1 Runtime comparison on synthetic data. We compare our effi- cient GP-KS implementation (left) versus its naive counterpart (right). Shown is the runtime in seconds on a logarithmic scale as a function of the sample size and the number of traits. The optimization was stopped prematurely if it did not complete after 104 seconds. . . . . 85
5.2 Evaluation of alternative methods for di↵erent simulation settings. (a) Evaluation for varying signal strength. (b) Evaluation for variable impact of the hidden signal. (c) Evaluation for di↵erent strength of relatedness between the tasks. In each simulation setting, all other parameters were kept constant at default parameters marked with the yellow star symbol. . . 86
5.3 Fitted task covariance matrices for gene expression levels in yeast. (a) Empirical covariance matrix of the gene expression levels. (b) Signal covariance matrix learnt by GP-kronsum. (c) Noise co- variance matrix learnt by GP-kronsum. The ordering of the tasks was determined using hierarchical clustering on the empirical covariance matrix. . . 88 5.4 Correlation between the mean di↵erence of the two condi-
tions and the latent factors on the yeast dataset. Shown is the strength of the latent factor of (a) the genetic signal and (b) the noise trait-trait covariance matrix as a function of the mean di↵erence between the two environmental conditions. Each dot corresponds to one gene expression level. . . 89 A.1 One-dimensional Gaussian distribution. Gaussian probability
density function (pdf), with mean µ = 0 and variance 2 = 1, is
shown in blue. The mean is depicted with red, the standard deviation with green. 68.27% of the mass of the distribution is within one standard deviation of the mean (yellow-shaded area). . . 102 A.2 Graphical description of the Kronecker product and the vec
operator. Left: The Kronecker product is a matrix product that multiplies each element of A with the complete matrix B. Right: The vec operator reshapes a matrix into a vector by concatenating the columns of the matrix. . . 107
List of tables
3.1 Associations close to known candidate genes. We report true positives/positives (TP/P) for LMM-Lasso and Lasso for all pheno- types related to flowering time in Arabidopsis thaliana. P are all acti- vated SNPs and TP are all activated SNPs that are close to candidate genes. . . 46 3.2 Candidate genes containing multiple associations. List of all
candidate genes that have two activated SNPs in close proximity for all phenotype related to flowering time of Arabidopsis thaliana. The last two columns show the log10 transformed p-values for the linear and the linear mixed model. . . 47 4.1 Chosen regularization parameters. For each method, we show
the median of the regularization parameters determined by cross- validation. All regularization parameters lie within the chosen in- tervals or on the left-boundary of the interval. For all methods except SIOL we used the set {4 4, 4 3, 4 2, 4 140, 41, 42, 43}. For SIOL, we
changed the range to {2 8, 2 7, 2 6, 2 5, 2 4, 2 3, 2 2, 2 1, 20} since
the data is internally normalized. . . 65 4.2 Phenotypes used in the yeast experiments. Members of the
co-expression cluster used in the glucose (top) and glucose vs. ethanol (bottom) experiments. The classification to functions were derived from Saccharomyces Genome Database (Cherry et al., 2012). . . 70 4.3 Frequencies of categories of antagonistic and synergistic ef-
fects among interacting pairs of active SNPs. . . 71 4.4 Predictive power analysis. Explained variance (± standard error)
achieved by all algorithms in both datasets. . . 72 115
5.1 Predictive performance of the di↵erent methods on the Ara- bidopsis thaliana dataset. Shown is the squared correlation coefficient and its standard error (measured by repeating 10-fold cross-validation 10 times). . . 90
Bibliography
G. R. Abecasis, A. Auton, L. D. Brooks, M. A. DePristo, R. M. Durbin, R. E. Handsaker, H. M. Kang, G. T. Marth, G. A. McVean, D. M. Altshuler, and et al. An integrated map of genetic variation from 1,092 human genomes. Nature, 491 (7422):56–65, Nov 2012.
M. A. ´Alvarez and N. D. Lawrence. Sparse convolved gaussian processes for multi- output regression. In NIPS, pages 57–64, 2008.
C. Archembeau, S. Guo, and O. Zoeter. Sparse Bayesian Multi-Task Learning, 2011. A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Advances
in Neural Information Processing Systems 19. MIT Press, 2007.
S. Atwell, Y. S. Huang, B. J. Vilhjalmsson, G. Willems, M. Horton, Y. Li, D. Meng, A. Platt, A. M. Tarone, T. T. Hu, and et al. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature, 465(7298):627–631, Jun 2010.
C. A. Azencott, D. Grimm, M. Sugiyama, Y. Kawahara, and K. M. Borgwardt. Efficient network-guided multi-locus association mapping with graph cuts. Bioin- formatics, 29(13):i171–179, Jul 2013.
D. Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, New York, NY, USA, 2012. ISBN 0521518148, 9780521518147.
S. I. Berndt, S. Gustafsson, R. Magi, A. Ganna, E. Wheeler, M. F. Feitosa, A. E. Justice, K. L. Monda, D. C. Croteau-Chonka, F. R. Day, T. Esko, and et al. Genome-wide meta-analysis identifies 11 new loci for anthropometric traits and provides insights into genetic architecture. Nat. Genet., 45(5):501–512, May 2013.
D. S. Bernstein. Matrix mathematics : theory, facts, and formulas. Princeton Uni- versity Press, Princeton, N.J., Woodstock, 2009. ISBN 978-0-691-14039-1. URL http://opac.inria.fr/record=b1128603. Prcdente ed.: 2005.
J. Bigler, J. Whitton, J. W. Lampe, L. Fosdick, R. M. Bostick, and J. D. Potter. CYP2C9 and UGT1A6 genotypes modulate the protective e↵ect of aspirin on colon adenoma risk. Cancer Res., 61(9):3566–3569, May 2001.
C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. ISBN 0387310738.
J. S. Bloom, I. M. Ehrenreich, W. T. Loo, T. L. Lite, and L. Kruglyak. Finding the sources of missing heritability in a yeast cross. Nature, 494(7436):234–237, Feb 2013.
E. V. Bonilla, K. M. Chai, and C. K. I. Williams. Multi-task Gaussian process prediction. In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20. MIT Press, Cambridge, MA, 2008. URL nips08.pdf.
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004. ISBN 0521833787.
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimiza- tion and statistical learning via the alternating direction method of multipli- ers. Found. Trends Mach. Learn., 3(1):1–122, Jan. 2011. ISSN 1935-8237. doi: 10.1561/2200000016. URL http://dx.doi.org/10.1561/2200000016.
J. R. Broach. Ras-regulated signaling processes in Saccharomyces cerevisiae. Curr. Opin. Genet. Dev., 1(3):370–377, Oct 1991.
P. Buehlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics. Springer, 2011. ISBN 9783642201929. URL http://books.google.de/books?id=S6jYXmh988UC. G. Bulmer. Principles of Statistics. Dover Books on Mathemat-
ics Series. Dover Publications, 1979. ISBN 9780486637600. URL http://books.google.de/books?id=dh24EaSrmBkC.
Bibliography 119 P. Carbonetto and M. Stephens. Scalable Variational Inference for Bayesian Variable Selection in Regression, and its Accuracy in Genetic Association Studies. Bayesian Analysis, 7:73–108, 2012. ISSN 1931-6690. doi: 10.1214/12-ba703.
R. Caruana. Multitask learning. Mach. Learn., 28(1):41–75, July 1997. ISSN 0885-6125. doi: 10.1023/A:1007379606734. URL http://dx.doi.org/10.1023/A:1007379606734.
F. P. Casale, B. Rakitsch, C. Lippert, and O. Stegle. A mixed-model approach for association studies of multiple variants to a set of correlated traits, 2013. MLSB 2013: The seventh workshop on Machine Learning in Systems Biology, Berlin, July 19-20, 2013.
A. T. Chan, G. J. Tranah, E. L. Giovannucci, D. J. Hunter, and C. S. Fuchs. Genetic variants in the UGT1A6 enzyme, aspirin use, and the risk of colorectal adenoma. J. Natl. Cancer Inst., 97(6):457–460, Mar 2005.
S. S. Chen, D. L. Donoho, Michael, and A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20:33–61, 1998.
X. Chen et al. A two-graph guided multi-task lasso approach for eqtl mapping. Journal of Machine Learning Research - Proceedings Track, 22:208–217, 2012. J. M. Cherry, E. L. Hong, C. Amundsen, R. Balakrishnan, G. Binkley, E. T. Chan,
K. R. Christie, M. C. Costanzo, S. S. Dwight, S. R. Engel, D. G. Fisk, J. E. Hirschman, B. C. Hitz, K. Karra, C. J. Krieger, S. R. Miyasato, R. S. Nash, J. Park, M. S. Skrzypek, M. Simison, S. Weng, and E. D. Wong. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic acids research, 40(Database issue):D700–5, Jan. 2012.
G. Csardi and T. Nepusz. The igraph software package for complex network research. InterJournal, Complex Systems:1695, 2006. URL http://igraph.sf.net.
A. Dahl, V. Hore, V. Iotchkova, and J. Marchini. Network inference in matrix-variate Gaussian models with non-independent noise. ArXiv e-prints, Dec. 2013.
A. Davies and Z. Ghahramani. The Random Forest Kernel and other kernels for big data from random partitions. ArXiv e-prints, Feb. 2014.
G. de los Campos, D. Gianola, and D. B. Allison. Predicting genetic predisposition in humans: the promise of whole-genome markers. Nat. Rev. Genet., 11(12):880–886, Dec 2010.
I. J. Deary, J. Yang, G. Davies, S. E. Harris, A. Tenesa, D. Liewald, M. Luciano, L. M. Lopez, A. J. Gow, J. Corley, P. Redmond, H. C. Fox, S. J. Rowe, P. Haggarty, G. McNeill, M. E. Goddard, D. J. Porteous, L. J. Whalley, J. M. Starr, and P. M. Visscher. Genetic contributions to stability and change in intelligence from childhood to old age. Nature, 482(7384):212–215, Feb 2012.
J. C. Denny, L. Bastarache, M. D. Ritchie, R. J. Carroll, R. Zink, J. D. Mosley, J. R. Field, J. M. Pulley, A. H. Ramirez, E. Bowton, and et al. Systematic comparison