• No se han encontrado resultados

3.1. DISCUSIÓN DE LA REVISIÓN SISTEMÁTICA

3.1.1. Contexto normativo y jurisprudencial actual con el que cuenta la

Statistical methods that measure the strength of an association between two univariate variables, measured on the same set of objects, are well-established. To examine the relationship between two disparate multivariate datasets on the same set of samples, one of dimension p1 and the other of dimension p2, it is natural to examine all p1×p2

pairwise associations. Standard measures of pairwise association include the correlation coefficient between two quantitative variables, the ANOVA F-statistic between a cat- egorical variable and a quantitative variable, or Pearson’s chi-square statistic between two categorical variables. We refer to the analysis of significant pairwise associations across multiple datatypes asassociation mining ( orcorrelation mining for quantitative data).

Association mining across high-dimensional datatypes is complicated by the issue of multiple comparisons. Mining associations between two datatypes of dimension p1

and p2 requires assessing the significance of p1 ×p2 variable pairs. Use of standard

procedures to independently assess the statistical significance of each pair will tend to result in many false discoveries (associations that are determined to be significant but really aren’t). Several adjustments have been developed to address the issue of multiple comparisons, including control of the familywise error rate (Hochberg, 1988) or the

false discovery rate (Benjamini and Hochberg, 1995). The familywise error rate is the

probability of making at least one false positive among all pairwise comparisons. The false discovery rate is a less conservative measure that controls the expected proportion of false positives.

The use of association mining to integrate across multiple biological datatypes on the same set of samples is widespread. Bredel et al. (2009) examine pairwise associ- ations between gene expression levels and copy number variation on the same set of glioma tumor samples. Genes with a significant expression - copy number association are determined by a permutation test with a false discovery rate adjustment, and those genes with a significant association are considered candidates for disease related func- tion. Adourian et al. (2008) investigate pairwise correlations across gene expression, metabolomic and proteomic datasets available on the same set of rats that have been administered a toxic compound. A network based on significant correlations across these three datatypes is used to examine the effects of drug-induced toxicity.

Assocation mining is often used to examine the relationship between genotype data (e.g., SNPs) and phenotype data (e.g., disease presence, toxicity, height, etc.). In a genome wide association study (GWAS), several hundred thousand to millions of SNPs along the length of the genome are tested for associations with a particular phenotype. A common form of GWAS identifies associations between genotype data and gene expression levels. This allows for the discovery of expression quantitative trait loci (eQTLs), genomic loci that regulate expression levels. Several methods have been developed for efficient computation and statistical significance of pairwise associations in eQTL analysis (Gilad, Rifkin and Pritchard, 2008; Gatti et al., 2009; Sun and Wright, 2010; Wright, Shabalin and Rusyn, 2012).

As an illustration of GWAS, we consider toxicity data from 81 human lymphoblast cells assembled by the Hapmap Consortium (?). Full genotype data is available for each cell, including 1.3 million SNPs along the length of the genome. Toxicity measurements for each cell in response to 240 different chemicals are measured through a collaboration between UNC and the National Institutes of Health (NIH) Chemical Genomics Center (NCGC). PCA is used to elicit the primary modes of variation between cells in the

240-chemical toxicity space. Figure 3.1 illustrates a GWAS to identify associations between SNPs and the first principal component scores in the toxicity data. SNPs with small association p-values (large −log(p)) yield candidate genomic regions where genetic polymorphisms may influence toxic response.

Figure 3.1: GWAS measures the strength of association between SNPs and the first principal component of toxicity on 81 cell lines. The SNPs are ordered by chromosomal location on the horizontal axis, and−log10(P-value) for the signficance of the association between each SNP and the first principal component is shown.

Association mining in high-dimensional data often requires the analysis of several hundred, thousand, or millions of pairwise associations. It can therefore be infeasible to individually process and interpret the association between each pair of variables. However, useful visualizations can help to simplify variable-by-variable associations and allow for their interpretation on a global scale. Figure 3.2 depicts significant

correlations between gene expression levels and copy number probes, measured for the same set of 234 GBM tumor samples. Both genes and copy number probes are ordered by genomic location, which which helps to illustrate clear patterns in copy number-expression associations. Copy number events at chromosomes 7,10, and 20-22 are highly correlated with the expression levels of hundreds of genes over the length of the genome.

Figure 3.2: Plot of significant correlations between copy number and gene expression data on 234 GBM tumor samples. Both copy number (horizontal axis) and gene ex- pression (vertical axis) are ordered by genomic location. Points are colored red (blue) if the correlation between the corresponding gene - copy number pair is significant and positive (negative).

While the visualizations in Figures 3.1 and 3.2 are useful, restricting attention to pairwise associations between variables can overlook important multivariate associa- tions between datatypes. Furthermore, solely examining the significance or strength of

variable associations gives no information regarding the common structure in the sam- ples that drive these associations. The methods we consider in the remaining sections take a more global approach to the integration of disparate datatypes.