• No se han encontrado resultados

The gene expression data in this study were obtained using a custom NanoString codeset

for 406 genes of interest. See Section 4.5 for a full list of genes included in this analysis. Per-

formance of the nCounter assay was assessed for efficiency and sub-optimal hybridization.

Expression levels below the mean of negative controls were set to the mean background

expression. Then positive control normalization multiplied all counts for a sample by the

ratio of the average geometric mean of positive controls across all samples to the geometric

mean of the sample-specific positive controls. Reference gene normalization was done in

a similar way based on a set of 11 housekeeping genes. Batch effects were corrected by

calibrating each lot based on a scaling factor calculated as the average geometric mean of

endogenous genes across the three lots to the geometric mean of endogenous genes within

lot. Finally, expression counts were log2 transformed.

Visualizations using 1-way dendograms and principal components analysis were used

to identify major outliers. A sample was considered a major outlier if, after all of the

pre-processing described in the previous paragraph was complete, the sample demonstrated

extreme expression across all genes. During the quality control process 126 samples were

analysis (Figure 4.1). In a sensitivity analysis clustering a set of cases that included the

major outliers, when compared to the results from the primary analysis with the major

outliers excluded, between 94% and 99% of cases were classified similarly, suggesting that

these major outliers did not comprise a separate etiologically distinct class. Gene expression

values were standardized within sample by subtracting the mean gene expression for that

sample and dividing by the sample standard deviation. Finally each gene’s expression

was median centered. Twenty cases from phases 1 and 2 and 20 cases from phase 3 were

randomly selected for removal from the case group to test for differences between the various

phases of the study, which were conducted at different times, without compromising the

overall type I error of the primary results. The overall gene expression distributions between

the different phases were compared using histograms and a Wilcoxon rank-sum test.

The final sample sizes for analysis are 83 cases and 739 controls from phase 1, 287 cases

and 716 controls from phase 2, and 467 cases from phase 3 (Figure 4.1).

4.2

Methods

The analysis is conducted in two stages:

1. Cluster discovery stage. The 467 phase 3 cases with available gene expression and risk

factor data are used to determine the optimally etiologically heterogeneous clustering

solution using a case-only analytic setting.

2. Cluster validation stage. The 370 cases with available gene expression and risk factor

data and the 1455 controls with available risk factor data from phases 1 and 2 are

pooled, and the cases are assigned to a class solution based on the discovery results.

Figure 4.1: Study exclusions Phase 1 861 cases 790 controls Phase 2 947 cases 774 controls Phase 3 2998 cases 105 cases 790 controls 99 cases 739 controls 88 cases 739 controls 83 cases 739 controls 410 cases 774 controls 383 cases 716 controls 302 cases 716 controls 287 cases 716 controls 540 cases 521 cases 487 cases 467 cases Exclude 756 cases without gene expression data Exclude 6 cases, 51 controls missing risk factor data

Exclude 11 cases with outlying gene expression values

Hold out 5 cases for independent data checking Exclude 537 cases without gene expression data Exclude 27 cases, 58 controls missing risk factor data

Exclude 81 cases with outlying gene expression values

Hold out 15 cases for independent data checking

Exclude 2458 cases without gene expression data

Exclude 19 cases missing risk factor data

Exclude 34 cases with outlying gene expression values

Hold out 20 cases for independent data checking

factors with heterogeneous effects.

The goal of conducting the analysis in two stages, with discovery followed by independent

validation, is to ultimately be able to obtain valid odds ratio estimates and p-values testing

for heterogeneity across the subtypes. If the subtypes were discovered using the same data

in which testing for heterogeneity was then conducted, the resulting p-values would be

over-optimistic, since the risk factor distributions are pivotal in selection of the optimal

subtype solution. The data were split into discovery and validation stages based on the

original CBCS study design, which in phase 3 collected data only on cases with no matched

controls, and in phases 1 and 2 collected data on cases with frequency matched controls.

This approach of using the phase 3 data for discovery and the phases 1 and 2 data for

validation is consistent with the original design of the study, which collected these data

validation additionally allows for calculation of standard case-control odds ratios.

4.2.1 Clustering methods

In the cluster discovery stage, a novel clustering method that uses unsupervised k-means

clustering of the gene expression data in combination with calculation of a scalar measure of

etiologic heterogeneity based on all available risk factors is applied to identify the optimally

etiologically heterogeneous subtype solution, as detailed in Section 3.1 of Chapter 3. In

the setting of a case-control study, the scalar measure of etiologic heterogeneity, denoted

D, is calculated according to Equation 3.1. An approximation of this measure, denoted

D∗, can be applied in the case-only setting, and details of this approach can be found in

Begg et al. (2013). Briefly, whereas the variance and covariance terms in Equation 3.1 are

averaged over the controls in a case-control setting, in a case-only setting they are averaged

over the cases, which represent a risk-biased sample from the population. The goal of an

analysis of this type is not to interpret the magnitude of D, but rather to use D to rank

different subtyping schemes and identify the one that maximizes etiologic heterogeneity,

and rankings based on D and D∗ are expected to be broadly similar in practice.

K-means clustering is performed with 1000 random starts on the gene expression data

in the discovery cases, to obtain a variety of class solutions. For each candidate solution

identified by k-means clustering, D∗ is calculated and the solution that maximizes D∗ is

selected as the optimal solution. To avoid solutions with subtypes with very small sample

sizes, clustering solutions where a class had fewer than 20 cases were not considered. Ad-

ditionally, because the true number of subtypes is unknown, the optimal 2-class, 3-class,

4-class, and 5-class solutions were identified and the ideal number of classes was later se-

sample size limitations and in order to avoid overfitting.