The gene expression data in this study were obtained using a custom NanoString codeset
for 406 genes of interest. See Section 4.5 for a full list of genes included in this analysis. Per-
formance of the nCounter assay was assessed for efficiency and sub-optimal hybridization.
Expression levels below the mean of negative controls were set to the mean background
expression. Then positive control normalization multiplied all counts for a sample by the
ratio of the average geometric mean of positive controls across all samples to the geometric
mean of the sample-specific positive controls. Reference gene normalization was done in
a similar way based on a set of 11 housekeeping genes. Batch effects were corrected by
calibrating each lot based on a scaling factor calculated as the average geometric mean of
endogenous genes across the three lots to the geometric mean of endogenous genes within
lot. Finally, expression counts were log2 transformed.
Visualizations using 1-way dendograms and principal components analysis were used
to identify major outliers. A sample was considered a major outlier if, after all of the
pre-processing described in the previous paragraph was complete, the sample demonstrated
extreme expression across all genes. During the quality control process 126 samples were
analysis (Figure 4.1). In a sensitivity analysis clustering a set of cases that included the
major outliers, when compared to the results from the primary analysis with the major
outliers excluded, between 94% and 99% of cases were classified similarly, suggesting that
these major outliers did not comprise a separate etiologically distinct class. Gene expression
values were standardized within sample by subtracting the mean gene expression for that
sample and dividing by the sample standard deviation. Finally each gene’s expression
was median centered. Twenty cases from phases 1 and 2 and 20 cases from phase 3 were
randomly selected for removal from the case group to test for differences between the various
phases of the study, which were conducted at different times, without compromising the
overall type I error of the primary results. The overall gene expression distributions between
the different phases were compared using histograms and a Wilcoxon rank-sum test.
The final sample sizes for analysis are 83 cases and 739 controls from phase 1, 287 cases
and 716 controls from phase 2, and 467 cases from phase 3 (Figure 4.1).
4.2
Methods
The analysis is conducted in two stages:
1. Cluster discovery stage. The 467 phase 3 cases with available gene expression and risk
factor data are used to determine the optimally etiologically heterogeneous clustering
solution using a case-only analytic setting.
2. Cluster validation stage. The 370 cases with available gene expression and risk factor
data and the 1455 controls with available risk factor data from phases 1 and 2 are
pooled, and the cases are assigned to a class solution based on the discovery results.
Figure 4.1: Study exclusions Phase 1 861 cases 790 controls Phase 2 947 cases 774 controls Phase 3 2998 cases 105 cases 790 controls 99 cases 739 controls 88 cases 739 controls 83 cases 739 controls 410 cases 774 controls 383 cases 716 controls 302 cases 716 controls 287 cases 716 controls 540 cases 521 cases 487 cases 467 cases Exclude 756 cases without gene expression data Exclude 6 cases, 51 controls missing risk factor data
Exclude 11 cases with outlying gene expression values
Hold out 5 cases for independent data checking Exclude 537 cases without gene expression data Exclude 27 cases, 58 controls missing risk factor data
Exclude 81 cases with outlying gene expression values
Hold out 15 cases for independent data checking
Exclude 2458 cases without gene expression data
Exclude 19 cases missing risk factor data
Exclude 34 cases with outlying gene expression values
Hold out 20 cases for independent data checking
factors with heterogeneous effects.
The goal of conducting the analysis in two stages, with discovery followed by independent
validation, is to ultimately be able to obtain valid odds ratio estimates and p-values testing
for heterogeneity across the subtypes. If the subtypes were discovered using the same data
in which testing for heterogeneity was then conducted, the resulting p-values would be
over-optimistic, since the risk factor distributions are pivotal in selection of the optimal
subtype solution. The data were split into discovery and validation stages based on the
original CBCS study design, which in phase 3 collected data only on cases with no matched
controls, and in phases 1 and 2 collected data on cases with frequency matched controls.
This approach of using the phase 3 data for discovery and the phases 1 and 2 data for
validation is consistent with the original design of the study, which collected these data
validation additionally allows for calculation of standard case-control odds ratios.
4.2.1 Clustering methods
In the cluster discovery stage, a novel clustering method that uses unsupervised k-means
clustering of the gene expression data in combination with calculation of a scalar measure of
etiologic heterogeneity based on all available risk factors is applied to identify the optimally
etiologically heterogeneous subtype solution, as detailed in Section 3.1 of Chapter 3. In
the setting of a case-control study, the scalar measure of etiologic heterogeneity, denoted
D, is calculated according to Equation 3.1. An approximation of this measure, denoted
D∗, can be applied in the case-only setting, and details of this approach can be found in
Begg et al. (2013). Briefly, whereas the variance and covariance terms in Equation 3.1 are
averaged over the controls in a case-control setting, in a case-only setting they are averaged
over the cases, which represent a risk-biased sample from the population. The goal of an
analysis of this type is not to interpret the magnitude of D, but rather to use D to rank
different subtyping schemes and identify the one that maximizes etiologic heterogeneity,
and rankings based on D and D∗ are expected to be broadly similar in practice.
K-means clustering is performed with 1000 random starts on the gene expression data
in the discovery cases, to obtain a variety of class solutions. For each candidate solution
identified by k-means clustering, D∗ is calculated and the solution that maximizes D∗ is
selected as the optimal solution. To avoid solutions with subtypes with very small sample
sizes, clustering solutions where a class had fewer than 20 cases were not considered. Ad-
ditionally, because the true number of subtypes is unknown, the optimal 2-class, 3-class,
4-class, and 5-class solutions were identified and the ideal number of classes was later se-
sample size limitations and in order to avoid overfitting.