1. MEMORIA
1.14. APÉNDICE 2 N ORMAS G ENERALES PARA LAS A CTIVIDADES S UBACUÁTICAS
1.14.2. CAPITULO II BUCEO PROFESIONAL
In the next chapter, I continue the description of Sparse Partitioning by explaining how the method is designed to cope with issues that arise when analysing non-idealised dataset. How- ever, at this point, I have provided sufficient details to create a working version of the method, so take the opportunity to demonstrate Sparse Partitioning’s potential using a simple set of
simulated datasets.
In total, I have performed ten simulation studies, with the aim of thoroughly testing Sparse Partitioning across a full range of scenarios. The complete results from all studies are pro-
vided in Chapter4. For Study One, I considered datasets containing 100 samples, each typed
for 1000 binary predictors. I examined three different underlying relationships, each involving three causal predictors. This study used perfect data; for example, there were no missing values, all causal predictors were observed and the predictors were uncorrelated. It formed the template for all subsequent studies, each of which then tested the effects of deviations from this idealised set-up. While Study One is far from realistic, during the development of Sparse Partitioning, I frequently used testing of this type as a sounding board to gauge whether progress was in the desired direction.
I picked the three underlying relationships in order to examine three contrasting models: one additive, one with a multiplicative interaction and one with a general interaction. These models are outlined in the following table:
Model Underlying Relationship
I Y = X1+ 1.5X2− 2X3
II Y = 1.5X1× X2+ X3
III Y = f (X1, X2) + X3,
where f (0, 0) = 0, f (1, 0) = 1, f (0, 1) = 2, f (1, 1) = −1
Figure2.4compares the performance of Sparse Partitioning to seven of the existing meth-
ods outlined in the introduction: Single and Pairs (my implementations of basic one and two-predictors-at-a-time analyses); as well as CART (Classification and Regression Trees), RF (Random Forests), SSS (Shotgun Stochastic Search), Logic (Logic Regression) and MARS (Multivariate Adaptive Regression Splines). I have found that the performance of different methods will be greatly influenced by the “causal predictor frequency” which, for a binary predictor, is the (sample-specific) percentage of time it takes the value 1. When the predic- tors are SNPs, this term corresponds to each SNP’s minor allele frequency. Therefore, as well as varying the underlying relationship, I also considered five different causal predictor frequencies: 0.05, 0.1, 0.2, 0.4 and ‘?’. The latter case corresponds to drawing each predictor’s frequency from U(0.05, 0.95), a uniform distribution on the interval [0.05, 0.95].
For each of the 15 scenarios, 100 datasets were created and each method was asked to declare its top three associations. I discuss why I took this decision more fully later on. In brief, I considered it a fairer comparison as it avoided the need to pick a declaration threshold
Causal Predictor Frequency A v er age Detection # 0.0 1.0 2.0 3.0 MODEL I 0.05 0.1 0.2 0.4 ?
Causal Predictor Frequency
MODEL II
0.05 0.1 0.2 0.4 ?
Causal Predictor Frequency
MODEL III
0.05 0.1 0.2 0.4 ?
SINGLE PAIRS CART RF SSS LOGIC MARS SPARSE PARTITIONING
Figure 2.4: Partial results of Simulation Study One. Each plot considers a different underlying relationship, which here are Models I, II and III (described in the main text). Within each plot, the lines report, for different causal predictor frequencies, the average number of causal predictors correctly detected by each method. The final frequency (‘?’) indicates that each causal predictor’s frequency was drawn uniformly at random from the interval [0.05, 0.95].
or to plot the false discovery rate. Each line in Figure 2.4 plots the average number of causal
predictors correctly declared by a particular method.
Sparse Partitioning, represented by the black line, is the best performing method under Model III. This is almost inevitable, as the general interaction contained in the simulated underlying relationship violates the assumptions of all other methods. However, it is reas- suring that this success does not appear to come at the expense of performance under more simple models, as we see that Sparse Partitioning has also performed well in the other two scenarios. For Model I, the additive relationship, the method’s line tracks very closely that of SSS, even though the latter method’s underlying assumptions consider only additive models. Likewise, for the multiplicative relationship of Model II, Sparse Partitioning has matched the performance of Logic, whose underlying assumptions are tailored for relationships of this type.
These plots provided strong encouragement, and it was due to results of this nature that I formed, then cemented the view that it is better to risk overfitting the true model by being too general, than underfitting it by being too restrictive.
Chapter 3
Additional Features
The previous chapter describes core details of Sparse Partitioning’s methodology, providing sufficient information with which to implement a working version. In this chapter, I explain
additional features intended to cope with non-idealised datasets. I also consider issues of
convergence and straightforward extensions of the method.
3.1
Basic Preprocessing of Data
Having read in the data files, Sparse Partitioning performs some basic preprocessing steps designed to remove redundancies and standardise values. Firstly, the method searches for predictors where either all values are missing or all observed values are the same. In either case, these predictors are unable to offer evidence for an association, so are removed from the dataset and assigned posterior estimates of zero. If desired, the user can increase the level of filtering and, for example, require that each predictor has no more than 25% missing values and no fewer than 5 occurrences of the least commonly observed state. In a similar manner, Sparse Partitioning also checks that the response is not trivial, nor has too many missing values. When the response is continuous, its observed values are standardised to have mean 0 and variance 1.
By default, Sparse Partitioning scans the predictor set for obvious duplications, compar- ing each predictor with its 100 neighbours on either side. The similarity of two vectors is
measured by calculating r2, the square of their correlation. If missing values are encountered
when comparing two predictors, the samples these correspond to are ignored for the purpose
of calculating their similarity. If any pair of predictors is found to be identical (r2 = 1), only
the predictor with fewest missing values is retained and assigned the higher of the two prior probabilities of association. At the end of the analysis, each predictor removed through this pruning is assigned the same posterior estimates as its retained duplicate (i.e. given the same marginal posterior probability of association, the same posterior probability of interaction,
and so on).
If Sparse Partitioning is applied to a dataset containing two identical predictors, this would lead to unnecessary computation and potentially have an undesirable effect on the MCMC sampling. The desire for parsimony will generally prevent more that one of the duplicates featuring in the current model, so any evidence that these predictors are associated will be divided between the two sets of posterior estimates. Sparse Partitioning allows the user to
vary the number of neighbours considered and reduce the r2 threshold, in which case highly
correlated predictors will also be considered duplicates. This comes in useful later on when analysing association study datasets exhibiting strong levels of LD.
After preprocessing the data, Sparse Partitioning calls the method Single, which performs one-predictor-at-a-time tests within both a frequentist and Bayesian framework, outputting p-values and posterior probabilities for each non-trivial predictor. (Single is explained in more
detail in Chapter4.)