TIPO DE LACTANCIA ACTUAL
5. CONCLUSIONES Y RECOMENDACIONES
The set update procedure in the second step of Differential Correlation Mining readily identifies variables that are significantly differentially correlated relative to a given variable set A, and is most effective when the initial set of variables exhibits at least low levels of differential correlation. (When applied to a randomly chosen set of variables, the set update procedure typically returns an empty set.) The core search procedure could be run exhaustively, beginning with every variable set
A⊂[d], but this is not computationally feasible for data sets of high or moderate dimension. As an alternative, we identify initial variable sets exhibiting a moderate degree of differential expression using a greedy search procedure. We then pass this initial skeleton clique to the set update process to be fleshed out into a final estimated DC clique.
The initialization procedure seeks a local maximum of the score function
S(A) = X j,k∈A n (n1−3)1/2 ϕ b R1 −(n2−3)1/2ϕ b R2 o jk (3.2)
whereϕis the element-wise Fisher transformation of sample correlations, namely
ϕ(r) = 1 2log 1−r 1 +r . (3.3)
To find a local maximizer of S(·), we begin with a random seed A. We consider only pairwise swaps in which we replace an element ofAwith one fromAc. The setAis then updated by making the swap that produced the largest increase in the score. Since exactly one element is added and removed at each stage, the size of the variable set remains constant. Because of the random seeding, the algorithm is not purely deterministic. However, in practice the same local maximum is reached from most seeds.
We make use of the variance-stabilizing Fisher transformation in the initialization procedure as a way of roughly capturingsignificance of differential correlation instead of simply maximizing over absolute differences Rb1−Rb2. The transformation, and subsequent weighting by degrees of
freedom, ensures that the first and second terms in the sum are approximately standardized. As such, sets maximizing S(·) are good ballpark guesses for true DC cliques. In the core set update procedure (Section 3.4), we employ a precise testing approach to measure significance of average
(a)Sample correlation, Condition 1 (b)Sample correlation, Condition 2 Figure 3.1: Sample correlation of simulated data.
differential correlation, so the initial sets need not be perfect. It is simply computationally more efficient to “warm-start” the algorithm with a reasonable set than to apply the core refinement procedure from random starting points.
Importantly, the cardinality ofAis user-specified (with a default of 50). Due to the subsequent set update procedure, which adaptively chooses the size of a final output set A∗, we need not be completely confident in our choice of initial choice of cardinality. We also can generally expect re- sults of the initialization procedure to be similar for similar cardinalities|A|=m. As an illustration of this phenomenon, we demonstrate the behavior of the initializing algorithm on artificial data. We generate 101 samples of a Gaussian random of 2,000 variables for each of two conditions. In Condition 2, the data is fully uncorrelated. In Condition 1, we include five correlated blocks with different correlation strength. Figure 3.1 shows the sample correlations for this simulated dataset.
It is clear that five distinct DC cliques are present, with decreasing signal size. A good initial- izing search procedure would have two properties: First, that when true DC cliques, selected sets of the correct size usually approximate these well; and second, that if the chosen cardinality m of the search procedure is too small or too large, selected sets will be sub- or super-sets of the true DC cliques. We find that our initializing method indeed exhibits these properties, as illustrated by Figures 3.2 and 3.3 for the artificial dataset.
(a) 500 initial sets, without removal (b) First 10 initial sets, with removal Figure 3.2: Overlap between initialized sets and DC cliques.
Figure 3.2(a) shows the percent of times, out of 500 separate runs with different random seeds, the initializing algorithm with m= 50 selected each of the DC cliques at less than 5% error. The algorithm selects one of the first three DC cliques nearly perfectly a high percentage of the time. Figure 3.2(a) shows 10 runs of the initializing algorithm, this time with the selected set removed from consideration in future seeds after each run. This figure shows that all five DC cliques are discovered to some degree in the first five runs of the initializing procedure. Although DC cliques 4 and 5 were never found in the 500 runs of 3.2(a), Figure 3.2(b) makes it clear that these lesser cliques are discoverable once the overshadowing signal of the stronger cliques is ignored.
In Figure 3.3, 5 distinct variable sets were selected for each value of m, and these are plotted according to their difference of average sample correlation. Colored points indicate that the set had at least 90% overlap with one of the true DC cliques in Figure 3.1. It is clear that even for misspecified m, the initializing procedure mostly selects sets that either contain or are contained by true DC cliques.
Pseudocode for the implementation of the initializing algorithm is provided as supplemental material. A closely related method is implemented in Section 3.6 for comparison with Differential Correlation Mining.
Figure 3.3: Initial sets at various sizes, colored by overlap with true DC cliques