4. Estado del arte
6.6. Sistema de grabación del proceso de crecimiento
A common set of assumptions when finding DE genes is normality of the data. Should we choose to cluster after finding DE genes, we typically use some mixture distribution assumption. So, when we analyze our data and find clusters, depending on our methodol- ogy, we can be assuming a mixture of normals. Further, this mixture of normals is a very common assumption for the clustering algorithms themselves. For this reason, we focus on a mixture of normals for our parametric assumption. This falls into the framework of the Unified Method. We will both state how to use our SOM methodology to simplify
the computations needed for this method, and use this Unified Method to interpret the results of our simulation study.
2.3.3.1 Initializing EM With SOM
The SOM can be used as described above. However, we can also use it as a first step, in conjunction with our EM algorithm. We can use SOM to first identify how many clusters there are within the data. Then, we can use it to estimate the locations, standard deviations, and proportions of the mixands. If accomplishable, this can be invaluable to anyone using an EM. Typically, the EM must be run on a set of feasible cluster numbers. After this is finished, some measure such as BIC is compared across the number of clusters in order to find an optimal one. Historically, this has met criticism both over the possible arbitrary nature of the value being optimized (many times two such values do not agree on the optimal number of clusters). Also, these values rely on convergence of the EM. However, when the incorrect number of clusters is specified, the EM need not necessarily converge. It can be difficult to set an upper bound on the number of iterations to run the EM. Also, the convergence rate is greatly affected by the starting points of the EM. While there are tricks to try and best set up the EM without any analysis of the data, having some initial step can greatly reduce the number of iterations needed for convergence.
In order to use our SOM as a step 0 to our process, we run the grow step, the smoothing step via batch algorithm, and run SOMWard clustering. Once accomplished, we choose the number of clusters based on the SOMWard output. This gives us the number of clusters. We can use the proportion of genes mapped to each cluster to identify its mixand percentage, and the sample means and standard deviations will serve as the initial mean and standard deviation estimates for the EM algorithm. At this point, we switch to the
EM algorithm, and need to only run the one iteration. We expect this to run much quicker than had we not estimated the starting values with some method. Should the data analyst be worried about the effectiveness of using our SOM method for identifying the number of clusters, it would still be advantageous to look at the SOMWard clusterings ranked by our cluster identification criteria. Considering the top 10 (or whatever top number the analyst desires) clusterings, using the SOMWard output to estimate the initial points of the EM for those cluster numbers would allow the analyst to prioritize likely numbers of clusters, as well as getting more quickly converging results due to the data driven initial points.
2.3.3.2 Expectation Maximization
We will focus on a mixture of normals for the remainder of this paper, making it simpler to summarize the parameters of interest, but the technique can be generalized for other distributions. We will need to have already identified estimates of the mixture proportions, p = (p1, ..., pK), the means (µ1, ...µK), and covariance structures (Σ1, ..., ΣK). These
estimates are useful, but mainly in application. In practice, we are most concerned with estimating P (Yi = k|Xi,1 = xi,1, ..., Xi,n0 = xi,n0). We assume that there is a parametric form to this, and simply use our estimates for the above listed parameters to find
b
fy|x(k|x) = f (x|µbk, bΣk)pbk
PK
j=1pbjf (x)|µbj, bΣj)pbj
As mentioned earlier, some EM algorithm is typically employed for the estimates of the parameters. Once we have them, we have this estimate of the probability that any gene belongs to cluster k given the expression data.
2.3.3.3 Rejection Rule
We have currently suggested that K true groupings of the data exist. We will fix the first group as being EE, in most cases this means that µ1 = 0. In order to perform any
classification, we will be considering the following estimate
ˆ τk= bfy|x(k|x) = b pkf (x|µbk, bΣk) PK−1 j=0 pbjf (x)|µbj, bΣj) (7)
In practice, we use ˆτik = bP (Yi= k|X). Since we wish to find genes likely not from the
null group, we consider ˆτi0as our test statistic in a sense. We can first consider some value
we wish to control our FDR for, q ∈ (0, 1). First, we order our ˆτi1from smallest to largest,
denoted by ˆτ(i)1. The next step is to find some value, s(∗) satisfying
Ps(∗)
i=1τˆ(i)1/s(∗) ≤ q
and Ps(∗)+1
i=1 τˆ(i)1/(s(∗)+ 1) > q. We call all genes DE so long as ˆτi1 < s(∗). This same
ranking and Call DE rule can be applied to the semi-parametric approach we developed within the SOM setting.