Consideraciones finales. - de programas de posgrado en el 2008.

de programas de posgrado en el 2008.

1.6 Consideraciones finales.

Analyzing the simulated data set, it is possible to assess for how many transcripts the model delivers a correct classiﬁcation to either the positive component or one of the other components. Of the 2 000 out of 21 236 transcripts in the simulated data set that display equally directed diﬀerences in both data types, the Bayesian mixture model is able to classify 1 656 to the positive com-

ponent based on the zi values. A total of 15 transcripts is falsely classiﬁed to

the positive component and 344 transcripts are falsely classiﬁed to a null or neg- ative component, which corresponds to a sensitivity of 0.828 and a speciﬁcity of 0.999. To compare the Bayesian mixture model to a naive separate analysis

of both data types, a threshold Υ is chosen for the observed diﬀerences x∗_i − ai

of the transcription values. All transcripts with |x∗_i − ai| ≥ Υ are considered as

diﬀerentially transcribed and an equal number of transcripts with the largest

factor c Bayesian mixture model Naive approach

Num. detected Sensitivity Num. detected Sensitivity

c∈ {−1, 1} 200 1 194 0.970 c∈ {−0.9, 0.9} 200 1 186 0.930 c∈ {−0.8, 0.8} 200 1 178 0.890 c∈ {−0.7, 0.7} 196 0.980 181 0.905 c∈ {−0.6, 0.6} 195 0.975 163 0.815 c∈ {−0.5, 0.5} 192 0.960 155 0.775 c∈ {−0.4, 0.4} 182 0.910 118 0.590 c∈ {−0.3, 0.3} 162 0.810 99 0.495 c∈ {−0.2, 0.2} 101 0.505 62 0.310 c∈ {−0.1, 0.1} 28 0.140 29 0.145

Table 5.5: Results from the simulation data set splitted by the magnitude of simulated diﬀerences. Data are simulated as described in Section 5.4.1. The fraction c indicates the strength of the simulated diﬀerences, e.g., c = 0.5 (c =

−0.5) means that the gene transcription value and the corresponding ChIP-seq

value are increased (decreased) by 50% in one of the two replicate samples. For comparability, the threshold of the naive approach is chosen such that it achieves the same speciﬁcity (0.999, 15 false positive out of 19 236 true negative genes) as the Bayesian mixture model. Overall, the Bayesian mixture model achieves a sensitivity of 0.828, whereas the sensitivity of the naive separate analyses is 0.683. Notably, the gain in sensitivity of the integrative approach is

most distinct for moderate diﬀerences (0.2≤ c ≤ 0.5). The acceptance rate for

the Metropolis-Hastings steps for α∗ in the MCMC run is 0.3092.

histone modified. Transcripts both differentially transcribed and differentially

histone modiﬁed for which sign(x∗_i−ai) = sign(y∗i−b∗i) are classiﬁed as potential

drivers by the naive analysis.

Subsequently, varying thresholds tsim are considered and ROC curves are

plotted based on these results as well as on the classifications implied by the Z score (see Figure A.30 in the Appendix). At the same level of specificity, the naive approach achieves a sensitivity of 0.683, which is smaller than the value achieved by the Bayesian mixture model. Table 5.5 shows that especially for moderate differences, the Bayesian mixture model achieves a gain in sensitivity of about 0.3 on average.

Summary and Discussion

In this PhD thesis, finite Bayesian mixture models with a small fixed number of components have been developed to answer actual research questions in two different contexts concerning molecular biophysics (Chapter 4) and molecular biology (Chapter 5).

In Chapter 4, the novel Bayesian approach GAMMICS was introduced. GAM- MICS builds on frequentist modeling ideas proposed by Byers and Raftery (1998) for distinguishing large objects from noise and employs them for a

Bayesian analysis of small clusters in presence of singletons. Relying on a

two-component gamma mixture model for the squared distances of points to their second nearest neighbors, it classifies proteins as either clustered or non- clustered. It is designed to estimate parameters of spatial nanoclustering exhib- ited by membrane-bound Ras proteins, particularly the proportion of clustered points, the mean cluster size and the mean cluster radius. GAMMICS combines ideas of both single-linkage hierarchical clustering and density-based clustering in a Bayesian framework. Specifically, it estimates the crucial user-defined parameter (the maximum nearest neighbor distance in density-based clustering, the cutoff in the dendrogram in hierarchical clustering) in a Bayesian mixture model and obtains posterior distributions for the cluster size and cluster radius

in an iteration-wise post-processing step. Specifically, in GAMMICS this crucial parameter is a cutoff corresponding approximately to the point of intersection between the two gamma distributions fit in the mixture model.

To compare the approach to other state-of-the-art methods regarding its performance, a comprehensive simulation study was conducted. A point pro-

cess model was designed for this purpose, the double Mat´ern cluster process.

Unlike standard cluster processes such as the Neyman-Scott process, it allows for non-clustered points and metaclusters, i.e. clusters containing clusters and singletons. Contrary to other recently proposed point process models that have

these features (see, e.g., Wiegand et al., 2009), the double Mat´ern cluster pro-

cess also allows for clusters outside of metaclusters and permits to freely choose both the proportion of clustered points and the cluster size at each clustering level.

The performance of GAMMICS is generally favorable compared to the com- mon cluster approach DBSCAN, Bayesian model-based clustering employing a Dirichlet process mixture (DPM) model, and a Bayesian version of DBSCAN based on the DPM model. It also outperforms the H-function approach com- monly used in biophysical literature to estimate the cluster radius. While the H- function and the two methods based on the nonparametric DPM model nowhere show an outstanding performance, DBSCAN achieves the best median estimation accuracy when estimating the proportion of clustered points. GAMMICS achieves both the smallest median misclassiﬁcation rate and the smallest median estimation error when estimating the mean cluster radius. In addition, when using one of the two considered procedures to estimate a cluster partition in the post-processing step, GAMMICS also performs best in estimating the mean cluster size.

The DPM model, although it has frequently been used for clustering, is orig- inally a model for density estimation. This may explain its poor performance. Possible reasons for the poor accuracy of the H-function estimates include a greater susceptibility to factors such as drift of the cells during the measure-

ment process, noise inherent in the measurements, or metaclustering. Also, the peak in the H-function is not always sharp. Thus, in some cases when the H- function performs particularly weak, H-function values assumed for a number of radii moderately smaller than the one maximizing the function may be almost as high as the maximum.

While the completely algorithmic DBSCAN performs best for one of the four parameters of interest, it does so only with crucial knowledge about the true clustering parameters. GAMMICS, on the other hand, performs best for the other three without depending on such knowledge and functions with much weaker assumptions regarding distances between points. It also offers a model that quantifies the uncertainty for its estimates, providing more insights and a better interpretability. Contrary to many established cluster algorithms, noise is explicitly considered and quantified.

Taking into account that prior knowledge of Ras clusters is still not taken for granted, the results conﬁrm the advantages of GAMMICS. However, this comes with a cost in terms of computation time: while DBSCAN never takes more than a few seconds to run, GAMMICS typically runs several, sometimes a large number of hours on the data sets in this study. The MCMC routines from the R package DPpackage and Bayesian DBSCAN are implemented in C, they are thus faster than GAMMICS. However, they may still take several hours depending on the data set. Bayesian DBSCAN consists of several C and R routines that have to be applied subsequently and in a manual way to each data set. This makes its application diﬃcult whenever dealing with a large number of data sets.

The relatively weak performance of GAMMICS compared to DBSCAN when estimating the proportion of clustered points may be due to an asymmetric overlap occurring at times between the two posterior gamma distributions. In such cases, the tails of the two gamma distributions that are literally ’cut off’ by the calculated GAMMICS cutoff correspond to probability masses of notably different sizes. The estimation of the proportion of clustered points may suffer

a bias in such cases, even if the probability masses corresponding to clustered points and singletons are in principle well separated by the cutoﬀ. Based on typical shapes of the empirical distribution of second nearest neighbor distances, an underestimation of the proportion is more likely than an overestimation.

This potential bias might be attenuated by allocating the proteins to the two components based on whether their second nearest neighbor distance is smaller or larger than the cutoﬀ between the two densities, instead of on the

sampling carried out to ﬁt the model. Such an approach, described in Sch¨afer

and Ickstadt (2012), would however lead to an even more algorithmic fashion of the estimation. By consequence, the incorporation of prior knowledge would no longer be possible. Also, in cases when clustered proteins and singletons are not particularly well separated in the histogram of the second nearest neighbor distances, the ‘fuzzy separation’ of the two groups implied by the mixture model’s allocations may be of advantage compared to a strict separation. To perform the algorithmic estimation for the mean cluster size and the mean cluster radius in a more fuzzy fashion as well, one might replace the hierarchical clustering employed for the estimation with a fuzzy version of such clustering (see, e.g., Torra, 2005). The subsequent calculation of mean cluster and mean radius could then take into account the provided cluster membership probabil- ities, potentially improving estimation accuracies.

In theory, it may happen that the two weighted gamma densities have more

than one point of intersection or none at all (without considering 0 and ∞).

Also, the two densities might ’switch sides’, i.e. the distribution assigned priors for representing the singletons could eventually represent the clustered points, and vice versa. While these cases are improbable in practice, the algorithm calculating the cutoﬀ for the estimation of a cluster partition in GAMMICS is designed to neutralize or at least attenuate a potential bias.

In GAMMICS’ mixture model, the two gamma distributions ﬁt to the empirical distributions of distances are weighted by the proportion of clustered

partition, these weights are taken into account. Alternatively, it is considered to use the unweighted gamma distributions in this algorithm. This is equivalent

to setting pc = 0.5 in the weighted version and, while less intuitive, leads to

smaller errors when estimating the mean cluster size and radius. Probably, it makes the procedure less susceptible against artiﬁcially extreme estimates for

pc in the mixture model. It is therefore the preferred option.

The GAMMICS method implies the assumption of independence for the squared distances of points to their second nearest neighbors. While appearing justiﬁable in the interest of model simplicity, this assumption is problematic because the knowledge of the whole point pattern is required to calculate the nearest neighbor distance for any of the proteins. In general, independence of second nearest neighbor distances may be assumed for some point processes depending on the context, e.g., Byth and Ripley (1980) postulate this independence for tree patterns if a random subset of no more than 10% of all trees is considered. However, such an argument depends on the physical characteristics of the considered objects. For trees, a minimum distance between objects can be assumed due to physical restrictions such as branch length and sunlight distribution. While in principle there is a minimum distance between Ras proteins, it is so small in relation to the measurement accuracy that in practice a similar argument cannot be made here. Thus, analyzing subsets of, e.g., only 10% of protein does not necessarily help while discarding most of the information.

Since the cluster radii and sizes are calculated in an algorithmic post-processing step, the inference is not fully model-based. However, due to identiﬁability issues concerning cluster size and cluster radius, this appears to be the price to pay in exchange for estimating all three parameters without prior knowledge on them. If prior knowledge does exist, it can be used to specify a prior distribution in the usual Bayesian way, but only for the proportion of clustered proteins and not for the cluster radius and the cluster size.

In this thesis, two ways to estimate the radius of a cluster are considered: one half of the maximum distance between two points within the same cluster,

and the mean pairwise distance between points within the cluster, multiplied

with the factor 2· 45π₁₂₈ of the maximum distance between two points within the

same cluster (the factor is based on the theoretical distribution of distances between points spread uniformly in a circle). For each method, the estimator resulting in a better estimation accuracy is chosen. However, other approaches to estimate the cluster radius might be used as well and lead to slightly diﬀerent results.

GAMMICS relies essentially on dividing the empirical distribution of distances between points into two sets, corresponding to clustered and non-clustered points. Thus, it has weaknesses in detecting the extreme cases in which either no points or all points are clustered. In these cases, the model will still tend to estimate a cutoﬀ although there is none in the data. A reasonable strategy to rule out these cases can make use of Monte Carlo tests based on, e.g., the K-function.

Contrary to most other methods for cluster analysis, the estimation of a posterior cluster partition is not a primary goal in this thesis and, thus, in GAMMICS. However, a posterior cluster partition could be obtained via the posterior similarity matrix based on the partitions inferred in each MCMC step. As in the DPM model, a criterion such as Binder’s loss could be utilized for this purpose.

In order to eliminate the algorithmic parts in GAMMICS, one could combine its gamma mixture model with a DPM model within the same MCMC sampling framework. One of the DPM model’s weaknesses is that it tends to group too many singletons into clusters, so GAMMICS’s gamma mixture model could furnish the DPM model with the classiﬁcation into singletons and clustered points. Carried out on the points classiﬁed as clustered by the gamma mixture model in each iteration, the DPM model may render a more accurate cluster partition and, potentially, estimates for mean cluster size and mean cluster radius equally precise as (or even more precise than) those obtained by the full GAMMICS method. Ji et al. (2009), e.g., present a similar combination of a

classification based on two densities representing two classes (although not in form of a formal mixture model) and a DPM model fitted on a set of points updated based on the classification. If, in addition, another DPM model is run on the points estimated as singletons by the gamma mixture model, the resulting two intensities (one for a cluster process and one for a singleton process) might both be compared across experimental conditions and thus help to do inference on the differences of the clustering behavior, e.g., between healthy cells and tumor cells. However, while such approaches would avoid algorithmic elements, they still do not allow to model cluster radius and size directly. To apply the method to experimental data on a large scale, a systematic and automated selection of the regions of interest would be needed, taking into account varying point densities and edge effects.

In this PhD thesis, fixed cells have been investigated in which the Ras proteins are stationary. In future research it will further be of interest to conduct research on living cells with moving proteins and/or on cells with a Ras expres- sion on an endogenous level (corresponding to cells in a normal, healthy state, as opposed to cancerous cells where Ras is overexpressed). The consideration of living cells over time turns the so far spatial problem into a three dimen- sional one, with time as third dimension. If GAMMICS is combined with an appropriate tracking analysis, providing protein labels over time, it could take into account time as third dimension when calculating second nearest neighbor distances. In this way, clusters would be defined as groups of proteins close in space and time. This would require the absence of blinking among the proteins, since blinking would hamper the tracking analysis. The distance measure used in the model would have to ensure an adequate influence of both space and time domains, e.g., by standardizing or adequately scaling the variables prior to the analysis, or by employing alternatives to Euclidean distance.

Ras interacts with other signaling proteins that possibly modify its cluster characteristics during the course of signal transduction. Alternatively, Ras may inﬂuence these binding partners in their role in the further signal transduction

(Tian et al., 2007). Examples of such signaling proteins interacting with Ras include galectines (Hancock, 2006), and Raf kinases, which pass on the signals of active Ras proteins (Wellbrock et al., 2004). It is of interest how the inter- action of Ras clusters with such binding partners modifies the subsequent steps in the signaling chain and how this influences the relevant biological processes like cellular growth or tumor progression. To investigate these questions, it is necessary to analyze the joint distribution of Ras and its interacting partners on the nanoscale. In addition, it may be of interest to analyze the joint distribution of different types of Ras proteins that differ in their chemical composition. For these purposes, the experimental setup might be extended to a dual color imaging (Bates et al., 2007; Shroff et al., 2007). Since the output of such a setup would technically allow to discriminate the differently colored proteins from the start, a simple option to assess a joint distribution might be, e.g., to apply GAMMICS to both protein types separately in a unified sampling framework, assuming a priori independent clustering behaviors. Currently, the applicability of GAMMICS and further methods to a joint analysis of two proteins continues to be investigated in the group of Katja Ickstadt based on simulations (Her- rmann et al., 2015).

In Chapter 5, scores inspired by the externally centered correlation coeﬃcient

(Sch¨afer et al., 2009) were designed to measure congruence between two diﬀerent

omics variables. Congruence in this context is understood as the degree to which transcripts present diﬀerences between a target sample and a reference sample that are equally directed in both variables. Only one sample per input and condition is required, facilitating analyses for sequencing data featuring

extremely small sample sizes. The score values zi allow to rank genomic loci

such as probes or transcripts regarding their probability of belonging to a driver

In document Construcción de un modelo de autoevaluación como herramienta para la evaluación de programas de posgrado de la Universidad Nacional de Colombia (página 94-99)