• No se han encontrado resultados

5. EL PROCESO DE MONUMENTALIZACIÓN EN BURGOS

5.6. LA PROTECCIÓN DE LOS BIENES EN DEMOCRACIA

5.6.5. MONASTERIO DE SANTA MARÍA LA REAL DE HUELGAS

6.4.1

Change-line classification and regression

This study is now in the early stages of development, and there are many possible ways to improve the proposed method. An important topic for future research is to extend and refine the split-line algorithm for feature space dimensions higher than two. We applied a probabilistic working model with the EM algorithm introduced in section 6.1.2 to the change-line classification problem with two heterogeneous subgroups de- termined by three-dimensional feature variables (the results are not presented in this paper). Three direction parameters ω1, ω2, and ω3 can be expressed by using two angles θ ∈(0, π) and ψ ∈(0, π), where ω1 =cosθ, ω2 =sinθcosψ, and ω3 =sinθsinψ to satisfy the condition of∥ω∥=1. The estimated model parameters were quite close to the true values, but the direction parameters were not estimated well.

Also, this method seems to work very well when there are two latent subgroups, but it clearly needs to be extended to allow the possibility of more than two subgroups in the population. Several extensions can be made in model (3.1) to allow more than two subgroups, for example, by simply taking two different cut-points such as

Y(θ;X, Z)∼1{ωTX ≤γ1}F(Z;β)+1{γ1<ωTX≤γ2}G(Z;δ)+1{ωTX >γ2}H(Z;α), where ω ∈ S2, γ

1 < γ2 ∈ [a, b] ∈ R, under the existing assumptions. To find γ1, γ2, we can use grid searching instead of line searching, which might be computationally more intensive.

One of the difficulties is how to handle large data sets such as the chemical toxicity example. To solve this problem, we can consider estimation based on subsampling

of observations frequently used in the machine learning approach. For the example of toxicity data analysis in this preliminary paper, 100 replications of subsampling for small portions of the whole sample were utilized, and this appeared to work well. Several positive aspects of inference based on subsampling include being able to approximate the correct limiting behavior. Unfortunately, the type of estimation approach used in this study has not been studied extensively. We have significant interest in estimation based on subsampling, and this will be one of our future re- search topics. Nevertheless, we very much need to develop more computationally efficient approaches to enumerating the relevant hyperplanes, and we plan on work- ing earnestly on this problem in the future.

In addition, we are interested in developing a hypothesis test for the existence of a change-line. In this preliminary study, a graphic examination by Gaussian kernel estimation and local regression was performed to verify whether an abrupt change occurs in the mean and the variance of toxicity activity. There is extensive literature on hypothesis testing for the existence of a change-point based on the weighted boot- strap method (see, for example, Kosorok and Song [2007]) and based on subsampling (see, for example, Lee and Seo [2008]). We expect that a similar approach can be taken with hypothesis testing for the existence of a change-line in the two-dimensional feature space setting, and sup score test statistics and mean score test statistics using the bootstrap technique were examined in section 3.4. Due to computational diffi- culty, this investigation was carried out in a very limited setting. Therefore, further study to find a more computationally efficient method would be one of the future research topics related to the change-line problem.

We are also interested in additional asymptotic properties in the change-line re- gression model including weak convergence of the proposed M-estimators. Graphical investigation of the empirical distribution of the change-line parameters suggested that they might converge to a certain limiting processes, so establishing the asymp- totic distribution of the estimators could be one of our future research topics. Al- though further research is needed to overcome its limitations, this preliminary study shows that the proposed method can be an attractive approach to finding latent sub- groups in a population. Studying the asymptotic validation of the test procedure would be helpful.

6.4.2

The interactive decision committee method

The proposed IDC method brought up many open problems. First, the current IDC method can be extended to resolving multiclass classification problems [Hsu and Lin, 2002] or to predict continuous outcomes [Budka and Gabrys, 2010]. Second, it would be helpful to determine whether the improvement achieved by the IDC method in this paper can be observed in other types of data, such as gene expression data with gene categories. Liu et al. [2004] showed that a combination of feature selection with an ensemble neural network based on individual genes improved a classification task. Since we searched for all 2nd order interaction terms between feature categories, the current IDC method would be inefficient for a large number of categories. Gene pathways are numerous, so we would need a more efficient way to select 2nd order interaction terms between gene pathways. When feature categories can be defined in multiple ways, the best choice of feature categories is an open problem.

Finally, further studies to obtain more significant improvements using the IDC

method are needed. As several researchers including Breiman [2001] and Wolpert [1992] have argued, increasing diversity among base classifiers (or minimizing depen- dency or correlation between base classifiers) and improving performance of individual base classifiers are key factors in successful use of the decision committee method. Wang et al. [2009] reported that SVM with bagging or boosting performed better than a single SVM on average. Therefore, it would also be interesting to integrate bootstrap resampling techniques with the IDC method in order to increase diversity, thus potentially achieving better prediction performance similar to Assareh et al. [2008] and Stefanowski [2005]. One possible alternative is to use output class proba- bilities of the base classifiers rather than the predicted class levels of base classifiers as suggested in Ting et al. [1997]. However, Bauer and Kohavi [1999] argued that combining output class probabilities of the base classifier can produce slightly worse results than combining classification outputs, so this may not guarantee improved performance of the IDC method.

Chapter 7

Appendix