4. Análisis e interpretación de resultados
5.2. Recomendaciones
3.1
Reliance on the list of models and the restriction to linear
combinations
We agree with Clyde and Zhou that the performance of stacking depends on the choice of model list, as stacking can do nothing better than the optimal linear combination
from the model list. Stacking is not strongly sensitive to the misspecified models (see Section 4.1 of our paper), but it will be sensitive to how good an approximation is possible given the ensemble space.
We discuss the concern of inflexibility of linear-additive-form of density combina- tion in Section 5.2, and construct the same orthogonal regression example as Clyde, in which stacking will not work to approximate the true model that is a convolution of individual densities. By optimizing the leave-one-out performance of combined predic- tion, the stacking framework can be extended to more general combination forms, such as the posterior family used in the BPS literature. Furthermore, simplex constraints will be unnecessary if it goes beyond the linear combination of densities. We are inter- ested in testing such approaches. Yoo proposes another way to obtain convolutional combinations by stacking in the Fourier domain.
3.2
Model expansion as an alternative
One setting where stacking can be used, but full model expansion could be more difficult, is when some set of different sorts of models have been separately fit. The same idea is summarized by Pericchi as “careful consideration of all the entertained models and admissible estimators for parameters should be considered prior to the optimization procedures.” We are less concerned about the situation described by Belitser and
Nurushev, Shin, and Zhou, in which the number of models are so large that stacking
can be both computationally expensive and theoretically inconsistent, because in that setting we would recommend moving to a continuous model space that encompasses the separate models in the list.
Stacking is not designed for model selection, but for model averaging to get good predictions. We do not recommend to use it as model selection, although models with zero weights could be discarded from the average. For large p and small n, instead of stacking or other model averaging methods, we recommend using an encompass- ing model with all variables and prior information about the desired level of sparsity (Piironen and Vehtari, 2017b,c). For example, the regularized horseshoe prior can be considered as a continuous extension of the spike-and-slab prior with discrete model av- eraging over models with different variable combinations (Piironen and Vehtari,2017c). For high-dimensional variable selection we recommend a projection predictive approach (Piironen and Vehtari, 2016, 2017a), which has a smaller variance in selection process due to the use of the encompassing model as a reference model and has better predictive performance due to making the inference conditional on the selection process and the encompassing model.
3.3
Nonparametric approaches
Li and Iacopini and Tonellato suggest the use of nonparametric reference models
to eliminate the need of cross-validation. If we are able to make a good nonparametric model there is probably no need for model averaging. Although model averaging might be used as part of model reduction, instead of using component models p(·|y, Mk) we
would prefer to form the component models using a projection predictive approach which projects the information from the reference model to the restricted models (Piironen and Vehtari,2016,2017a).
Zhou suggests Bayesian nonparametric (BNP) models as an alternative to model
averaging. Indeed, the spline models used in the experiments in Section 4.6 of our paper can be considered as BNP models. We can compute fast LOO-CV also for Gaussian processes and other Gaussian latent variable models (Vehtari et al.,2016).
3.4
Logarithmic scoring rules
Finally, we emphasize that the choice of scoring rules in stacking depends on the under- lying application, and it is unlikely to give one optimal result that is applicable to any situation in advance. As Winkler, Jose, Lichtendahl and Grushka-Cockayne and
Gr¨uwald and Heide point out, there is no need to use log score if the focus is some
other utility. Our proposed stacking framework is applicable to any scoring rule. We are particularly interested in interval stacking that optimizes the interval score, which is likely to provide better interval estimation and posterior uncertainties.
We thank Franck for numerically verifying that stacking outperforms intrinsic Bayesian model averaging (iBMA) in simulations. This result suggests that the stacking procedure’s prior invariance property is a convenient bonus but not the only reason for its impressive performance.
References
Bernardo, J. M. and Smith, A. F. (1994). Bayesian theory. John Wiley & Sons.
MR1274699. doi:https://doi.org/10.1002/9780470316870. 1001
Buerkner, P., Vehtari, A., and Gabry, J. (2018). “PSIS assisted m-step-ahead pre- dictions for time-series models.” Technical report. URL http://mc-stan.org/loo/
articles/m-step-ahead-predictions.html 1003,1004
Dawid, A. P. (1984). “Present position and potential developments: Some personal views: Statistical theory: The prequential approach.” Journal of the Royal Statistical
Society. Series A, 278–292. MR0763811. doi: https://doi.org/10.2307/2981683. 1003
Geweke, J. and Amisano, G. (2011). “Optimal prediction pools.” Journal of Econo-
metrics, 164(1): 130–141.MR2821798. doi:https://doi.org/10.1016/j.jeconom.
2011.02.017. 1003
Geweke, J. and Amisano, G. (2012). “Prediction with misspecified models.” American
Economic Review , 102(3): 482–486. 1003
Kamary, K., Mengersen, K., Robert, C. P., and Rousseau, J. (2014). “Testing hypotheses via a mixture estimation model.” arXiv preprint arXiv:1412.2044. 1004
McAlinn, K., Aastveit, K. A., Nakajima, J., and West, M. (2017). “Multivari- ate Bayesian Predictive Synthesis in Macroeconomic Forecasting.” arXiv preprint
arXiv:1711.01667. 1004
McAlinn, K. and West, M. (2017). “Dynamic Bayesian predictive synthesis in time series forecasting.” arXiv preprint arXiv:1601.07463.MR3664859. 1004
Piironen, J. and Vehtari, A. (2016). “Projection predictive model selection for Gaussian processes.” In 2016 IEEE 26th International Workshop on Machine Learning for
Signal Processing (MLSP), 1–6. 1005,1006
Piironen, J. and Vehtari, A. (2017a). “Comparison of Bayesian predictive methods for model selection.” Statistics and Computing, 27(3): 711–735. 1005, 1006
Piironen, J. and Vehtari, A. (2017b). “On the hyperprior choice for the global shrinkage parameter in the horseshoe prior.” In Artificial Intelligence and Statistics, 905–913. 1005
Piironen, J. and Vehtari, A. (2017c). “Sparsity information and regularization in the horseshoe and other shrinkage priors.” Electronic Journal of Statistics, 11(2): 5018– 5051. 1005
Roberts, D. R., Bahn, V., Ciuti, S., Boyce, M. S., Elith, J., Guillera-Arroita, G., Hauen- stein, S., Lahoz-Monfort, J. J., Schr¨oder, B., Thuiller, W., et al. (2017). “Cross- validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure.” Ecography, 40(8): 913–929. 1003
Shimodaira, H. (2000). “Improving predictive inference under covariate shift by weight- ing the log-likelihood function.” Journal of Statistical Planning and Inference, 90(2): 227–244. MR1795598. doi: https://doi.org/10.1016/S0378-3758(00)00115-4. 1002
Sugiyama, M., Krauledat, M., and M¨uller, K.-R. (2007). “Covariate shift adaptation by importance weighted cross validation.” Journal of Machine Learning Research, 8(May): 985–1005. 1002
Sugiyama, M. and M¨uller, K.-R. (2005). “Input-dependent estimation of generalization error under covariate shift.” Statistics & Decisions, 23(4/2005): 249–279.MR2255627.
doi:https://doi.org/10.1524/stnd.2005.23.4.249. 1002
Vehtari, A., Buerkner, P., and Gabry, J. (2018a). “Leave-one-out cross-validation for non-factorizable models.” Technical report. URL http://mc-stan.org/loo/
articles/loo2-non-factorizable.html 1003
Vehtari, A., Gabry, J., Yao, Y., and Gelman, A. (2018b). “loo: Efficient leave-one-out cross-validation and WAIC for Bayesian models.” R package version 2.0.0. 1003 Vehtari, A., Gelman, A., and Gabry, J. (2017). “Pareto smoothed importance sampling.”
arXiv preprint arXiv:1507.02646. 1003
Vehtari, A., Mononen, T., Tolvanen, V., Sivula, T., and Winther, O. (2016). “Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models.”