Clustering of variables for enhanced interpretability of predictive models.
DOI:
https://doi.org/10.31449/inf.v45i4.3283Abstract
A new strategy is proposed for building easy to interpret predictive models in the context of a high-dimensional dataset, with a large number of highly correlated explanatory variables. The strategy is based on a first step of variables clustering using the CLustering of Variables around Latent Variables (CLV) method. The exploration of the hierarchical clustering dendrogram is undertaken in order to sequentially select the explanatory variables in a group-wise fashion. For model fitting implementation, the dendrogram is used as the base-learner in an L2-boosting procedure. The proposed approach, named lmCLV, is illustrated on the basis of a toy-simulated example when the clusters and predictive equation are already known, and on a real case study dealing with the authentication of orange juices based on 1H-NMR spectroscopic analysis. In both illustrative examples, this procedure was shown to have similar predictive efficiency to other methods, with additional interpretability capacity. It is available in the R package "ClustVarLV".References
Algamal, Z. and M. Lee (2019). A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classication. Adv. Data Anal. Classif. 13,753-771.
Bondell, H. D. and B. J. Reich (2008). Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR. Biometrics 64, 115-123.
Breiman, L. (2001). Random forests. Mach. Learn. 45, 5-32.
Bühlmann, P. and T. Hothorn (2007). Boosting algorithms: Regularization, prediction and model settting. Stat. Sc. 22, 477-505.
Bühlmann, P., P. Rutimann, S. van de Geer, and C.-H. Zhan (2013). Correlated variables in regression: clustering and sparse estimation. J. Stat. Plan. Infer. 143, 1835-1858.
Celeux, G., C. Maugis-Rabusseau, and M.Sedki (2019). Variable selection in model-based clustering and discriminant analysis with a regularization approach. Adv. Data Anal. Classif. 13, 259-278.
Chakraborty, S. and A. C. Lozano (2019). A graph laplacian prior for bayesian variable selection and grouping. Comput. Stat. Data An. 136, 72-91.
Chen, M. and E. Vigneau (2016). Supervised clustering of variables. Adv. Data Anal. Classif. 10, 85-101.
Chipman, H. A. and H. GU (2005). Interpretable dimension reduction. J. Appl. Stat. 32, 969-987.
Chun, H. and S. Keles (2010). Sparse partial least squares for simultaneous dimension reduction and variable selection. J. Roy. Stat. Soc. B 72, 3-25.
Cox, T. F. and D. S. Arnold (2018). Simple components. J. Appl. Stat. 45, 83-99.
Curtis, S. M. and S. K. Ghosh (2011). A bayesian approach to multicollinearity and the simultaneous selection and clustering of predictors in linear regression. J. Stat. Theory Pract. 5, 715-735.
Efron, B. and T. Hastie (2016). Computer age statistical inference: algorithms, evidence and data science. New York: Cambridge University Press.
Enki, D. G., N. T. Trendalov, and I. T. Jolliffe (2013). A clustering approach to interpretable principal components. J. Appl. Stat. 40, 583-599.
Figueiredo, A. and P. Gomes (2015). Clustering of variables based on watson distribution on hypersphere: A comparison of algorithms. Comm. Stat. - Simul Comput. 44, 2622-2635.
Friedman, J., T. Hastie, and R. Tibshirani (2010). A note on the group lasso and a sparse group lasso. Technical report, Statistics Department, Stanford University.
Hastie, T., R. Tibshirani, D. Botstein, and P. Brown (2001). Supervised harvesting of expression trees. Genom. Biol. 2, 1-12.
Hastie, T., R. Tibshirani, and J. Friedman (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second ed.). Springer Series in Statistics. New York: Springer.
Hofner, B., A. Mayr, N. Robinzonov, and M. Schmid (2014). Model-based boosting in r: A hands-on tutorial using the r package mboost. Comput. Stat. 29, 3-35.
Jolliffe, I., N. Trendalov, and M. Uddin (2003). A modied principal component technique based on the lasso. J. Comput. Graph. Stat. 12, 531-547.
Karlis, D., G. Saporta, and A. Spinakis (2003). A simple rule for the selection of principal components. Comm. Stat. - Theor. M. 32, 643-666.
Park, M. Y., T. Hastie, and R. Tibshirani (2007). Averaged gene expressions for regression. Biostat. 8, 212-227.
Rinke, P., S. Moitrier, E. Humpfer, S. Keller, M. Moertter, M. Godejohann, G. Hoffmann, H. Schaefer, and M. Spraul (2007). An 1H NMR technique for high troughput screening in quality and authenticity control of fruit juice and fruit juice raw materials- SGF- proling. Fruit Process. 1, 10-18.
SDBSWeb (2020). Spectral Database for organic Compounds. https://sdbs.db.aist.go.jp (National Institute of Advanced Industrial Science and Technology).
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. B 58, 267-288.
Tibshirani, R., M. Saunders, S. Rosset, J.Zhu, and K. Knight (2005). Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. B 67, 91-108.
Vigneau, E. and M. Chen (2016). Dimensionality reduction by clustering of variables while setting aside atypical variables. Electron. J. Appl. Stat. An. 9, 134-153.
Vigneau, E., M. Chen, and E. M. Qannari (2015). ClustVarLV: An R package for the clustering of variables around latent variables. R J. 7, 134-148.
Vigneau, E. and E. Qannari (2003). Clustering of variables around latent components. Comm. Stat. - Simul Comput. 32, 1131-1150.
Vigneau, E. and F. Thomas (2012). Model calibration and feature selection for orange juice authentication by 1H NMR spectroscopy. Chemometr. Intell. Lab. Sys. 117, 22-30.
Yengo, L., J. Jacques, C. Biernack, and M. Canoui (2016). Variable clustering in highdimensional linear regression: the r package clere. The R journal 8, 92-106.
Yuan, M. and Y. Lin (2007). Model selection and estimation in regression with grouped variables. J. Roy. Stat. Soc. Ser. B 68, 49-67.
Zeng, B., X. M. Wen, and L.Zhu (2017). A link-free sparse group variable selection method for single-index model. J. Appl. Stat. 44, 2388-2400.
Zou, H. and T. Hastie (2005). Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. B 67, 301-320.
Downloads
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika