Static and incremental overlapping clustering algorithms for large collections processing in GPU
Abstract
There are several problems in Pattern Recognition and Data Mining that, by their inherent nature, consider that objects could belong to more than one class; that is, clusters can overlap each other. OClustR and DClustR are overlapping clustering algorithms that have shown, in the task of documents clustering, the better tradeoff between quality of the clusters and efficiency, among the existing overlapping clustering algorithms. Despite the good achievements attained by both aforementioned algorithms, they are O(n2) so they could be less useful in applications dealing with a large number of documents. Moreover, although DClustR can efficiently process changes in an already clustered collection, the mount of memory it uses could make it not suitable for applications dealing with very large document collections. In this paper, two GPU-based parallel algorithms, named CUDA-OClus and CUDA-DClus, are proposed in order to enhance the efficiency of OClustR and DClustR, respectively, in problems dealing with a very large number of documents. The experimental evaluation conducted over several standard document collections showed the correctness of both CUDA-OClus and CUDA-DClus, and also their better performance in terms of efficiency and memory consumption.References
E. Bae, J. Bailey, G. Dong, A clustering comparison measure using density profiles and its application to the discovery of alternate clusterings, Data Mining and Knowledge Discovery 21 (3) (2010) 427–471.
A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: a review, ACM computing surveys (CSUR) 31 (3) (1999) 264–323.
S. Gregory, A fast algorithm to find overlapping communities in networks, in: Machine learning and knowledge discovery in databases, Springer, 2008, pp. 408–423.
J. Aslam, K. Pelekhov, D. Rus, Static and dynamic information organization with star clusters, in: Proceedings of the seventh international conference on Information and knowledge management, ACM, 1998, pp. 208–217.
A. Pons-Porrata, R. Berlanga-Llavori, J. Ruiz-Shulcloper, J. M. Pérez-Martínez, Jerartop: A new topic detection system, in: Progress in Pattern Recognition, Image Analysis and Applications, Springer, 2004, pp. 446–453.
A. Pérez-Suárez, J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa, J. E. Medina-Pagola, Oclustr: A new graph-based algorithm for overlapping clustering, Neurocomputing 121 (2013) 234–247.
A. Pérez-Suárez, J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa, J. E. Medina-Pagola, An algorithm based on density and compactness for dynamic overlapping clustering, Pattern Recognition 46 (11) (2013) 3040–3055.
L. J. G. Soler, A. P. Suárez, L. Chang, Efficient overlapping document clustering using gpus and multi-core systems, in: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications - 19th Iberoamerican Congress, CIARP 2014, Puerto Vallarta, Mexico, November 25, 2014. Proceedings, 2014, pp. 264–271.
M. W. Berry, M. Castellanos, Survey of text mining, Computing Reviews 45 (9) (2004) 548.
R. Gil-García, A. Pons-Porrata, Dynamic hierarchical algorithms for document clustering, Pattern Recognition Letters 31 (6) (2010) 469–477.
J. Sanders, E. Kandrot, CUDA by example: an introduction to general-purpose GPU programming, Addison-Wesley Professional, 2010.
D. B. Kirk, W. H. Wen-mei, Programming massively parallel processors: a hands-on approach, Newnes, 2012.
G. Salton, A. Wong, C.-S. Yang, A vector space model for automatic indexing, Communications of the ACM 18 (11) (1975) 613– 620.
E. Amigó, J. Gonzalo, J. Artiles, F. Verdejo, A comparison of extrinsic clustering evaluation metrics based on formal constraints, Information retrieval 12 (4) (2009) 461–486.
Downloads
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika