Ensemble-Based Text Classification for Spam Detection
DOI:
https://doi.org/10.31449/inf.v48i6.5246Abstract
This research proposes an ensemble-based approach for spam detection in digital communication, addressing the escalating challenge posed by unsolicited messages, commonly known as spam. The exponential growth of online platforms has necessitated the development of effective information filtering systems to maintain security and efficiency. The proposed approach involves three main components: feature extraction, classifier selection, and decision fusion. The feature extraction techniques is word embedding, are explored to represent text messages effectively. Multiple classifiers, including RNN including LSTM and GRU are evaluated to identify the best performers for spam detection. By employing the ensemble model combines the strengths of individual classifiers to achieve higher accuracy, precision, and recall. The evaluation of the proposed approach utilizes widely accepted metrics on benchmark datasets, ensuring its generalizability and robustness. The experimental results demonstrate that the ensemble-based approach outperforms individual classifiers, offering an efficient solution for combatting spam messages. Integration of this approach into existing spam filtering systems can contribute to improved online communication, user experience, and enhanced cybersecurity, effectively mitigating the impact of spam in the digital landscape.References
Yadav, B. P., Ghate, S., Harshavardhan, A., Jhansi, G., Kumar, K. S., & Sudarshan, E. (2020, December). Text categorization Performance examination Using Machine Learning Algorithms. In IOP Conference Series: Materials Science and Engineering (Vol. 981, No. 2, p. 022044). IOP Publishing.
Wang, S., Cai, J., Lin, Q., & Guo, W. (2019). An overview of unsupervised deep feature representation for text categorization. IEEE Transactions on Computational Social Systems, 6(3), 504-517.
Belazzoug, M., Touahria, M., Nouioua, F., & Brahimi, M. (2020). An improved sine cosine algorithm to select features for text categorization. Journal of King Saud University-Computer and Information Sciences, 32(4), 454-464.
Almuzaini, H. A., & Azmi, A. M. (2020). Impact of stemming and word embedding on deep learning-based Arabic text categorization. IEEE Access, 8, 127913-127928.
Lee, J., Yu, I., Park, J., & Kim, D. W. (2019). Memetic feature selection for multilabel text categorization using label frequency difference. Information Sciences, 485, 263-280.
Chen, S. W., Chen, Y. W., & Wei, C. P. (2020). Deep learning-based text classification: A comprehensive review. Journal of Computer Science and Technology, 35(1), 143-165.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL-HLT.
Gupta, B. B., & Soni, D. (2020). Detecting malicious URLs using machine learning algorithms: A comparative study. International Journal of Advanced Computer Science and Applications, 11(9), 185-191.
Maatuk, M. J. A., & Abbass, H. A. (2020). Spam detection in online social networks: A survey. IEEE Access, 8, 189095-189105.
Singh, A. K., & Singh, S. K. (2018). Text classification using ensemble methods: A survey. Procedia Computer Science, 132, 1095-1102.
Zhou, Z., & Wu, H. (2020). Ensemble methods in machine learning: A survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 50(5), 1774-1792.
Al-Salemi, B., Ayob, M., Kendall, G., & Noah, S. A. M. (2019). Multi-label Arabic text categorization: A benchmark and baseline comparison of multi-label learning algorithms. Information Processing & Management, 56(1), 212-227.
Berge, G. T., Granmo, O. C., Tveit, T. O., Goodwin, M., Jiao, L., & Matheussen, B. V. (2019). Using the Tsetlin machine to learn human-interpretable rules for high-accuracy text categorization with medical applications. IEEE Access, 7, 115134-115146.
Berge, G. T., Granmo, O. C., Tveit, T. O., Goodwin, M., Jiao, L., & Matheussen, B. V. (2019). Using the Tsetlin machine to learn human-interpretable rules for high-accuracy text categorization with medical applications. IEEE Access, 7, 115134-115146.
Kilimci, Z. H., & Akyokuş, S. (2019, July). The analysis of text categorization represented with word embeddings using homogeneous classifiers. In 2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA) (pp. 1-6). IEEE.
Cherif, W., Madani, A., & Kissi, M. (2021). Text categorization based on a new classification by thresholds. Progress in Artificial Intelligence, 10(4), 433-447.
Cherif, W., Madani, A., & Kissi, M. (2021). Text categorization based on a new classification by thresholds. Progress in Artificial Intelligence, 10(4), 433-447.
H. Ahmed, I. Traore, and S. Saad, “Detecting opinion spams and fake news using text classification,” Security and Privacy, vol. 1, no. 4, p. e9, 2018.
D. Martens and W. Maalej, “Towards understanding and detecting fake reviews in app stores,” Empirical Software Engineering, vol. 24, no. 6, pp. 3316–3355, 2019.
N. Jindal and B. Liu, “Analyzing and detecting review spam,” in Proceedings of the Seventh IEEE International Conference on Data Mining (ICDM 2007), pp. 547–552, Omaha, NE, USA, October 2007.
J. Li, M. Ott, C. Cardie, and E. Hovy, “Towards a general rule for identifying deceptive opinion spam,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1566–1576, Baltimore, MD, USA, June 2014.
Y. Lin, T. Zhu, H. Wu, J. Zhang, X. Wang, and A. Zhou, “Towards online anti-opinion spam: spotting fake reviews from the review sequence,” in Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014), pp. 261–264, Beijing, China, August 2014.
Y. Ren and D. Ji, “Neural networks for deceptive opinion spam detection: an empirical study,” Information Sciences, vol. 385-386, pp. 213–224, 2017.
Y. Ren, D. Ji, and H. Zhang, “Positive unlabeled learning for deceptive reviews detection,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 488–498, Doha, Qatar, October 2014.
A. Sharaff and A. Soni, “Analyzing sentiments of product reviews based on features,” in Proceedings of the 2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI), pp. 710–713, Tirunelveli, India, May 2018.
Downloads
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika