Enhancing Phishing Website Detection via Feature Selection in URL-Based Analysis
DOI:
https://doi.org/10.31449/inf.v47i9.5177Abstract
Detecting a phishing website accurately is crucial for ensuring the safety of online users, underscoring the importance of maintaining a secure digital environment. This research delves into the effectiveness of enhancing the detection of phishing websites through the application of a new dataset generation method. The method involves the transformation of a pure dataset obtained from Mendeley, by the utilization of regular expressions to extract the important features so that a detection process can be performed correctly with high performance. Based on the proposed features, we selected the best machine-learning algorithm.We performed a rigorous evaluation using Three prominent machine learning algorithms: Decision Trees, Support Vector Machines (SVM), and Random Forests, achieving 0.96% for Decision Tree Accuracy, 0.97% for SVM Accuracy, and 0.98% for Random Forest Accuracy.One of the critical contributions of this research is the deliberate selection of features. We have leveraged regular expressions to create a feature set that captures salient aspects of URLs and optimizes the algorithms' detection capabilities.This research has examined how feature selection affects the performance of each algorithm, highlighting its strengths and uncovering its weaknesses.Povzetek: glavni prispevek te raziskave je namerna izbira lastnosti. Izkoristili smo regularne izraze, da smo ustvarili nabor funkcij, ki zajame pomembne vidike URL-jev in optimizira zmožnosti zaznavanja algoritmovReferences
K. Ahmed and S. Naaz, "Detection of phishing websites using machine learning approach," in Proceedings of International Conference on Sustainable Computing in Science, Technology and Management (SUSCOM), Amity University Rajasthan, Jaipur-India, 2019.
M. Ahsan, K. E. Nygard, R. Gomes, M. M. Chowdhury, N. Rifat, and J. F. Connolly, "Cybersecurity threats and their mitigation approaches using Machine Learning—A Review," Journal of Cybersecurity and Privacy, vol. 2, no. 3, pp. 527-555, 2022.
Y. Xu et al., "Artificial intelligence: A powerful paradigm for scientific research," The Innovation, vol. 2, no. 4, 2021.
N. Kareem, "Afaster Training Algorithm and Genetic Algorithm to Recognize Some of Arabic Phonemes."
A. S. Hashim, W. A. Awadh, and A. K. Hamoud, "Student performance prediction model based on supervised machine learning algorithms," in IOP Conference Series: Materials Science and Engineering, 2020, vol. 928, no. 3: IOP Publishing, p. 032019.
W. Chu, B. B. Zhu, F. Xue, X. Guan, and Z. Cai, "Protect sensitive sites from phishing attacks using features extractable from inaccessible phishing URLs," in 2013 IEEE international conference on communications (ICC), 2013: IEEE, pp. 1990-1994.
W. Fadheel, M. Abusharkh, and I. Abdel-Qader, "On Feature selection for the prediction of phishing websites," in 2017 IEEE 15th Intl Conf on Dependable, Autonomic and Secure Computing, 15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech), 2017: IEEE, pp. 871-876.
I. Tyagi, J. Shad, S. Sharma, S. Gaur, and G. Kaur, "A novel machine learning approach to detect phishing websites," in 2018 5th International conference on signal processing and integrated networks (SPIN), 2018: IEEE, pp. 425-430.
A. D. Kulkarni and L. L. Brown III, "Phishing websites detection using machine learning," 2019.
D. N. Kumar, N. S. R. Hemanth, S. Premnath, V. N. Kumar, and S. Uma, "Detection of phishing websites using an efficient machine learning framework," International Journal of Engineering Research and Technology, vol. 9, no. 5, 2020.
A. Lakshmanarao, P. S. P. Rao, and M. B. Krishna, "Phishing website detection using novel machine learning fusion approach," in 2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS), 2021: IEEE, pp. 1164-1169.
M. Abutaha, M. Ababneh, K. Mahmoud, and S. A.-H. Baddar, "URL phishing detection using machine learning techniques based on URLs lexical analysis," in 2021 12th International Conference on Information and Communication Systems (ICICS), 2021: IEEE, pp. 147-152.
S. Jain, "Phishing Websites Detection Using Machine Learning," Available at SSRN 4121102.
S. A. Anwekar and V. Agrawal, "PHISHING WEBSITE DETECTION USING MACHINE LEARNING ALGORITHMS."
A. Prathap, M. L. Mounika, M. Reethika, N. Navya, and R. S. Sahithi, "PHISHING WEBSITE DETECTION USING MACHINE LEARNING MODELS," Machine learning, vol. 52, no. 4, 2023.
U. B. Penta, B. Panda, and S. S. Gantayat, "MACHINE LEARNING MODEL FOR IDENTIFYING PHISHING WEBSITES," Journal of Data Acquisition and Processing, vol. 38, no. 1, p. 2455, 2023.
O. K. Sahingoz, E. Buber, O. Demir, and B. Diri, "Machine learning based phishing detection from URLs," Expert Systems with Applications, vol. 117, pp. 345-357, 2019.
A. Aljofey et al., "An effective detection approach for phishing websites using URL and HTML features," Scientific Reports, vol. 12, no. 1, p. 8842, 2022.
E. M. Karabulut, S. A. Özel, and T. Ibrikci, "A comparative study on the effect of feature selection on classification accuracy," Procedia Technology, vol. 1, pp. 323-327, 2012.
S. F. Ariyadasa, Shantha; Fernando, Subha, "Phishing Websites Dataset," Mendeley Data, 2021, doi: http://doi.org/10.17632/n96ncsr5g4.1.
G. Stiglic, S. Kocbek, I. Pernek, and P. Kokol, "Comprehensive decision tree models in bioinformatics," PloS one, vol. 7, no. 3, p. e33812, 2012.
S. V. Razavi-Termeh, A. Sadeghi-Niaraki, and S.-M. Choi, "Spatial modeling of asthma-prone areas using remote sensing and ensemble machine learning algorithms," Remote Sensing, vol. 13, no. 16, p. 3222, 2021.
J. Cervantes, F. Garcia-Lamont, L. Rodríguez-Mazahua, and A. Lopez, "A comprehensive survey on support vector machine classification: Applications, challenges and trends," Neurocomputing, vol. 408, pp. 189-215, 2020.
A. Mammone, M. Turchi, and N. Cristianini, "Support vector machines," Wiley Interdisciplinary Reviews: Computational Statistics, vol. 1, no. 3, pp. 283-289, 2009.
L. Breiman, "Random Forests," Machine Learning, 45(1), 5-32. , 2021, doi: 10.1023/A:1010933404324.
S. Athey, J. Tibshirani, and S. Wager, "Generalized random forests," 2019.
Y. Liu, Y. Zhou, S. Wen, and C. Tang, "A Strategy on Selecting Performance Metrics for Classifier Evaluation," International Journal of Mobile Computing and Multimedia Communications, vol. 6, pp. 20-35, 10/01 2014, doi: 10.4018/IJMCMC.2014100102.
N. Japkowicz and M. Shah, Evaluating learning algorithms: a classification perspective. Cambridge University Press, 2011.
D. M. Powers, "Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation," arXiv preprint arXiv:2010.16061, 2020.
Downloads
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika