An Automated Python Script for Data Cleaning and Labeling using Machine Learning Technique
DOI:
https://doi.org/10.31449/inf.v47i6.4474Abstract
Every employee in the company that deals with data needs to have clean, noise-free data. Since data warehouses store and update enormous amounts of data from several sources, there is a potential that some of those references may contain inaccurate data. Due to the noise, inefficacy, and poor characterization of the vast amount of accessible data, as well as the ensuing insensitivity and inefficiencies of human data cleaning and labeling, the presentation of the data has become ambiguous, and the assessment of the information has become difficult. A hole in the creation of a better data analysis method was identified. This helped to guide the creation of a Python script for automatically cleaning and labeling data. The first step in the strategy used in this study to accomplish its goals and objectives was to obtain a financial dataset from the top database, "Kaggle". Create a machine learning (ML) approach in Python that intends to automate the financial dataset cleaning. This covers ingesting data, addressing incomplete data, addressing anomalies, one-hot wrapping and label encoding, extracting date and time values, and data normalization. Implementing an unsupervised machine learning method that attempts to automate financial dataset labeling (k-means). Using the method includes the elbow principle, k-means clustering, data modeling of "age" versus "arrival," dimensionality reductions, computer vision, and dataset categorizing using the groupings. Empirical assessment of the cleaned and labeled automated trading dataset utilizing a comparison of the cleaned dataset before and after PCA adoption. The results show that the developed ML technique not only improved the performance of the audit data used in this study, but it also classified the data after cleaning it and removing the unpleasant section and incomplete data, as shown by the k-means segmentation result and grouping by PCAReferences
Ajagbe, S. A., Oladipupo, M. A. & Balogun, E. O., 2020. Crime Belt Monitoring Via Data Visualization: A Case Study of Folium. International Journal of Information Security, Privacy and Digital Forensic, 4(2), pp. 35-44.
Alkatheeri, Y. et al., 2020. The effect of big data on the quality of decision-making in Abu Dhabi Government organisations. In: Data management, analytics and innovation . s.l.:Springer, Singapore.
Alwert, K., Bornemann, M. & Will, M., 2009. Does intellectual capital reporting matter to financial analysts?. Journal of intellectual capital., Volume 10, pp. 354-368.
Bansal, S. K., 2014. Towards a semantic extract-transform-load (ETL) framework for big data integration. s.l., IEEE, pp. 522-529.
Bansal, S. K. & Kagemann, S., 2015. Integrating big data: A semantic extract-transform-load framework. Computer, 48(3), pp. 42-50.
Benenson, Z., Gassmann, F. & Landwirth, R., 2017. Unpacking spear phishing susceptibility. s.l., Cham: Springer, p. 610–627.
Bergholz, A. et al., 2010. New filtering approaches for phishing email. Journal of Computer Security, 18(1), pp. 7-35.
Bergholz, A. et al., 2008. Improved Phishing Detection using Model-Based Features. Mountain View, California, USA, s.n., pp. 1-10.
Beskales, G., Ilyas, I. F. & L., G., 2010. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3(1-2), pp. 197-207.
Chang, J. C., Amershi, S. & Kamar, E., 2017. Revolt: Collaborative crowdsourcing for labeling machine learning datasets. s.l., s.n., pp. 2334-2346.
Chen, Z. & Cafarella, M., 2014. Integrating spreadsheet data via accurate and low-effort extraction. s.l., ACM, p. 1126–1135.
Chicco, D., 2017. Ten quick tips for machine learning in computational biology. Bio Data mining, 10(1), pp. 1-17.
Dallachiesa, M. et al., 2013. Nadeef: a commodity data cleaning system. SIGMOD, pp. 541-552.
Fang, Y. et al., 2019. Phishing Email Detection Using Improved RCNN Model With Multilevel Vectors and Attention Mechanism. IEEE Access, Volume 7, pp. 56329-56340.
Fan, W. et al., 2010. Towards certain fixes with editing rules and master data. PVLDB, 3(1-2), pp. 173-184.
Halgaš, L., Agrafiotis, I. & Nurse, J. R. C., 2020. Catching the Phish: detecting Phishing Attacks Using Recurrent Neural Networks RNNs. s.l., Springer, pp. 219-233.
Hellerstein, J. M., 2008. Quantitative data cleaning for large databases, s.l.: United Nations Economic Commission for Europe (UNECE).
Johnson, G. M., 2021. Algorithmic bias: on the implicit biases of social technology. Synthese, 198(10), pp. 9941-9961.
Kairam, S. & Heer, J., 2016. Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks. s.l., ACM, pp. 1637-1648.
Khayyat, Z. et al., 2015. Bigdansing: A system for big data cleansing. s.l., ACM, pp. 1215-1230.
Kostopoulos, G., Kotsiantis, S. & Pintelas, P., 2015. Estimating student dropout in distance higher education using semi-supervised techniques. s.l., s.n., pp. 38-43.
Krishnan, S. et al., 2016. ActiveClean: interactive data cleaning for statistical modeling. s.l., ACM, p. 948.
Kubat, M., 2017. An introduction to machine learning (2nd Ed.). s.l.:Springer Publishing Company, Incorporated.
Kulesza, T. et al., 2014. Structured labeling for facilitating concept evolution in machine learning. s.l., ACM, p. 3075–3084.
Lai, S., Xu, L., Liu, K. & Zhau, J., 2015. Recurrent convolutional neural networks for text classification. s.l., ACM, p. 2267–2273.
Liebchen, G. A. & Shepper, M., 2005. Gernot Armin Liebchen, Martin Shepper, “Software Productivity Analysis of a Large Data Set and Issues of Confidentiality and Data Quality” 11th IEEE International Software Metrics Symposium (METRICS 2005).. s.l., ACM.
Madanagopal, K., Ragan, E. D. & Benjamin, P., 2019. Analytic provenance in practice: The role of provenance in real-world visualization and data analysis environments. IEEE Computer Graphics and Applications, 39(6), pp. 30-45.
Myklebust, T. et al., 2021. Data safety, sources, and data flow in the offshore industry. ESREL, Angers.
Phene, S. et al., 2019. Deep Learning and Glaucoma Specialists: The Relative Importance of Optic Disc Features to Predict Glaucoma Referral in Fundus Photographs. Ophthalmology, 126(12), pp. 1627-1639.
Pisani, M., 2020. CHAPTER 1 – Introduction. In: MACHINE LEARNING . s.l.:Rootstrap, pp. 1-10.
Rajasekar, S. P., Philominathan, P. & Chinnathambi, V., 2019. Research Methodology. Knowledge Management Techniques for Risk Management in IT Projects.. Knowledge Management Techniques for Risk Management in IT Projects, pp. 1-53.
Reddy, U. S., Thota, A. V. & Dharun, A., 2018. Machine learning techniques for stress prediction in working employees. s.l., IEEE, pp. 1-4.
Roh, Y., Heo, G. & Whang, S. E., 2019. A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective. IEEE Transactions on Knowledge and Data Engineering, pp. 1-1.
Sadique, F., Kaul, R. & Badsha, S. S. S., 2020. An Automated Framework for Real-time Phishing URL Detection. s.l., IEEE, pp. 335-341.
Sidi, F. et al., 2012. Data Quality: A Survey of Data Quality Dimensions. s.l., IEEE, pp. 300-304.
Taleb, I., Dssouli, R. & Serhani, M. A., 2015. Big data pre-processing: A quality framework. s.l., IEEE, pp. 191-198.
Tang, N., 2014. Big Data Cleaning. International Journal of Database Theory and Application, pp. 13-24.
Thadson, K., Visitsattapongse, S. & Pechprasarn, S., 2021. Deep learning-based single-shot phase retrieval algorithm for surface plasmon resonance microscope based refractive index sensing application. Scientific Reports, 11(1), pp. 1-14.
Tomar, D. & Agarwal, S., 2014. A Survey on Pre-processing and Post-processing Techniques in Data Mining. International Journal of Database Theory and Application , 7(4), pp. 99-128.
Toolan, F. & Carthy, J., 2010. Feature selection for Spam and Phishing detection. s.l., IEEE, pp. 1-12.
Yao, L., Mao, C. & Luo, Y., 2019. Graph convolutional networks for text classification. s.l., ACM, p. 7370–7377.
Downloads
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika