Gender Classification on Twitter Based on Feeds and User Descriptions Using Xlnet-Fasttext
DOI:
https://doi.org/10.31449/inf.v48i20.5761Abstract
Gender falsification in social media content is an increasingly troubling challenge, with users often choosing to hide their true gender identity or pretend to be members of a different gender. This can lead to negative consequences, including the spread of disinformation, discrimination and online security risks. To overcome this problem, this research proposes a text classification-based solution to identify gender fakes in social media texts. This method involves extracting linguistic features from texts, such as word usage, sentence structure, and language patterns that can provide clues to the author's gender. Therefore, this research aims to introduce a new transformers-based approach that uses XLNet and is also modified with additional Fasttext embedding. Modifications were made to the embedding section which can increase XLNet's understanding of text context in carrying out text classification. The results of this research are that baseline XLNet gets a fairly good performance score in gender classification based on Twitter feeds, namely with accuracy, precision, recall and f1-score of 0.704, 0.770, 0.598, 0.674 respectively, while XLNet-FastText gets the respective scores. -respectively 0.714, 0.770, 0.609, 0.680. And for gender classification based on user account descriptions, baseline XLNet gets scores of accuracy, precision, recall, f1-score of 0.705, 0.771, 0.598, 0.674 respectively while XLNet-FastText gets scores of 0.724, 0.751, 0.6324, 0.686 respectively.References
Delić, D. (2022). Are women at more risk of online scams, the latest 2024 statistics. Retrieved from https://proprivacy.com/blog/women-and-online-scams-latest-statistics-2022
Susandra, A. (2022). Erayani Pelaku Penipuan nikah Sesama Jenis dilaporkan ke Polresta Jambi : Okezone Video. Retrieved from https://video.okezone.com/play/2022/06/30/1/149948/erayani-pelaku-penipuan-nikah-sesama-jenis-dilaporkan-ke-polresta-jambi
Yang, L., Li, Y., Wang, J., & Sherratt, R. S. (2020). Sentiment analysis for e-commerce product reviews in Chinese based on sentiment lexicon and deep learning. IEEE Access, 8, 23522–23530. https://doi.org/10.1109/access.2020.2969854
Bazzaz Abkenar, S., Haghi Kashani, M., Akbari, M., & Mahdipour, E. (2023). Learning textual features for Twitter Spam Detection: A systematic literature review. Expert Systems with Applications, 228, 120366. https://doi.org/10.1016/j.eswa.2023.120366
Adhikari, A., Ram, A., Tang, R., Hamilton, W. L., & Lin, J. (2020). Exploring the limits of simple learners in knowledge distillation for document classification with DocBERT. Proceedings of the 5th Workshop on Representation Learning for NLP. https://doi.org/10.18653/v1/2020.repl4nlp-1.10
Joshi, S., & Abdelfattah, E. (2021). Multi-class text classification using machine learning models for online drug reviews. 2021 IEEE World AI IoT Congress (AIIoT). https://doi.org/10.1109/aiiot52608.2021.9454250
Suleymanov, U., Kiani Kalejahi, B., Amrahov, E., & Badirkhanli, R. (2020). Text classification for azerbaijani language using machine learning. Computer Systems Science and Engineering, 35(6), 467–475. https://doi.org/10.32604/csse.2020.35.467
Garcia-Mendez, S., Fernandez-Gavilanes, M., Juncal-Martinez, J., Gonzalez-Castano, F. J., & Seara, O. B. (2020). Identifying banking transaction descriptions via support vector machine short-text classification based on a specialized labelled corpus. IEEE Access, 8, 61642–61655. https://doi.org/10.1109/access.2020.2983584
Zhong, B., Xing, X., Love, P., Wang, X., & Luo, H. (2019). Convolutional Neural Network: Deep learning-based classification of building quality problems. Advanced Engineering Informatics, 40, 46–57. https://doi.org/10.1016/j.aei.2019.02.009
Wani, A., Joshi, I., Khandve, S., Wagh, V., & Joshi, R. (2021). Evaluating deep learning approaches for covid19 fake news detection. Combating Online Hostile Posts in Regional Languages during Emergency Situation, 153–163. https://doi.org/10.1007/978-3-030-73696-5_15
Gupta, A., Chugh, D., Anjum, & Katarya, R. (2022). Automated News summarization using Transformers. Retrieved from https://link.springer.com/chapter/10.1007/978-981-16-9012-9_21
Anwar, M. T., Permana, A. K., Ambarwati, L., & Agustin, D. (2021). Analyzing public opinion based on emotion labeling using Transformers. 2021 2nd International Conference on Innovative and Creative Information Technology (ICITech). https://doi.org/10.1109/icitech50181.2021.9590110
Anwar, M. T., Permana, A. K., Ambarwati, L., & Agustin, D. (2021). Analyzing public opinion based on emotion labeling using Transformers. 2021 2nd International Conference on Innovative and Creative Information Technology (ICITech). https://doi.org/10.1109/icitech50181.2021.9590110
Kumar, D., Kumar, N., & Mishra, S. (2021). NLP@NISER: Classification of covid19 tweets containing symptoms. Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task. https://doi.org/10.18653/v1/2021.smm4h-1.19
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., &; Soricut, R. (2020). Albert: A lite bert for self-supervised learning of language representations. arXiv.org. https://doi.org/10.48550/arXiv.1909.11942
Yao, T., Zhai, Z., & Gao, B. (2020). Text classification model based on fasttext: IEEE Conference Publication: IEEE Xplore. Retrieved from https://doi.org/10.1109/ICAIIS49377.2020.9194939
Nia, Z. M., Ahmadi, A., Mellado, B., Wu, J., Orbinski, J., Agary, A., & Kong, J. D. (2022). Twitter-based gender recognition using Transformers. Retrieved from https://arxiv.org/abs/2205.06801
Vashisth, P., &; Meehan, K. (2020). Gender classification using Twitter Text Data. 2020 31st Irish Signals and Systems Conference (ISSC). https://doi.org/10.1109/issc49989.2020.9180161
Puertas, E., Ureña-López, L. A., Pomares-Quimbaya, A., Alvarado-Valencia, J. A., Plaza-del-Arco, F. M., & Moreno-Sandoval, L. G. (2019). Bots and gender profiling on Twitter using sociolinguistic features ... Bots and gender profiling on Twitter using sociolinguistic features. https://www.researchgate.net/publication/335611800_Bots_and_Gender_Profiling_on_Twitter_using_Sociolinguistic_Features_Notebook_for_PAN_at_CLEF_2019
Staykovski, T. (2019). Stacked bots and gender prediction from Twitter feeds - CEUR-WS.org. Stacked Bots and Gender Prediction from Twitter Feeds. https://ceur-ws.org/Vol-2380/paper_197.pdf
Alroobaea, R., Aldahass, A., Alhomidi, S., Alafif, S., Hamed, R., Mulla, R., &; Alotaibi, B. (2020). A decision support system for detecting age and gender from Twitter feeds based on a comparative experiments. International Journal of Advanced Computer Science and Applications, 11(12). https://doi.org/10.14569/ijacsa.2020.0111245
Saeed, U., &; Shirazi, F. (2019). Bots and gender classification on Twitter - Webis. Notebook for PAN at CLEF 2019. https://pan.webis.de/downloads/publications/papers/saeed_2019.pdf
Ouni, S., Fkih, F., &; Omri, M. N. (2022). Bots and gender detection on twitter using stylistic features. Advances in Computational Collective Intelligence, 650–660. https://doi.org/10.1007/978-3-031-16210-7_53
Soldevilla, I., &; Flores, N. (2021). Natural language processing through Bert for identifying gender-based violence messages on social media. 2021 IEEE International Conference on Information Communication and Software Engineering (ICICSE). https://doi.org/10.1109/icicse52190.2021.9404127
Hashempour, R., Amorim, R., Villavicencio, A., & Plank, B. (2019). A deep learning approach to language-independent gender prediction on Twitter. ACL Anthology. https://aclanthology.org/W19-3630/
Eight, F. (2016). Twitter User Gender Classification. Retrieved from https://www.kaggle.com/datasets/crowdflower/twitter-user-gender-classification
Downloads
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika