Automatic Detection of Stop Words for Texts in Uzbek Language
DOI:
https://doi.org/10.31449/inf.v47i2.3788Abstract
Stop words are very important for information retrieval and text analysis investigation. This study aimed to automatically analyzed and detect stop words in texts in Uzbek language. Because of limited availability of methods for automatic search of stop words of texts in Uzbek we analyzed a newly prepared corpus. Uzbek language belongs to the family of agglutinative languages. As with all agglutinative languages, we can explain that the detection of stop words in Uzbek texts is a more complex process than in inflected languages: In inflected languages, words such as auxiliary words, articles, prepositions can be included in the stop words group. In agglutinative languages, the meanings of such words are hidden in the text. Therefore, it is not appropriate to apply all known methods of stop words detection in inflected languages directly to agglutinative languages. In this work, the “School corpus” which contains 731156 Uzbek words has been investigated. The bigram method of analysis was applied to the corpus. We proposed the collocation method of detecting stop words of the corpus. We proposed the method of automatically detecting stop words of texts in Uzbek. It is shown that the collocation method is 6 times better than the bigram method.References
S. Matlatipov, X. Madatov, G. Matlatipov,A. O‘razbayev, M. Raximboyev, I. Avezma-tov, U. Babajanov, L. Kurbanova, D. Xu-jamov, and D. Matjumayeva, “”o‘zbek tilin-ing statistik electron lug‘at” exm das-turi uchun guvohnoma,”Intellektual mulkagentligi, 2020.
A. W. Pradana and M. Hayaty, “The ef-fect of stemming and removal of stop wordson the accuracy of sentiment analysis onindonesian-language texts,”Game Technol-ogy, Information System, Computer Net-work, Computing, Electronics, and ControlJournal, vol. 4, no. 3, pp. 277–288, 2019.
R. U. Haque, P. Mehera, M. F. Mridha, andM. A. Hamid, “A complete bengali stop worddetection mechanism,” inConference Paper·May 2019. Conference, 2019.
R. Rania and D.K.Lobiyal, “Automatic con-struction of generic stop words list for hinditext,” inInternational Conference on Com-putational Intelligence and Data Science, vol.132, International Conference on Computa-tional Intelligence and Data Science.IC-CIDS 2018, 2018, pp. 362–370.
P. J. Burns, “Constructing stoplists for his-torical languages,”Digital Classics Online,vol. 4, no. 2, 2018.
R. M. Rakholia and J. R. Saini, “A rule-based approach to identify stop words forgujarati language,” inIn Proceedings of the5th International Conference on Frontiers inIntelligent Computing: Theory and Applica-tions, 2017, pp. 797–806.
J. K. Raulji and J. R. Saini, “Generatingstopword list for sanskrit language,” inIn:2017 IEEE 7th International Advance Com-puting Conference.IEEE 7th, 2017, pp.799–802.
O. D. Tijani, A. T. Akinwale, S. A.Onashoga, and E. O. Adeleke, “An auto-generated approach of stop words using aggregated analysis,” inIn: Proceedings of the13th International Conference of the NigeriaComputer Society, 2017, pp. 99–115.
M. Mhatre, D. Phondekar, P. Kadam,A. Chawathe, and K. Ghag, “Dimen-sionality reduction for sentiment analysisusing pre-processing techniques,”in Proceedings of the IEEE 2017 Interna-tional Conference on Computing Methodolo-gies and Communication.ICCMC, 2017,pp. 16–21. [Online]. Available: https://doi.org/10.1109/ICCMC.2017.8282676
C. Sammut and G. I. Webb, Eds.,TF–IDF.Boston, MA: Springer US, 2010,pp. 986–987. [Online]. Available: https://doi.org/10.1007/978-0-387-30164-8832
Y. Wang, K. Kim, B. Lee, and H. Y.Youn, “Word clustering based on pos featurefor efficient twitter sentiment analysis,”Human-centric Comput, vol. 8, no. 17,pp. 1–25, 2019. [Online]. Available:
https://doi.org/10.1186/s13673-018-0140-y
N. Ousirimaneechai and S. Sinthupinyo, “Ex-traction of trend keywords and stop words from thai facebook pages using character n-grams,”International Journal of MachineLearning and Computing, vol. 8, no. 6, 2018.
C. Slamet, A. R. Atmadja, D. S. Maylawati,R. S. Lestari, W. Dharmalaksana, and M. A.Ramdhani, “Automated text summarizationfor indonesian article using vector spacemodel model,” inIOP Conf. Ser. Mater.Sci. Eng., vol. 288, no. 1, Conference. IOP,2018. [Online]. Available: https://doi.org/10.1088/1757-899X/288/1/012037
G. Li and J. Li, “Research on senti-ment classification for tang poetry basedon tf-idf and fp-growth,” inProceedingsof 2018 IEEE 3rd Advanced Informa-tion Technology, Electronic and Automa-tion Control Conference.IAEAC, 2018,pp. 630–634. [Online]. Available: https://doi.org/10.1109/IAEAC.2018.8577715
H. M. Zin, N. Mustapha, M. A. A.Murad, and N. M. Sharef, “The effectsof pre-processing strategies in sentimentanalysis of online movie reviews,” inAIPConf. Proc.,vol. 1891,no. 1.AIPConf., 2017, pp. 1–7. [Online]. Available: https://doi.org/10.1063/1.5005422
S. K. Metin and B. Karaog’lan, “Stop worddetection as a binary classification problem,”Anadolu University Journal of Science andTechnology A- Applied Sciences and Engineering, vol. 18, no. 2, pp. 346–359, 2017.
J. K. Raulji and J. R. Saini, “Generating stopword list for sanskrit language,” inIn Ad-vance Computing Conference IEEE 7th In-ternational. IEEE, 2017, pp. 799–802.
S. J. R. Rakholia R. M.,“A rule-basedapproach to identify stop words for gu-jarati language,” inSuresh Chandra Satap-athy Vikrant Bhateja Siba K., 2017.
R. M. Rakholia and J. R. Saini, “Informationretrieval for gujarati language using cosinesimilarity based vector space model,” inThe-ory and Applications. SpringerSingapore,2017, pp. 1–9.
X.Madatov and S. Matlatipov, “Kosinuso’xshahshlik va uning o’zbek tili matnlar-iga tatbiqi haqida,”O’zMU xabarlari, vol. 2,no. 1, 2016.
Downloads
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika