Discriminating Between Closely Related Languages on Twitter"

Authors

  • Nikola Ljubešić
  • Denis Kranjčić

Abstract

Editorial: "In this paper we tackle the problem of discriminating Twitter users by the language they tweet in, taking into account very similar South-Slavic languages – Bosnian, Croatian, Montenegrin and Serbian. We apply the supervised machine learning approach by annotating a subset of 500 users from an existing Twitter collection by the language the users primarily tweet in. We show that by using a simple bag-ofwords model, univariate feature selection, 320 strongest features and a standard classifier, we reach user classification accuracy of 98%. Annotating the whole 63,160 users strong Twitter collection with the best performing classifier and visualizing it on a map via tweet geo-information, we produce a Twitter language map which clearly depicts the robustness of the classifier."

Downloads

How to Cite

Ljubešić, N. ., & Kranjčić, D. (2015). Discriminating Between Closely Related Languages on Twitter". Informatica, 39(1). Retrieved from https://puffbird.ijs.si/index.php/informatica/article/view/746

Issue

Section

Regular papers