Categorization of Event Clusters from Twitter Using Term Weighting Schemes
DOI:
https://doi.org/10.31449/inf.v45i3.3063Abstract
A real-world event is commonly represented on Twitter as a collection of repetitive and noisy text messages posted by different users. Term weighting is a popular pre-processing step for text classification, especially when the size of the dataset is limited. In this paper, we propose a new term weighting scheme and a modification to an existing one and compare them with many state-of-the-art methods using three popular classifiers. We create a labelled Twitter dataset of events for exhaustive cross-validation experiments and use another Twitter event dataset for cross-corpus tests. The proposed schemes are among the best performers in many experiments, with the proposed modification significantly improving the performance of the original scheme. We create two majority voting based classifiers that further enhance the F1-scores of the best individual schemes.References
[Alsaedi et al., 2016] Alsaedi, N., Burnap, P., and
Rana, O. F. (2016). Automatic summarization of
real world events using twitter. In Proceedings of
the Tenth International Conference on Web and So-
cial Media, Cologne, Germany, May 17-20, 2016.,
pages 511–514.
[Cardoso-Cachopo, 2007] Cardoso-Cachopo, A.
(2007). Improving methods for single-label text
categorization. PhD Thesis, Instituto Superior
Tecnico, Universidade Tecnica de Lisboa.
[Debole and Sebastiani, 2003] Debole, F. and Sebas-
tiani, F. (2003). Supervised term weighting for automated text categorization. In Proceedings of
the 2003 ACM Symposium on Applied Computing,
SAC ’03, pages 784–788, New York, NY, USA.
ACM.
[Escalante et al., 2015] Escalante, H. J., Garc´ ıa-
Limón, M. A., Morales-Reyes, A., Graff, M.,
Montes-y Gómez, M., Morales, E. F., and
Mart´ ınez-Carranza, J. (2015). Term-weighting
learning via genetic programming for text classi-
fication. Know.-Based Syst., 83(C):176–189.
[Joachims, 1998] Joachims, T. (1998). Text catego-
rization with support vector machines: Learning
with many relevant features. In Proceedings of
the 10th European Conference on Machine Learn-
ing, ECML’98, pages137–142, Berlin, Heidelberg.
Springer-Verlag.
[Kalyanam et al., 2016] Kalyanam, J., Quezada, M.,
Poblete, B., and Lanckriet, G. (2016). Prediction
and characterization of high-activity events in so-
cial media triggered by real-world news. PLOS
ONE, 11(12):1–13.
[Lan et al., 2006] Lan, M., Tan, C. L., and Low,
H. (2006). Proposing a new term weighting
scheme for text categorization. In Proceedings,
The Twenty-First National Conference on Artificial
Intelligence and the Eighteenth Innovative Appli-
cations of Artificial Intelligence Conference, July
-20, 2006, Boston, Massachusetts, USA, pages
–768.
[Malliaros and Skianis, 2015] Malliaros, F. D. and
Skianis, K. (2015). Graph-based term weight-
ing for text categorization. In 2015 IEEE/ACM
International Conference on Advances in Social
Networks Analysis and Mining (ASONAM), pages
–1479.
[McMinn et al., 2013] McMinn, A.J., Moshfeghi, Y.,
and Jose, J. M. (2013). Building a large-scale cor-
pus for evaluating event detection on twitter.
[Ng et al., 1997] Ng, H. T., Goh, W. B., and Low,
K. L. (1997). Feature selection, perceptron learn-
ing, and a usability case study for text categoriza-
tion. In Proceedings of the 20th annual interna-
tional ACM SIGIR conference on Research and
development in information retrieval - SIGIR ’97,
pages 67–73.
[Quan et al., 2011] Quan, X., Wenyin, L., andQiu, B.
(2011). Term weighting schemes for question cate-
gorization. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 33(5):1009–1021.
[Radev et al., 2004] Radev, D. R., Jing, H., Sty´ s, M.,
and Tam, D. (2004). Centroid-based summariza-
tion of multiple documents. Inf. Process. Manage.,
(6):919–938.
[Reed et al., 2006] Reed, J. W., Jiao, Y., Potok, T. E.,
Klump, B. A., Elmore, M. T., and Hurson, A. R.
(2006). Tf-icf: A new term weighting scheme for
clustering dynamic data streams. In 2006 5th In-
ternational Conference on Machine Learning and
Applications (ICMLA’06), pages 258–263.
[Wang et al., 2015] Wang, T., Cai, Y., Leung, H.,
Cai, Z., and Min, H. (2015). Entropy-based term
weighting schemes for text categorization in vsm.
In 2015 IEEE 27th International Conference on
Tools with Artificial Intelligence (ICTAI), pages
–332.
[Wu et al., 2017] Wu, H., Gu, X., and Gu, Y. (2017).
Balancing between over-weighting and under-
weighting in supervised term weighting. Inf. Pro-
cess. Manage., 53(2):547–557.
[Yang and Pedersen, 1997] Yang, Y. and Pedersen,
J. O. (1997). A comparative study on feature se-
lection in text categorization. In Proceedings of
the Fourteenth International Conference on Ma-
chine Learning, ICML ’97, pages 412–420, San
Francisco, CA, USA. Morgan Kaufmann Publish-
ers Inc.
Downloads
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika