Using Semantic Clustering for Detecting Bengali Multiword Expressions

Authors

  • Tanmoy Chakraborty

Abstract

Multiword Expressions (MWEs), a known nuisance for both linguistics and NLP, blur the lines between syntax and semantics. The semantic of a MWE cannot be expressed after combining the semantic of its constituents. In this study, we propose a novel approach called "semantic clustering" as an instrument for extracting the MWEs especially for resource constraint languages like Bengali. At the beginning, it tries to locate clusters of the synonymous noun tokens present in the document. These clusters in turn help measure the similarity between the constituent words of a potential candidate using a vector space model. Finally the judgment for the suitability of this phrase to be a MWE is carried out based on a predefined threshold. In this experiment, we apply the semantic clustering approach only for noun-noun bigram MWEs; however we believe that it can be extended to any types of MWEs. We compare our approach with the state-ofthe- art statistical approach. The evaluation results show that the semantic clustering outperforms all other competing methods. As a byproduct of this experiment, we have started developing a standard lexicon in Bengali that serves as a productive Bengali linguistic thesaurus.

Downloads

How to Cite

Chakraborty, T. (2014). Using Semantic Clustering for Detecting Bengali Multiword Expressions. Informatica, 38(3). Retrieved from https://puffbird.ijs.si/index.php/informatica/article/view/690

Issue

Section

Regular papers