Khmer-Vietnamese Neural Machine Translation Improvement Using Data Augmentation Strategies

Thai Nguyen Quoc; Huong Le Thanh; Hanh Pham Van

doi:10.31449/inf.v47i3.4761

Authors

Thai Nguyen Quoc School of Information and Communication Technology, Hanoi University of Science and Technology
Huong Le Thanh School of Information and Communication Technology, Hanoi University of Science and Technology
Hanh Pham Van FPT AI Center

DOI:

https://doi.org/10.31449/inf.v47i3.4761

Abstract

The development of neural models has greatly improved the performance of machine translation, but these methods require large-scale parallel data, which can be difficult to obtain for low-resource language pairs. To address this issue, this research employs a pre-trained multilingual model and fine-tunes it by using a small bilingual dataset. Additionally, two data-augmentation strategies are proposed to generate new training data: (i) back-translation with the dataset from the source language; (ii) data augmentation via the English pivot language. The proposed approach is applied to the Khmer-Vietnamese machine translation. Experimental results show that our proposed approach outperforms the Google Translator model by 5.3% in terms of BLEU score on a test set of 2,000 Khmer-Vietnamese sentence pairs.

References

T. Khanna, J. N. Washington, and et al. Recent advances in apertium, a free/opensource rule-based machine translation platform for low-resource languages. Machine Translation, Dec 2021.

P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. NAACL ’03, page 48–54, USA, 2003. Association for Computational Linguistics. doi:10.3115/ 1073445.1073462.

P. Koehn, H. Hoang, and et al. Moses: Open source toolkit for statistical machine translation. pages 177–180, Prague, Czech Republic, June 2007. Association for Computational Linguistics. URL: https:// aclanthology.org/P07-2045.

K. Cho, B. Merri¨enboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder–decoder approaches. pages 103–111, Doha, Qatar, October 2014. Association for Computational Linguistics. doi:10.3115/v1/W14-4012

D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate, 2014. doi:10.48550/ ARXIV.1409.0473.

Minh-Thang Luong, Hieu Pham, and et al. Effective approaches to attention-based neural machine translation, 2015. doi:10. 48550/ARXIV.1508.04025.

S. Edunov, M. Ott, M. Auli, and et al. Understanding back-translation at scale. pages 489–500, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi:10.18653/v1/ D18-1045.

Vaswani, N. Shazeer, and et al. Attention is all you need. volume 30. Curran Associates, Inc., 2017. doi:99.9999/ woot07-S422.

Y. Liu, J. Gu, N. Goyal, and et al. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726– 742, 2020. doi:10.1162/tacl_a_00343.

R. Sennrich, B. Haddow, and A. Birch. Improving neural machine translation models with monolingual data. pages 86–96, Berlin, Germany, August 2016. Association for Computational Linguistics. doi:10.18653/v1/ P16-1009.

Van-Vinh Nguyen, Ha Nguyen-Tien, Huong Le-Thanh and et al. Kc4mt: A high-quality corpus for multilingual machine translation. page 5494–5502, June 2022.

J. Zhu, Y. Xia, L Wu, and et al. Incorporating bert into neural machine translation, 2020. URL: https://arxiv.org/abs/2002. 06823.

S. Rothe, S. Narayan, and et al. Leveraging pre-trained checkpoints for sequence generation tasks. Transactions of the Association for Computational Linguistics, 8:264– 280, 2020.

L. Mike, L. Yinhan, and et al. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL, 2020.

B. Zoph, D. Yuret, and et al. Transfer learning for low-resource neural machine translation. pages 1568–1575. Association for Computational Linguistics, November 2016.

R. C. Moore and W. Lewis. Intelligent selection of language model training data. pages 220–224, Uppsala, Sweden, July 2010. Association for Computational Linguistics. URL: https://aclanthology.org/P10-2041.

M. Wees, A. Bisazza, and C. Monz. Dynamic data selection for neural machine translation, 2017. doi:10.48550/ARXIV.1708.00712.

R. Wang, A. Finch, and et al. Sentence embedding for neural machine translation domain adaptation. pages 560–566, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi:10.18653/v1/ P17-2089.

S. Zhang and D. Xiong. Sentence weighting for neural machine translation domain adaptation. pages 3181–3190, Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics. URL: https: //aclanthology.org/C18-1269.

C. C. Silva, C. Liu, A. Poncelas, and A. Way. Extracting in-domain training corpora for neural machine translation using data selection methods. pages 224–231, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi:10.18653/v1/ W18-6323.

A. Poncelas and et al. Data selection with feature decay algorithms using an approximated target side. 2018. doi:10.48550/ ARXIV.1811.03039.

A. Imankulova, T. Sato, and et al. Improving low-resource neural machine translation with filtered pseudo-parallel corpus. pages 70– 78, Taipei, Taiwan, November 2017. Asian Federation of Natural Language Processing. URL: https://aclanthology.org/ W17-5704.

P. Koehn, H. Khayrallah, and et al. Findings of the WMT 2018 shared task on parallel corpus filtering. pages 726–739, Belgium, Brussels, October 2018. Association for Computational Linguistics. doi:10.18653/v1/ W18-6453.

M. Johnson, M. Schuster, Quoc V. Le, and et al. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339– 351, 2017. URL: https://aclanthology. org/Q17-1024.

N. H. Quan, N. T. Dat, N. H. M. Cong, and et al. Vinmt: Neural machine translation toolkit, 2021. doi:10.48550/ARXIV.2112. 15272.

P. V. Hanh and Huong L. T. Improving khmer-vietnamese machine translation with data augmentation methods. pages 276– 282, 2022. doi:https://doi.org/10.1145/ 3568562.3568646.

Y. Tang, C. Tran, X. Li, and et al. Multilingual translation with extensible multilingual pretraining and finetuning, 2020. doi: 10.48550/ARXIV.2008.00401.

N. Reimers and I. Gurevych. Making monolingual sentence embeddings multilingual using knowledge distillation, 2020. doi:10. 48550/ARXIV.2004.09813.

K. Papineni, S. Roukos, and et al. Bleu: a method for automatic evaluation of machine translation. pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135.

A. Fan, S. Bhosale, H. Schwenk, and et al. Beyond english-centric multilingual machine translation, 2020. doi:10.48550/ARXIV. 2010.11125.

NLLB Team. No language left behind: Scaling human-centered machine translation, 2022. doi:10.48550/ARXIV.2207.04672.

Khmer-Vietnamese Neural Machine Translation Improvement Using Data Augmentation Strategies

Authors

DOI:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Information

SUPPORT & INDEXING

Make a Submission

Current Issue

Browse