Khmer-Vietnamese Neural Machine Translation Improvement Using Data Augmentation Strategies
DOI:
https://doi.org/10.31449/inf.v47i3.4761Abstract
The development of neural models has greatly improved the performance of machine translation, but these methods require large-scale parallel data, which can be difficult to obtain for low-resource language pairs. To address this issue, this research employs a pre-trained multilingual model and fine-tunes it by using a small bilingual dataset. Additionally, two data-augmentation strategies are proposed to generate new training data: (i) back-translation with the dataset from the source language; (ii) data augmentation via the English pivot language. The proposed approach is applied to the Khmer-Vietnamese machine translation. Experimental results show that our proposed approach outperforms the Google Translator model by 5.3% in terms of BLEU score on a test set of 2,000 Khmer-Vietnamese sentence pairs.References
T. Khanna, J. N. Washington, and et al. Recent advances in apertium, a free/opensource rule-based machine translation platform for low-resource languages. Machine Translation, Dec 2021.
P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. NAACL ’03, page 48–54, USA, 2003. Association for Computational Linguistics. doi:10.3115/ 1073445.1073462.
P. Koehn, H. Hoang, and et al. Moses: Open source toolkit for statistical machine translation. pages 177–180, Prague, Czech Republic, June 2007. Association for Computational Linguistics. URL: https:// aclanthology.org/P07-2045.
K. Cho, B. Merri¨enboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder–decoder approaches. pages 103–111, Doha, Qatar, October 2014. Association for Computational Linguistics. doi:10.3115/v1/W14-4012
D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate, 2014. doi:10.48550/ ARXIV.1409.0473.
Minh-Thang Luong, Hieu Pham, and et al. Effective approaches to attention-based neural machine translation, 2015. doi:10. 48550/ARXIV.1508.04025.
S. Edunov, M. Ott, M. Auli, and et al. Understanding back-translation at scale. pages 489–500, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi:10.18653/v1/ D18-1045.
Vaswani, N. Shazeer, and et al. Attention is all you need. volume 30. Curran Associates, Inc., 2017. doi:99.9999/ woot07-S422.
Y. Liu, J. Gu, N. Goyal, and et al. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726– 742, 2020. doi:10.1162/tacl_a_00343.
R. Sennrich, B. Haddow, and A. Birch. Improving neural machine translation models with monolingual data. pages 86–96, Berlin, Germany, August 2016. Association for Computational Linguistics. doi:10.18653/v1/ P16-1009.
Van-Vinh Nguyen, Ha Nguyen-Tien, Huong Le-Thanh and et al. Kc4mt: A high-quality corpus for multilingual machine translation. page 5494–5502, June 2022.
J. Zhu, Y. Xia, L Wu, and et al. Incorporating bert into neural machine translation, 2020. URL: https://arxiv.org/abs/2002. 06823.
S. Rothe, S. Narayan, and et al. Leveraging pre-trained checkpoints for sequence generation tasks. Transactions of the Association for Computational Linguistics, 8:264– 280, 2020.
L. Mike, L. Yinhan, and et al. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL, 2020.
B. Zoph, D. Yuret, and et al. Transfer learning for low-resource neural machine translation. pages 1568–1575. Association for Computational Linguistics, November 2016.
R. C. Moore and W. Lewis. Intelligent selection of language model training data. pages 220–224, Uppsala, Sweden, July 2010. Association for Computational Linguistics. URL: https://aclanthology.org/P10-2041.
M. Wees, A. Bisazza, and C. Monz. Dynamic data selection for neural machine translation, 2017. doi:10.48550/ARXIV.1708.00712.
R. Wang, A. Finch, and et al. Sentence embedding for neural machine translation domain adaptation. pages 560–566, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi:10.18653/v1/ P17-2089.
S. Zhang and D. Xiong. Sentence weighting for neural machine translation domain adaptation. pages 3181–3190, Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics. URL: https: //aclanthology.org/C18-1269.
C. C. Silva, C. Liu, A. Poncelas, and A. Way. Extracting in-domain training corpora for neural machine translation using data selection methods. pages 224–231, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi:10.18653/v1/ W18-6323.
A. Poncelas and et al. Data selection with feature decay algorithms using an approximated target side. 2018. doi:10.48550/ ARXIV.1811.03039.
A. Imankulova, T. Sato, and et al. Improving low-resource neural machine translation with filtered pseudo-parallel corpus. pages 70– 78, Taipei, Taiwan, November 2017. Asian Federation of Natural Language Processing. URL: https://aclanthology.org/ W17-5704.
P. Koehn, H. Khayrallah, and et al. Findings of the WMT 2018 shared task on parallel corpus filtering. pages 726–739, Belgium, Brussels, October 2018. Association for Computational Linguistics. doi:10.18653/v1/ W18-6453.
M. Johnson, M. Schuster, Quoc V. Le, and et al. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339– 351, 2017. URL: https://aclanthology. org/Q17-1024.
N. H. Quan, N. T. Dat, N. H. M. Cong, and et al. Vinmt: Neural machine translation toolkit, 2021. doi:10.48550/ARXIV.2112. 15272.
P. V. Hanh and Huong L. T. Improving khmer-vietnamese machine translation with data augmentation methods. pages 276– 282, 2022. doi:https://doi.org/10.1145/ 3568562.3568646.
Y. Tang, C. Tran, X. Li, and et al. Multilingual translation with extensible multilingual pretraining and finetuning, 2020. doi: 10.48550/ARXIV.2008.00401.
N. Reimers and I. Gurevych. Making monolingual sentence embeddings multilingual using knowledge distillation, 2020. doi:10. 48550/ARXIV.2004.09813.
K. Papineni, S. Roukos, and et al. Bleu: a method for automatic evaluation of machine translation. pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135.
A. Fan, S. Bhosale, H. Schwenk, and et al. Beyond english-centric multilingual machine translation, 2020. doi:10.48550/ARXIV. 2010.11125.
NLLB Team. No language left behind: Scaling human-centered machine translation, 2022. doi:10.48550/ARXIV.2207.04672.
Downloads
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika