《厦门大学学报（自然科学版）》

神经机器翻译是目前机器翻译领域主流研究方法,但是蒙汉平行语料的稀缺使得蒙汉神经机器翻译性能难以提升.本文针对基于Transformer的蒙汉神经机器翻译系统,利用深度学习模型对蒙古文词切分方法进行研究,分析了蒙古文部分切分、BPE子词切分和BiLSTM-CNN-CRF神经网络切分方法对于蒙汉机器翻译模型的影响,并在此基础上利用基于BERT(bidirectional encoder representations from Transformers)中文语义相似度计算的数据增强技术去扩充蒙汉机器翻译训练数据.在CCMT2019提供的数据集上进行对比实验,实验结果表明,数据增强方法的BLEU值相较于基线实验提升显著,且BLEU4值达到了75.28%.

Objective : Neural machine translation is currently the mainstream research method in the field of machine translation. In order to obtain a translation model with better translation quality, a large-scale, high-quality, bilingual parallel corpus containing various fields is needed as the training data of the neural network model. Aiming at the problem that the scarcity of Mongolian-Chinese parallel corpus makes it difficult to improve translation performance, in order to enhance the training data in Mongolian-Chinese neural machine translation, this thesis proposes a method to expand the Mongolian-Chinese machine translation training corpus using BERT (bidirectional encoder representations from Transformers) based data enhancement techniques based on Mongolian word segmentation .
Methods : In this paper, the Mongolian-Chinese neural machine translation system based on Transformer is selected. According to the rich morphological features of Mongolian and the limited vocabulary dictionary of neural machine translation, the Mongolian corpus for machine translation is segmentation preprocessing work. The Mongolian words are segmented with various granularities by using the partial segmentation method, the BPE (Byte-Pair Encoding) sub-word segmentation method, and the BiLSTM-CNN-CRF neural network segmentation method. On this basis, BERT is used to train the Chinese semantic similarity calculation model to improve the quality of the pseudo-parallel corpus and effectively expand the Mongolian-Chinese machine translation training corpus.
Results : In this paper, the training set of the BERT Chinese semantic similarity calculation model to train the Chinese semantic similarity model is the LCQMC (large-scale Chinese question matching corpus) data set. The Mongolian-Chinese machine translation experiments all use the training set and development set provided by CCMT2021, and the test set uses the offline test set provided by CCMT2019. In the experiment, firstly, through the comparison of partial segmentation, BPE segmentation and BiLSTM-CNN-CRF neural network segmentation experiments on Mongolian-Chinese machine translation Mongolian corpus, it is found that the method based on partial segmentation in the Transformer neural machine translation system is higher. For other segmentation methods with different granularities, the BLEU4 value reaches 69.87%. This may be because the BPE segmentation method is mainly based on word frequency statistics, which will cause Mongolian to lose the original part of speech and word meaning. The BiLSTM-CNN-CRF neural network segmentation method has a high word segmentation accuracy, but in the Mongolian-Chinese machine translation task, the Mongolian sentences become longer and the word segmentation granularity is too fine. Therefore, subsequent experiments will be carried out on the basis of the partial segmentation method with the highest BLEU value. Then, the Mongolian corpus is partially segmented and the control characters in the Mongolian corpus are filtered. Based on the data enhancement by using the BERT model, the quality of the Mongolian-Chinese neural machine translation is significantly improved. This method makes the BLEU4 value reach 75.28%. By analyzing the experimental results, it is concluded that BERT data enhancement technology based on word segmentation can effectively improve the generalization ability of neural machine translation to low-resource languagesto deal with the sparse problem of parallel bilingual data in Mongolian-Chinese machine translation tasks.
Conclusions : The above experimental results show that treating Mongolian words as a whole will lose a lot of grammatical and semantic information, and word segmentation in Mongolian can not only reduce the loss of grammar and semantics, but also solve the problem of data sparseness. Compared with the word-based method, the Mongolian word segmentation method has better performance on the Transformer-based neural machine translation system. The application of the BERT data enhancement method in the Mongolian-Chinese neural machine translation based on word segmentation can construct a high-quality Mongolian-Chinese pseudo-parallel corpus, effectively expand the Mongolian-Chinese machine training corpus, and thus improve the translation quality of machine translation. In the task of bilingual machine translation between Mongolian and Chinese, the most serious problem is the problem of sparse bilingual data. In the future research work on this problem, methods such as transfer learning and post-translation processing can be used.

引言
1 数据预处理
2 基于BERT数据增强的蒙汉神经机器翻译方法
3 蒙汉机器翻译实验及分析
4 总结

图1 系统整体流程框架<br/>Fig.1 System overall process framework

图1 系统整体流程框架
Fig.1 System overall process framework

图2 BERT中文语义相似度计算模型结构图<br/>Fig.2 BERT Chinese semantic similarity calculation model framework

图2 BERT中文语义相似度计算模型结构图
Fig.2 BERT Chinese semantic similarity calculation model framework

表1 不同粒度词切分实例对比<br/>Tab.1 Comparison of examples of word segmentation with different granularities

表1 不同粒度词切分实例对比
Tab.1 Comparison of examples of word segmentation with different granularities

表2 蒙古文词切分方法在Transformer机器翻译中的实验结果<br/>Tab.2 Experimental results of Mongolian word segmentation method in Transformer machine translation

表2 蒙古文词切分方法在Transformer机器翻译中的实验结果
Tab.2 Experimental results of Mongolian word segmentation method in Transformer machine translation

表3 蒙汉神经机器翻译结果<br/>Tab.3 Mongolian-Chinese neural machine translation results

表3 蒙汉神经机器翻译结果
Tab.3 Mongolian-Chinese neural machine translation results

[1] SU J S,ZHANG X W,LIN Q,et al.Exploiting reverse target-side contexts for neural machine translation via asynchronous bidirectional decoding[J].Artificial Intelligence,2019,277:103168.
[2] LUONG M T,PHAM H,MANNING C D.Effective approaches to attention-based neural machine translation[C]∥Conference on Empirical Methods in Natural Language Processing.Stroudsburg:ACL,2015:1412-1421.
[3] LAMPLE G,DENOYER L,RANZATO M.Unsupervised machine translation using monolingual corpora only[EB/OL].[2021-11-01].https:∥www.arxiv-vanity.com/papers/1711.00043/.
[4] NIU X,DENKOWSKI M,CARPUAT M.Bi-directional neural machine translation with synthetic parallel data[C]∥Workshop on Neural Machine Translation and Generation.Stroudsburg:ACL,2018:84-91.
[5] WU L J,WANG Y R,XIA Y C,et al.Exploiting monolingual data at scale for neural machine translation[C]∥Conference on Empirical Methods in Natural Language Processing.Stroudsburg:ACL,2019:4207-4216.
[6] SENNRICH R,HADDOW B,BIRCH A.Improving neural machine translation models with monolingual data[EB/OL].[2021-11-01].https:∥arxiv.org/abs/1511.06709.
[7] FADAEE M,MONZ C.Back-translation sampling by targeting difficult words in neural machine translation[EB/OL].[2021-11-01].https:∥arxiv.org/abs/1808.09006.
[8] SUGIYAMA A,YOSHINAGA N.Data augmentation using back-translation for context-aware neural machine translation[C]∥Workshop on Discourse in Machine Translation.Stroudsburg:ACL,2019:35-44.
[9] PONCELAS A,SHTERIONOV D,WAY A,et al.Investigating backtranslation in neural machine translation[C]∥Annual Conference of the European Association for Machine Translation.Stroudsburg:ACL,2018:249-258.
[10] EDUNOV S,OTT M,AULI M,et al.Understanding back-translation at scale[C]∥Conference on Empirical Methods in Natural Language Processing.Stroudsburg:ACL,2018:489-500.
[11] 谷舒豪,单勇,谢婉莹,等.基于数据增强及领域适应的神经机器翻译技术[J].江西师范大学学报(自然科学版),2019,43(6):643-648.
[12] SENNRICH R,HADDOW B,BIRCH A.Neural machine translation of rare words with subword units[C]∥Annual Meeting of the Association for Computational Linguistics.Stroudsburg:ACL,2016:1715-1725.
[13] YAO Y,HUANG Z.Bi-directional LSTM recurrent neural network for chinese word segmentation[C]∥International Conference on Neural Information Processing.Cham:Springer,2016:345-353.
[14] WANG C Q,XU B.Convolutional neural network with word embeddings for Chinese word segmentation[C]∥International Joint Conference on Natural Language Processing.Taiwan:Asian Federation of Natural Language Processing,2017:163-172
[15] JOHN L,ANDREW M,FERNANDO C N P.Conditional random fields:probabilistic models for segmenting and labeling sequence data[C]∥International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc,2001:282-289.
[16] 国家质量监督检验检疫总局,中国国家标准化管理委员会.信息技术蒙古文变形显现字符集和控制字符使用规则:GB/T 26226—2010[S].北京:中国标准出版社,2011.
[17] 包乌格德勒,赵小兵.基于RNN和CNN的蒙汉神经机器翻译研究[J].中文信息学报,2018,32(8):60-67.
[18] 任众,侯宏旭,武静,等.基于统计和神经网络的蒙汉机器翻译研究[J].中文信息学报,2018,32(11):1-7.
[19] 申志鹏.基于注意力神经网络的蒙汉机器翻译系统的研究[D].呼和浩特:内蒙古大学,2017.
[20] DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[C]∥Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Stroudsburg:ACL,2018:4171-4186.
[21] PAPINENI K,ROUKOS S,WARD T,et al.BLEU:a method for automatic evaluation of machine translation[C]∥Annual Meeting on Association for Computational Linguistics.Stroudsburg:ACL,2002:311-318.
[22] CHIANG D,DENEEFE S,CHAN Y,et al.Decomposability of translation metrics for improved evaluation and efficient algorithms[C]∥Conference on Empirical Methods in Natural Language Processing.Stroudsburg:ACL,2008:610-619.
[23] DODDINGTON G.Automatic evaluation of machine translation quality using n-gram co-occurence statistics[C]∥International Conference on Human Language Technology Research.San Francisco:Morgan Kaufmann Publishers Inc,2002:138-145.

备注

引言

1 数据预处理

2 基于BERT数据增强的蒙汉神经机器翻译方法

3 蒙汉机器翻译实验及分析

4 总结

学报简介

备注

引言

1 数据预处理

2 基于BERT数据增强的蒙汉神经机器翻译方法

3 蒙汉机器翻译实验及分析

4 总 结

学报简介

4 总结