结合BERT数据增强的基于词切分的蒙汉神经机器翻译系统

(内蒙古师范大学计算机科学技术学院,内蒙古 呼和浩特 011500)

蒙汉神经机器翻译; Transformer神经网络; BERT; 语义相似度

Mongolian-Chinese neural machine translation system based on word segmentation with BERT data enhancement
HE Wuyun,XIU Zhi,BAO Jingjing,CHEN Meilan,WANG Siriguleng*

(Collage of Computer Science and Technology,Inner Mongolia Normal University,Hohhot 011500,China)

Mongolian-Chinese neural machine translation; Transformer neural network; BERT; semantic similarity

DOI: 10.6043/j.issn.0438-0479.202110035

备注

神经机器翻译是目前机器翻译领域主流研究方法,但是蒙汉平行语料的稀缺使得蒙汉神经机器翻译性能难以提升.本文针对基于Transformer的蒙汉神经机器翻译系统,利用深度学习模型对蒙古文词切分方法进行研究,分析了蒙古文部分切分、BPE子词切分和BiLSTM-CNN-CRF神经网络切分方法对于蒙汉机器翻译模型的影响,并在此基础上利用基于BERT(bidirectional encoder representations from Transformers)中文语义相似度计算的数据增强技术去扩充蒙汉机器翻译训练数据.在CCMT2019提供的数据集上进行对比实验,实验结果表明,数据增强方法的BLEU值相较于基线实验提升显著,且BLEU4值达到了75.28%.

Objective : Neural machine translation is currently the mainstream research method in the field of machine translation. In order to obtain a translation model with better translation quality, a large-scale, high-quality, bilingual parallel corpus containing various fields is needed as the training data of the neural network model. Aiming at the problem that the scarcity of Mongolian-Chinese parallel corpus makes it difficult to improve translation performance, in order to enhance the training data in Mongolian-Chinese neural machine translation, this thesis proposes a method to expand the Mongolian-Chinese machine translation training corpus using BERT (bidirectional encoder representations from Transformers) based data enhancement techniques based on Mongolian word segmentation .
Methods : In this paper, the Mongolian-Chinese neural machine translation system based on Transformer is selected. According to the rich morphological features of Mongolian and the limited vocabulary dictionary of neural machine translation, the Mongolian corpus for machine translation is segmentation preprocessing work. The Mongolian words are segmented with various granularities by using the partial segmentation method, the BPE (Byte-Pair Encoding) sub-word segmentation method, and the BiLSTM-CNN-CRF neural network segmentation method. On this basis, BERT is used to train the Chinese semantic similarity calculation model to improve the quality of the pseudo-parallel corpus and effectively expand the Mongolian-Chinese machine translation training corpus.
Results : In this paper, the training set of the BERT Chinese semantic similarity calculation model to train the Chinese semantic similarity model is the LCQMC (large-scale Chinese question matching corpus) data set. The Mongolian-Chinese machine translation experiments all use the training set and development set provided by CCMT2021, and the test set uses the offline test set provided by CCMT2019. In the experiment, firstly, through the comparison of partial segmentation, BPE segmentation and BiLSTM-CNN-CRF neural network segmentation experiments on Mongolian-Chinese machine translation Mongolian corpus, it is found that the method based on partial segmentation in the Transformer neural machine translation system is higher. For other segmentation methods with different granularities, the BLEU4 value reaches 69.87%. This may be because the BPE segmentation method is mainly based on word frequency statistics, which will cause Mongolian to lose the original part of speech and word meaning. The BiLSTM-CNN-CRF neural network segmentation method has a high word segmentation accuracy, but in the Mongolian-Chinese machine translation task, the Mongolian sentences become longer and the word segmentation granularity is too fine. Therefore, subsequent experiments will be carried out on the basis of the partial segmentation method with the highest BLEU value. Then, the Mongolian corpus is partially segmented and the control characters in the Mongolian corpus are filtered. Based on the data enhancement by using the BERT model, the quality of the Mongolian-Chinese neural machine translation is significantly improved. This method makes the BLEU4 value reach 75.28%. By analyzing the experimental results, it is concluded that BERT data enhancement technology based on word segmentation can effectively improve the generalization ability of neural machine translation to low-resource languagesto deal with the sparse problem of parallel bilingual data in Mongolian-Chinese machine translation tasks.
Conclusions : The above experimental results show that treating Mongolian words as a whole will lose a lot of grammatical and semantic information, and word segmentation in Mongolian can not only reduce the loss of grammar and semantics, but also solve the problem of data sparseness. Compared with the word-based method, the Mongolian word segmentation method has better performance on the Transformer-based neural machine translation system. The application of the BERT data enhancement method in the Mongolian-Chinese neural machine translation based on word segmentation can construct a high-quality Mongolian-Chinese pseudo-parallel corpus, effectively expand the Mongolian-Chinese machine training corpus, and thus improve the translation quality of machine translation. In the task of bilingual machine translation between Mongolian and Chinese, the most serious problem is the problem of sparse bilingual data. In the future research work on this problem, methods such as transfer learning and post-translation processing can be used.