基于多种数据筛选的维汉神经机器翻译

(新疆大学信息科学与工程学院 新疆多语种信息技术重点实验室,新疆 乌鲁木齐 830046)

维汉翻译; 自注意力机制; 低资源翻译

Uyghur-Chinese neural machine translation system based on multiple data filtering
YI Nian,AISHAN Wumaier,MAIHEMUTI Maimaiti,TURGUN Ibrayim

(Xinjiang Laboratory of Multi-Language Information Technology,College of Information Science and Engineering,Xinjiang University,Urumqi 830046)

Uyghur-to-Chinese translation; self attention mechanism; low resource translation

DOI: 10.6043/j.issn.0438-0479.202111038

备注

为了得到更好的翻译结果,研究者对于利用平行数据生成大量高质量生成数据进行了广泛的研究.为此,针对数据增强和系统训练方法,提出结合知识蒸馏、数据增强和数据筛选的方法得到高质量生成数据.具体为利用知识蒸馏的方法得到鲁棒性更强的汉维翻译模型,在该汉维模型的基础之上通过反向翻译的方法生成质量较好的生成数据,并利用不同的数据筛选方法进一步得到高质量生成数据.之后利用现有的平行数据和生成数据训练得到一个高性能的维吾尔语-汉语神经机器翻译系统.在CCMT2021维汉评测任务中验证上述方法对于维汉翻译质量的影响,对比基线系统、反向翻译和同任务其他系统,该方法训练得到的系统有着更好的翻译结果,并在该翻译任务上获得了第一名.

Objective : To improve the performance of the Uyghur-Chinese translation model, we use the method of knowledge distillation to obtain a more robust Chinese- Uyghur translation model. The reverse translation method generates the generated data with better quality based on the Chinese-Uyghur model. Different data screening methods are used to Get high-quality generated data further. Then, a high-performance Uyghur-Chinese neural machine translation system is obtained using the existing parallel and generated data. In the CCMT2021 Uyghur-Chinese evaluation task, the impact of the above method on the quality of Uyghur-Chinese translation is verified, and it is the first place in this translation task.
Methods : In this paper, various data filtering methods are studied. The generated data are extracted and mixed into actual data according to the scores of generated data obtained by different methods. Then different models are trained using the data obtained by different data screening methods, and these models are integrated to obtain the best translation system. In addition, to obtain a better reverse translation model, the data obtained by the knowledge distillation method are added to the actual data to enhance the robustness of the Chinese-Uygur translation model. Different data screening methods are used to obtain the generated data, and a multi-branch model training method is used to obtain the translation model.
Results : The experiment uses 160,000 Uyghur - Chinese sentence pairs and 6 million Chinese monolingual data. These data are obtained by processing the data provided by the Uyghur - Chinese translation task. This paper conducts a comparative study of different data screening methods, uses different data screening methods to process the generated data, and conducts experiments on different methods with the same amount of data. The experimental results of different data screening methods are similar and are significantly better than those of the baseline model, data distillation, and reverse translation. This result may be due to the method of adding labels when training the translation model with generated data. This method can indicate which data is generated and which is actual, thus making the results of different data filtering methods similar. And it is found in this paper that after fine-tuning the model obtained by data enhancement only with parallel data, the performance of the model decreases, which may be because only training sets are used to fine-tune the model. Result Using the model ensemble method is better than that of a single model, which indicates that the model with better performance can be obtained by an ensemble of the models obtained by different data selection methods. In addition, this paper compares the depth and breadth of models and the impact of different model architecture integration on performance. It is found that deeper models may not lead to better results under low resource conditions. The integration of different model structures is better. In addition, this paper compares the depth and breadth of models and the impact of different model architecture integration on performance. It is found that deeper models may not lead to better results under low resource conditions. The integration of different model structures is better. In the CCMT2021 Uyghur-Chinese translation task, compared with the baseline system, reverse translation, and other systems for the same task, our system trained has better translation results, and it is the first place in this translation task.
Conclusions : To improve the performance of the Uyghur - Chinese translation model, we use reverse translation, knowledge distillation, fine-tuning, model averaging, and Ensemble integration. At the same time, this paper also compares different data filtering methods, hoping to continue to study data enhancement methods through these comparative experiments. Due to the instability of interpolation methods for fusion methods, this paper hopes to combine domain similarity and data quality filtering from the root rather than just using the interpolation method to fuse different methods. This paper hopes to combine the advantages of dynamic convolution and Transformer models in the following research to achieve better results in low-resource translation tasks. It also hopes to try deeper models on the Transformer model and integrate models with different structures.