融合数据增强与多样化解码的神经机器翻译

(大连理工大学计算机科学与技术学院,辽宁 大连 116024)

神经机器翻译; 数据增强; 多样化解码

Incorporating data enhancement and diverse decoding into neural machine translation
ZHANG Yiming,LIU Junpeng,SONG Dingxin,HUANG Degen*

(College of Computer Science and Technology,Dalian University of Technology,Dalian 116024,China)

DOI: 10.6043/j.issn.0438-0479.202011047

备注

基于神经机器翻译模型Transformer,提出一种融合数据增强技术和多样化解码策略的方法来提高机器翻译的性能.首先,对训练语料进行预处理和泛化,提高语料质量并缓解词汇稀疏的现象; 然后,基于数据增强技术使用单语句子构造伪双语数据,扩充双语平行语料以增强模型; 最后,在解码阶段融合检查点平均、模型集成、重打分等策略以提高译文质量.第16届全国机器翻译大会(CCMT 2020)中英新闻领域翻译任务的实验结果显示,改进后的方法较基线系统的双语互译评估(BLEU)值提升了4.89个百分点.
Based on the neural machine translation model Transformer,the paper proposes a method to improve the performance of machine translation by combining the data enhancement technology with diverse decoding strategies.First,the training corpus is preprocessed and generalized to improve the quality of the corpus and alleviate the phenomenon of sparse vocabulary.Then,monolingual sentences are used to construct pseudo bilingual data based on data enhancement technology,and the model is enhanced by expanding the bilingual parallel corpus.Finally,checkpoint averaging,model ensembling,and rescoring in the decoding stage are integrated in the translation system,which improves the quality of translations.Experimental results on the 16th China Conference on Machine Translation(CCMT 2020)Chinese-English news translation task show that the proposed method achieves an increase of 4.89 percentage points compared to the bilingual evaluation understudy(BLEU)value of baseline system.