基于半监督学习的小语种机器翻译算法

(1.上海交通大学电子信息与电气工程学院,上海 200240; 2.上海交通大学-上海嵩恒信息内容分析技术联合实验室,上海 200240)

半监督学习; 小语种; 机器翻译

Machine translation algorithm of low-resource languages based on semi-supervised learning
LU Wenjie1,TAN Ruxin1,LIU Gongshen1,2*,SUN Huanrong2

(1.Shanghai Jiao Tong University,School of Electronic Information and Electrical Engineering,Shanghai 200240,China; 2.Shanghai Jiao Tong University-Shanghai Songheng Information Content Analysis Joint Lab,Shanghai 200240,China)

DOI: 10.6043/j.issn.0438-0479.201811015

备注

近年来,基于神经网络的机器翻译取得了快速发展,然而由于它需要大规模的平行语料库,所以对于资源稀缺的小语种的翻译往往显得效果不佳.在分析编码-解码框架和注意力机制的基础上,基于对偶学习的思想,提出了一种面向小语种翻译的半监督神经网络模型.该模型利用较大的单语语料库与少量平行语料库来实现小语种翻译.实验结果表明,当平行语料资源不足以训练一个普通神经网络模型时,使用半监督网络模型能够取得较好的结果,但所采用的半监督学习模型对单语语料库的数量要求非常高,要达到一定数量级才能达到良好效果.

Recent years,neural machine translation has achieved great development.However,its requirement for large-scale parallel corpora,translating low-resource languages fluently becomes a big challenge.This paper first briefly introduces the encoder-decoder framework and attention mechanism.Next,we propose a semi-supervised neural network model based on dual-learning,which can translate low-resource languages using some monolingual corpora and small parallel corpora.Finally,results show that semi-supervised neural machine translation can achieve reasonable results with parallel corpora which are insufficient to train a common neural model.However,the semi-supervised model requires a large number of monolingual corpora to achieve great performance.