维吾尔语词向量的评测研究

(新疆大学信息科学与工程学院,新疆多语种信息技术实验室,新疆 乌鲁木齐 830046)

词向量; 维吾尔语; 评测任务

Research on Uyghur word embedding evaluation
WU Hao,WUMAIER Aishan*,WANG Lulu,ABIDEREXITI Kahaerjiang,YIBULAYIN Tuergen

(Xinjiang Laboratory of Multi-language Information Technology,College of Information Science and Engineering,Xinjiang University,Urumqi 830046,China)

DOI: 10.6043/j.issn.0438-0479.201811028

备注

目前探究维吾尔语词向量表示的相关研究报道很少,在对其性能评价、实际使用等方面存在一些问题亟待解决.基于此,构建了维吾尔语版wordsim240和word analogy词向量评测数据集; 提出了新的单词语义相似度评测方法,并以命名实体识别任务作为实际任务验证其有效性; 同时分析了改进的类比推理评测方法鉴别词向量表示语义的能力.实验结果显示,提出及改进的方法均能有效应用于评测任务; 且在较小语料规模下,较低维度(64,128,256维)的词向量在各项评测任务上表现更好.

Currently,Uyghur word embedding has been rarely investigated,and some problems remain to be solved in the performance evaluation and practical application.This paper primarily constructs the vector evaluation data set of wordsim240 and word analogy in Uyghur,and proposes a new method of word semantic similarity evaluation whose validity is verified by name entity recognition task.The ability of identifying the semantic representation using the improved analogic reasoning method is analyzed.Experimental results indicate that the proposed and improved method can be applied to the evaluation tasks effectively.Under small-scale corpus,low-dimensional(64,128,256 dimension)word embedding performs satisfactorily.