《厦门大学学报（自然科学版）》

目前探究维吾尔语词向量表示的相关研究报道很少,在对其性能评价、实际使用等方面存在一些问题亟待解决.基于此,构建了维吾尔语版wordsim240和word analogy词向量评测数据集; 提出了新的单词语义相似度评测方法,并以命名实体识别任务作为实际任务验证其有效性; 同时分析了改进的类比推理评测方法鉴别词向量表示语义的能力.实验结果显示,提出及改进的方法均能有效应用于评测任务; 且在较小语料规模下,较低维度(64,128,256维)的词向量在各项评测任务上表现更好.

Currently,Uyghur word embedding has been rarely investigated,and some problems remain to be solved in the performance evaluation and practical application.This paper primarily constructs the vector evaluation data set of wordsim240 and word analogy in Uyghur,and proposes a new method of word semantic similarity evaluation whose validity is verified by name entity recognition task.The ability of identifying the semantic representation using the improved analogic reasoning method is analyzed.Experimental results indicate that the proposed and improved method can be applied to the evaluation tasks effectively.Under small-scale corpus,low-dimensional(64,128,256 dimension)word embedding performs satisfactorily.

引言
1 词向量主流模型
2 维吾尔语词向量评测任务数据
3 词向量评测方法及其改进
4 实验工作
5 实验结果分析
6 结论与讨论

图1 CBOW模型结构图<br/>Fig.1 Structure diagram of CBOW model

图1 CBOW模型结构图
Fig.1 Structure diagram of CBOW model

图2 Skip-gram模型结构图<br/>Fig.2 Structure diagram of Skip-gram model

图2 Skip-gram模型结构图
Fig.2 Structure diagram of Skip-gram model

图3 BiLSTM+CRF模型结构图<br/>Fig.3 Structure diagram of BiLSTM+CRF model

图3 BiLSTM+CRF模型结构图
Fig.3 Structure diagram of BiLSTM+CRF model

表1 不同参数组合下模型的训练时间和文件大小<br/>Tab.1 Training time of model and file size under different parameter combinations

表1 不同参数组合下模型的训练时间和文件大小
Tab.1 Training time of model and file size under different parameter combinations

表2 单词语义相似度的斯皮尔曼等级相关系数<br/>Tab.2 Spearman grade correlation coefficient of word similarity

表2 单词语义相似度的斯皮尔曼等级相关系数
Tab.2 Spearman grade correlation coefficient of word similarity

图4 语义相似度评价结果<br/>Fig.4 Evaluation results of semantic similarity

图4 语义相似度评价结果
Fig.4 Evaluation results of semantic similarity

图5 “省会-省份”数据集评价结果<br/>Fig.5 Evaluation results of

图5 “省会-省份”数据集评价结果
Fig.5 Evaluation results of "provincial capital-province" data set

图6 “首都-国家”数据集评价结果<br/>Fig.6 Evaluation results of

图6 “首都-国家”数据集评价结果
Fig.6 Evaluation results of "capital-country" data set

表3 采用不同词向量的命名实体识别任务结果<br/>Tab.3 Results of named entity recognition using different word embedding

表3 采用不同词向量的命名实体识别任务结果
Tab.3 Results of named entity recognition using different word embedding

[1] HARRIS Z S.Distributional structure[J].Word,1981,10(2/3):146-162.
[2] BROWN P F,DESOUZA P V,MERCER R L,et al.Class-based n-gram models of natural language[J].Computational Linguistics,1992,18(4):467-479.
[3] LANDAUER T K,FOLTZ P W,LAHAM D.An introduction to latent semantic analysis[J].Discourse Processes,1998,25(2/3):259-284.
[4] FELLBAUM C.An electronic lexical database[J].Library Quarterly Information Community Policy,1998,25(2):292-296.
[5] DONG Z,DONG Q.HowNet:a hybrid language and knowledge resource[C]∥International Conference Natural Language Processing.Beijing:IEEE,2003:820-824.
[6] 刘群,李素建.基于《知网》的词汇语义相似度计算[J].中文计算语言学,2002,7(2):59-76.
[7] BENGIO Y,DUCHARME R,VINCENT P,et al.A neural probabilistic language model[J].Journal of Machine Learning Research,2003,3(6):1137-1155.
[8] MIKOLOV T,KARAFIAT M,BURGET L,et al.Recurrent neural network based language model[C]∥Interspeech 2010,Conference of the International Speech Communication Association.Chiba:ISCA,2010:1045-1048.
[9] MIKOLOV T,CHEN K,CORRADO G S,et al.Efficient estimation of word representations in vector space[C]∥International Conference on Learning Representations.Scottsdale:arXiv,2013:1301.3781.
[10] MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[J].Advances in Neural Information Processing Systems,2013,26:3111-3119.
[11] PENNINGTON J,SOCHER R,MANNING C.Glove:global vectors for word representation[C]∥Conference on Empirical Methods in Natural Language Processing.Doha:ACL,2014:1532-1543.
[12] HUANG E H,SOCHER R,MANNING C,et al.Improving word representations via global context and multiple word prototypes[C]∥Meeting of the Association for Computational Linguistics:Long Papers.Jeju Island:ACL,2012:873-882.
[13] CHEN X,XU L,LIU Z,et al.Joint learning of character and word embeddings[C]∥International Conference on Artificial Intelligence.Buenos Aires:AAAI Press,2015:1236-1242.
[14] LAI S W,LIU K,HE S,et al.How to generate a good word embedding?[J].IEEE Intelligent Systems,2016,31(6):5-14.
[15] 来斯惟.基于神经网络的词和文档语义向量表示方法研究[D].北京:中国科学院大学,2016:1-143.
[16] 户保田.基于深度神经网络的文本表示及其应用[D].哈尔滨:哈尔滨工业大学,2016:1-114.
[17] 哈里旦木·阿布都克里木,程勇,刘洋,等.基于双向门限递归单元神经网络的维吾尔语形态切分[J].清华大学学报(自然科学版),2017(1):1-6.
[18] MAIMAITI M,WUMAIER A,ABIDEREXITI K,et al.Bidirectional long short-term memory network with a conditional random field layer for Uyghur part-of-speech tagging[J].Information,2017,8(4):157.
[19] WANG W,BAO F,GAO G.Mongolian named entity recognition with bidirectional recurrent neural networks[C]∥IEEE,International Conference on Tools with Artificial Intelligence.Boston:IEEE,2017:495-500.
[20] 汪祥,贾焰,周斌,等.基于中文维基百科链接结构与分类体系的语义相关度计算[J].小型微型计算机系统,2011,32(11):2237-2242.
[21] REI M,CRICHTON G K O,PYYSALO S.Attending to characters in neural sequence labeling models[EB/OL].[2018-11-10].https:∥arxiv.org/pdf/1611.04361
[22] LAMPLE G,BALLESTEROS M,SUBRAMANIAN S,et al.Neural architectures for named entity recognition[C]∥The 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.San Diego:ACL,2016:260-270.

备注

引言

1 词向量主流模型

2 维吾尔语词向量评测任务数据

3 词向量评测方法及其改进

4 实验工作

5 实验结果分析

6 结论与讨论

学报简介