|本期目录/Table of Contents|

[1]吴 浩,艾山·吾买尔*,王路路,等.维吾尔语词向量的评测研究[J].厦门大学学报(自然科学版),2019,58(02):209-216.[doi:10.6043/j.issn.0438-0479.201811028]
 WU Hao,WUMAIER Aishan*,WANG Lulu,et al.Research on Uyghur word embedding evaluation[J].Journal of Xiamen University(Natural Science),2019,58(02):209-216.[doi:10.6043/j.issn.0438-0479.201811028]
点击复制

维吾尔语词向量的评测研究(PDF/HTML)
分享到:

《厦门大学学报(自然科学版)》[ISSN:0438-0479/CN:35-1070/N]

卷:
58卷
期数:
2019年02期
页码:
209-216
栏目:
民族语言处理
出版日期:
2019-03-27

文章信息/Info

Title:
Research on Uyghur word embedding evaluation
文章编号:
0438-0479(2019)02-0209-08
作者:
吴 浩艾山·吾买尔*王路路卡哈尔江·阿比的热西提吐尔根·依布拉音
新疆大学信息科学与工程学院,新疆多语种信息技术实验室,新疆 乌鲁木齐 830046
Author(s):
WU HaoWUMAIER Aishan*WANG LuluABIDEREXITI KahaerjiangYIBULAYIN Tuergen
Xinjiang Laboratory of Multi-language Information Technology,College of Information Science and Engineering,Xinjiang University,Urumqi 830046,China
关键词:
词向量 维吾尔语 评测任务
Keywords:
word embedding Uyghur evaluation task
分类号:
TP 391
DOI:
10.6043/j.issn.0438-0479.201811028
文献标志码:
A
摘要:
目前探究维吾尔语词向量表示的相关研究报道很少,在对其性能评价、实际使用等方面存在一些问题亟待解决.基于此,构建了维吾尔语版wordsim240和word analogy词向量评测数据集; 提出了新的单词语义相似度评测方法,并以命名实体识别任务作为实际任务验证其有效性; 同时分析了改进的类比推理评测方法鉴别词向量表示语义的能力.实验结果显示,提出及改进的方法均能有效应用于评测任务; 且在较小语料规模下,较低维度(64,128,256维)的词向量在各项评测任务上表现更好.
Abstract:
Currently,Uyghur word embedding has been rarely investigated,and some problems remain to be solved in the performance evaluation and practical application.This paper primarily constructs the vector evaluation data set of wordsim240 and word analogy in Uyghur,and proposes a new method of word semantic similarity evaluation whose validity is verified by name entity recognition task.The ability of identifying the semantic representation using the improved analogic reasoning method is analyzed.Experimental results indicate that the proposed and improved method can be applied to the evaluation tasks effectively.Under small-scale corpus,low-dimensional(64,128,256 dimension)word embedding performs satisfactorily.

参考文献/References:

[1] HARRIS Z S.Distributional structure[J].Word,1981,10(2/3):146-162.
[2] BROWN P F,DESOUZA P V,MERCER R L,et al.Class-based n-gram models of natural language[J].Computational Linguistics,1992,18(4):467-479.
[3] LANDAUER T K,FOLTZ P W,LAHAM D.An introduction to latent semantic analysis[J].Discourse Processes,1998,25(2/3):259-284.
[4] FELLBAUM C.An electronic lexical database[J].Library Quarterly Information Community Policy,1998,25(2):292-296.
[5] DONG Z,DONG Q.HowNet:a hybrid language and knowledge resource[C]∥International Conference Natural Language Processing.Beijing:IEEE,2003:820-824.
[6] 刘群,李素建.基于《知网》的词汇语义相似度计算[J].中文计算语言学,2002,7(2):59-76.
[7] BENGIO Y,DUCHARME R,VINCENT P,et al.A neural probabilistic language model[J].Journal of Machine Learning Research,2003,3(6):1137-1155.
[8] MIKOLOV T,KARAFIAT M,BURGET L,et al.Recurrent neural network based language model[C]∥Interspeech 2010,Conference of the International Speech Communication Association.Chiba:ISCA,2010:1045-1048.
[9] MIKOLOV T,CHEN K,CORRADO G S,et al.Efficient estimation of word representations in vector space[C]∥International Conference on Learning Representations.Scottsdale:arXiv,2013:1301.3781.
[10] MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[J].Advances in Neural Information Processing Systems,2013,26:3111-3119.
[11] PENNINGTON J,SOCHER R,MANNING C.Glove:global vectors for word representation[C]∥Conference on Empirical Methods in Natural Language Processing.Doha:ACL,2014:1532-1543.
[12] HUANG E H,SOCHER R,MANNING C,et al.Improving word representations via global context and multiple word prototypes[C]∥Meeting of the Association for Computational Linguistics:Long Papers.Jeju Island:ACL,2012:873-882.
[13] CHEN X,XU L,LIU Z,et al.Joint learning of character and word embeddings[C]∥International Conference on Artificial Intelligence.Buenos Aires:AAAI Press,2015:1236-1242.
[14] LAI S W,LIU K,HE S,et al.How to generate a good word embedding?[J].IEEE Intelligent Systems,2016,31(6):5-14.
[15] 来斯惟.基于神经网络的词和文档语义向量表示方法研究[D].北京:中国科学院大学,2016:1-143.
[16] 户保田.基于深度神经网络的文本表示及其应用[D].哈尔滨:哈尔滨工业大学,2016:1-114.
[17] 哈里旦木·阿布都克里木,程勇,刘洋,等.基于双向门限递归单元神经网络的维吾尔语形态切分[J].清华大学学报(自然科学版),2017(1):1-6.
[18] MAIMAITI M,WUMAIER A,ABIDEREXITI K,et al.Bidirectional long short-term memory network with a conditional random field layer for Uyghur part-of-speech tagging[J].Information,2017,8(4):157.
[19] WANG W,BAO F,GAO G.Mongolian named entity recognition with bidirectional recurrent neural networks[C]∥IEEE,International Conference on Tools with Artificial Intelligence.Boston:IEEE,2017:495-500.
[20] 汪祥,贾焰,周斌,等.基于中文维基百科链接结构与分类体系的语义相关度计算[J].小型微型计算机系统,2011,32(11):2237-2242.
[21] REI M,CRICHTON G K O,PYYSALO S.Attending to characters in neural sequence labeling models[EB/OL].[2018-11-10].https:∥arxiv.org/pdf/1611.04361
[22] LAMPLE G,BALLESTEROS M,SUBRAMANIAN S,et al.Neural architectures for named entity recognition[C]∥The 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.San Diego:ACL,2016:260-270.[1] HARRIS Z S.Distributional structure[J].Word,1981,10(2/3):146-162.
[2] BROWN P F,DESOUZA P V,MERCER R L,et al.Class-based n-gram models of natural language[J].Computational Linguistics,1992,18(4):467-479.
[3] LANDAUER T K,FOLTZ P W,LAHAM D.An introduction to latent semantic analysis[J].Discourse Processes,1998,25(2/3):259-284.
[4] FELLBAUM C.An electronic lexical database[J].Library Quarterly Information Community Policy,1998,25(2):292-296.
[5] DONG Z,DONG Q.HowNet:a hybrid language and knowledge resource[C]∥International Conference Natural Language Processing.Beijing:IEEE,2003:820-824.
[6] 刘群,李素建.基于《知网》的词汇语义相似度计算[J].中文计算语言学,2002,7(2):59-76.
[7] BENGIO Y,DUCHARME R,VINCENT P,et al.A neural probabilistic language model[J].Journal of Machine Learning Research,2003,3(6):1137-1155.
[8] MIKOLOV T,KARAFIAT M,BURGET L,et al.Recurrent neural network based language model[C]∥Interspeech 2010,Conference of the International Speech Communication Association.Chiba:ISCA,2010:1045-1048.
[9] MIKOLOV T,CHEN K,CORRADO G S,et al.Efficient estimation of word representations in vector space[C]∥International Conference on Learning Representations.Scottsdale:arXiv,2013:1301.3781.
[10] MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[J].Advances in Neural Information Processing Systems,2013,26:3111-3119.
[11] PENNINGTON J,SOCHER R,MANNING C.Glove:global vectors for word representation[C]∥Conference on Empirical Methods in Natural Language Processing.Doha:ACL,2014:1532-1543.
[12] HUANG E H,SOCHER R,MANNING C,et al.Improving word representations via global context and multiple word prototypes[C]∥Meeting of the Association for Computational Linguistics:Long Papers.Jeju Island:ACL,2012:873-882.
[13] CHEN X,XU L,LIU Z,et al.Joint learning of character and word embeddings[C]∥International Conference on Artificial Intelligence.Buenos Aires:AAAI Press,2015:1236-1242.
[14] LAI S W,LIU K,HE S,et al.How to generate a good word embedding?[J].IEEE Intelligent Systems,2016,31(6):5-14.
[15] 来斯惟.基于神经网络的词和文档语义向量表示方法研究[D].北京:中国科学院大学,2016:1-143.
[16] 户保田.基于深度神经网络的文本表示及其应用[D].哈尔滨:哈尔滨工业大学,2016:1-114.
[17] 哈里旦木·阿布都克里木,程勇,刘洋,等.基于双向门限递归单元神经网络的维吾尔语形态切分[J].清华大学学报(自然科学版),2017(1):1-6.
[18] MAIMAITI M,WUMAIER A,ABIDEREXITI K,et al.Bidirectional long short-term memory network with a conditional random field layer for Uyghur part-of-speech tagging[J].Information,2017,8(4):157.
[19] WANG W,BAO F,GAO G.Mongolian named entity recognition with bidirectional recurrent neural networks[C]∥IEEE,International Conference on Tools with Artificial Intelligence.Boston:IEEE,2017:495-500.
[20] 汪祥,贾焰,周斌,等.基于中文维基百科链接结构与分类体系的语义相关度计算[J].小型微型计算机系统,2011,32(11):2237-2242.
[21] REI M,CRICHTON G K O,PYYSALO S.Attending to characters in neural sequence labeling models[EB/OL].[2018-11-10].https:∥arxiv.org/pdf/1611.04361
[22] LAMPLE G,BALLESTEROS M,SUBRAMANIAN S,et al.Neural architectures for named entity recognition[C]∥The 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.San Diego:ACL,2016:260-270.

备注/Memo

备注/Memo:
收稿日期:2018-11-14 录用日期:2018-12-11
基金项目:国家重点研发计划(2017YFB1002103); 国家自然科学基金(61331011,61662077,61462083)
*通信作者:hasan1479@xju.edu.cn
更新日期/Last Update: 1900-01-01