|本期目录/Table of Contents|

[1]段宇光,刘 扬 *,俞士汶.《同义词词林》的嵌入表示与应用评估[J].厦门大学学报(自然科学版),2018,57(06):867-875.[doi:10.6043/j.issn.0438-0479.201805013]
 DUAN Yuguang,LIU Yang*,YU Shiwen.An Embedded Representation for "Tongyici Cilin" and ItsEvaluation on Tasks[J].Journal of Xiamen University(Natural Science),2018,57(06):867-875.[doi:10.6043/j.issn.0438-0479.201805013]
点击复制

《同义词词林》的嵌入表示与应用评估(PDF/HTML)
分享到:

《厦门大学学报(自然科学版)》[ISSN:0438-0479/CN:35-1070/N]

卷:
57卷
期数:
2018年06期
页码:
867-875
栏目:
自然语言处理
出版日期:
2018-11-28

文章信息/Info

Title:
An Embedded Representation for "Tongyici Cilin" and ItsEvaluation on Tasks
文章编号:
0438-0479(2018)06-0867-09
作者:
段宇光12刘 扬13 *俞士汶13
1.北京大学计算语言学教育部重点实验室,2.北京大学元培学院,3.北京大学计算语言学研究所,北京 100871
Author(s):
DUAN Yuguang12LIU Yang13*YU Shiwen13
1.Key Laboratory of Computational Linguistics(Ministry of Education),Peking University,2.Yuanpei College,Peking University,3.Institute of Computational Linguistics,Peking University,Beijing 100871,China
关键词:
《同义词词林》 嵌入表示 词义合成 类比推理 相似度
Keywords:
"Tongyici Cilin" embedded representation semantic compositionality analogical reasoning similarity
分类号:
TP 391
DOI:
10.6043/j.issn.0438-0479.201805013
文献标志码:
A
摘要:
在自然语言处理中,嵌入表示是表达语言知识的重要途径和手段,以《同义词词林》为例,提出基于知识库训练嵌入表示的伪句式构造方法,并在多项任务上测试新方法的有效性.根据《同义词词林》词义编码反映的层级结构,将这些编码扩展为多种伪句式,并据此生成不同的伪语料库,采用word2vec模型在伪语料库上训练义素向量及词向量,得到CiLin2Vec资源,并应用于词义合成、类比推理和词义相似度计算等任务.在词义合成、类比推理任务上的准确率达到90% 以上,超过了以往在语料库上训得的结果.证明该方法可以有效地将知识库中的理性知识注入嵌入表示中,也显示了CiLin2Vec嵌入表示资源在应用上的巨大潜力.
Abstract:
In natural language processing(NLP),to learn embedded representation is an effective approach of capturing semantics from language resources.At present,however,this approach has been much limited to using large-scale corpora,with little attention to extracting rational knowledge from knowledge bases.In this paper,based on "Tongyici Cilin",a famous Chinese thesaurus,we present a method for implanting rational knowledge into embedded representation,then evaluate it in terms of different NLP tasks.According to the hierarchical encodings for morphemic and lexical meanings in "Tongyici Cilin",we design multiple templates to create instances as pseudo-sentences from these pieces of knowledge,and apply word2vec to obtain CiLin2Vec,the sememe and word embeddings of new kinds as for "Tongyici Cilin".For evaluation,tasks of semantic compositionality,analogical reasoning and word similarity measurement are taken into consideration.We make progress and breakthrough on the tasks,reaching an accuracy of over 90% for both semantic compositionality and analogical reasoning,demonstrating that the pieces of rational knowledge have been appropriately implanted,with very promising prospects for adoption of the knowledge bases.

参考文献/References:

[1] 田久乐,赵蔚.基于同义词词林的词相似度计算方法[J].吉林大学学报(信息科学版),2010,28(6):602-608.
[2] 吕立辉,梁维薇,冉蜀阳.基于《词林》的词相似度的度量[J].现代计算机(专业版),2013,1:3-6.
[3] 朱新华,马润聪,孙柳,等.基于知网与《词林》的词语义相似度计算[J].中文信息学报,2016,30(4):29-36.
[4] 刘丹丹,彭成,钱龙华,等.《同义词词林》在中文实体关系抽取中的作用[J].中文信息学报,2014,28(2):91-99.
[5] 徐庆,段利国,李爱萍,等.基于实体词义相似度的中文实体关系抽取[J].山东大学学报(工学版),2015,45(6):7-15.
[6] 李国臣,吕雷,王瑞波,等.基于同义词词林信息特征的语义角色自动标注[J].中文信息学报,2016,30(1):101-108.
[7] 王东,熊世桓.基于同义词词林扩展的短文本分类[J].兰州理工大学学报,2015,4:104-108.
[8] DEERWESTER S,DUMAIS S T,FURNAS G W et al.Indexing by latent semantic analysis[J].Journal of the American Society for Information Science,1990,41(6):391- 407.
[9] SCHüTZE H.Dimensions of meaning[C]∥Proceedings of the 1992 ACM/IEEE Conference on Supercomputing.California:IEEE,1992:787-796.
[10] LUND K,BURGESS C.Producing high-dimensional semantic spaces from lexical co-occurrence[J].Behavior Research Methods,Instruments,& Computers,1996,28(2):203-208.
[11] COLLOBERT R,WESTON J.A unified architecture for natural language processing:deep neural networks with multitask learning[C]∥International Conference on Machine Learning.Helsinki:ACM,2008:160-167.
[12] COLLOBERT R,WESTON J,BOTTOU L,et al.Natural language processing(almost)from scratch[J].Journal of Machine Learning Research,2011,12(1):2493-2537.
[13] TURNEY P D.Domain and function:a dual-space model of semantic relations and compositions[J].Journal of Artificial Intelligence Research,2012,44:533-585.
[14] PENNINGTON J,SOCHER R,MANNING C D.Glove:global vectors for word representation[C]∥Conference on Empirical Methods on Natural Language Processing.Doha:Association for Computational Linguistics,2014:1532-1543.
[15] BARTUSIAK R,AUGUSTYNIAK ,KAJDANOWICZ T,et al.WordNet2Vec:corpora agnostic word vectorization method[J].Neurocomputing,2017.doi:10.1016/j.neucom.2017.01.121.
[16] TISSIER J,GRAVIER C,HABRARD A.Dict2vec:learning word embeddings using lexical dictionaries[C]∥Conference on Empirical Methods in Natural Language Processing.Copenhagen:Association for Computational Linguistics,2017:254-263.
[17] ROTHE S,SCHüTZE H.AutoExtend:extending word embeddings to embeddings for synsets and lexemes[EB/OL].[2018-04-20].http:∥arxiv.org/pdf/1507.0112701.pdf.
[18] PANCHENKO A.Best of both worlds:making word sense embeddings interpretable[C]∥Edition of the Language Resources and Evaluation Conference.Portoro:ELRA,2016:2649-2655.
[19] YANG L,SUN M.Improved learning of Chinese word embeddings with semantic knowledge[M]∥Chinese computational linguistics and natural language processing based on naturally annotated big data.Switzerland:Springer,2015:15-25.
[20] GOIKOETXEA J,SOROA,AGIRRE E.Random walks and neural network language models on knowledge bases[C]∥Proceedings of the 2015 Annual Conference of the North American Chapter of the ACL.San Diego:ACL,2015:1434-1439.
[21] 梅家驹,竺一鸣,高蕴奇,等.同义词词林[M].上海:上海辞书出版社,1983:1-362.
[22] HARRIS Z.Distributional structure[J].Word,1954,10(2):146-162.
[23] HINTON G E, MCCELLAND J L,RUMELHART D E. Distributedrespresentations[M]∥RUMELHART D E,MCCLELLAND J L.Parallel distributed processing:explorations in the microstructure of cognition(volume 1).Cambridge:MIT,1986:77-109.
[24] 孙飞,郭嘉丰,兰艳艳,等.分布式单词表示综述[J].计算机学报,2016,39:1-22.
[25] CHOMSKY N.Three models for the description of language[J].IRE Transactions on Information Theory,1956,2(3):113-124.
[26] YESSENALINA A,CARDIE C.Compositional matrix-space models for sentiment analysis[C]∥Conference on Empirical Methods on Natural Language Processing.Edinburgh:Association for Computational Linguistics,2011:172-182.
[27] SOCHER R,HUVAL B,MANNING C D,et al.Semantic compositionality through recursive matrixvector spaces[C]∥Conference on Empirical Methods on Natural Language Processing.Jeju Island:Association for Computational Linguistics,2012:1201-1211.
[28] GREFENSTETTE E,DINU G,ZHANG Y Z,et al.Multi-step regression learning for compositional distributional semantics[EB/OL].(2013-01-29)[2018-04-01].http:∥cn.arXiv.org/abs/:1301.6939.
[29] FODOR J A,PYLYSHYN Z W.Connectionism and cognitive architecture:a critical analysis[J].Cognition,1988,28(1/2):3-71.
[30] GERSHMAN S,TENENBAUM J B.Phrase similarity in humans and machines[C]∥Proceedings of the 37th Annual Conference of the Cognitive Science Society.Cambridge:MIT,2015:776-781.
[31] VAKULENKO S.The notion of sememe in the work of Adolf Noreen[J].Henry Sweet Society for the History of Linguistic Ideas Bulletin,2005(44):19-35.
[32] LYONS J.Linguistic semantics[M].Cambridge:Cambridge University Press,1996.
[33] MIKOLOV T,YIH W T,ZWEIG G.Linguistic regularities in continuous space word representations[C]∥Proceeding of the 2013 Conference of the North American Chapter of the ACL.Atlanta:Association for Computational Linguistics,2013:746-751.
[34] CHEN X,XU L,LIU Z,et al.Joint learning of character and word embeddings[C]∥Proceedings of IJCAI.Buenos Aires:AAAI,2015:1236-1242.
[35] 葛斌,李芳芳,郭丝路,等.基于知网的词汇语义相似度计算方法研究[J].计算机应用研究,2010,27(9):3329-3333.
[36] 石静,吴云芳,邱立坤,等.基于大规模语料库的汉语词义相似度计算方法[J].中文信息学报,2013,27(1):1-6.
[37] LI Y,BANDAR Z A,MCLEAN D.An approach for measuring semantic similarity between words using multiple information sources[J].IEEE Transactions on Knowledge and Data Engineering,2003,15(4):871-882.
[38] 梅立军,周强,臧路,等.知网与同义词词林的信息融合研究[J].中文信息学报,2005,19(1):64-71.
[39] TAIEB M A H,AOUICHA M B,HAMADOU A B.Ontology-based approach for measuring semantic similarity[J].Engineering Applications of Artificial Intelligence,2014,36:238-261.

备注/Memo

备注/Memo:
收稿日期:2018-05-10 录用日期:2018-08-06
基金项目:国家重点基础研究发展计划(973计划)(2014CB340504); 国家社会科学基金重大项目(12&ZD119); 国家社会科学基金(16BYY137)
*通信作者:liuyang@pku.edu.cn
引文格式:段宇光,刘扬,俞士汶.《同义词词林》的嵌入表示与应用评估[J].厦门大学学报(自然科学版),2018,57(6):867-875.
Citation:DUAN Y G,LIU Y,YU S W.An embedded representation for "Tongyici Cilin" and its evaluation on tasks[J].J Xiamen Univ Nat Sci,2018,57(6):867-875.(in Chinese)
更新日期/Last Update: 1900-01-01