《同义词词林》的嵌入表示与应用评估

(1.北京大学计算语言学教育部重点实验室,2.北京大学元培学院,3.北京大学计算语言学研究所,北京 100871)

《同义词词林》; 嵌入表示; 词义合成; 类比推理; 相似度

An Embedded Representation for "Tongyici Cilin" and Its Evaluation on Tasks
DUAN Yuguang1,2,LIU Yang1,3*,YU Shiwen1,3

(1.Key Laboratory of Computational Linguistics(Ministry of Education),Peking University,2.Yuanpei College,Peking University,3.Institute of Computational Linguistics,Peking University,Beijing 100871,China)

"Tongyici Cilin"; embedded representation; semantic compositionality; analogical reasoning; similarity

DOI: 10.6043/j.issn.0438-0479.201805013

备注

在自然语言处理中,嵌入表示是表达语言知识的重要途径和手段,以《同义词词林》为例,提出基于知识库训练嵌入表示的伪句式构造方法,并在多项任务上测试新方法的有效性.根据《同义词词林》词义编码反映的层级结构,将这些编码扩展为多种伪句式,并据此生成不同的伪语料库,采用word2vec模型在伪语料库上训练义素向量及词向量,得到CiLin2Vec资源,并应用于词义合成、类比推理和词义相似度计算等任务.在词义合成、类比推理任务上的准确率达到90% 以上,超过了以往在语料库上训得的结果.证明该方法可以有效地将知识库中的理性知识注入嵌入表示中,也显示了CiLin2Vec嵌入表示资源在应用上的巨大潜力.

In natural language processing(NLP),to learn embedded representation is an effective approach of capturing semantics from language resources.At present,however,this approach has been much limited to using large-scale corpora,with little attention to extracting rational knowledge from knowledge bases.In this paper,based on "Tongyici Cilin",a famous Chinese thesaurus,we present a method for implanting rational knowledge into embedded representation,then evaluate it in terms of different NLP tasks.According to the hierarchical encodings for morphemic and lexical meanings in "Tongyici Cilin",we design multiple templates to create instances as pseudo-sentences from these pieces of knowledge,and apply word2vec to obtain CiLin2Vec,the sememe and word embeddings of new kinds as for "Tongyici Cilin".For evaluation,tasks of semantic compositionality,analogical reasoning and word similarity measurement are taken into consideration.We make progress and breakthrough on the tasks,reaching an accuracy of over 90% for both semantic compositionality and analogical reasoning,demonstrating that the pieces of rational knowledge have been appropriately implanted,with very promising prospects for adoption of the knowledge bases.