基于LDA主题模型的维吾尔语无监督词义消歧
袁 扬1,2,3,李 晓1,2,3*,杨雅婷1,2,3

(1.中国科学院新疆理化技术研究所,新疆 乌鲁木齐 830011; 2.中国科学院大学,北京 100049; 3.新疆民族语音语言信息处理实验室,新疆 乌鲁木齐 830011)

维吾尔语; 无监督词义消歧; 主题模型; 语义相似度; 同义词

Unsupervised word sense disambiguation for Uyghur based on LDA topic model
YUAN Yang 1,2,3,LI Xiao 1,2,3*,YANG Yating 1,2,3

(1.The Xinjiang Technical Institute of Physics & Chemistry,Chinese Academy of Sciences,Urumqi 830011,China; 2.University of Chinese Academy of Sciences,Beijing 100049,China; 3.Xinjiang Laboratory of Minority Speech and Language Information Processing,

DOI: 10.6043/j.issn.0438-0479.201908044

备注

维吾尔语是典型的资源稀缺型语言,由于词义消歧标注语料资源和语义分析工具的不足,导致传统的有监督方法难以实现.针对该问题,将篇章文本的词义消歧问题类比为文本主题分类问题,在LDA(latent Dirichlet allocation)主题模型的基础上提出了一种维吾尔语无监督词义消歧模型.为强化主题模型对歧义词语义项的分类性能,加入了3个数据预处理过程:去除停用词,过滤有效词和强化同义词词频权重.实验结果表明,在随机抽取的63组测试样本集中,该模型的词义消歧准确率达到65.08%,在篇章文本采样词任务中词义消歧准确率达到61.2%.

As a resource-scarce language,due to the shortage of corpus resources and semantic analysis tools,Uyghur faces the difficulty of being implemented with the traditional supervised method for its word sense disambiguation(WSD).In this paper,we compare the textual WSD problems as text subject classification problems,and propose an unsupervised Uyghur WSD model based on the latent Dirichlet allocation(LDA)topic model.In order to enhance the classification performance of the topic model on various meanings of ambiguous words,we add three data preprocessing processes:removing stop words,filtering effective words and strengthening synonyms' frequency weight.Experimental results show that the accuracy of this WSD model increases to 65.08% in random test samples of 63 sets and 61.2% in the document-level sampling-word task.