《厦门大学学报（自然科学版）》

维吾尔语是典型的资源稀缺型语言,由于词义消歧标注语料资源和语义分析工具的不足,导致传统的有监督方法难以实现.针对该问题,将篇章文本的词义消歧问题类比为文本主题分类问题,在LDA(latent Dirichlet allocation)主题模型的基础上提出了一种维吾尔语无监督词义消歧模型.为强化主题模型对歧义词语义项的分类性能,加入了3个数据预处理过程:去除停用词,过滤有效词和强化同义词词频权重.实验结果表明,在随机抽取的63组测试样本集中,该模型的词义消歧准确率达到65.08%,在篇章文本采样词任务中词义消歧准确率达到61.2%.

As a resource-scarce language,due to the shortage of corpus resources and semantic analysis tools,Uyghur faces the difficulty of being implemented with the traditional supervised method for its word sense disambiguation(WSD).In this paper,we compare the textual WSD problems as text subject classification problems,and propose an unsupervised Uyghur WSD model based on the latent Dirichlet allocation(LDA)topic model.In order to enhance the classification performance of the topic model on various meanings of ambiguous words,we add three data preprocessing processes:removing stop words,filtering effective words and strengthening synonyms' frequency weight.Experimental results show that the accuracy of this WSD model increases to 65.08% in random test samples of 63 sets and 61.2% in the document-level sampling-word task.

引言
1 相关研究
2 主题分类词义消歧模型
3 实验部分
4 实验结果及分析
5 结论

图1 LDA概率模型图<br/>Fig.1 LDA probability model diagram

图1 LDA概率模型图
Fig.1 LDA probability model diagram

图2 基于LDA的词义消歧模型<br/>Fig.2 Word sense disambiguation model based on LDA

图2 基于LDA的词义消歧模型
Fig.2 Word sense disambiguation model based on LDA

表1 维吾尔语停用词<br/>Tab.1 Stop words in Uyghur

表1 维吾尔语停用词
Tab.1 Stop words in Uyghur

图3 维吾尔语同义词的抽取<br/>Fig.3 Synonym extraction for Uyghur

图3 维吾尔语同义词的抽取
Fig.3 Synonym extraction for Uyghur

表2 维吾尔语同义词集<br/>Tab.2 Synonym set for Uyghur

表2 维吾尔语同义词集
Tab.2 Synonym set for Uyghur

图4 歧义词语义标注测试集的构建<br/>Fig.4 Construction of semantic labeling test set of ambiguous words

图4 歧义词语义标注测试集的构建
Fig.4 Construction of semantic labeling test set of ambiguous words

表3 维吾尔语篇章资源统计<br/>Tab.3 Statistics of Uygur textresources

表3 维吾尔语篇章资源统计
Tab.3 Statistics of Uygur textresources

表4 采样任务的维吾尔语歧义词<br/>Tab.4 Ambiguous Uygur words for sampling task

表4 采样任务的维吾尔语歧义词
Tab.4 Ambiguous Uygur words for sampling task

表5 维吾尔语语义标注集词义消歧结果<br/>Tab.5 Word sense disambiguation results for Uyghur on semantic annotation test set

表5 维吾尔语语义标注集词义消歧结果
Tab.5 Word sense disambiguation results for Uyghur on semantic annotation test set

表6 维吾尔语篇章级采样词任务词义消歧结果<br/>Tab.6 Word sense disambiguation results for Uyghur in document-level sampling-word task

表6 维吾尔语篇章级采样词任务词义消歧结果
Tab.6 Word sense disambiguation results for Uyghur in document-level sampling-word task

[1] CHAN Y S,NG H T,CHIANG D.Word sense disambiguation improves statistical machine translation[C]∥Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics.Prague:ACL,2007:33-40.
[2] ZHONG,Z,NG H T.Word sense disambiguation improves information retrieval[C]∥Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics:Long Papers.Jeju Island:ACL,2012:273-282.
[3] RAMAKRISHNAN G,JADHAV A,JOSHI A,et al.Question answering via Bayesian inference on lexical relations[C]∥Proceedings of the ACL 2003 Workshop on Multilingual Summarization and Question Answering.Morristown:ACL,2003:1-10.
[4] 王厚峰.计算语言学歧义消解研究:兼介绍北京大学计算语言学教育部重点实验室[J].术语标准化与信息技术,2010(3):25-28.
[5] ELMOUGY S,HAMZA T,NOAMAN H M.Naïve Bayes classifier for Arabic word sense disambiguation[EB/OL].[2019-08-31].http:∥pdfs.semanticscholar.org/742a/5057cf6dcbb0d429d0f77ce8b625c6724f73.pdf.
[6] PARK S B,ZHANG B T,KIM Y T.Word sense disambiguation by learning decision trees from unlabeled data[J].Applied Intelligence,2003,19(1/2):27-38.
[7] TRATZ S,SANFILIPPO A,GREGORY M,et al.PNNL:a supervised maximum entropy approach to word sense disambiguation[C]∥Proceedings of the 4th International Workshop on Semantic Evaluations.Pargue:ACL,2007:264-267.
[8] LEE Y K,NG H T,CHIA T K.Supervised word sense disambiguation with support vector machines and multiple knowledge sources[C]∥Proceedings of SENSEVAL-3,the 3rd International Workshop on the Evaluation of Systems for the Semantic Analysis of Text.Barcelona:ACL,2004:137-140.
[9] NAVIGLI R,LAPATA M.An experimental study of graph connectivity for unsupervised word sense disambi-guation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2010,32(4):678-692.
[10] BLEI D M,NG A Y,JORDAN M I,et al.Latent Dirichlet allocation[J].Journal of Machine Learning Research,2003,3(4/5):993-1022.
[11] HE B,AGRAWAL D P.An identity-based authentication and key establishment scheme for multi-operator maintained wireless mesh networks[C]∥Proceedings of The 7th IEEE International Conference on Mobile Ad-hoc and Sensor Systems(IEEE MASS 2010).San Francisco:IEEE,2010:71-78.
[12] CAI J F,LEE W S,TEH Y W.Improving word sense disambiguation using topic features[EB/OL].[2019-08-31].http:∥www.doc88.com/p-2905310533315.html.
[13] BOYD-GRABER J,BLEI D.PUTOP:turning predominant senses into a topic model for word sense disambiguation[C]∥Proceedings of the 4th International Workshop on Semantic Evaluations.Pargue:ACL,2007:277-281.
[14] MCCARTHY D,KOELING R,WEEDS J,et al.Finding predominant word senses in untagged text[C]∥Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics.Barcelona:ACL,2004:279.
[15] BOYD-GRABER J,BLEI D,ZHU X J.A topic model for word sense disambiguation[EB/OL].[2019-08-31].http:∥read.pudn.com/download291/doc/1309613/topicmodal/D07-1109.pdf.
[16] ABNEY S,LIGHT M.Hiding a semantic hierarchy in a Markov model[C]∥Proceedings of the Workshop on Unsupervised Learning in Natural Language Processing.College Park:ACL,1999:1-8.
[17] LI L L,ROTH B,SPORLEDER C.Topic models for word sense disambiguation and token-based idiom detection[C]∥Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.Uppsala:ACL,2010:1138-1147.
[18] RUBÉN I B,MARTEN P,PIEK V.Topic modeling and word sense disambiguation on the Ancora corpus[J].Procesamiento del Lenguaje Natural,2015(55):15-22.
[19] CHAPLOT D S,SALAKHUTDINOV R.Knowledge-based word sense disambiguation using topic models[EB/OL].[2019-08-31].https:∥arxiv.org/pdf/1801.01900.pdf.

备注

引言

1 相关研究

2 主题分类词义消歧模型

3 实验部分

4 实验结果及分析

5 结论

学报简介

备注