基于词对向量的中文新闻话题检测方法
张文博1,2,3,米成刚1,3,杨雅婷1,2,3*

(1.中国科学院新疆理化技术研究所,新疆 乌鲁木齐 830011; 2.中国科学院大学计算机科学与技术学院,北京 100049; 3.新疆民族语音语言信息处理实验室,新疆 乌鲁木齐 830011)

话题检测; 词对模型; 降维; 相似度

Chinese news topic detection method based on word pair vector
ZHANG Wenbo1,2,3,MI Chenggang1,3,YANG Yating1,2,3*

(1.The Xinjiang Technical Institute of Physics & Chemistry,Chinese Academy of Sciences,Urumqi 830011,China; 2.School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 100049,China; 3.Xinjiang Laboratory of Minority Speech and Language Information Processing,Urumqi 830011,China)

DOI: 10.6043/j.issn.0438-0479.201811013

备注

针对传统话题检测方法得到的结果和实际话题个数相差较大的缺点,根据话题所包含的文本数对话题之间的相似度进行衰减,进而优先合并粒度较小类,并根据文档话题频率和权重对较大的话题向量进行降维,通过这两方面对传统的层次聚类方法进行改进.同时为了更好地表达话题的语义信息,使用在句子中共现的词对向量来取代传统的向量空间模型.实验结果表明,使用词对模型和改进的方法可以取得更好的效果,而且得到的聚类结果和实际话题个数相近.

According to the shortcoming of the great difference between the result of the traditional topic detection method and the actual number of topics,this paper improves the traditional hierarchical clustering method in two aspects.One is to reduce the similarity between topics according to the number of texts contained in the topic,which prioritizes merging smaller granularity classes.The other is dimension reduction of larger topic vectors based on the weight and document frequency in a topic.Meanwhile,to better express the semantic information of a topic,we use the word pair vector,which appears in sentences,to replace the traditional vector space model.Experimental results show that the improved method on the word pair model achieves the better results,which resemble the actual numbers of topics.