《厦门大学学报（自然科学版）》

针对传统话题检测方法得到的结果和实际话题个数相差较大的缺点,根据话题所包含的文本数对话题之间的相似度进行衰减,进而优先合并粒度较小类,并根据文档话题频率和权重对较大的话题向量进行降维,通过这两方面对传统的层次聚类方法进行改进.同时为了更好地表达话题的语义信息,使用在句子中共现的词对向量来取代传统的向量空间模型.实验结果表明,使用词对模型和改进的方法可以取得更好的效果,而且得到的聚类结果和实际话题个数相近.

According to the shortcoming of the great difference between the result of the traditional topic detection method and the actual number of topics,this paper improves the traditional hierarchical clustering method in two aspects.One is to reduce the similarity between topics according to the number of texts contained in the topic,which prioritizes merging smaller granularity classes.The other is dimension reduction of larger topic vectors based on the weight and document frequency in a topic.Meanwhile,to better express the semantic information of a topic,we use the word pair vector,which appears in sentences,to replace the traditional vector space model.Experimental results show that the improved method on the word pair model achieves the better results,which resemble the actual numbers of topics.

引言
1 话题检测模型
2 实验结果与分析
3 结论

表1 训练集中传统层次聚类的实验结果<br/>Tab.1 Experimental results of traditional hierarchical clustering in the training set

表1 训练集中传统层次聚类的实验结果
Tab.1 Experimental results of traditional hierarchical clustering in the training set

表2 训练集中基于词对的改进层次聚类的实验结果<br/>Tab.2 Experimental results of improved hierarchical clustering based on word pairs in the training set

表2 训练集中基于词对的改进层次聚类的实验结果
Tab.2 Experimental results of improved hierarchical clustering based on word pairs in the training set

表3 训练集中Single-Pass聚类的实验结果<br/>Tab.3 Experimental results of Single-Pass clustering in the training set

表3 训练集中Single-Pass聚类的实验结果
Tab.3 Experimental results of Single-Pass clustering in the training set

表4 训练集中自适应K-means聚类的实验结果<br/>Tab.4 Experimental results of adaptive K-means clustering in the training set

表4 训练集中自适应K-means聚类的实验结果
Tab.4 Experimental results of adaptive K-means clustering in the training set

图1 不同模型训练集中F1值随聚类话题个数的变化<br/>Fig.1 The F1 values of different models in the training set change with the number of topics

图1 不同模型训练集中F1值随聚类话题个数的变化
Fig.1 The F1 values of different models in the training set change with the number of topics

表5 测试集中4组模型的实验结果<br/>Tab.5 Experimental results of four model in the test set

表5 测试集中4组模型的实验结果
Tab.5 Experimental results of four model in the test set

表6 聚类话题表达结果<br/>Tab.6 Topic expression of clustering results

表6 聚类话题表达结果
Tab.6 Topic expression of clustering results

[1] ALLAN J.Topic detection and tracking pilot study:final report[C]∥Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop.San Francisco:Morgan Kaufmann Publishers,1998:194-218.
[2] 骆卫华,刘群,程学旗.话题检测与跟踪技术的发展与研究[C]∥语言计算与基于内容的文本处理.北京:清华大学出版社,2003:560-566.
[3] 洪宇,张宇,刘挺,等.话题检测与跟踪的评测及研究综述[J].中文信息学报,2007,21(6):71-87.
[4] 李湘东,巴志超,黄莉.基于LDA模型和HowNet的多粒度子话题划分方法[J].计算机应用研究,2015,32(6):1625-1629.
[5] BRANTS T,CHEN F,FARAPHAT A.A system for new event detection[C]∥Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval.Toronto:ACM,2003:330-337.
[6] 翟东海,鱼江,高飞,等.最大距离法选取初始簇中心的K-means文本聚类算法的研究[J].计算机应用研究,2014,31(3):713-715,719.
[7] 骆卫华,于满泉,许洪波,等.基于多策略优化的分治多层聚类算法的话题发现研究[C]∥全国计算语言学联合学术会议.南京:南京师范大学,2005:29-36.
[8] 路荣,项亮,刘明荣,等.基于隐主题分析和文本聚类的微博客新闻话题发现研究[C]∥全国信息检索学术会议.牡丹江市:中国中文信息学会,2012:291-298.
[9] 贾自艳,何清,张海俊,等.一种基于动态进化模型的事件探测和追踪算法[J].计算机研究与发展,2004,41(7):1273-1280.
[10] 叶施仁,杨英,杨长春,等.孤立点预处理和Single-Pass聚类结合的微博话题检测方法[J].计算机应用研究,2016,33(8):2294-2297.
[11] 付艳,周明全,王学松,等.面向互联网新闻的在线事件检测[J].软件学报,2010,21:363-372.
[12] KUMARAN G,ALLAN J.Text classification and named entities for new event detection[C]∥Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Sheffield:ACM,2004:297-304.
[13] 李欣雨,袁方,刘宇,等.面向中文新闻话题检测的多向量文本聚类方法[J].郑州大学学报(理学版),2016,48(2):47-52.
[14] 贺敏,徐杰,杜攀,等.基于时间序列分析的微博突发话题检测方法[J].通信学报,2016,37(3):48-54.
[15] WU J J,SUN B,XIONG H T,et al.Topic detection from short text:a term-based consensus clustering method[C]∥Proceedings of the 13th International Conference on Service Systems & Service Management.New York:IEEE,2016:1-6.
[16] XIE W,ZHU F,JIANG J,et al.TopicSketch:real-time bursty topic detection from Twitter[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(8):2216-2229.
[17] ALLAN J.Introduction to topic detection and tracking[M]∥Topic detection and tracking.Boston:Springer,2002:1-16.
[18] 李胜东,吕学强,施水才,等.基于话题检测的自适应增量K-means算法[J].中文信息学报,2014,28(6):190-193.
[19] WANG C,ZHANG M,MA S,et al.Automatic online news issue construction in web environment[C]∥Proceedings of the 17th International Conference on World Wide Web.Lyon:ACM,2008:457-466.

备注

引言

1 话题检测模型

2 实验结果与分析

3 结论

学报简介

备注

引言

1 话题检测模型

2 实验结果与分析

3 结 论

学报简介

3 结论