|本期目录/Table of Contents|

[1]张文博,米成刚,杨雅婷*.基于词对向量的中文新闻话题检测方法[J].厦门大学学报(自然科学版),2019,58(02):231-236.[doi:10.6043/j.issn.0438-0479.201811013]
 ZHANG Wenbo,MI Chenggang,YANG Yating*.Chinese news topic detection method based on word pair vector[J].Journal of Xiamen University(Natural Science),2019,58(02):231-236.[doi:10.6043/j.issn.0438-0479.201811013]
点击复制

基于词对向量的中文新闻话题检测方法(PDF/HTML)
分享到:

《厦门大学学报(自然科学版)》[ISSN:0438-0479/CN:35-1070/N]

卷:
58卷
期数:
2019年02期
页码:
231-236
栏目:
自然语言处理计算方法
出版日期:
2019-03-27

文章信息/Info

Title:
Chinese news topic detection method based on word pair vector
文章编号:
0438-0479(2019)02-0231-06
作者:
张文博123米成刚13杨雅婷123*
1.中国科学院新疆理化技术研究所,新疆 乌鲁木齐 830011; 2.中国科学院大学计算机科学与技术学院,北京 100049; 3.新疆民族语音语言信息处理实验室,新疆 乌鲁木齐 830011
Author(s):
ZHANG Wenbo123MI Chenggang13YANG Yating123*
1.The Xinjiang Technical Institute of Physics & Chemistry,Chinese Academy of Sciences,Urumqi 830011,China; 2.School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 100049,China; 3.Xinjiang Laboratory of Minority Speech and Language Information Processing,Urumqi 830011,China
关键词:
话题检测 词对模型 降维 相似度
Keywords:
topic detection word pair model dimension reduction similarity
分类号:
TP 391
DOI:
10.6043/j.issn.0438-0479.201811013
文献标志码:
A
摘要:
针对传统话题检测方法得到的结果和实际话题个数相差较大的缺点,根据话题所包含的文本数对话题之间的相似度进行衰减,进而优先合并粒度较小类,并根据文档话题频率和权重对较大的话题向量进行降维,通过这两方面对传统的层次聚类方法进行改进.同时为了更好地表达话题的语义信息,使用在句子中共现的词对向量来取代传统的向量空间模型.实验结果表明,使用词对模型和改进的方法可以取得更好的效果,而且得到的聚类结果和实际话题个数相近.
Abstract:
According to the shortcoming of the great difference between the result of the traditional topic detection method and the actual number of topics,this paper improves the traditional hierarchical clustering method in two aspects.One is to reduce the similarity between topics according to the number of texts contained in the topic,which prioritizes merging smaller granularity classes.The other is dimension reduction of larger topic vectors based on the weight and document frequency in a topic.Meanwhile,to better express the semantic information of a topic,we use the word pair vector,which appears in sentences,to replace the traditional vector space model.Experimental results show that the improved method on the word pair model achieves the better results,which resemble the actual numbers of topics.

参考文献/References:

[1] ALLAN J.Topic detection and tracking pilot study:final report[C]∥Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop.San Francisco:Morgan Kaufmann Publishers,1998:194-218.
[2] 骆卫华,刘群,程学旗.话题检测与跟踪技术的发展与研究[C]∥语言计算与基于内容的文本处理.北京:清华大学出版社,2003:560-566.
[3] 洪宇,张宇,刘挺,等.话题检测与跟踪的评测及研究综述[J].中文信息学报,2007,21(6):71-87.
[4] 李湘东,巴志超,黄莉.基于LDA模型和HowNet的多粒度子话题划分方法[J].计算机应用研究,2015,32(6):1625-1629.
[5] BRANTS T,CHEN F,FARAPHAT A.A system for new event detection[C]∥Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval.Toronto:ACM,2003:330-337.
[6] 翟东海,鱼江,高飞,等.最大距离法选取初始簇中心的K-means文本聚类算法的研究[J].计算机应用研究,2014,31(3):713-715,719.
[7] 骆卫华,于满泉,许洪波,等.基于多策略优化的分治多层聚类算法的话题发现研究[C]∥全国计算语言学联合学术会议.南京:南京师范大学,2005:29-36.
[8] 路荣,项亮,刘明荣,等.基于隐主题分析和文本聚类的微博客新闻话题发现研究[C]∥全国信息检索学术会议.牡丹江市:中国中文信息学会,2012:291-298.
[9] 贾自艳,何清,张海俊,等.一种基于动态进化模型的事件探测和追踪算法[J].计算机研究与发展,2004,41(7):1273-1280.
[10] 叶施仁,杨英,杨长春,等.孤立点预处理和Single-Pass聚类结合的微博话题检测方法[J].计算机应用研究,2016,33(8):2294-2297.
[11] 付艳,周明全,王学松,等.面向互联网新闻的在线事件检测[J].软件学报,2010,21:363-372.
[12] KUMARAN G,ALLAN J.Text classification and named entities for new event detection[C]∥Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Sheffield:ACM,2004:297-304.
[13] 李欣雨,袁方,刘宇,等.面向中文新闻话题检测的多向量文本聚类方法[J].郑州大学学报(理学版),2016,48(2):47-52.
[14] 贺敏,徐杰,杜攀,等.基于时间序列分析的微博突发话题检测方法[J].通信学报,2016,37(3):48-54.
[15] WU J J,SUN B,XIONG H T,et al.Topic detection from short text:a term-based consensus clustering method[C]∥Proceedings of the 13th International Conference on Service Systems & Service Management.New York:IEEE,2016:1-6.
[16] XIE W,ZHU F,JIANG J,et al.TopicSketch:real-time bursty topic detection from Twitter[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(8):2216-2229.
[17] ALLAN J.Introduction to topic detection and tracking[M]∥Topic detection and tracking.Boston:Springer,2002:1-16.
[18] 李胜东,吕学强,施水才,等.基于话题检测的自适应增量K-means算法[J].中文信息学报,2014,28(6):190-193.
[19] WANG C,ZHANG M,MA S,et al.Automatic online news issue construction in web environment[C]∥Proceedings of the 17th International Conference on World Wide Web.Lyon:ACM,2008:457-466.[1] ALLAN J.Topic detection and tracking pilot study:final report[C]∥Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop.San Francisco:Morgan Kaufmann Publishers,1998:194-218.
[2] 骆卫华,刘群,程学旗.话题检测与跟踪技术的发展与研究[C]∥语言计算与基于内容的文本处理.北京:清华大学出版社,2003:560-566.
[3] 洪宇,张宇,刘挺,等.话题检测与跟踪的评测及研究综述[J].中文信息学报,2007,21(6):71-87.
[4] 李湘东,巴志超,黄莉.基于LDA模型和HowNet的多粒度子话题划分方法[J].计算机应用研究,2015,32(6):1625-1629.
[5] BRANTS T,CHEN F,FARAPHAT A.A system for new event detection[C]∥Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval.Toronto:ACM,2003:330-337.
[6] 翟东海,鱼江,高飞,等.最大距离法选取初始簇中心的K-means文本聚类算法的研究[J].计算机应用研究,2014,31(3):713-715,719.
[7] 骆卫华,于满泉,许洪波,等.基于多策略优化的分治多层聚类算法的话题发现研究[C]∥全国计算语言学联合学术会议.南京:南京师范大学,2005:29-36.
[8] 路荣,项亮,刘明荣,等.基于隐主题分析和文本聚类的微博客新闻话题发现研究[C]∥全国信息检索学术会议.牡丹江市:中国中文信息学会,2012:291-298.
[9] 贾自艳,何清,张海俊,等.一种基于动态进化模型的事件探测和追踪算法[J].计算机研究与发展,2004,41(7):1273-1280.
[10] 叶施仁,杨英,杨长春,等.孤立点预处理和Single-Pass聚类结合的微博话题检测方法[J].计算机应用研究,2016,33(8):2294-2297.
[11] 付艳,周明全,王学松,等.面向互联网新闻的在线事件检测[J].软件学报,2010,21:363-372.
[12] KUMARAN G,ALLAN J.Text classification and named entities for new event detection[C]∥Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Sheffield:ACM,2004:297-304.
[13] 李欣雨,袁方,刘宇,等.面向中文新闻话题检测的多向量文本聚类方法[J].郑州大学学报(理学版),2016,48(2):47-52.
[14] 贺敏,徐杰,杜攀,等.基于时间序列分析的微博突发话题检测方法[J].通信学报,2016,37(3):48-54.
[15] WU J J,SUN B,XIONG H T,et al.Topic detection from short text:a term-based consensus clustering method[C]∥Proceedings of the 13th International Conference on Service Systems & Service Management.New York:IEEE,2016:1-6.
[16] XIE W,ZHU F,JIANG J,et al.TopicSketch:real-time bursty topic detection from Twitter[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(8):2216-2229.
[17] ALLAN J.Introduction to topic detection and tracking[M]∥Topic detection and tracking.Boston:Springer,2002:1-16.
[18] 李胜东,吕学强,施水才,等.基于话题检测的自适应增量K-means算法[J].中文信息学报,2014,28(6):190-193.
[19] WANG C,ZHANG M,MA S,et al.Automatic online news issue construction in web environment[C]∥Proceedings of the 17th International Conference on World Wide Web.Lyon:ACM,2008:457-466.

备注/Memo

备注/Memo:
收稿日期:2018-11-11 录用日期:2019-01-12
基金项目:国家自然科学基金(U1703133); 新疆自治区重大科技专项(2016A03007-3); 中科院“西部之光”人才培养引进计划(2017-XBQNXZ-A-005)
*通信作者:yangyt@ms.xjb.ac.cn
更新日期/Last Update: 1900-01-01