论文部分内容阅读
【目的】通过大规模文本聚类技术进行话题检测,并自动拣选优质话题。【方法】以新浪微博上与饮食相关的微博内容为数据源,结合文本聚类与深度学习知识进行话题检测。通过匹配微博发布的月份,将微博划分为四季微博;使用向量空间模型和文本聚类方法,对不同季节的微博进行话题检测,获得候选话题;结合深度学习知识,提出主题覆盖率概念,用以自动评价话题质量,去除低质量话题。【结果】基于主题覆盖率的话题筛选结果符合人工拣选预期,抽取获得主题覆盖率高于0.5的优质话题。【局限】话题检测质量的评价主要以定性评价为主。【结论】通过计算主题覆盖率来自动选择优质话题,该方法效率高,通用性强,获得的话题便于理解,较好地揭示了四季中饮食微博的话题分布。
[Purpose] To carry out topic detection through large-scale text clustering technology and automatically select high-quality topics. [Methods] With the microblogging content related to diet on Sina Weibo as the data source, the topic detection was carried out based on the text clustering and deep learning knowledge. By matching the months of the publication of Weibo, the microblogs are divided into four seasons of Weibo; using the vector space model and the text clustering method, topic detection is conducted on Weibo in different seasons to obtain candidate topics; combined with the deep learning knowledge, the topic coverage is proposed Concept to automatically evaluate the quality of topics and remove low quality topics. 【Result】 The topic screening results based on the topic coverage are in line with the expectation of manual picking, and the high quality topics with the topic coverage higher than 0.5 are obtained. [Limitations] The quality of topic test evaluation mainly qualitative evaluation. 【Conclusion】 By selecting the topic coverage to automatically select high-quality topics, this method is highly efficient and versatile. The obtained topics are easy to understand, and the topic distribution of the food weibo in the four seasons is well revealed.