论文部分内容阅读
海量流数据的分析与处理是信息社会面对的一个基本问题。各种传感器汇聚的数据是流数据,人们发出的短信对于移动通信运营商的数据中心来说是流数据,人们写的微博对于新浪或者腾讯来说是流数据,搜索引擎网页爬取子系统传给后台处理的数据也可以看成是流数据。尽管它们的应用背景不同,但有共同的特征,即存在一个网络上的汇聚节点,从该节点的角度看,数据源源不断地到来。通常,这些数据会以某种特定的格式缓存起来,待某个特定的后续系统处理。启发本文工作的问题是:那些数据常常是有多方面价值的,有些甚至是当前没有想到的,我们有必要同时开放一个流数据接口供未来可能出现的新应用调用。该接口应该具有如下特征:(1)向外输出原始流数据;(2)允许其他(多个)应用程序动态接入和退出;(3)接入的应用程序的行为不影响数据搜集和最初设计的后续系统的功能。本文以连续运行了10年以上的天网搜索引擎和中国Web博物馆(WebInfomall)为例,讨论其网页搜集子系统的改造以适应上述需求,IP多播是采用的基本技术。在介绍了设计思想和实现要点后,我们也给出一个“新应用”的实际例子。这样一个接口的实现,为各种网页流信息分析应用打开了一扇窗口。该接口的设计思想也可以用于其他流数据汇聚系统中。
Massive stream data analysis and processing is one of the basic issues facing the information society. Data collected by various sensors is streaming data, and messages sent by people are streaming data for data centers of mobile communication operators. People write microblogs for streaming data to Sina or Tencent, and search engine web crawling subsystems Data passed to the background can also be seen as streaming data. Although they have different application backgrounds, they all share the same feature that there is a convergent node on the network. From this node’s point of view, data comes in an endless stream. Often, these data are cached in a specific format for processing by a particular, subsequent system. The problem with this article’s work is that those data are often valuable in many ways, and some are not even currently thought out. It is necessary for us to simultaneously open a stream data interface for new application calls that may come out in the future. The interface should have the following characteristics: (1) export the original stream data outwards; (2) allow other application (s) to dynamically access and exit; (3) the accessed application’s behavior does not affect the data collection and initial Design of follow-up system functions. Taking the Skynet search engine and WebInfomall that have been running continuously for more than 10 years as an example, this paper discusses the transformation of its webpage collection subsystem to meet the above requirements. IP Multicast is the basic technology adopted. After introducing the design ideas and implementation points, we also give a practical example of “new application.” The realization of such an interface opens a window for various webpage flow information analysis applications. The design idea of this interface can also be used in other stream data gathering systems.