论文部分内容阅读
大数据计算面对的是传统IT技术无法处理的数据量超大规模、服务请求高吞吐量和数据类型异质多样的挑战。得益于国内外各大互联网公司的实际应用和开源代码贡献,Apache Hadoop软件已成为PB量级大数据处理的成熟技术和事实标准,并且围绕不同类型大数据处理需求的软件生态环境已经建立起来。文章介绍了大数据计算系统中存储、索引和压缩解压缩的硬件加速三项研究工作,即RCFile、CCIndex和SwiftFS,有效解决了大数据计算系统的存储空间问题和查询性能等问题。这些研究成果已形成关键技术并集成在天玑大数据引擎软件栈中,直接支持了淘宝和腾讯公司的多个生产性应用。
Big data computing faces the challenges of exceedingly large amounts of data, high throughput of service requests, and heterogeneous data types that traditional IT technologies can not handle. Benefit from the practical application of the major Internet companies at home and abroad and the contribution of open source code, Apache Hadoop software has become a mature technology and de facto standard PB big data processing, and around the needs of different types of big data processing software ecosystem has been established . This paper introduces three research work of storage, indexing and hardware acceleration of compression and decompression in big data computing system, namely RCFile, CCIndex and SwiftFS, effectively solving the storage space and query performance of big data computing system. These research results have formed a key technology and integrated in the DNT big data engine software stack, directly supporting a number of productive applications Taobao and Tencent.