,TextGen:a realistic text data content generation method for modern storage system benchmarks

来源 :信息与电子工程前沿(英文版) | 被引量 : 0次 | 上传用户:ff303
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
Mode storage systems incorporate data compressors to improve their performance and capacity. As a result, data content can significantly influence the result of a storage system benchmark. Because real-world proprietary datasets are too large to be copied onto a test storage system, and most data cannot be shared due to privacy issues, a benchmark needs to generate data synthetically. To ensure that the result is accurate, it is necessary to generate data content based on the characterization of real-world data properties that influence the storage system performance during the execution of a benchmark. The existing approach, called SDGen, cannot guarantee that the benchmark result is accurate in storage systems that have built-in word-based compressors. The reason is that SDGen characterizes the properties that influence compression performance only at the byte level, and no properties are characterized at the word level. To address this problem, we present TextGen, a realistic text data content generation method for mode storage system benchmarks. TextGen builds the word corpus by segmenting real-world text datasets, and creates a word-frequency distribution by counting each word in the corpus. To improve data generation performance, the word-frequency distribution is fitted to a lognormal distribution by maximum likelihood estimation. The Monte Carlo approach is used to generate synthetic data. The running time of TextGen generation depends only on the expected data size, which means that the time complexity of TextGen isO(n). To evaluate TextGen, four real-world datasets were used to perform an experiment. The experimental results show that, compared with SDGen, the compression performance and compression ratio of the datasets generated by TextGen deviate less from real-world datasets when end-tagged dense code, a representative of word-based compressors, is evaluated.
其他文献
四川大学新闻系教师王绿萍是一位有心之人。她有一颗艰苦创业之心。她所主持制作的一套新闻史教学幻灯片终于获得成功。新闻历史的教学,本来是枯燥的。它需要借助于形象,用
火棘也称火把果、救军粮,其秋冬季节红果满枝,鲜红似火,经冬不落,作为年宵花中的观果植物,红艳艳的果实充满喜庆之感,很能营造节日的欢乐氛围。此外,火棘的果实含有丰富的有
研究性教学在高校教学中的重要性日益提高。我们在“中国古代政治制度史”这门历史学专业的选修课中对研究性教学进行探索,既可发挥教师的主导作用,又重视了学生的主体地位,引导
周日例会是实施大学生思想政治教育工作不可忽视的一种重要途径。为了充分发挥周日例会制度在大学生思想政治教育工作中的作用,提高大学生的思想政治教育工作的针对性、预见性
该文主要利用2个BT型粳稻雄性不育系六千辛A和3726A与17个常规粳稻品种(系)配制 二元不育系,然后再和相应的恢复系77302-1、C堡或六千辛R配组获得的19个三交种,对二 元不育系
Content-based satellite image registration is a difficult issue in the fields of remote sensing and image processing. The difficulty is more significant in the
通过4种异源细胞质(粘果山羊草、易变山羊草、偏凸山羊草、二角山羊草细胞质)1BL/1RS小麦雄性不育系和对应普通细胞质亲本系及其与生产上已推广的优良品种(系)杂交,从杂种F的
The unified modeling language(UML) is one of the most commonly used modeling languages in the software industry.It simplifies the complex process of design by providing a set of graphical notations,wh
该试验通过对短光低温不育水稻宜DS育性转换特性、不育性和叶缘颜色的遗传基因,花培后代的整齐度分析等几方面进行研究后得到以下主要结论:1.宜D1S是一个光敏性较强的材料,在
1.儿时小儿麻痹,32岁时仍孑然一身。身体虽受禁锢,灵魂却无比倔强,天无绝人之路,卖烧烤开花Q果,一年盈利十余万元。请问,发明开花Q果小吃的是哪一家企业?2.他出身贫寒,初中辍