(请使用IE浏览器访问本系统)

  学科分类

  基础科学

  工程技术

  生命科学

  人文社会科学

  其他

篇目详细内容

【篇名】 Optimization for data de-duplication algorithm based on file content
【刊名】 Frontiers of Optoelectronics in China
【刊名缩写】 Front. Optoelectron. China
【ISSN】 1674-4128
【EISSN】 1674-4594
【DOI】 10.1007/s12200-010-0103-z
【出版社】 Higher Education Press and Springer-Verlag Berlin Heidelberg
【出版年】 2010
【卷期】 3 卷3期
【页码】 308-316 页,共 9 页
【作者】 Xuejun NIE; Leihua QIN; Jingli ZHOU; Ke LIU; Jianfeng ZHU; Yu WANG;
【关键词】 data de-duplication; content defined chunking (CDC); file content; candidate anchor histogram (CAH)

【摘要】
Content defined chunking (CDC) is a prevalent data de-duplication algorithm for removing redundant data segments in archival storage systems. Current researches on CDC do not consider the unique content characteristic of different file types, and they determine chunk boundaries in a random way and apply a single strategy for all file types. It has been proven that such method cannot achieve optimal performance for compound archival data. We analyze the content characteristic of different file types and propose candidate anchor histogram (CAH) to capture it. We propose an improved strategy for determining chunk boundaries based on CAH and tune some key parameters of CDC based on the data layout of underlying data de-duplication file system (TriDFS), which can efficiently store variable-sized chunks on fixed-sized physical blocks. These strategies are evaluated with representative archival data, and the result indicates that they can increase on average the compression ratio by 16.3% and write throughput by 13.7%, while only decrease the read throughput by 2.5%.
版权所有 © CALIS管理中心 2008