SAFE: Structure-Aware File and Email ... - Semantic Scholar

2 downloads 48535 Views 273KB Size Report
and emails shows that SAFE accomplishes as good of storage savings as a variable-block deduplication, while being as fast as a file-level or a large fixed-size ...
SAFE: Structure-Aware File and Email Deduplication for Cloud-based Storage Systems Daehee Kim, Sejun Song, Baek-Young Choi University of Missouri–Kansas City, MO {daehee.kim, sjsong, choiby}@umkc.edu

Abstract—Cloud-based storages have become considerably popular in recent years, as they enable data access from anywhere and any device at any time. Many leading cloudbased storage services including Dropbox, JustCloud, and Mozy use data deduplication techniques at a source to save network bandwidth from a user to cloud servers as well as storage space, which in turn expedites the speed of data upload. Although traditional variable-size block-level deduplication techniques tend to achieve a high data reduction rate, they require a high processing overhead due to data chunking, index processing, and data fragmentation. However, a user’s device can be limited in processing capability and memory space to perform an effective client side deduplication. While, a simple file-level or a large fixedsize block-level deduplication may be able to cope with the limited source device capacity, it cannot produce a high data reduction rate. In this paper, we propose a novel Structure-Aware File and Email deduplication (SAFE) scheme that achieves both fast and effective data reduction for cloud-based storage services. SAFE efficiently deduplicates redundant objects in structured files as well as emails exploiting object-level components based on their structures. Our evaluation using real data sets of structured files and emails shows that SAFE accomplishes as good of storage savings as a variable-block deduplication, while being as fast as a file-level or a large fixed-size block-level deduplication.

I. I NTRODUCTION Recently, cloud-based storage services such as Dropbox [5], Google Drive [4], Apple iCloud [13], JustCloud [16], Mozy [22], and Microsoft skyDrive [27] competitively offer easy to access, secure, reliable, and low cost remote storage spaces for file-sharing, document suites, and online-backup services for their users. As they enable easy data access from anywhere and any device and at any time, cloud-based storages have become considerably popular both for private and enterprise customers. The main quality characteristics of such services are how efficiently they can handle the large amount of network bandwidth requirements from a user to cloud storages and how effectively they can reduce the storage space usages. Thus, leading cloud-based storage services use some form of data deduplication techniques at a source or client site in order to save the network bandwidth as well as the data storage space, which in turn expedite the speed of data upload. Although traditional variable-size block-level deduplication techniques that are used for data backup systems[23], [24], [29] tend to achieve a high data reduction rate, they incur high overheads for data chunking (for example, the use of Rabin fingerprint matching [25]) and for maintaining and tracking large index and data fragmentation, to be used for in-line

or cloud storage systems. A client device of a cloud-based storage is especially often limited in its processing capability and memory space to perform an effective traditional data deduplication. Many cloud-based storage services such as JustCloud [16] and Mozy [22] employ a single instance storage using a simple file-level deduplication [2] that requires less indexing overhead and no data chunking overhead. While it may be able to cope with the limited client device capacity, it cannot achieve a high data reduction rate. In addition to a single file instance storage, Dropbox [5] uses a fixed and large size (4 MB chunk size) block-level deduplication because of its less data chunking overhead. However, its data deduplication rate is still far less than a variable-size deduplication, due to the large granularity of chunks and the potential chunk boundaryshifting problem [9]. In this paper, we propose a novel Structure-Aware File and Email deduplication (SAFE) scheme that achieves both efficient and effective data deduplication at a source site for cloud-based storage services. SAFE efficiently deduplicates redundant objects in structured files such as MS docx, pptx, and pdf as well as emails, by exploiting object-level components based on their structure, which results in less data chunking overhead as well as fewer indexes than a blocklevel deduplication. SAFE is effective in that its chunks are content-oriented objects and it does not have a boundaryshifting problem, thus achieving a higher data deduplication ratio than a file-level deduplication. The contributions of this paper are as follows. (1) Although the approach of SAFE, a structure aware data deduplication is similar to ADMAD [18], a metadata-based deduplication, we further enhance it and exploit the structures of objects without being limited by a specific file format. As an example, for email with attachments of document files, SAFE parses an email and then each attached object. (2) We have designed and implemented a client-based SAFE deduplication module that achieves the benefits of both a file-level and blocklevel deduplication scheme for cloud-based storage services. We have further applied the SAFE module into a Dropbox client. (3) We have extensively evaluated the performance and overhead of SAFE with real data sets of structured files and emails and a real cloud system. Our evaluation results validate that SAFE achieves as of good storage savings as a variablesize block-level deduplication, while being as fast as a filelevel or a large (4 MB as in Dropbox) fixed-size block-level deduplication.

The rest of the paper is organized as follows. Section II discusses the related work. We describe the SAFE deduplication scheme and its implementation in Section III. We validate our approach in Section IV. We conclude the paper in Section V. II. R ELATED W ORK As most of the cloud-based storage systems are commercially motivated, technical details of the systems are not clearly open to the public. However, there are a few recent empirical performance studies of cloud-based storage systems. [12] evaluated four popular commercial cloud-based storage services such as Dropbox, Mozy, Carbonite, and CrashPlan in the aspect of backup and restore performance, security, and reliability. One of the results shows that the amount of data sent over the network by Carbonite is higher than others due to no data deduplication support. [3] passively measured the Dropbox traffic data and analyzed it to infer the internal operations of Dropbox. According to the analysis, it also suggested a couple of Dropbox protocol improvements such as a delayed acknowledgement and a new chunk building scheme. However, to the best of our knowledge, there is no previous work on the deduplication mechanisms to improve the cloudbased storage systems. As the performance of data deduplication typically depends upon the granularity of the object that is closely related to the data chunking and index processing overhead, various data chunking methods have been proposed. Microsoft’s Single Instance Server (SIS) [2] and EMC’s Centera [8] use a filelevel deduplication. As it performs a simple chunking (a chunk is a file) that requires less index processing, it has been used for many applications with time and space limitations. For example, a data deduplication for in-line processing applications [19] uses it to cope with the costs of processing time and memory overhead. Many cloud-based storage services such as JustCloud [16], and Mozy [22] also use file-level deduplication. A block-level deduplication provides an object-level granularity by chunking the data file into blocks of fixed or variable size [23]. Since it provides fine granularity chunking techniques to achieve high deduplication rates, it has been used for backup or file systems such as Venti [24] and Data Domain File System (DDFS) [29] as well as for removing redundant network traffic including Low Bandwidth File System (LBFS) [23]. However, as block-level deduplication technique, especially variable-size one, requires the high cost of processing time and space, it often runs on specialized fast and highcapacity servers. Therefore, Dropbox [5] cloud storage uses a very large fixed-size (4MB) block-level deduplication. Hybrid approaches have been proposed by adaptively using variable-size block-level deduplication and file-level deduplication either based on fixed policy or dynamically changed file information [17], [21]. Min et al. [21] employs contextaware chunking where they use a file-level deduplication for multimedia content, compressed files, or encrypted content and use variable-size block-level deduplication for text files. Hybrid Email Deduplication System (HEDS) [17] first separates

Fig. 1.

SAFE deduplication architecture

the message body and individual attachments, and performs a variable-size block-level deduplication if the object size exceeds a predefined threshold. Otherwise, a file-level deduplication is used. SAFE is different from the aforementioned hybrid approaches. SAFE first uses a file-level deduplication for unstructured file types such as multimedia content and encrypted content like Min et al [21]. However, SAFE uses a recursive object-level based deduplication scheme for structured files such as email, docx, pptx, and pdf. A few format-aware data deduplication techniques such as ADMAD [18], [15], and [28] have been proposed to simplify the chunking mechanism by using the structured objects for the traditional server-based backup applications. ADMAD [18] chunks a file into variable-size semantic segments, called meaningful chunks (MCs), based on the metadata of each file. Although the idea of ADMAD to decompose a file into objects according to the object structure is similar to the proposed SAFE approach, ADMAD is limited to a specific file format. For example, ADMAD does not deal with document file types such as docx, pptx, and pdf. In addition, ADMAD does not handle an email with multiple attachments. [15] and [28] show similar concepts where they deduplicate structured objects. However, unlike SAFE, they do not show structure and dynamic policies based on specific document file types. III. S TRUCTURE -AWARE F ILE AND E MAIL (SAFE) D EDUPLICATION A RCHITECTURE In this section, we present the SAFE deduplication architecture and explain the SAFE modules as well as the decomposed object structures. We then show how SAFE can be embedded for cloud-storage services such as Dropbox. A. SAFE Modules SAFE deduplication system consists of the Email parser, the File parser, the Object-level deduplication module, the Object manager, and the Store manager module. Figure 1 illustrates the overall SAFE deduplication architecture. SAFE first employs a file-level deduplication to eliminate unnecessary

parsing of an entirely duplicate file. As for emails, the Email parser intercepts incoming emails from an email server and parses an email based on the email policy. Even though each email is unique with a distinct email ID, it may have duplicate attachments of the previously saved ones. Therefore, individual attachments are first divided into separated files, and the hash values of the attachments are added into an email index for a reconstruction of the email. Files that either users attempt to save or are separated from an email are sent to the File-level deduplication module. If a file is an unstructured file type such as text or image and video files the File-level deduplication is first run, and only a unique file is saved into the storages after a compression by the store manger. If a file is a structured one such as docx, pptx, or pdf, the file is parsed into smaller sized objects based on the file policy by the File parser. The File parser can combine several small objects to a compound object based on the File parser file policy. The File parser then sends objects to the object manager which temporarily holds objects in the object buffer. The Object-level deduplication module checks the existence of objects using the object index table, and a unique object is saved into storage through the Store manager. We explain the detail design per module in the following subsections.

Fig. 2.

Fig. 3.

Email parser

Structure of an email

B. Email Parser The Email parser runs as a light-weight mail filter on a sendmail server [26]. It intercepts an email using Milter [20] APIs when a Mail Transfer Agent (MTA) of a sendmail server receives an email. Milter API is a part of Sendmail Content Management API that can look up, add, and modify email messages. Figure 2 shows how the Email parser works. When the parser receives an email, it separates an email to metadata, body, and attachments based on the email policy that has information of structures to be separated. As depicted in Figure 3, an email is separated by a boundary string designated at "Boundary=” in metadata. The current email policy of SAFE is based on the format of Multipurpose Internet Mail Extensions (MIME) [11]. Note that each attachment in an email is encoded, and the encoding type is specified at "ContentTransfer-Encoding" before the encoded attachment. The Email parser decodes each attachment corresponding to the decoding scheme such as base64. The email indexer writes SHA1 hash values of separated data of metadata, a body, and attachments into the email index table. The key of an entry is an email ID with 14 byte string. The buffer keeps separated data into an array. The array also contains content-type of each attachment that is used to check if a data is a structured file or an unstructured file in the filelevel deduplication. An array of data is passed to the file-level deduplication by which the existence of a data is checked. C. File Parser SAFE parses three structured document file types such as Microsoft Word (docx) and Powerpoint (pptx) and Adobe Portable Document (pdf). For parsing each file to objects, SAFE considers the following key aspects: (1) how to extract

objects from a file, and (2) what granularity is efficient for deduplication. The granularity of deduplication of SAFE is either an object or a combination of objects. We explain how the File parser works for the following example file types in the above aspects. MS docx and pptx files follow the Office Open XML format (called Open XML) which is standardized at ECMA-376 [7] and ISO/IEC 29500 [10]. Based on the format, docx and pptx are ZIP [14] files each of which consists of XML files and data files such as images. Figure 4(a) shows the structure of an MS Open XML file. The File parser tracks down from the end of a document file, and extracts the XML files and data files that are considered as objects in SAFE. The File parser inspects “end of central directory record" where the offset of “central directory header" is found. The central directory header contains offsets, file name, and other metadata for objects. The “local file header" is accessible through an offset in the corresponding file header of the central directory header. The “local file header" has specific information of a file such as compression method and file name. A PDF file consists of a header, body, cross references, and a trailer as depicted in Figure 4(b). The PDF format is defined at ISO 32000 [1]. The header indicates the version of the pdf document. The body includes a series of objects. The cross reference has offsets of objects in the document, and the trailer has an offset of the cross reference section. Therefore, an object is accessed through a cross reference from a trailer upwards. An object may include a stream that can be either a text or an image. A stream is encoded by a compression algorithm and can be decoded by the corresponding decompression algorithm shown in the meta-

(a) Structure of an MS Office Open XML file: Gray bars are signatures. Signatures of “end of central directory record", “file header" in central directory header, and “local file header" are 0x06064b50, 0x02014b50, and 0x04034b50, respectively. The encryption which comes between local file header and file data is not shown. Fig. 4.

(b) Structure of a PDF file: The body consists of many objects. The encoded data between ’stream’ and ’endstream’ keyword is data such as text or image. The encoded (or compressed) data are decoded by the decoding scheme in dictionary ,“«/Type/.../Filter»”.

Physical file format

(a) Word (docx)

Fig. 5. File parser: Dotted lines are control flows and solid lines are data flows. Output of File parser is indexes of all objects including individual objects and combined objects of a file.

data of the object, called ’dictionary’ (i.e, ). According to ISO 32000, there are 10 different decompression algorithms among which FlateDecode and DCTDecode are used to decode a text stream and a JPEG image stream respectively. As shown in Figure 5, the file parser receives a structured file, and separates it into objects, and the decoder decodes if an object is compressed. The combiner concatenates small objects into a large object to reduce object index overhead. A parsed object that is not combined consists of 5-tuple with hash value of an object, length of an object, ID of container that contains an object (file ID for Open XML format and obj ID for PDF), decoding scheme (if specified), and object itself. A combined object is a concatenation of 5-tuples. The object putter passes objects to the object manager that subsequently holds objects to the object buffer temporarily until deduplication process for a file is finished. The trigger combines all object indexes of a file and passes them to the object-level deduplication module. SAFE runs parsing and combining based on file policy per file type. For parsing, SAFE has an abstract base class, FilePolicy, that specifies functions to be implemented in derived classes such as DOCXFilePolicy, PPTXFilePolicy, and PDFFilePolicy. The file parser creates a derived class object corresponding to a file type and executes functions of the class object. Thus, a policy for a new file type can be added into the

Fig. 6.

(b) Powerpoint (pptx)

Logical structure of MS office document file

file policy by inheriting (and implementing) FilePolicy class. For combining, SAFE puts together metadata objects that are small, but uses a image and text(content) objects without combination based on logical structures per file type. Figure 6 illustrates a logical structure of docx and pptx files. As shown in Figure 6(a), texts of a Word file are contained in a document.xml object, and image objects are under a media directory, and other directories shown in the figure contain metadata objects. Likewise, a Powerpoint file in Figure 6(b) has a media directory, but has different metadata objects. In addition, texts per slide are structured into each individual slide.xml. A presentation.xml holds the pointers of slide objects. D. Object-Level Deduplication and Store Manager The object-level deduplication module receives object indexes of a file, and checks if each index is unique using the object index table. The store manager stores either an unstructured file or blocks in a structured file after compression. An unstructured file is passed from file-level deduplication and objects are retrieved from object manager based on unique object indexes that are passed from object-level deduplication. The store manager uses Berkeley DB as storage, and pairs of are saved into the storage; an index is a hash value of either an individual object or a combined object. Note that index can be a hash value of an unstructured file. The object manager removes objects of a file from the object buffer when the store manager finishes storing them to storage.

Fig. 7. Dropbox internal mechanism: Circles with numbers are orders in which a file is saved. File-A is a file and Blk-X is a block which is separated from a file. h(Blk-X) means hash value of a block. Thick h(Blk-X) and Blk-X are considered as hash values and blocks which already existed before a file is saved. A user’s device is mobile phone, tablet, labtop, or desktop.

E. SAFE in Dropbox We now show how SAFE can be integrated into cloud-based storage services like Dropbox. We explain how Dropbox works and where SAFE module can be embedded. A recent study [3] describes the internal mechanism of Dropbox by analyzing its network traffic. Dropbox has two kinds of servers; one is a control server that updates metadata such as indexes of blocks and notifies the changes in storage to Dropbox clients. The other is a storage server that saves data blocks. A user can access Dropbox using either a Dropbox client and a Web user interface (http://www.dropbox.com). A block is sent to a storage server after delta encoding and compression. Figure 7 shows how a file is saved using the Dropbox client. As soon as a user saves a file (File_A) in a Dropbox folder, the hash values of fixed-size blocks are computed, if the file is larger than 4 MB. Otherwise, a hash value of the file is computed. Suppose a file has two blocks such as BlkA and Blk-B. A dropbox client computes the hash values of two blocks and sends them to a control server. Dropbox uses SHA256 hash. Assuming that a control server already has the hash value of Blk-A, a control server returns the hash value of Blk-B that does not exist in the servers. A Dropbox client subsequently sends Blk-B to a storage server (Amazon S3). Ultimately, File-A is synchronized between a client and servers. Note that storage saving occurs in a server (thanks to not saving Blk-A again), and the incurred network traffic is reduced thanks to sending Blk-B only. SAFE can complement the fixed-size block-level deduplication in a Dropbox client as shown in Figure 8. Suppose that an unstructured file (File-A) and a structured file (FileB) are added into Dropbox folder. The file-level deduplication module checks duplicate files using the file index table whose entry has a pair of . For duplicate files, the entry is added into a file index table without savings of a file in local storage. An unstructured file follows the fixed-size block-level

Fig. 8. SAFE integration with Dropbox: Control servers function as objectlevel dedup module. Thick fonts such as h(Blk-X), h(Obj-X), Blk-X, and Obj-X are existent already before file-A and file-B are saved.

deduplication. A structured file is fed into the File parser, and objects of the file are extracted. The trigger module calls the REST API [6] of Dropbox to send the hash values of objects. The control servers act as an object-level dedup module. We used SHA256 hash function in SAFE for compatibility with Dropbox. The store manager sends objects corresponding to returned hashes from a control server to a storage server through the REST API. IV. E VALUATIONS In this section, we first discuss the performance evaluation criteria and data sets used. We then show the evaluation results of performance and overhead of the proposed SAFE approach, compared with a file-level deduplication that JustCloud and Mozy use, a fixed-size block-level deduplication that Dropbox uses, and variable-size block-level deduplication schemes. A. Metrics and Used Data sets The major performance metrics are the deduplication ratio and incurred data traffic amount. The deduplication ratio indicates how much storage space can be saved by removing redundancies, and is computed by Equation (1). ) ( 𝐼𝑛𝑝𝑢𝑡𝐷𝑎𝑡𝑎𝑆𝑖𝑧𝑒 − 𝐶𝑜𝑛𝑠𝑢𝑚𝑒𝑑𝑆𝑡𝑜𝑟𝑎𝑔𝑒𝑆𝑖𝑧𝑒 × 100 (1) 𝐼𝑛𝑝𝑢𝑡𝐷𝑎𝑡𝑎𝑆𝑖𝑧𝑒 Data traffic incurred designates how much data are transferred to a storage that is the amount of unique data out of the input data. As overhead metrics, we measure the processing time and index size. Since the overhead is proportional to the data size, we compare the processing time and index size overhead relative to the file-level deduplication that has the least overhead. We collected real data sets of structured files including docx, pptx, and pdf from the file systems and emails of five graduate students in the same department. Table 9 summarizes the information of used data sets. Individual user’s data is labeled as ‘P-’#. For the experiments with email data sets, we deployed two sendmail servers; structured files are attached

0.8

emails size (MB) no. 637 955 554 720 249 480 358 859 744 823 2,542 3,837

Fig. 9. Used data sets: structured files that were collected from file systems and emails. ‘Group’ is the sum of all personal data sets and ‘no.’ is the number of structured files in each data set.

Median :263 KB

Probability

Data set P-1 P-2 P-3 P-4 P-5 Group

file systems size (MB) no. 1,721 4,384 509 590 266 523 869 1,499 864 1,430 4,229 8,426

0.6 0.4 0.2 0 0

Mean :673 KB

5 10 15 20 File size (bin size :512 KB)

Fig. 10. Distribution of the file sizes in the email data set: 10 and 20 in x-axis indicate 5 MB and 10 MB, respectively.

to emails from a sending sendmail server, and the attached structured files are extracted by the email parser at a receiving sendmail server. Structured files in the file system data sets are fed into the file parser directly. Figure 10 shows the ranges of the file sizes in the email group data set whose mean value (673 KB) is relatively small compared to the maximum block size 4 MB of Dropbox. Meanwhile, we measure the percentages of the structured files among all attached files of five people’s emails. As shown in Figure 11, the structured files occupy 89% out of all attached files. PDF occupies 44% and the percentage of docx and pptx is 11%. Despite the small size of data sets, the high percentage of structured files (89% for all types of structured files and 55% for docx, pptx, and pdf structured files) validates the popularity of structured file types on which SAFE is based. The data sets used may be considered to be relatively small. However, we note that the results obtained in this evaluation will only be stronger if larger data sets of an organization are used, since the redundancy levels would become greater. For the variable-size block-level deduplication, we use 2 KB, 8 KB, and 64 KB as minimum, average, and maximum chunk sizes, respectively. For fixed-size block-level deduplication, we use 4 MB as the fixed block size as Dropbox does. Fixedsize block-level deduplication is thus the same as the file-level deduplication for files smaller than 4 MB. We carried out the evaluations on Fedora 16 Linux operating systems of kernel 2.6.35.9 SMP on Intel Core 2 Duo 3GHz. B. Performance Evaluations We first evaluate the deduplication ratio for each data set. The deduplication ratio of a group is larger than that of each personal data set. For the file systems, the high deduplication ratio of a group is due to the same or similar content files shared among people in the same department. For emails, the high deduplication ratio of a group is due to same duplicates of multiple-recipient emails as well as the same or similar attachments delivered and updated through email threads. Compared to the file-level deduplication in Figure 12 on an average based on group data sets, SAFE can further reduce 15% redundancies and achieves about 40% better performances than the ones of the file-level deduplication. For the

11% Structured Unstructured pdf,zip,doc, docx,ppt,pptx, rar

89%

Fig. 11. Percentage of the structured files in the email data sets: Image files such as jpg, bmp, and png belong to unstructured file types.

email data sets, SAFE shows almost 99% of the performance of the variable-size block-level deduplication. Furthermore, SAFE’s deduplication ratio is better than the variable-size block-level deduplication in the file system data sets. It is because SAFE can find the boundaries of objects more efficiently in complicated structured files than the variable-size blocklevel deduplication, especially for PDF that uses compressions for more individual objects than other structured files such as docx and pptx. Note that file system data sets have twice as many PDF files as email data sets. We next evaluate the incurred data traffic for group data sets as shown in Figure 13. For file system data sets, SAFE shows the lowest data traffic among all dedeuplication types. This supports SAFE can be used as a deduplication technique for personal cloud storage services like Dropbox due to the expected decrease in network bandwidth consumption. In addition, for email data sets SAFE reduces 56% data traffic out of the email group data set (1.4 GB out of 2.5 GB). Compared to the file-level and fixed-size block-level deduplications, SAFE has lower data traffic by 30% for the email data sets (and 15% for the file system data sets), which indicates that SAFE efficiently reduces the network bandwidth requirement storing emails to cloud storages. C. Overhead We here show the assessments of the processing time and memory overhead. As shown in Figure 14, the file-level deduplication runs the fastest for both data sets types, due to no overhead of separating a file. The fixed-size blocklevel deduplication shows close processing time overhead to the file-level deduplication. Even if it is slower than the filelevel deduplication, SAFE processing is pretty fast on average for the data sets despite that we do not use salient cache management schemes in our implementation. In addition, SAFE is faster by two orders of magnitudes than the variablesize block-level deduplication. We now compare the relative index overhead in Figure 15. SAFE shows 2 to 3 times less index overhead than the variable-size block-level deduplication. We use a 40 byteshexadecimal string of SHA1 hash value for a chunk index in all testing deduplication schemes. Though smaller sized

P−2

P−3

P−4

P−5

Group

P−1 60 Deduplication ratio (%)

Deduplication ratio (%)

P−1 40 30 20 10 0

File

Block−F

SAFE

P−3

P−4

P−5

Group

40

20

0

Block−V

P−2

File

(a) File system data sets

Block−F

SAFE

Block−V

(b) Email data sets

4000

2000

3000

1500

Data traffic (MB)

Data traffic (MB)

Fig. 12. Deduplication ratio: Results of six data sets (five personal data sets and a group data set) per deduplication type are shown. File, Block-F, and Block-V means file-level deduplication, fixed-size block-level deduplication and variable-size block-level deduplication, respectively. SAFE achieves the highest deduplication ratio (32%) for the file system group data set, shows a ratio (55%) close to the best one - the variable-size block-level deduplication (56%) for the group email data set. Deduplication ratios with the email data sets are higher than those with the file system data sets due to the frequent email threads in addition to shared attached files among people in the same department.

2000 1000 0

File

Block−F SAFE Block−V

(a) File system data sets

1000 500 0

File

Block−F SAFE Block−V

(b) Email data sets

Fig. 13. Data traffic incurred (MB): SAFE has the lowest data traffic with the file system data sets, and the second to the lowest - the variable-size block-level deduplication - with the email data sets.

chunk index can reduce overhead of variable-size block-level deduplication, the relative ratios shown in Figure 15 would be maintained. The index overhead increases proportionally to the number of unique chunks. For the email data sets, the numbers of unique chunks for file-level deduplication, fixedsize block-level deduplication, SAFE, variable-size block-level deduplication were 2.4K, 2.5K, 33K, and 92K, respectively. For the file system data sets, the numbers for each deduplication scheme were 5K, 5.5K, 155K, and 248K, respectively. SAFE with the file system data sets shows a little more chunk index overhead than with the email data sets. This is because the file system data sets had higher percentages of pdf files than the email data sets. PDF files have a relatively complex structure where files are divided to many small objects, and the current file policy we implemented for PDF saves each object individually without combining. By combining multiple small objects into a large object as in file policy for docx and pptx, SAFE would reduce more chunk index overhead for PDF files. V. C ONCLUSIONS We have proposed a novel deduplication scheme called SAFE that exploits structures in prevalent structured files and emails including attachments. Unlike traditional deduplication

schemes that bear the tradeoff between deduplication ratio and processing overhead, SAFE accomplishes both high deduplication and low processing overhead. Our experiments with real data sets and implementation on a cloud storage client show that SAFE accomplishes more storage savings by 10% to 40% and less data traffic by 20% on average than a file-level deduplication and a fixed-size block-level deduplication that are used in existing cloud-based storage services. In addition, SAFE shows permissible processing time per file to be used in-line or for cloud storage systems, and is faster by two orders of magnitude than a traditional variable-size blocklevel deduplication with comparable deduplication ratio for structured files. As for future work, we plan to extend our prototype system to incorporate more types of structured files such as open office documents. Since SAFE focuses on exploiting the structures of files, it can be complementarily incorporated with other salient indexing and cache management techniques to further optimize the overall deduplication system performance. R EFERENCES [1] Adobe. ISO32000 : Document management - Portable document format.

4

4

10

P−1 P−2 P−3 P−4 P−5 Group

3

10

2

10

Relative processing time log scale

Relative processing time log scale

10

1

10

0

10

3

10

2

10

P−1 P−2 P−3 P−4 P−5 Group

1

10

0

File

10

Block−F SAFE Block−V

File

(a) File system data sets

Block−F SAFE Block−V

(b) Email data sets

Fig. 14. Processing time overhead: relative to the file-level deduplication. File-level deduplication (whose value is 1) is shown as 0 because y-axis is set in log scale.

60 40

60

P−1 P−2 P−3 P−4 P−5 Group

Relative index size

Relative index size

80

20

50 40 30

P−1 P−2 P−3 P−4 P−5 Group

20 10

0

File

Block−F

SAFE

0

Block−V

(a) File system data sets Fig. 15.

File

Block−F

SAFE

Block−V

(b) Email data sets

Index overhead: relative to the file-level deduplication.

[2] W.J. Bolosky, S. Corbin, D. Goebel, and J.R. Douceur. Single instance storage in Windows 2000. In Proceedings of the 4th USENIX Windows Systems Symposium, pages 13–24. USENIX, 2000. [3] Idilio Drago, Marco Mellia, Maurizio M Munafo, Anna Sperotto, Ramin Sadre, and Aiko Pras. Inside dropbox: understanding personal cloud storage services. In Proceedings of the 2012 ACM conference on Internet measurement conference, IMC ’12, pages 481–494. ACM, 2012. [4] Google Drive. https://drive.google.com. [5] Dropbox. http://www.dropbox.com. [6] Dropbox. REST API. https://www.dropbox.com/developers/core/docs. [7] European Computer Manufacturers Association (ECMA). Standard ECMA-376 : Office Open XML File Formats . [8] EMC. Centera: Content Addresses Storage System, Data Sheet. [9] Kave Eshghi and Hsiu Khuern Tang. A framework for analyzing and improving content-based chunking algorithms. Technical Report HPL2005-30(R.1), Hewlett-Packard Labs, 2005. [10] The International Organization for Standard Organization (ISO) and the International Electrotechnical commission (IEC). ISO/IEC 295001:2008. [11] Ned Freed and Nathaniel S. Borenstein. Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies. http://tools.ietf.org/html/rfc2045. [12] Wenjin Hu, Tao Yang, and Jeanna N. Matthews. The good, the bad and the ugly of consumer cloud storage. ACM SIGOPS Operating Systems Review, 44(3):110–115, Aug 2010. [13] iCloud. http://www.icloud.com. [14] PKWARE Inc. ZIP File Format Specification. [15] Sudipta Sengupta Amitanand Aiyer Jin Li, Li-wei He. MULTIMODAL OBJECT DE-DUPLICATION. Microsoft Corporation, 08 2009. Patent. [16] JustCloud. http://www.justcloud.com/. [17] Daehee Kim and Baek-Young Choi. HEDS: Hybrid Deduplication

[18]

[19] [20] [21] [22] [23]

[24] [25] [26] [27] [28] [29]

Approach for Email Servers. In Ubiquitous and Future Networks (ICUFN), 2012 Fourth International Conference on, pages 97–102. C. Liu, Y. Lu, C. Shi, G. Lu, D.H.C. Du, and D.S. Wang. ADMAD: Application-Driven Metadata Aware De-duplication Archival Storage System. In Storage Network Architecture and Parallel I/Os, SNAPI’08. Fifth IEEE International Workshop on, pages 29–35, 2008. Dutch T. Meyer and William J. Bolosky. A Study of Practical Deduplication. In Proceeding of the USENIX Conference on File and Stroage Technologies(FAST). USENIX, Feb. 2011. milter.org. https://www.milter.org/home. Jaehong Min, Daeyoung Yoon, and Youjip Won. Efficient Deduplication Techniques for Modern Backup Operation. In IEEE Transactions on Computers, 2011. Mozy. http://mozy.com/. Athicha Muthitacharoen, Benjie Chen, and David Mazières. A lowbandwidth network file system. In SOSP ’01 Proceedings of the eighteenth ACM symposium on Operating systems principles, volume 35, pages 174–187. ACM, Dec. 2001. S. Quinlan and S. Dorward. Venti: A New Approach to Archival Storage. In Proceeding of the USENIX Conference on File and Storage Technologies(FAST), volume 4, Jan. 2002. M. O. Rabin. Fingerprinting by random polynomials. Technical Report Report TR-15-81, Harvard University, 1981. sendmail.com. http://www.sendmail.com/sm/open_source/. skyDrive. https://skydrive.live.com. Fang Yan and YuAn Tan. A Method of Object-based De-duplication. Journal of Networks, 6(12):1705–1712, 2011. Benjamin Zhu, Kai Li, and Hugo Patterson. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. In Proceeding of the USENIX Conference on File and Stroage Technologies(FAST), volume 18, 2008.