Comité de Riesgos
Principio 22: Código de Ética y conflictos de interés
We have implemented Sungem in Java and successfully integrated it with DSS. To evaluate the effectiveness of the design decisions and the implementation efficiency of this prototype, we collected a real-world backup trace from a production environment, and used it in a trace-driven evaluation study.
Two evaluation metrics were used. The first metric is the data deduplica- tion ratio, which is expressed as the percentage of duplicate fingerprints over the input fingerprints before deduplication. The second metric is the data deduplication throughput, which is measured in terms of the number of fin- gerprints that can be processed per second, including the overheads of both deduplication and garbage collection.
Trace Collection and Analysis
To derive a real-world trace of incremental data backup streams, we wrote a user-level tool that tracks and records changed files in a file system on a Windows machine within a period of time, and deployed this tool on 23 desktop machines of a research laboratory to collect the set of changed files every day on each machine for 10 weeks. Each of these 23 machines was used predominantly by a single user. We will refer to the resulting changed file trace as userT race in the following discussion. At the beginning of the trace collection period, the user-level tool traversed a file system, and for each file recorded into a database its last modify time and a 64-byte MD5 fingerprint for every 4KB block in it. At the end of every day, this tool traversed the file system, and compared the current modification time of each traversed file with its previously recorded modification time if it existed. If the previous modification time of a traversed file did not exist, the file was newly created. If the current and previous modification times of a traversed file were different, the file had been modified. In either case, the tool further computed a 64-byte MD5 fingerprint for every
File Conjectured Contribution
Type Append Behavior Percentage (%)
VM-related Files Append 35.1
Multimedia Files Overwrite 21.6
System Files Overwrite 21.9
User Documents Overwrite 11.5
Installation Media Overwrite 5.1
Log Files Append 1.6
Mails Overwrite 1.6
Database Files Overwrite 1.6
Table 3.1: The set of file types appearing in the collected userT race, their conjec- tured append behaviors, and their contribution percentages in terms of size.
4KB block in that file, and recorded into database its last modification time and all fingerprints computed this way.
Our tool tracks the file-level changes between consecutive versions of a file by comparing their constituent fingerprint sequences. It assumes that when a file is modified, the entire file is over-written. However, in some cases, modifications to a file could be in the form of appends. Because we had no way of knowing how applications actually modified files, we relied on the file type information to infer whether a file was overwritten or appended at the block level when its file-level change indicated an append-like pattern.
Table 3.1 shows the list of file types appearing in the raw collected trace, their contribution percentage in terms of size and our conjectures of whether they were overwritten or appended at the block level. VM-related files include files that support virtual machines, e.g., vmdk, vmem and vdi files. Multi- media files include all audio and video files. System files include files in the system directory, including Windows and ”Program Files” directories. User documents include Microsoft Office files, pictures, and development files. In- stallation media refer to those files that are meant to install programs, e.g., iso and msi files. Log files include system log files and application log files. Mails include files that are updated by Microsoft outlook, including files with the suffix pst and ost. Database files cover all database files used by either applications or the operating system. Eventually we decided to remove VM image-related files because we believe these files are relatively rare in a typical office environment. After all necessary pre-processing, we produced a trace of daily fingerprint streams, which was initially 10.8GB in size, and eventually grew to 43.7GB at the end of the trace collection period. From the collected trace, the size of the average daily increment change is about 1.5% of all the
files on these 23 machines taken together. However, in this trace when a file is modified, the delta consists of not only the changed data but the entire file data. Hence the deduplication ratio is much higher than a typical block level incremental snapshot.
To measure some precise characteristics of Sungem we use another trace which is created based on the idea proposed by Fanglu et al. [69] as follows: Create a vm image called vmImageBase using Vmplayer and use Windows 7 as the guest OS. To vmImageBase add language packs and some huge windows applications to generate another image vmImageM odif ied. The measured deduplication ratio in these traces are 20% in each trace. But when they are passed as input one after the other, deduplication ratio is around 34%. Meaning, around 14% of duplicates were found in modified vm image with respect to original vm image. But with this trace, the amount of data available is too short to test various features of a deduplication engine. Hence we use this trace sparingly only when the amount of input data doesn’t influence the analysis. Lets call this trace as V mT race.
While analyzing the userT race, we noticed several interesting data shar- ing patterns that eventually helped make better design decisions in Sungem. Interestingly, the finer details of the trace analysis not only helped fine-tune Sungem but also looked promising to be applied to any generic deduplication algorithm. Towards that goal, we collected yet another trace using more ma- chines, more users and collected lot more information from the user machines over an even larger period of time, to do a deeper analysis of the trace work- load. With such a detailed trace analysis we proposed several guidelines to help design any generic deduplication algorithm and are discussed in greater detail in Chapter 4. The core characteristics of the trace workload that we wish to exploit and demonstrate in this evaluation study do not differ much between the userT race, and the bigger and latest trace that we just described. Hence we just use userT race and V mT race in this evaluation study.
The hardware testbed used in the evaluation of the Sungem prototype consists of two quad-core 3.4GHz Intel Core i-7 processor, 14GB of RAM, five 7200-RPM WD Caviar blue hard disks of 1TB each, one for the system disk, two for storing the GC metadata P-array on a striped software RAID with a 64-Kbyte stripe unit size, and the remaining two for storing the deduplication fingerprints on a striped software RAID.