• No se han encontrado resultados

The following are factors that have an influence on how effectively TSM reduces the amount of data to be stored using deduplication.

4.1.1

Characteristics of the data

4.1.1.1

Uniqueness of the data

The first factor to consider is the uniqueness of the data. Much of deduplication savings come from repeated backups of the same objects. Some savings, however, result from having data in common with backups of other objects or even within the same object. The uniqueness of the data is the portion of an object that has never been stored by a previous backup. Duplicate data can be found within the same object, across different objects stored by the same client, and from objects stored by different clients.

4.1.1.2

Response to fingerprinting

The next factor is how data responds to the deduplication fingerprinting processing used by TSM. During deduplication, TSM breaks objects into chunks, which are examined to determine whether they have been previously stored. These chunks are variable in size and are identified using a process called fingerprinting. The purpose of fingerprinting is to ensure that the same chunk will always be identified regardless of whether it shifts to different positions within the object between successive backups.

The TSM fingerprinting implementation uses a probability-based algorithm for identifying chunk boundaries within an object. The algorithm strives to have all of the chunks created for an object average out in terms of size to a target average for all chunks. The actual size of each chunk is variable within the constraints that it must be larger than the minimum chunk size and cannot be larger than the object itself. The fingerprinting implementation results in average chunk sizes that vary for different kinds of data. For data that fingerprints to average chunk sizes significantly larger than the target average, the deduplication efficiency is more sensitive to changes. More details are given in the later section that discusses tiering.

4.1.1.3

Volatility of the data

The final factor is the volatility of the data. A significant amount of deduplication savings is a result of the fact that similar objects are backed up repeatedly over time. Objects that undergo only minor changes between backups will end up having a significant percentage of chunks that are unchanged since the last backup and hence do not need to be stored again. Likewise, an object can undergo a pattern of change that alters a large percentage of the chunks in the object. In these cases, there is very little savings realized by

deduplication. It is important to note that this effect does not necessarily relate to the amount of data being written to an object. Instead, it is a factor of how pervasively the changes are scattered throughout the

object. Some change patterns, such as appending new data at the end of an object, have a very favorable response with deduplication.

4.1.1.4

Examples of workloads that respond well to deduplication

The following are general examples of backup workloads that respond well to deduplication:

 Backup of workstations with multiple copies or versions of the same file.

 Backup of objects with regions that repeat the same chunks of data (for example, regions with zeros).

 Multiple full backups of different versions of the same database.

 Operating system files across multiple systems. For example, Windows system state backup is a common source of duplicate data. Another example is virtual machine image backups with TSM for Virtual Environments.

 Backup of workstations with versions or copies of the same application data (for example, documents, presentations, or images).

 Periodic full backups taken of systems using a new nodename for the purposes of creating a out of cycle backup with special retention criteria.

4.1.1.5

Deduplication efficiency of some data types

The following table shows some common data types along with their expected deduplication efficiency.

Data type Deduplication efficiency

Audio (mp3, wma), Video (mp4), Images (jpeg) Poor

Human generated/consumer data: text documents, source code

Good Office documents – spreadsheets, presentations Poor

Common operating system files Good

Large repeated backups of databases (Oracle, DB2, etc.) Good

Objects with embedded control structures Poor

TSM data stored in non-native storage pools (for example, NDMP data)

None

4.1.2

Impacts from backup strategy decisions

The gains realized from deduplication are also influenced by two different implementation choices in how backups are taken and managed.

4.1.2.1

Backup model

For TSM, a very common backup model is the use of incremental-forever backups. In this case, each subsequent backup achieves significant storage savings by not having to send unchanged objects. These objects that are not re-sent also do not need to go through deduplication processing, which turns out to be a very efficient method of reducing data. On the other hand, other data types use a backup model that always runs a full backup, or a periodic full backup. In these cases, there will typically be significant reductions in the data to be stored, which is a result of the significant duplication across subsequent backups of the similar objects. The following table illustrates some examples of deduplication savings between full and incremental backup models:

Does deduplication offer savings in the case where

…. Full backup Incremental backup

File-level backups are taken using the backup-archive client.

Yes when:

· There is data in common from other nodes such as operating system files

· Periodic full backups are taken for a system. This is occasionally performed using a different node name for the purpose of establishing a different retention scheme

Yes for files that are being re-sent due to changes (depends on volatility) No for new files that are being sent for the first time (depends on uniqueness)

Database backups are taken using a data protection client.

Yes when:

 Subsequent full backups are taken (depends on volatility) No when:

 The first backup is taken. Databases are typically unique

Typically no. The database incremental mechanism is only sending changed regions of the object, which typically have not been stored before.

Virtual machine backups are taken using the Data

Protection for VMware product.

Yes. VMware full backups often experience savings with matches from the backups of other virtual machines, as well as from regions from the same virtual disk that are in common.

4.1.2.2

Retention settings

In general, the more versions you set TSM policy to retain, the more savings you will realize from TSM deduplication as a percentage of the total you would have needed to store without deduplication. Users who desire to retain more versions of objects in TSM storage find this to be more cost effective when using deduplication. Consider the example below, which shows the accumulated storage used over a series of backups using the Data Protection for Oracle product. You can see that ten backup versions are stored with deduplication using less capacity than three backup versions require without deduplication.

4.2

Effectiveness of deduplication combined with progressive

Documento similar