Hands-On Testing and Analysis

Why Primary Deduplication Is A Different Animal From Backup Dedupe

A decade ago, storage admins took a chance on purpose-built backup appliances (PBBAs) (Insert link to Take a chance on me) and the magical technology that made them cost effective for data deduplication.

It took a few years to convince the tape huggers (except maybe Jon Toigo) but the Society of Steely-Eyed Storage Professionals has acknowledged that deduplication is safe for backups,

In fact, they may have learned this lesson too well, as data deduplication for primary storage is being brought into the mainstream. The problem is that primary dedupe demands slightly different technologies.

I’ll look at the differences of post-process vs. inline for primary data, introduce the concept of dedupe on demote, and examine why block sizes may not have a significant impact on dedupe efficiency.


Defining Terms

Before we get to the specific requirements of dedupe for primary storage, we have to address the question of post-process verses inline deduplication.

The definition of these terms, and a particular vendor’s stretching of those definitions, generated some controversy, as well as an entertaining blog post from my friend Alex Galbraith, at Tech Field Day 9.

In the backup world, post-process deduplication is a design decision that trades reduced data reduction for higher performance.

By writing the incoming backup stream to a dedicated landing area, post-process deduplication decouples the rate at which a PBBA can ingest data from how fast it can deduplicate it. It also speeds up restores from the last backup by restoring from the landing area, where the data is in its native format and doesn’t need rehydrating.

However, as we discovered when we, the TFD delegates, dove deep into Violin’s deduplication, inline deduplication is the speed play for primary data. Which brings us to the definition of inline.

Because I’m a well-respected industry analyst, part of my job is defining terms. For me, inline deduplication means that the data is not written to persistent storage media in its native, unreduced form. It may be temporarily held in a DRAM or NVRAM cache as the deduplication process does its thing, but if the data’s written to flash or spinning disk before it’s deduplicated, it’s not inline deduplication to me.

Since all incoming data is deduplicated before being written to flash, inline deduplication not only reduces the volume of data to be stored, but also the amount of data written to the flash devices, which in turn reduces the wear on those devices.

An incoming unique data chunk generates two I/Os: one to store the data block and one to update the system’s metadata, whereas a duplicate chunk generates?] just one I/O for the metadata update.

A system like Violin’s performs one I/O to write the data to a flash landing area, and then another to read the data out of that buffer for deduplication. As a result, the system can only deliver 1/3rd as many IOPS with deduplication enabled as with it turned off.

Post-Process Vs. Inline For Primary Data

My problem with post-process deduplication for primary data is that it complicates capacity planning. If your storage array deduplicates data hours after it’s been written, your capacity requirements become dependent not just on how much data your users store but how much they write during the day.

For example, if desktop support uses a group policy object to turn on 1 minute autosave in Microsoft Office, one user editing a 100MB PowerPoint presentation will save 6GB of data an hour–all of which will dedupe away at the end of the day.

Meanwhile, inline deduplication has some significant advantages, but it can be quite demanding on the storage controller’s CPU and memory.

To address this drawback, some storage systems use a design I call deduplication on demote. For example, purely software-defined storage systems that can’t count on NVRAM or integrated UPS protection have to write data to persistent media before acknowledging the write to the application anyway, so most SDS and hyperconverged solutions (including, as we saw at a later SFD9 presentation VMware’s VSAN) use the deduplicate on demote design.

These systems have a high-performance flash tier that they use, at least in part, as a write cache. Data comes into the system, lands in the cache, and is acknowledged to the application. As the data cools, it’s demoted to a lower performance tier, and is deduplicated in the process of demotion.

While this means there are more I/Os than if the data was deduped inline directly to the capacity tier, these I/Os are required to demote anyway and aren’t added by the deduplication process.

The advantage of dedupe on demote is that frequently overwritten data blocks, like file system metadata or database indexes, don’t generate CPU load by deduplicating each state as they’re overwritten again and again. Those overwrites just happen in the performance tier.

Dedupe on demote could get bottlenecked at times of high data ingest, such as when the cache overflows in a hyper-converged environment where every CPU cycle consumed by the deduplication engine is “stolen” from the host’s VMs. That said, dedupe on demote is a good compromise between CPU utilization, storage performance, and data reduction.

Block And Tackle

The other big difference between backup and primary storage deduplication is that primary storage data is almost always block-aligned, whereas backup data is a continuous stream. Every major operating system since 2008 has written data aligned on 4KB or larger block boundaries. That means if I have 200 Windows 2012 virtual servers, WINSOCK.DLL will start on a 4KB boundary for every one of them.

Backup applications create large aggregate files in proprietary formats. Those formats don’t pad out every file from every VM to a 4KB boundary, and the data shifts around within each block.

Backup appliances that do variable-block-size deduplication use triggers in the data stream to determine where each deduplication block begins and ends. This can make a system like a Data Domain or Quantum DXi that does variable-block dedupe several times more efficient than a fixed-block PBBA.

PBBAs also need to understand the backup application’s proprietary format, so the backup application’s internal timestamps and metadata inserted in a stream of otherwise duplicate data don’t screw up the deduplication.

While some primary storage systems do variable-block-size deduplication, this is a very different process that calculates hashes on blocks at say 4, 8, 16, 32 and 64KB. If the 64KB block’s hash matches another 64KB block, they can store the data with 1/8th the metadata of a system using 4KB chunks alone. This makes the system more CPU and memory efficient, which (don’t get me wrong) is a good thing, but doesn’t have the same kind of impact on storage efficiency that variable-block dedupe has on a NetBackup stream.

Primary storage systems have it much easier; they only see block-aligned data, and as a result there’s much less variance in their deduplication efficiency. Where a good PBBA with variable-block dedupe could reduce data such as Exchange backups as much as four times better than a basic PBBA that only did fixed-block, the difference between primary storage systems on virtual machine images is usually within 20-25%.

For primary storage we’re in favor of in-line deduplication and don’t really buy the argument you should be able to turn dedupe off.  If you’ve sized your storage system to be able to handle your workloads with deduplication on, all turning it off does is save CPU cycles in the storage controller where they can’t be used for anything else.



Disclaimer: EMC/Data Domain, Quantum, VMware and Veritas are or have been clients of DeepStorage, LLC.