Hands-On Testing and Analysis

Reading Storage Benchmarks—The Details Matter

As I prepare to speak at Interop about all storage things flashy and software-defined, I’m quite disappointed at how many vendors, and some of my fellow analysts, present their benchmark data. Given the sophistication of today’s storage systems and software, the details of how benchmarks were run matter now more than ever. Let’s review the details you should be looking for when evaluating storage performance claims.

First, let me say, as a guy who makes his living at least in part testing storage systems and writing reports about it, that the best performance testing is the testing you do in your datacenter running your applications. As I wrote three long years ago, Most of Our Benchmarks Are Broken when they’re used to evaluate modern storage systems that use data deduplication and/or use flash as a cache or tier of storage because the simplistic sequential or random I/O distributions and repeatable data benchmarks used don’t resemble real workloads closely enough to simulate real-world applications when the storage systems are smart enough. This isn’t the kind of build-a-product-to-max-the-benchmarks syndrome we saw many years ago with video cards but just that the common benchmarks haven’t caught up with storage technology.

That said, it’s clear that few, if any, real customers can afford to do a proof-of-concept and test more than one or two storage solutions in house. Published benchmark results, if the tests are well designed and the test conditions are fully documented, can at least help you put together the short, or very short, list of product that are worth your closer consideration and hopefully in-house testing.

All IOPS are not created equal

The way vendors will juke the stats is to manipulate the definition of an IOP. Benchmark results will vary, sometimes significantly, with the size of an IO, the mix of reads and writes in the test, and, of course, whether the IOs are random or sequential. Vendors shooting for a magic benchmark result—like the 1 million IOPS many all-flash array vendors were claiming until they realized that it made them sound too much like Dr. Evil—can specify 512-byte sequential IO for the benchmark.

Because most of us are more interested in how a storage product performs when doing the sort of random I/O that typifies database applications, testers should try and tune their benchmarks to act more like those applications. Real world database engines do their I/O in 4- to 64KB chunks semi-randomly, and you should only give weight to benchmarks that use similar I/O sizes.

Most testers have settled on using 4K IOPS with a 50/50 or 60/40 mix of reads and writes. Since flash devices are usually significantly faster reading than writing data, be wary of results with 100% reads or where the read/write mix isn’t specified.

Aggregate vs. single app results

Many storage systems can only deliver a small fraction of their total performance to a single workload. When a vendor claims “a 16 node cluster of our scale out storage system can deliver over 750,000 IOPS,” that doesn’t necessarily mean that the same storage system can provide the Oracle database behind your ecommerce site ERP system with the 500,000 IOPS your DBAs are telling you it will need. It might take 10 or 50 workloads, accessing different logical volumes or virtual disks, to get the full performance of the system.

Traditionally, this was because different volumes used different disk drives, so accessing one volume would only get some of the drives in the system moving.In today’s flash-centric world, we may still have to run multiple workloads to access all the flash, or the system may just need to spread the load across multiple port or processors to reach full performance.

If you have thousands of VMs, you’re probably more concerned with aggregate performance, but if you’re looking to speed up one or two big applications, make sure you’re looking at single-application performance and not just the aggregate, or you’ll be wondering why you’re only seeing a small fraction of the performance you were expecting.

What’s the latency, Kenneth?[1]

While IOPS are a good measure of how much random I/O the system can do, how fast any given application runs is going to be more affected by how long it takes to process each individual request. Take the latency: IOPS are like passenger miles per hour, IOPS like elapsed time to get from point A to point B. To get 400 passenger miles per hour, you can put 2 people in a Lamborghini Gallardo and drive 200 miles an hour or put eight people in a minivan and drive 50mph. The folks in the Gallardo will, of course, arrive several hours before the minivan.

Many applications will only provide reasonable performance when storage latency is above some threshold. Microsoft’s Jetstress load generator, which emulates Exchange using Exchange’s Jet database engine, will fail a test if latency ever exceeds 20ms. My Oracle DBA friends tell me that users will typically start to complain when latency exceeds 5-10ms.

When a vendor presents you with an IOPS performance figure you should be asking what the corresponding latency is. We’ve started to see that in the all-flash array space, where vendor claims have started being 1 million IOPS at 1ms latency. In fact, 1ms is sort of a magic number, as that level of latency is unachievable with spinning disks. Even better is to have IOPS numbers for a given system at multiple levels of latency: 1 million IOPS at 1ms, 2 million at 5ms, and 2.2 million at 20ms.

The dataset size matters

While most of the serious results I’ve been reviewing do specify the IO size and read/write mix from their testing, the critical item that’s missing from most reports is the size of the dataset that’s being used. All but the lowest end of today’s storage systems have at least some flash that’s used as a cache or tier of storage.

If the test set is smaller than the flash component of the solution, the benchmark is going to see flash performance. Users, however, don’t buy a 10TB storage array with 1TB of flash to store 1TB of data. Their active data is somewhere between a quarter to half the total capacity of the system or two- to five- times the size of the flash.

Back in the ’00s, when we were testing storage systems with spinning disks, we, the storage brain trust at NetworkComputing—basically Don Macvitte, Steven Schuchart and this humble reporter—settled on a dataset size of 100GB for our testing. Our thought was that since even a relatively high-end storage array of the era, like an EMC Clariion, had just 32GB of NVRAM cache, a 100GB test set would ensure we were measuring the performance of the underlying storage, not just cache performance. If I used 100GB datasets on today’s storage, they would perform much better in the test lab than in the real world.

Really, correcting for this problem will require a new generation of benchmarks that, like real applications, create hotspots of very active data. While we wait for these to develop, users should at least demand that the performance numbers show latency and the dataset size in addition to a detailed description of what kind of IOPS were used to test the system.

[1] Refers to an incident in New York City in 1986, when two then-unknown assailants attacked journalist Dan Rather while repeating “Kenneth, what is the frequency?”

  • michael stump

    Great write-up. But now that R.E.M. song is stuck in my head. So, thanks.