Hands-On Testing and Analysis

On Calling Out and Being Called Out

It’s interesting how companies develop corporate personalities that extend not just to official communications but even to employees’ personal blogs. A few years ago, this was most apparent at NetApp and EMC. Today, the most visible example is how folks at Nutanix seem to take even the most tangential reference or minimal criticism as direct attacks that must be responded to.

As an example, my friend and fellow Tech Field Day delegate Keith Townsend tweeted a link to an English translation of a German, Nutanix user’s blog. The blog post described the problems the user had with his Nutanix system running Hyper-V. Keith even mentioned that it was unusual to see negative feedback from Nutanix users.

Over the next few hours, several Nutanix employees tweeted responses to Keith, and despite all of the great work that Keith has produced, his simply drawing attention to a user’s blog became his most retweeted/commented-on Twitter conversation.

We at DeepStorage recently produced a report for Atlantis Computing where we used Microsoft’s Jetstress to drive one of their HyperScale CX-12 systems to support the I/O of 60,000 simulated mailboxes. In the report, we obliquely referred to Nutanix in two bullet items in the executive summary, which we call The Bottom Line:

  • 2.5 times the mailboxes of a leading HCI provider’s ESRP report
  • Five times the IOPS of that HCI provider’s system

And again in this paragraph where we’re talking about how to compare the results of different organizations’ Jetstress reports:

One leading vendor of hyperconverged infrastructure appliances used 0.05 IOPS per mailbox for its ESRP report. This would represent users averaging roughly 70 messages a day. Users comparing published Jetstress results should pay more attention to the total IOPS supported than just the number of mailboxes, by multiplying the number of mailboxes by the IOPS per mailbox.

In response, Josh Odgers, who works as a staff solution architect at Nutanix, wrote a blog post titled “Being called out on Exchange performance & scale. Close (well not really), but no cigar.” Now, in Queens, NY, where I grew up, being called out was a serious deal, because you usually had to answer being called out with your fists, but you pretty much had to be called out by name. Sure, I could be called out as the fat kid in the L.C. Greenwood jersey, but three oblique references to you as “A leading HCI provider” wouldn’t come close. Of course, salesmen will be salesmen, and Josh may be referring to some Atlantis folks who might be making hay at Nutanix’s expense, but we never called Nutanix out.

While the post didn’t call us out by name, it did refer to our work as “an April Fools joke,” we’re proud of our work and stand behind it so we’re going to have to respond.

About the Report and Jetstress

Atlantis asked us to test their HyperScale appliance with Jetstress to demonstrate how much Exchange I/O traffic the system could handle. Even leaving aside the storage-colored glasses that are issued to all DeepStorage personnel, Jetstress is a storage benchmark. As the Jetstress 2013 Field Guide (the only official Microsoft documentation for Jetstress) describes it:

Jetstress is a tool for simulating Exchange database I/O load without requiring Exchange to be installed.  It is primarily used to validate physical deployments against the theoretical design targets that were derived during the design phase.

 Jetstress testing provides the following benefits prior to deploying live users.

  • Validates that the physical deployment is capable of meeting specific performance requirements
  • Validates that the storage design is capable of meeting specific performance requirements
  • Finds weak components prior to deploying in production
  • Proves storage and I/O stability

One of the things that makes Jetstress a good benchmark is that it produces a very clear pass or fail. The operator defines the number of mailboxes, their size, and the number of IOPS per mailbox. If the system can perform the specified I/O load Jetstress will report a pass; if it can’t, the Jetstress report gets a very disappointing, red FAIL.

BUT Jetstress is NOT Exchange

As we say in the report: Jetstress uses the Jet database engine that is at the heart of the Exchange server. Jetstress performs read, insert, delete, and other operations against the Jet database to simulate Exchange users. By accessing the core database directly, Jetstress can create much higher levels of disk I/O than using full Exchange servers that have the overhead of processing email before writing to the database.

But Jetstress is not Exchange. It doesn’t actually process email and. as a result. Jetstress uses significantly less CPU and memory than real Exchange servers. It would be more realistic to use full-blown Exchange servers, but the very complexity of creating a full-blown Exchange configuration big enough to stress a modern storage system is a big job- and such a system would still need another tier of servers to emulate the thousands of users sending and reading email.

This very complexity is why Jetstress was created, so users could validate that a storage system could handle the demands of their Exchange estate before they actually build their Exchange estate.

Jetstress and DAGs

Jetstress does have an option to set the number of database copies, in order to simulate Exchange Database Availability Groups (DAGs), but it really doesn’t do much. With a two- or three-DAG configuration, Exchange ships logs to the secondary servers, which replay them, writing data to their databases. This should increase the number of IOPS a test configuration must perform to support a given number of mailboxes by 30% or more.

In reality: “This value simply simulates some LOG I/O reads to account for the log shipping between active and passive databases – it does NOT actually copy logs between servers.”[1]

Telling JetStress to use two copies in our testing generated about 11 additional IOPS an increase of 0.3%.

Microsoft’s Exchange Solution Reviewed Program (ESRP)

In Microsoft’s own words: The Exchange Solution Reviewed Program (ESRP) – Storage is a Microsoft Exchange Server program designed to facilitate third-party storage testing and solution publishing for Exchange Server.

The program combines a storage testing harness (Jetstress) with solution publishing guidelines. Microsoft Gold Certified or Certified Storage Partners (storage original equipment manufacturers (OEMs) who are part of the Microsoft Certified Partner Program) can use the ESRP framework provided to test their storage solutions targeted for Microsoft Exchange deployment. Customers can use the solutions published here to help plan/design their own Exchange storage architectures. 

And Many ESRP Reports Do Not Run Jetstress On Exchange Reference Architectures

The first problem the post identifies about our report is that the “Atlantis document is not an ESRP, it’s a (paid for?) analysts report.” I can at least remove the question mark and say, yes, DeepStorage was paid, by Atlantis, to test the Atlantis HyperScale appliance and to produce the report.

However, strictly speaking, it’s not Atlantis’ document; it’s a sponsored publication of DeepStorage. Editorial control remains with DeepStorage. Atlantis hired us to test with Jetstress. They had limited input into the parameters used and did not set performance goals for us.

The post sets ESRP up as a model of virtue and, by comparison, DeepStorage as a coin-operated joint that will write anything the client wants. I know Josh on Twitter, and I know that’s not what he thinks of DeepStorage, but I also think that the I-wonder-if-it-was-paid-for implication definitely implies that paid for is worth less than ESRP.

Again from the Microsoft site: Note: The ESRP – Storage program is not a Microsoft certification, qualification, or logo program. Microsoft makes no warranties or representations with regard to third-party storage solutions, including without limitation regarding the supportability of such third-party storage solutions. It is solely your responsibility to confirm the accuracy of the data and contents of the storage solution whitepaper you produce using the ESRP – Storage program testing framework.

Readers have to make their own decisions about author credibility. ESRP reports are written by vendors who are members of a Microsoft OEM program. As part of the ESRP program, Microsoft reviews ESRP reports to make sure they meet Microsoft’s standards. We like to believe that we have a reputation for independence in our testing and analysis.

Does Independence Have Value?

Does Microsoft’s review make what is really a vendor-written document more valid than an independent one? That’s for you to decide. You should note, however, that Microsoft just reviews the documents. It’s not like SPC, which sends an auditor to watch the actual testing and ensure that the reported results are the actual results.

As Tony Redmond points out, in a blog post Josh references, many of the ESRP configurations are nowhere close to what a decent Exchange architect would build to support the users and resiliency goals of a real organization. They have too few DAGs and server configurations that can run Jetstress fine but don’t provide enough server CPU and RAM to actually run Exchange.

We used those ESRP reports as guidance to how Jetstress testing was reported. Most of the complaints in the post are more about the general state of Jetstress—and, therefore, ESRP—benchmark reports than about the specifics of our report for Atlantis. Jetstress performance reports, including many, if not most, ESRP reports are about storage performance and storage performance only.

Sure it would be nice if Jetstress testing wasn’t about “hero numbers,” it would also be nice if Jetstress ran across multiple servers, actually shipping and posting logs like real Exchange servers, and produced consolidated reports. It would have been easy for us to use 50 rather than 150 messages a day as our target so we could have a hero number over 100,000, but we picked parameters that would show the limit of the system while being somewhat realistic.

Nobody’s Perfect—Errors We’re Correcting

We don’t pretend to be perfect, and after Josh’s post led us to take one more careful read of the report, we did identify some things that needed correcting:

  1. The Bottom Line of the report (the executive summary) does refer to 200 messages a day, whereas the testing and the rest of the report correctly say 150 messages a day. The 0. 121 IOPS/mailbox we tested with is the 0.101 IOPS/mailbox that Microsoft says are generated by 150 messages a day plus the 20% headroom. The 20% headroom is common across most Jetstress testing reports.
  2. We said, “To keep our testing in line with real world,” describing why we used three Jetstress servers. We’ve reworded that to “To preserve a semblance of the resiliency we’d recommend for use in the real world, we created three Jetstress servers.”
  3. To clear up any confusion about Jetstress vs. Exchange we added:

Readers should not take this test configuration as a recommended Exchange server configuration. It is configured solely to support Jetstress generating load against the HyperScale CX-12’s storage layer.  Since Jetstress can’t coordinate testing across multiple servers and can generate the load we need from just three, that’s what we used. As we were using Jetstress, a storage benchmark, to test the HyperScale storage layer, we did not attempt to scale the servers with CPU and RAM as if they were Exchange servers.

  1. The HyperScale CX-12 we tested with had 384GB of RAM, not the 256GB originally reported.

Comparing Apples and Oranges

Josh complains that we compared an all-flash Atlantis system to the published results for the hybrid Nutanix system. While I first remind you, dear reader, that the extent of the comparison is two bullet items that refer to “A leading HCI provider” in a twelve-page report, we also reject the premise that all-flash and hybrids shouldn’t be compared.

Unlike Gartner, we at DeepStorage don’t think there’s a separate market for all-flash that’s distinct from hybrids. All-flash systems simply deliver more consistent performance and usually cost more.

My best understanding of Nutanix’s and Atlantis’ pricing is that the all-flash Atlantis and the hybrid Nutanix will cost a customer about the same amount of cash. The Nutanix system may offer a few more TB of capacity but not a significant amount after Atlantis’ data reduction. We’re not comparing apples and oranges but tangerines and clementines—not quite the same but your kid will welcome either in their lunchbox.

When we say 2.5 times the mailboxes and 5 times the IOPS, that’s hero number to hero number. Nutanix’s ESRP report says 24,000 mailboxes at 0.06 IOPS. We achieved 60,000 at 0.121.

Not Real-World Exchange

OK, you got me on that one. We view Jetstress as a storage benchmark, which is all it really is.

Since we were testing the Atlantis system’s storage, we didn’t build Exchange servers as we would have to support 60,000 real users, and we didn’t validate that those Exchange servers would run on the HyperScale.

However, as the blog post points out, by quoting Tony Redmond, most actual ESRP reports don’t either.

If I were trying to deploy Exchange on a four-node HCI appliance, as opposed to Jetstress, I would use twelve virtual Exchange servers, so each host would run three under normal conditions and four in the event of a node failure.

It seems strange that the guy who, at the top of the blog post, says, “not a competition to see what vendor has the most hardware lying around the office to perform benchmarks” is later arguing that the servers running Jetstress should have had enough RAM to run Exchange or that Nutanix could have hit 60,000 mailboxes by scaling-out.

Frankly, since, again, unlike SPC, there’s no cost-analysis in our report or in ESRP reports, we could have asked Atlantis to add RAM to 768GB/server, but the Jetstress results wouldn’t have changed.

No Fault Tolerance

Most Exchange experts nowadays recommend using Exchange’s built-in DAGs for data protection, and many of Josh’s arguments about IOPS and capacity are based on his using the Exchange Server Role Requirements Calculator for a system with two DAGs. While most ESRP reports are based on two DAGs, we choose not to go there.

First, as we discussed above, Jetstress DAGs don’t really affect the results nearly as much as real log shipping would. More importantly, we firmly believe that DAGs only provide true protection when they’re on independent media. Since volumes in the Atlantis system share all the system’s media, having multiple DAGs would only very slightly increase resiliency and would still leave all the copies vulnerable to the same failures.

Even if the Atlantis storage system could isolate volumes so volume A was stored on nodes 1 and 2 while volume B was stored on nodes 3 and 4, We wouldn’t consider that sufficient isolation in a four-node cluster, just as we wouldn’t consider separate shelves on the same SAN array enough.

We considered setting database copies to two and claiming that it was a resilient configuration but decided against it, as a DAG would only be valid on a second appliance, just as we would tell an array vendor that two DAGs need two arrays. We ended up testing with copies set to two but are not calling this a resilient system with DAGs.

Readers should note that the June 2015 Nutanix ESRP report was tested with three servers and 8000 users per server, while the resilient configuration the report discussed was six servers each with 4000 active and 4000 passive mailboxes. They did test with mailbox copies set to two, as we did.

Jetstress on Dedupe

The post also complains that we ran Jetstress on a storage platform with deduplication, which is not recommended by Microsoft. As a consultant, I’ve installed and managed every version of Exchange since 4.0. As an analyst, I know more about deduplication than most. I don’t see any reason not to use good deduplication behind Exchange, and being independent, I can move faster than Redmond.

I think Microsoft has to stop thinking that their Windows deduplication is the only way it’s done and take a close look at storage systems like NetApp/SolidFire and EMC/XtremIO. These systems use the hashes generated in the deduplication system to determine data placement. Dedupe isn’t bolted on. It can’t be turned off. It’s part of how the system works.

I will admit, and we discuss at length in the report, that Jetstress data is significantly more reducible than real-world Exchange data. That has a significant impact on the amount of storage capacity consumed, but on a one-tier all-flash system like Atlantis, it shouldn’t have a significant impact on performance.

It does mean that, as with CPU, the CX-12 we tested on would be stressed to hold the data for 60,000 mailboxes at 800MB each. Perhaps we should have set the mailbox size to 200MB, which, as of a year ago, was the mailbox quota at one Fortune 500 company I was working with.

Frankly, since Atlantis guarantees that a CX-12 owner will be able to store 12TB of data and a CX-24 owner 24TB of data, we think most admins will be able to figure out if their data will fit.

It Can’t Do the IOPS

I have to admit that this one annoys me. The post’s problem is that Josh has gotten so deep into the Exchange sizing calculator for a 60,000-user, 150-message/day, two-DAG config that he’s forgotten we were writing about a no-DAG Jetstress report.

Yes, the Jetstress Field Guide says to use the calculator to see how many IOPS your DAG configuration will need, but I’ve never seen a Jetstress test report that does that, including the Nutanix June 2015 report.

For Jetstress, we take the IOPS number from Technet (for 150 messages/day, that’s 0.101) multiply by 1.2 to provide 20% headroom, and plug that into Jetstress. In our case, that was 0.121. Jetstress says pass or fail.  We needed to achieve 2420 IOPS per server and achieved 2937.


“Don’t hate the player; hate the game.”  There are certainly times I do, but that’s another blog post.

All in the game, yo


[1] Jetstress 2013 Field Guide