In this episode of As the Cluster Turns we’re going to look at how HCI cluster size impacts the storage component of an HCI solution. While My Google Sheet from part one compared clusters with the same total SSD capacity that analysis ignored data protection and the impact of cluster size on data protection efficiency.
Reviewing HCI protection
HCI systems protect data by either synchronously replicating data across two or three nodes, or by striping data with single or double parity across devices on as many as a dozen nodes. While it may seem obvious that 2-way replication require a minimum of two, and three nodes respectively I and my paranoid storage administrator brethren would sleep much better at night if the system could rebuild data after a node failure and maintain the N+1 or N+2 level of protection I want.
The ability to rebuild means not only that the system needs not only access to a node’s worth of free space across the cluster but also that the system be able to resolve any split-brain issues by communicating with a quorum or witness device. Some solutions manage with three nodes for 2-way mirroring or four nodes for 3-way, possibly with an external witness. Others require four nodes for a rebuildable 2-way replication cluster and five for minimum N+2 configurations.
Even after the cluster size exceeds the minimum users must reserve one node’s capacity so the system always has enough space to rebuild. Unlike disk arrays that have dedicated hot spares most HCI solutions leave this to the system administrator. This rebuild capacity can be included in a vendor’s headspace recommendation (see below).
As a result, a cluster’s usable capacity is the capacity of n-1 nodes, where n is the number of nodes in the cluster. In small clusters, this can be significant. A four-node cluster loses 25% of its capacity to this spare space while a 16 node cluster only sacrifices 6¼%
Calculating Storage Efficiency
This week’s Google Sheet calculates the storage efficiency of common HCI protection methods. Greyed out cells represent cases where the number of nodes is less than the minimum for that protection method. Pink cells configurations that may not be rebuildable.
The sheet assumes a capacity of 4TB per node and populates the TB capacity and N-1 columns with the raw capacity of the cluster. It then calculates the usable capacity of a rebuildable cluster using 2 and 3-way replication, single parity erasure coding with three and five data strips and double parity with four and six data strips. It then divides the cluster’s raw capacity with the net capacity for each method and cluster size.
There are some columns off to the right for double protection but they’re the subject of a later blog post. What, you thought this was a trilogy?
As you can see from the chart above each protection method asymptotically approaches its theoretical efficiency as cluster size increases. The initial blip on each line is the cluster size that can accommodate the protection method but lacks enough resources to rebuild after a node failure.
The 5+1 line for example peaks at 80% efficiency for a six-node cluster and then re-approaches that efficiency level hitting 78% at a cluster size of 32.
Implementation Specific Factors
Of course, all we’ve considered so far are the impacts of an HCI system’s basic data protection method. Even before we consider data reduction, and have to go down the wormhole of how well vendor A’s system deduplicates vs vendor B’s system, how the system uses storage should get factored in along with the underlying capacity:
These factors include:
- File/object system metadata
The distributed file system, object store, or whatever else a vendor wants to call it that runs the HCI solution has metadata to manage VMs, snapshots and the like. Typically this will be less than 5% of total storage.
- Deduplication metadata
Any deduplication system has to store additional metadata, including a hash/block use count table. Typically 2-3% of capacity
- Deduplication realm scope
Data deduplication eliminates multiple copies of the data stored in whatever set of data makes up a deduplication realm. Breaking our servers, and therefore our VMs, into multiple clusters will reduce deduplication efficiency by creating multiple deduplication realms. If you have Windows servers on 20 clusters you’ll have 20 copies of Windows, one stored on each cluster.
Some HCI solutions create deduplication realms smaller than a cluster at a disk group or datastore/volume level while others offer global deduplication across a federation of clusters.
- Performance tier cache
On many HCI systems, some or all of the performance tier or SSD will be used as a cache of one sort or another. Since data in a cache is a copy of data stored in a separate endurance tier SSDs used as cache shouldn’t be counted towards system capacity.
- Vendor headspace recommendations
To be perfectly honest most storage systems start to lose performance as they fill up. Hot spots emerge, the system has to do more garbage collection to create free space for the new data, and so on, and so on. Distributed systems have the additional complication of having to rebalance when individual components like nodes or drives fill up.
VMware’s VSAN rebalances data whenever components exceed 80% full. To prevent the I/O overhead created by rebalancing they recommend the VSAN system not be filled beyond 70% of capacity.
- Mixed Protection Levels
While my little spreadsheet will calculate the efficiency of some common data protection schemes it assumes that an entire cluster uses one, and only one data protection scheme. Most HCI systems actually use some combination of data protection schemes both because the administrator has selected different policies for different VMs and because the HCI system may replicate data across its performance tier and only erasure codes data that hasn’t been accessed in 3, or 30, days.
The Economies of Scale
In part one we saw that clusters made of the biggest servers we could find would cost less per compute unit. Now we see that storage efficiency gets better with bigger clusters.
Which means the HCI world is one where having enough scale to manage clusters of ten to sixteen, 56-64 core servers will give you a cost advantage.
Anjoying this series on HCI? Do you want a more? I’m presenting a six-hour deep dive webinar/class as part of my friend Ivan’s ipSpace.net site.
The first 2-hour session was December 11 and is, of course, available on demand to subscribers. Part 2, where I compare and contrast some leading HCI solutions will go live Tuesday, January 22. Sign up here