Blog

Hands-On Testing and Analysis

Local RAID – Belt and braces for HCI

Replication and parity based erasure coding may allow your HCI system to survive one, or two node failures in a cluster. To get even greater resiliency HCI vendors can layer another protection method under the HCI systems basic node to node replication with local RAID in each node, or over it by replicating data to another cluster or failure domain. Much like combination RAID.

Higher levels of resiliency become even more important when systems are in the remote sites where HCI is such a good fit. In your corporate data center, in a major metropolitan area, an HCI node that goes offline will get noticed in a few seconds to a few minutes. Once someone acknowledges the alert they can probably add another node to the degraded cluster in a few minutes, or call your server vendor who’s contractually required to send a tech in four hours.

When a server dies in the server closet of the SprawlMart in Truth or Consequences, NM at 4 PM on a Friday afternoon it’s going to be down for the big Memorial Day barbequeue grill sale even if SprawlMart’s server monitoring system alerts someone by 4:05. If it doesn’t the store manager is going to be too busy to check that both of his servers are actually working until at least Tuesday.

Every hour a system spends running from a single copy of data is an hour during which even the smallest additional failure, like the failure of a drive to read a particular block, could trigger a total failure and even worse data loss. Larger HCI clusters address this problem by having enough nodes and storage capacity to rebuild their data but a SprawlMart only runs half a dozen VMs, putting four hosts, or three hosts with an external witness, in every store so they can rebuild would get expensive.

Local RAID

If you want to increase the resiliency of any kind of cluster you can either increase the number replicas, which will require more nodes, or you can increase the resiliency of individual nodes which local RAID does by implementing traditional RAID on each node.

LocalRAID has been around since the days of the VSA. Since the early VSAs from VMware, StoreMagic only supported two node clusters local RAID was the only way to have any resiliency when a node went offline. Many VSAs relied on the hypervisor to manage the RAID controller simply consuming .VMDKs like any other VM. This allowed users to leverage RAID controller features like DRAM and SSD caches.

2-Node HCI with Local RAID

As HCI vendors added 3-way replication and double-parity erasure coding local RAID fell out of favor with many vendors. Simplivity continues to use local RAID and 2-way replication as their primary data protection scheme and it continues to be the most cost-effective way to get higher resiliency from small clusters.

A 2-node cluster of small Simplivity nodes would replicate 5+1 RAID5 for effective resiliency of N+3. Their large nodes use 10+2 RAID6 which is effectively N+5 protection. Both are 42% efficient at any cluster size if rebuild space isn’t considered. Since the system will remain resilient after a node failure I would consider that acceptable for environments where failed nodes can be replaced in a small number of days.

While local RAID may sound like an old-fashioned solution, Simplivity’s 42% efficiency is better than the 25-30% efficiency of 3-way replication and by most measures a higher level of resiliency.



Do you like this series on HCI? Do you want a more? I’m presenting a six-hour deep dive webinar/class as part of my friend Ivan’s ipSpace.net site.

The first 2-hour session was December 11 and is, of course, available on demand to subscribers. Part 2, where I compare and contrast some leading HCI solutions will go live Tuesday, January 22 and the final installment February 5th. Sign up here