Blog

Hands-On Testing and Analysis

All About Data Protection Part 6 ½ – Thinking About Parity and Read-Modify-Write

Genious

In part 3 of this series, I described how single parity RAID, levels 4 and 5 for those keeping track at home, worked. In that post, I talked about the I/O expansion, and performance penalty created by the read-modify-write process that’s required for writes smaller than the size of a full RAID stripe.

In that post I wrote, “RAID4 performs one I/O per read, n I/Os (1 per drive) for a full stripe write and 2D-1 I/Os for a write smaller than a strip to accommodate a read, modify, write cycle.”

In this post, I want to examine that read-modify-write process and show a technique that could reduce the 2D-1 I/Os to 4 I/O for any number of drives.

Let’s take the RAID4 set I used as an example in Part 2 ¾ – A Few Words on RAID. This set contains 4 data drives A-D and a parity drive P. Generically we’ll call this a 4D+P stripe set. A RAID5 set would rotate the parity strip from drive to drive and may change which drives will perform some I/Os, but our example, and the mathematical proof of the logic behind it will work out the same for RAID4sets, RAID5sets and just about any distributed erasure coding scheme using single XOR parity.

Each data strip on our example array is 64KB making the full stripe size 256GB.

The way most people(Including me in part 3) explain it; when an application performs an 8KB write to an LBA (Logical Block Address) that maps to drive C in our array the system has to:

  • Store the 8KB new data into a memory buffer
  • Read the strip from drive C into a memory buffer (1 read I/O 64KB)
  • Modify the memory buffer to include the new data
  • Read the data from drives A, B, and D (N-1(3) read I/Os 192KB)
  • Calculate the new parity data
  • Write the new data to drive C (1 write I/O 64KB)
  • Write the new parity data to drive P (1 write 64KB)

That process performs 6 total 64KB I/Os. The general case would be D+2 I/Os where D is the number of data strips or drives.

As I was re-thinking all of this writing this series, and discussing all the mathy parts with my favorite mathematician Rachel Traylor Ph.D. I started thinking if all that I/O was really necessary. Why I thought to myself, couldn’t I calculate the new parity data from the original data, the new data and the original parity data?

After all, if I XORed the original data with the new data (ColdCNew), the result would be the bitwise difference between the old data and the new. If I XORed that with the old parity data I could calculate the new parity. The process then becomes:

  • Store the 8KB new data into a memory buffer
  • Read the current data from drive C into a memory buffer (1 read I/O 8KB)
  • Read the parity data from drive P (1 read I/O 8KB)
  • Calculate the new parity data P=(ColdCNew)POld
  • Write the new data to drive C (1 write I/O 8KB)
  • Write the new parity data to drive P (1 write 8KB)

This process requires only four I/Os (a read and write each for the modified data strip and the parity strip.) regardless of the number of drives in the stripe.

It seemed too good to be true. For the next few hours, I was alternately sure that I was a genius one step down from Stephen Hawking or Sheldon Cooper or an idiot on the scale of the guy who claims he has a 200 MPH carburetor and 10,000:1 lossless data reduction.

Have, I asked myself, the people that program RAID controllers and the Windows volume manager been mathematical illiterates? I know at least some of the systems I’ve used over the years got slower and slower at small random writes as the number of drives in the set grew, that would imply they’re reading all the data to recalculate parity.

So I wrote Dr. Rachel an email asking if XOR really was commutative enough for P=(ColdCNew)POld. Math geek extraordinary and plenipotentiary that she is Dr. Rachel not only confirmed, much to my surprise, that my math was right and only four I/Os were required regardless of the number of drives in the set but also produced a formal proof of it. For those of you who, like Dr. Rachel, are enraptured by the beauty of pure mathematics, she’s posted the proof on her blog
The Math Citadel.

While waiting for her return message I did a little Google searching and academic paper reading (see the sacrifices I make for you dear reader) and discovered that, I was far from the first discoverer of this little trick of mathematics. It is in fact so well known that no one is claiming it and calling it say the Marks-Traylor Method.

While I’ve discussed this in the context of traditional RAID5; the additional node-to-node latency of a scale-out or hyper-converged system using XOR parity as a distributed erasure code would make using this little trick even more important.

Even if I didn’t add to the world’s understanding of parity data protection writing this series has got me rethinking storage down to basic principles. While it’s not the same as a Nobel prize, it is edifying to grok in fullness the tech so central to my work-life.

Of course, every revelation just brings up new questions, so I have a few things to figure out:

    • Do the basic RAID products we use minimize I/O this way?
      • Run 4K random write benchmarks on 3-8 drive RAID5 sets on SDS cluster server:
        • Windows 2003 volume manager
        • CentOS7 md RAID5
        • LSI 9270 RAID controller – All cache disabled
      • Apply XOR commutativity and I/O minimization to diagonal parity RAID6 (fixed and rotating) does this really come out to 6 I/Os?

 

The Lab’s SDS cluster is busy with another project at the moment but as soon as that’s done I’ve got some testing to do.