data-durability (1)

What is data durability
It’s an unfortunate fact of life that things often go wrong. Despite this, we’re all quite creative when it comes to lowering risks that we deem too high to stomach.

For example, I get incredibly nervous about losing my car keys, so I have a spare located at a friend’s place should something happen to mine. This idea of spreading multiple copies of the same thing across different physical locations is something we all probably do subconsciously, be in money in different bank accounts, backups of our most important files, or something as mundane as a second set of keys.

But the fundamental objective is the same under all scenarios – we are trying to make sure we don’t lose something.

In the world of data, this is called durability.


Data durability is a measure of how resilient your data is against loss.

We would all ideally want to live in a world where we can pull out our most precious photos at any moment and have a guarantee that they will still open; the underlying data will be identical to what it was when we saved it there all that time ago. Unfortunately though, without additional consideration this is far from the case. If anyone has had a hard drive failure, or lost a USB thumb drive, you will know first hand how readily data can be lost. For clarity, I am using the term “lost” to imply the data is no longer present – it’s vanished – but the means by which it disappeared is not relevant. Lost, corrupted, eaten by the dog, whatever the scenario might be.

For a lot of us, the most common form of data loss is due to either human error or physical failure of the storage medium. Accidentally exposing the storage medium to water is a typical, relatable example. Others might be accidentally dropping the storage medium or simply misplacing it.

But let’s zoom in a bit closer to the storage medium itself. When data is stored on hard drives, the strings of zeros and ones are written to a rotating magnetic platter spinning around seven thousand times a minute. It’s easy too conceive that over time, the disks themselves will start to wear down, or sectors within the disk will no longer be able to receive new writes. Flash based media (think SSDs, USB drives, or the fancy new NVMe/m.2 drives) have no moving parts, but are still prone to different failure modes.

Then there’s the possibility of corruption. Even if we do everything correct, there’s still a chance that a mechanical or electrical issue will cause data to be incorrectly written to the disk. Solid-state drives are particularly prone to this during a sudden power outage, and it’s the reason most enterprise SSDs come with in-build capacitors – little storage batteries – to ensure that any in-flight writes are written to disk, no matter what. Recall that data – those precious wedding photos – are really a long sequence of zeros and ones written to disk. Unfortunately as disks degrade, there’s the possibility that some of these bits will “flip”. Colloquially, this is called bit rot.

But is one bit out of ten thousand changing really a big deal?
Let’s see. Here’s a nice picture of a little kid. From the computer’s point of view, this is just a string of zeros and ones, 326,000 digits long:

 

 

Let’s flip just one bit and see what it looks like:

 

Yikes, that’s not ideal. So does one bit matter? Yes.

Truth be told, it also matters what bit is flipped, so really it’s a game of chance, but as you can see, bit rot in the wrong spot can really make a big difference.


So what is done to stop this from happening?

You’re probably wondering how all those online services are storing your information so perfectly. After all, my display picture has never changed to look like that poor kid’s photo above, so it’s natural to think they must be employing techniques to mitigate against these types of issues. In our newly learnt parlance, they are increasing the durability of your data.

There are really two ways to do this. The first is taking our example of the car keys and simply translating that into the virtual world – storing multiple copies of your file across different storage mediums. The natural consequence is that if one fails, you can hopefully recover it from one of the remaining locations. This approach works well, except the price paid for durability is an increase in the storage used. So if I wanted to make my 1Gb photo library more durable, I might copy it to three different USB drives. Of course, I now need three of them, not one.

This idea is formally called redundancy. The idea of storing the same thing in multiple locations lowers the chance of you losing the data, as you’d have to lose it in all locations simultaneously.

But that duplication cost can really add up. Luckily, it turns out that there is an alternative means of increasing the durability of your data without copying it like-for-like and peppering it across the world. At first blush it shouldn’t be possible, but thanks to some clever mathematics, it really does work. Enter error-correction coding.

An error correction code is a way of summarizing a very large piece of data into a very short sequence of zeros and ones. For example, we could take the photo of the child we used above – all 326,000 bytes – push it through an error correcting code and out pops 6 new bytes of data. Now, we store the original image along with the 6 new bytes.

Now let’s say one of our zeros in the original image has changed to a one. The chance of us finding it at random are, well, zero, given we’re talking about more than 300,000 of them. But we can simply put the corrupt image through an algorithm, along with our sequence of six bytes from earlier, and we can not only detect which of the bits has flipped but also correct for it.

That’s right – by storing a mere 6 additional bits we can detect and correct any bit flip among the 320,000 (even more impressive is we can detect even if those 6 bytes themselves are corrected).

Unfortunately the actual mechanism by which this happens is very complex, and relies on some very abstract branches of mathematics. But the result is truly astounding – you can store data across four disks, and you are able to lose any individual disk entirely. You would think the only way to guarantee this is to write the same thing on all four disks, thus really only being able to use 1/4 of the space. But by using error correcting codes, you can still lose any of the disks at random but use 3/4 of the space.

Let’s talk cloud

So when you store data in the cloud, how durable is it? Well, that depends on where and how it’s stored. At Amazon Web Services, their S3 service boasts – wait for it – 99.999999999% durability. So if you gave AWS 1 million files to store, statistically they would lose one file every 650,000 years.

You are about 411 times more likely to get hit by a meteor. So yeah, it’s pretty amazing.

But not everything can run on S3. Operating systems and servers cannot run on S3 for various technical reasons (because they rely on block storage, which is fundamentally different). Instead, if we look at Amazon’s offering that servers can run on, called EBS, it has (only) 99.999% durability. It’s lower, but the point to note is that no one is saying 100%. It’s a physical impossibility.

So is there anything to worry about? I mean, if my data has such high durability, everything is fine, right?

Unfortunately not. While your data might be safe on, say, Amazon Web Services, they can still have an outage that could last a long time. Your data is still there, but it’s not available.

Or, you could delete your files only realizing you want them back a month later. Of course, by this time all the replicas and error correcting codes have also been deleted.

Or, you could get hit by ransomware that encrypts all your data. It’s still there, and it’s still highly durable, but it’s a garbled mess.

So durability is important, but it’s not the only player in the story. If you’re anything like me, hopefully next time you see a cute cat picture on Facebook, you can smile knowing that however random that picture is, it has the full force of mathematics and engineering behind it, bit by bit (literally) ensuring it’s safe from harm, and ready whenever you want it. Just like those extra car keys at my friend’s place.

Back to Blog