How Do Apps Store Your Password Securely?

Ever wondered how web apps store your password securely? It’s quite an interesting topic, and the answer may surprise you. Let’s take a deep dive look at the mechanics that protect some of your most sensitive data.

First, how not to do it

Firstly, imagine your favourite website that you often login to. You might think that there’s a database somewhere that contains your username and your password, and when you attempt to login, the credentials you input are compared to those that are stored, and if they match, voila, you’re in.

From a purely pragmatic approach, this would work. Except it has one fundamental flaw – if anyone gets a hold of the database, all the passwords of all the users are compromised (1). You might think the solution is to prevent the database from being compromised in the first place, but this is actually far harder said than done. Further, it doesn’t obey the principle of defence in depth. The real solution is to not store the password in the first place.

But if we don’t store it, how do we know if what the user enters is correct? Welcome to the world of the cryptographic hash.

What is a cryptographic hash?

A cryptographic hash function takes some input string, such as a password, and outputs a corresponding sequence of bits, called a bitstring. They are designed in such a way that pushing through a password and getting a bitstring is easy but doing the reverse (getting a password from a bitstring) is computationally infeasible. And when I say infeasible, I mean impossible with current computing power and mathematical research. It’s the same as mixing coffee in water – easy to mix but near impossible to undo.

 

Visual example of a cryptographic hash called MD5

This property is central to the idea of a cryptographic hash function – it’s easy to compute in the forward direction but nigh on impossible to do it backwards. When a function is truly impossible to invert, it’s called a one-way function. We currently don’t know if true one-way functions exist – it’s an open mathematical problem – but we do know that current hash algorithms are pretty close; it would take millions of years to try and reverse current hashes.

How are hashes used

Let’s consider our scenario about storing passwords once again. Rather than storing the password, we could push the password through a cryptographic hash function and store the bitstring instead. When a user attempts to login, we simply compute the hash of the password they entered with the hash we have stored, and if they match, they’re the same (2).

This is quite a profound outcome: we can verify that a password is correct without actually knowing the password. Pretty neat, huh?

The benefit of storing the hash instead of the password directly is that if the database gets leaked, only the hashes are exposed, and recall, it’s not feasible to reverse a hash to obtain the original password.

So, are we done? Is that the problem solved? Unfortunately, not.

Time for some rainbows

The problem with our approach so far is that our hash function is deterministic. This means that when we push a password through it, we get the same output bitstring each time. This is, of course, a necessary feature in our construction so far because it’s this feature that allows us to compare to hashes and obtain certainty that their underlying inputs – the passwords – are the same.

But there is a subtle flaw here, one that can be exploited by a malicious actor. Let’s find the top ten million passwords used from around the world, and in our own time, push them through the hash function. We now have a list of each password and its corresponding hash. If we find a leaked database, we simply have to check each hash and see if it’s in our list. If it is, we can determine the password, because we generated it! This is called a rainbow attack. All we’ve done is:

  • Found the top ten million passwords (easy to do).
  • Calculate the hash in our own time (called an offline attack) – again, this is easy.
  • If we have a hash from the real world, look it up in our list. If we have it, we know the password.

Note that a rainbow attack doesn’t rely on reversing a hash, as that’s still impossible. Instead, it relies on having a massive dictionary of known password/hash pairs and simply looking up your hash in that list.

These “rainbow lists” are so common, and there’s free tools online to do it, so we’re not out of the woods yet. In fact, this tool here broke the example hash above in less than 2 seconds (I was using a weak hash algorithm called md5):

This free web tool reversed MD5 in seconds

We need a salt

We need to somehow break the deterministic nature of the hash, but still know with certainty that if two hashes are equal, then their inputs are most likely equal. There’s a clever way to do this, and we use a little cryptographic primitive called a salt.

A salt is nothing more than a random string of characters that we append onto the end of the password before we push it through the hash function. For example, instead of pushing my password “hello” through a hash to obtain “12a4c”, we would pick a salt value of “123” and then push the string “hello123” through the hash function. We then store the hash and the salt together in the database, so that in order to compute the comparison hash, you need to have the password and the salt (note that you would have a unique salt per user). Note that the password is still not being stored – just the hash.

The salt doesn’t need to be secret – its only role is to prevent a rainbow attack. It does this because it means an attacker cannot arrive pre-prepared with a dictionary of hashes. Instead, they would have to re-compute the dictionary for every single password, for every single salt. We’re making the burden on them significantly higher.

 

Example of adding a salt to a password before hashing it

This is looking pretty neat. So far, we have the following defences in place:

  • Obtaining the hash itself is useless, because it’s irreversible.
  • Having a dictionary of pre-computed passwords is useless, because we’ve got a salt

Unfortunately, we’re still not in the clear. Can you spot the problem?

More power, Scotty!

While the salt definitely makes brute force attacks harder, they’re still not impossible, and adversaries with enough will power and money are still able to break such systems. Indeed, a current GPU is capable of calculating 1-10 billion hashes per second. Scale that to thousands of computers and the chance of brute force working is pretty high.

Further, simply sticking a salt on the end of the password as described above also has some subtle problems. The specifics are rather technical, but the internal state of some hash functions can be exposed depending on how the salt is appended to the password. For example, the now abandoned MD5 hash is prone to a length extension attack by exploring exactly this idea. Yes, believe it or not, it matters how you mix the salt in with your password.

So how should you do it? Use a key derivation function. This takes an input password, a salt, and usually a computational complexity factor and outputs a value that is safe to hash. So the key derivation function is used as an input to the hash function.

The key derivation function blends the salt into the password in such a way that leaks no information about either the salt or the password. Further, they are designed to be slow. Recall that if a hash can be calculated too quickly, simple brute force attacks (trying every option) become feasible. To prevent this, key derivation functions involve internal loops and iterations to make sure they’re slow. Note that we’re talking about a few milliseconds here, so it won’t be noticeable when you’re logging in. But when you’re trying to compute hundreds of millions of them, this really adds up and makes the attack, once again, computationally infeasible.

But even more sneakily, they often shuffle memory around in such a way that makes them hard to parallelize on a GPU. That is, they’re deliberately constructed to both be slow and inefficient for modern GPU and CPUs.

Finally, any biases in the password and the salt can be apparent in the hash. Key derivation functions are designed to specifically remove these biases by mixing all the inputs around. For example, in English the letter “a” is used more than “z”. As such, adversaries can make assumptions around this and are able to exploit this asymmetry to break the password.

So key derivation functions do two main things:

  • Safely blend in a salt with a password, removing any biases and not leaking any information.
  • Increase the computation required so that the hashes are slower to compute.

Thank you, standards

This got complex quickly, and as you can imagine, there’s a lot of change for a developer to mis-implement a system like this and risk introducing a bug or security flaw. As a result of this, the development community has wrapped up the above into standard implementations for anyone to use. There are now functions that are available that include the ability to generate a salt, use an internal key derivation function and an internal hash all in one. One example is called bcrypt. There are others, but the main thing is the same: they make the above as simple as possible to implement.

We have to be safe now, right?

If you’re using salts, key derivation functions and hashes (directly or indirectly) then you’re doing things pretty well. Unfortunately, nothing will stop someone from finding a flaw in the underlying mathematics. Recall that it’s unknown if true one-way functions exist, and the hashes in use now are our best attempt at trying to create them. That means it’s a waiting game and one day someone might find a way to reverse them. This is exactly what happened with MD5.

In 1996, a flaw was identified in a part of the hash function called the compression function, which indicated that MD5’s life was nearing an end. At the time, it wasn’t feasible to exploit it, but quoting the author who discovered it:

The presented attack does not yet threaten practical applications of MD5, but it comes rather close … in the future MD5 should no longer be implemented … where a collision-resistant hash function is required.”

Hans Dobbertin

In 2005, enhancements in computing power and a deeper understanding of mathematics allowed researchers to deliberately craft two strings with the same hash. In 2012, the malware community took advantage of it and were able to forge digital signatures from Microsoft. The hash has long since been abandoned and is no longer used by the community because its cryptographic safety has been long-since undermined.

Summary

The good news is that the global community of cryptographic experts and computer scientists are constantly trying to break hashes. If a researcher even manages to bump the requirements down even a little – even if they can’t truly break it yet – the hash is abandoned globally. In some sense, the safety sticker is torn off and the world moves on to a new hash. Experience has taught us that as soon as one problem is found, the hash will likely fall within the next decade and we need to stop using it immediately. So where do the next hashes come from? The NIST, a government body, runs competitions periodically where researchers from around the world submit them and they go through rigorous trials over a period of years. Eventually a hash is crowned the winner and the world can gain some confidence in its safety given all the due diligence done on it up to that point. For example, the winner in 2012 has the fun name of KECCAK.

So finally, are we safe? For now, yes. But as with most things, we have to stay vigilant. Ask me in a decade, and there’ll almost surely be new things that would need to be added to an article like this.

Footnotes

  • (1) Unfortunately, the problem is even worse. If adversaries are able to exfiltrate data through an SQL injection, they don’t even need direct access to the database server itself.
  • (2) Interestingly – and quite an unknown fact – there’s more than one password that will get you in. That’s because the space of all possible hashes is less than the space of all possible inputs.
  • (3) Assuming MD5 or the SHA-1 family.
Back to Blog