What You Absolutely Must Know About Hash Functions

How or Why are your passwords stored using Hash, ever wondered?

I believe having a working knowledge of cryptographic hash functions is a must to everyone. Even though they are primarily used by security practitioners. You must hear the words often or even use them without realizing it.

I will be honest with you, these things come off as boring to everyday users that is unfortunately the reason why we read the term hashing but don’t dive into what’s happening behind the scenes.

That is why I will try to make this one interesting.

Let’s talk data, data could be of any size. Open up your terminal and we will create hashes for multiple strings.

We will use MD5 hashing algorithm, more info on that later.

Table of Contents

Fire up your terminal:

md5sum <<< “your_string”

Now let’s use one character

md5sum <<< “s”

And now for my final trick, lets take a huge string

md5sum <<<”huge_string”

Things we just did and what to pick up from it:

1 . Fixed Size:

What we did just now is created multiple 128-bit hash value of multiple input strings that were different in sizes.

A cryptographic hash function, such as SHA-256 or MD5, takes as input a set of binary data (typically as bytes) and gives output —”the hash”/”message digest”—for any particular hash function is typically the same for any pattern of inputs

2. Uniqueness : All the hashes generated were unique.

We always expect that the output that is provided is hopefully unique for each set of possible inputs.

Important point to note here is that in the example we used MD5 which is no longer used by security practitioners, as it is severely compromised.

3. Deterministic, this means that we should get the same hash with the same method and same input.

As you can see MD5 for “same_test” produces the same hash.

4. Preimage Resistance, just a fancy term which means, it should be computationally infeasible to work backward from the output hash to the input. This is why they are sometimes referred to as one-way hash functions.

5. Second preimage resistance, another fancy term lads. It relates to the property of a hash-function that it should be infeasible to find two different messages with the same hash value.

6. Avalanche effect, one small change for an input, one giant change for the hash generated.

A small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value.

Lets put this to test:

Two inputs : “shreyash”, “shreyass”. Vastly different hashes generated

We can generate hashes now using MD5 for some strings, but where do I use this, what’s the purpose?

The two most important uses for Hash Functions are Password Storage and Data Integrity Check.

When using cryptographic hash functions we are always hopeful that the hash generated are unique. Should two inputs yield the same output, the hash is said to have a “collision.” In fact, MD5 has become deprecated because it is now easily possible to find collisions. A collision attack exists that can find collisions within seconds on a computer with a 2.6 GHz Pentium 4 processor.[Link]

We can say that without uniqueness, the technology is rendered useless, at least for the purpose you generally have for it.

Password Storage:

If a database is storing your ID and Password in a key value pair, which it needs to in order to authenticate you. The database should never store the password in clear text, at the very least that password’s hash is stored.

As we discussed Preimage Resistance, it comes in full effect here.

If ever so the database is exposed, the attacker will only see the hash of your password stored. He can neither logon using hash nor can he derive the password from hash value since hash function possesses the property of pre-image resistance.

Data Integrity Check:

Another common application is data integrity check. This is to make sure that the piece of binary data that you were handed is the correct binary data that the sender intended you to have.

This data could be a text, excel a whole damn movie of several Gigabytes. Did your mind just twitch a bit?

How in the world are you going to compare several Gigabytes of binary data, it is slow, cumbersome and you just wanna watch your movie.

Given two files several megabytes or gigabytes in size, you can produce hashes of them ahead of time and defer the comparisons to when you need them.

It is also easier to digitally sign hashes of data rather than large sets of data themselves. This is such an important feature that one of the most common uses of hashes in cryptography is to generate “digital” signatures.

Therefore you do not even require both the files, the file’s and its hash is sufficient to check the integrity.

However, you will have to be sure of the origin of both, the file and the hash. In case there is a MITM attack for example, the attacker can modify the file and create a new hash for the modified file send it to you.

Therefore you have to be sure of the origin of both.