Understanding hash map by building a dummy one in python!

If like me, you have used dictionaries for years without understanding how they work. This is made for you!

What is a hash map and why we invented it?

A Hash Map is a data structure with an average insert and search time of a key in O(1). It is often used to build a dictionary, by associating arbitrary data to keys, allowing to retrieve corresponding data in O(1). To understand how a Hash Map achieves that let visualize and formalize the problem.

In order to have O(1) access time we need to use a vector which support random access in O(1). Random access mean that we can access any value by its position without having to iterate thought the structure. We can reinitialize a vector with empty data with a length equals to the total number of keys. Keys are mapped to the vector indices (positive integer) and allow insert and search of a key in O(1). The problem is that the total number of possible key is often several order of magnitude larger than the actual number of keys we are going to use. Leading to excessive memory usage. For example, if the key are drawn from any positive 32 bits integer (u32) that represent 4 294 967 295 possible keys.

A solution for that would be to initialize a vector of reasonable size and map indices to this vector. We are going to use the modulo operator. It is time to start coding ! At last. Let’s begin by coding a constructor and an insert method. We will solve problems as they come. If you are an observer you may ask where the hash function? We are going to talk about it very soon I promise

One problem stand out:
1- What if two key have the same position? (that is call a collision)
right now if two keys have the same position (modulo) the last one will erase the previous one.

There are two things we can do about collision:
1- minimize collision frequencies. (spoiler that where hash are useful)
2- manage it.

A well design “hash” function will help to minimize collision because they give a “random” output from an input such: the same input always give the same output. Similar input (but not equal) will give different output. The randomness allows minimizing collision.

Our Code become :

In order to manage collision we are going to use a container such a list at each position of our storage. When we insert a value we will have to test if the key exist. So we also have to store the keys. If the pair (keys, value) already exists we do nothing else if the key exists we override the value else we append the pair (key, value). Because looking if a value exist in a list may take up to O(n) (number of element in the list). In real life you will find double linked list or binary tree or others data structures may be used instead of list. But this is a Dummy Hash Map. While at it we also made a function that given a key return the position to look for.

A point I did not talk about is growth. At some point you may need to extend the storage size of the hash map to minimize collision and keep good performance. We won’t do it here. You can implement a naive strategy that consist of creating a new hash map twice as large and populate it with the old one. Here we are! We will finish by implementing the basics functionality of a hash map such as get, exist, delete.

I hope This helped you to understand how hash map works. The implementation is still very simplistic and unoptimized. This is for educational purpose only. Don’t hesitate to give me feedback! This was too long or too short ? Not enough details? too much prerequisite etc…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store