Understanding hash map by building a dummy one in python!

lannes romain
4 min readFeb 16, 2021

If like me, you have used dictionaries for years without understanding how they work. This is made for you!

What is a hash map and why we invented it?

A Hash Map is a data structure with an average insert and search time of a key in O(1). It is often used to build a dictionary, by associating arbitrary data to keys, allowing to retrieve corresponding data in O(1). To understand how a Hash Map achieves that let visualize and formalize the problem.

In order to have O(1) access time we need to use a vector which support random access in O(1). Random access mean that we can access any value by its position without having to iterate thought the structure. We can reinitialize a vector with empty data with a length equals to the total number of keys. Keys are mapped to the vector indices (positive integer) and allow insert and search of a key in O(1). The problem is that the total number of possible key is often several order of magnitude larger than the actual number of keys we are going to use. Leading to excessive memory usage. For example, if the key are drawn from any positive 32 bits integer (u32) that represent 4 294 967 295 possible keys.

A solution for that would be to initialize a vector of reasonable size and map indices to this vector. We are going to use the modulo operator. It is time to start coding ! At last. Let’s begin by coding a constructor and an insert method. We will solve problems as they come. If you are an observer you may ask where the hash function? We are going to talk about it very soon I promise

class DummyHashMap():
def __init__(self, storage_size=100):
self.size = storage_size
self.storage = [None for i in range(storage_size)]

def insert(self, key, value):
"""
insert key to self, and associate value
:param: key: hashable value
:return: None
"""
position = key % self.size # modulo operation
self.storage[position] = value

One problem stand out:
1- What if two key have the same position? (that is call a collision)
right now if two keys have the same position (modulo) the last one will erase the previous one.

There are two things we can do about collision:
1- minimize collision frequencies. (spoiler that where hash are useful)
2- manage it.

A well design “hash” function will help to minimize collision because they give a “random” output from an input such: the same input always give the same output. Similar input (but not equal) will give different output. The randomness allows minimizing collision.

Our Code become :

class DummyHashMap():
def __init__(self, storage_size=100, hash_func=hash):
# by default we use the default python hash function
self.size = storage_size
self.storage = [None for i in range(storage_size)]
self.hash = hash_func

def insert(self, key, value):
"""
insert key to self, and associate value
:param: key: hashable value
:return: None
"""
hash_value = self.hash(key)
position = hash_value % self.size # modulo operation
self.storage[position] = value

In order to manage collision we are going to use a container such a list at each position of our storage. When we insert a value we will have to test if the key exist. So we also have to store the keys. If the pair (keys, value) already exists we do nothing else if the key exists we override the value else we append the pair (key, value). Because looking if a value exist in a list may take up to O(n) (number of element in the list). In real life you will find double linked list or binary tree or others data structures may be used instead of list. But this is a Dummy Hash Map. While at it we also made a function that given a key return the position to look for.

class DummyHashMap():    def __init__(self, storage_size=100, hash_func=hash):
# by default we use the default python hash function
self.size = storage_size
self.storage = [[]for i in range(storage_size)]
self.hash = hash_func
def position(self, key):
"""
return the position of key in self storage
:param: key: hashable value
:return: int
"""
hash_value = self.hash(key)
position = hash_value % self.size # modulo operation
return position
def insert(self, key, value):
"""
insert key to self, and associate value
:param: key: hashable value
:return: None
"""
position = self.position(key)
for e in self.storage[position]: if e[0] == key:
e[1] = value # We use the fact that lists are mutable
return
self.storage[position].append([key, value])

A point I did not talk about is growth. At some point you may need to extend the storage size of the hash map to minimize collision and keep good performance. We won’t do it here. You can implement a naive strategy that consist of creating a new hash map twice as large and populate it with the old one. Here we are! We will finish by implementing the basics functionality of a hash map such as get, exist, delete.

def get(self, key):
"""
If key in dict return the associated value else return an Error
:param: key: hashable value
:return: any | Error
"""
position = self.position(key)

for
e in self.storage[position]:
if e[0] == key:
return e[1]
raise(ValueError)def exist(self, key):
"""
If key in dict return TRUE else return FALSE
:param: key: hashable value
:return: bool
"""
position = self.position(key)

for
e in self.storage[position]:
if e[0] == key:
return True
return Falsedef delete(self, key):
"""
If key in dict delete the pair (key value)
:param: key: hashable value
:return: None
"""
position = self.position(key)
value = self.get(key) # we should take care of the error here
self
.storage[position].remove([key, value])

I hope This helped you to understand how hash map works. The implementation is still very simplistic and unoptimized. This is for educational purpose only. Don’t hesitate to give me feedback! This was too long or too short ? Not enough details? too much prerequisite etc…

--

--