Bloom Filters For System Design
Bloom filter (BF), a space- and time-efficient probabilistic data structure introduced by Burton Bloom in 1970. Bloom filters are used in writing complex applications which have very high traffic hits. Examples as the number of views on youtube videos, the number of active users on facebook, the number of devices users connected from.
What are bloom filters?
If we have to calculate the number of unique views for a particular video, we have to compare userId of a person every time to check if a person has previously watched the video. For this, you need to store all user Ids and then compare if that userId already exists to calculate unique views. This is inefficient in terms of space as well as time because we have to store n number of ids and have to do N number of comparisons. Bloom filter can solve this problem efficiently.
There are some properties of string that don’t require storing the entire string to compare to other strings and the way you do is using hashing. How hashing work is we pass a string through the hash function get the value and set the value in a bit vector. A simple hash function you can imagine is assigning each alphabet a value based on its position. A as 0, B as 1 … Z as 25. So to calculate the hash function we will add the values of each character in the string. Let's say we have 16 size bit vector. If the value of the string is higher than 15 we take the modulo of hash function value and set it in the bit vector. Some examples are given below. So whenever we get a string we calculate the hash function and see if the bit vector at that position is set. If it is set we can tell we have seen that string before. CAT produces a value of 5 so we set the bit at 5 to one. If we again see CAT, it is going to calculate value 5, and if we see the bit at 5 already set we know that previously we have seen CAT. In the case of DOG value is 7 we set the bit at 7 in vector. When we get the string GOD which is an anagram of DOG, the value is the same. We see that bit at 7 already set. This gives a false positive. This is called Hash Collision.
To fix this consider another hash function. Let's multiple the character value with the index of that character in the string. For CAT we get 6, DOG we get 10, and GOD we get 4. This is better than the previous hash function. But we may again run into the same problems for longer strings. To prevent Hash Collision we need a complex Hashing Function.
Coming back to space and time complexity, we reduce space size to the bit vector and time to only compute hashing. This is good for solving simple problems. But at the scale of the application grows this is not enough. Imagine we got so many strings that all bits in bit vector set to one. This will give incorrect results for all queries. Bloom Filters solve this problem by using the same logic of hashing. In Bloom Filters, we compute k values from k hash functions for a string and set bit vectors for all the k values. Taking the example above CAT we set at 5,6, DOG at 7,10, and GOD at 7,4. So when we see DOG again we see that bit at 7,10 set at 1. For string to be matched again we need to get 1 across all the bit arrays. If we get a single 0 at any bit index we guarantee that string has never been seen.
Bloom Filter is susceptible to Hash collision but having more numbers of hash functions and large bit vectors can solve this problem. It always guarantees that string was never been seen. And false results in case of collision occur. Due to this nature, it is called a probabilistic data structure. If bit arrays are set 70-80% of their capacity Bloom Flitters can reset the arrays. It can also increase the size of bit arrays and recompute the hash functions. We can perform insertion by setting bit-vectors and deletion by unsetting to 0.
System that used Bloom Filters