Count Min Sketch to solve Top K or Frequent K Heavy Hitters System Design problems
System design problems are more focused on addressing scalability, availability and performance for large amounts of data. To address these issues one needs to carefully decide on data structures and algorithms to use while having constraints on memory and CPU. In heavy hitters systems like most frequent keywords on Google search engine or most watched videos on Youtube or most played songs on Spotify or most shared posts on Facebook or most retweeted tweet on Twitter requires a lot of data processing. These are closer to stream processing problems. One might come with a solution using mapreduce and bigdata but that could be overkill to get simple functionality of finding most frequently used data. Let us address these issues one by one and see how we can use Count Min Sketch in this system design.
Hashmap based approach
The main intuition for this problem is to record the frequency of each identifier of the video and sort in decreasing order. We can use Hashmap to keep a record of the frequencies and as a new stream comes we calculate frequencies and sort it. This approach will take NlogN time and N space.
Heap based approach
We can use heap if we are required to find the top K most watched videos. For every stream we calculate frequencies of the identifier and compare it with K videos we stored in the max heap. Here since we are comparing on K identifiers time required is NlogK and K space. As K is always going to be smaller than N this approach is faster than the hashmap approach. This is because we only need to maintain data of K identifiers.
Issue with both approaches
In the first approach, we require all identifiers to calculate the most frequently watched videos. We have put all the data in memory to do the calculation. Because the data we will get is in the form stream there will be frequent queries for the data from the database. That means we will have higher I/O calls. As the number of records increases, we are soon going to run out of memory. So these won't scale well as data increases. Therefore we need to split the data into batches and calculate the frequencies on multiple servers using MapReduce and distributed systems.
In the map-reduce technique, we split the data stream into k parts. And we process each part on a single server. On each server, we calculate the frequency of identifiers. Then collect results from all servers and add it together. Though it solves the scalability it raises new issues. For every data partitioning, we have to deal with data replication. If we add new servers to the cluster we need to rebalance the load so data is equally shared. Also in distributed systems when we partition data across we lose consistency. That results in a sacrifice of accuracy. We can try another data structure which requires fixed size memory. i.e count-min sketch.
|Refer to CAP theorem
Count-min Sketch is a probabilistic data structure which uses fixed memory. It functions very similar to bloom filters.
|Refer to Bloom Filter
Count-min Sketch uses a 2-dimensional array. When a new element comes we calculate multiple hash functions. Then increment by 1 on all such positions returned by hash functions. There will be collisions while calculating the hash function resulting in the overestimation of frequencies. While retrieving data we take the minimum of all values at positions by hash functions. As this data structure is probabilistic, the values returned are approximate. Accuracy and probability can be adjusted by fixing the height and width 2-D array. The greater the height and width the lesser the collision and the higher the accuracy.
We replace hashmap with the count-min sketch to keep a record of all the frequencies but keep a heap for top K elements. This way we optimized the space and time complexity to get frequencies of identifiers