QuangPhamCS100W: Count-Min Sketch Data Structure

Data structure is the core of any software program. Without an appropriate data structure, any algorithm might suffer in accessing stored data and optimizing the running time. For example, the following tasks that need to build for a best timing in a data set:

To estimates frequencies of particular elements
To finds top-K most frequent elements
To performs a range of queries
To make a percentiles estimation.

As we can see, they are all related to a frequency or a range of queries in a data set. So, one of the structures sets up to get the best optimization for the tasks above is Count-Min Sketch, a method estimates frequency-related properties of a data set.

What is Count-Min Sketch Data Structure?
Count-Min Sketch is one of members in the family of memory efficient data structures to optimize the counting of the rate of an element in its lifetime.

Problem statement:
We have a set of duplicated values, and the issues is to estimate the frequency for each value. The estimation for relatively rare values can be imprecise, however, frequent values and their absolute frequencies should be determined accurately.

The basic idea of Count-Min Sketch:
As similar to Linear Counting, Count-Min sketch is designed as:

A 2-D array (d x w) of integer counters.
When a value is set, it is mapped to one position at each of d rows using d difference and an independent hash code.
Counters on each position will increase for the next arriving values.
Algorithm estimates the frequency of given value as a minimum of the corresponding counters in each row.

Below is an animation for Count-Min sketch data structure:

Below is a practical implementation of Count-Min sketch:

class CountMinSketch {

long estimators[][] = new long[d][w] // d and w are design parameters

long a[] = new long[d]

long b[] = new long[d]

long p // hashing parameter, a prime number. For example 2^31-1

void initialize() {

for(i = 0; i < d; i++) {

a[i] = random(p) // random in range 1..p

b[i] = random(p)

}

void add(value) {

for(i = 0; i < d; i++)

estimators[i][ hash(value, i) ]++

}

long estimate(value) {

long minimum = MAX_VALUE

for(i = 0; i < d; i++)

minimum = min(

minimum,

estimators[i][ hash(value, i) ]

)

return minimum

}

hash(value, i) {

return ((a[i] * value + b[i]) mod p) mod w

}

Accuracy

Accuracy of the Count-Min sketch depends on the ratio between the sketch size and the total number inserting values (not duplicated values). This means that Count-Min technique provides significant memory gains only for skew values, for example, the values have very different probabilities.

It is clear that Count-Min sketch can't track frequencies of 5900 elements using only 152 counters, in the case of low skewed values with high frequencies, so the histogram will be very inaccurate.

In general, the applicability of Count-Min sketches is not a straightforward question that can be recommended in the real life of experimental evaluation, but it can be use for some particular cases. However, the theory of Count-Min sketch is a base for accuracy on skewed data and measurements on a real data set.

References:
1. Count-Min Sketch
http://en.wikipedia.org/wiki/Count–min_sketch

2. Streaming Algorithms and Sketches
http://blog.aggregateknowledge.com/tag/count-min-sketch/

3. Count-min sketch & its applications
https://sites.google.com/site/countminsketch/

4 comments:

AnonymousNovember 15, 2013 at 8:54 AM
Wow, you did an awesome job with this post. It was great that you included an animation, a detailed explanation, source code, and all the rest. I thought it was a good idea that you focused on one data structure so that the reader can understand it in depth. I also like how you cited your sources, as some people don’t even do that. It gives the reader the opportunity to learn more about it, because blog posts are more of summaries than full information.
Joshua NgNovember 15, 2013 at 8:25 PM
Hi Quang. Your article on data structures is really great. You provide detailed explanations that aren't too long but detailed enough for unfamiliar users and you also provide great visuals that help clarify the definition of data structures. I like that you got into the Count-min sketch and you explained it very thoroughly. Great work.
Akshay KapoorNovember 16, 2013 at 12:59 PM
Hi Quang,
Thanks for making me learn a new data structure which I wouldn't have otherwise..Your explanations were very concise and it was a god thing to focus on just one data structure so that the reader learns something while reading the blog-post...I lke this way of writing and have also done it similarly in my bog-post..
One suggestion for you would be to check on your grammar...you have missed a few words here and there..rest was all good...
Keep up the good work!
UnknownNovember 17, 2013 at 3:23 PM
Hi Quang!

I liked that you picked a specific data structure to talk about for this topic. It helps us (your readers) learn more about Count Min. You did an excellent job describing it, and providing the gif and implementation really helped me a lot. I've seen this algorithm before , but never really focused too much on it. I can see count-min be useful for many other algorithms though, so it's good to know that I can always refer to this post if I ever needed a brief explanation of Count-Min.

Sunday, November 3, 2013

Count-Min Sketch Data Structure

4 comments: