Session 19

Bloom Filters and the Set Membership Problem

CS 3530
Design and Analysis of Algorithms

A Unique Puzzle

You have a long list of values -- maybe even 1012 of them. You'd like to know if all the values are unique. You would also like to know the answer before the fan on your laptop overheats, so you need an efficient way of finding it.

Propose as many ways to solve your problem as you can think of.

How efficient is each of your ideas? Can you achieve O(n)?

Testing for Uniqueness

What are our options?

These have short implementations in most languages. The first is a three-line doubly-nested loop plus a return statement. sort my_list | uniq -c sorts and squashes a list in the Unix shell, and len(my_list) == len(set(my_list)) implements the set insertion idea in Python. You can implement the insertion idea in Python in two lines and achieve short-cut behavior as well!

How efficient computationally is each of these?

Can we achieve even O(1)? Maybe an idea from our previous session would be of use...

A Hashing Good Solution

Clearly, the brute-force solution -- a nested for loop that compares each item to every item after it in the list -- is a no-go: O(n²). As some of you found out on Homework 3, O(n²) is slow for a million items or more.

But what option do we have? We could use a data structure to help us. If we insert each item in the list into a binary search tree, we will recognize a repeat as soon as we insert it into the tree. In the average case, each insertion takes only O(logn), so the whole process is O(nlogn). Unfortunately, the worst case is still O(n²).

But the idea of using a suitable data structure as a helper is a good one. For today, you read a little refresher on hashing. Hash tables provide a wonderful O(1) look-up time in their best and average cases, by distributing values evenly over an array. They do so by use of a well-chosen hashing function that maps values into the range [1..n], where n is the size of the table.

So, let's try this, which uses a hash table with buckets:

    input : array A[1..n] of values
    output: boolean -- all values are unique, or not

    table ← empty hash table of size n
    for i ← 1 to n
        insert A[i] into table
        if there is a collision,
           then for each item in the bucket
                    if item = A[i]
                       return false
    return true

What is the run-time complexity of this algorithm in the average case? The worst case? Can we avoid the worst case?

The average case is Θ(n).
The worst case is still O(n²).
Would use of multiple hash tables work? What about space?

Think some about the trade-off between algorithms (behavior) and data structures (state)... In many cases, we can simply use a data structure to solve a problem. The price we pay is the space it requires. Other times, we can use the idea at the kernel of a data structure to create an algorithm that is more efficient.

Implementing Membership Tests Using a Bloom Filter

Last time, which seems so long ago, we considered an interesting question: How efficient an algorithm can we develop for solving the set membership problem if we are willing to accept occasional false positives? We learned about one candidate, a Bloom filter, that supports membership tests using relatively little time and space.

A Bloom filter consists of two components: a bit vector of length m and a set of k hash functions that return values between 1 and m. This technique exacts two kinds of cost:

  1. It admits false positives.
  2. It does not support deletions.

Recall the basic idea:

Why only may? Some bits may record information about multiple values, so we can't be certain that we've inserted this key before. This means that answering yes may signal a false positive. The same problem affects deletions. Setting any bit to 0 could affect more than one item's membership in the set.

Unfortunately, bit overlap is also what makes a Bloom filter so efficient in space and time. What are we to do?

One way to work around the deletion problem is to save all deletions to be done later. Then, at some suitable point, we build a new filter off-line containing only the items still in the set. This simulates the deletions in a batch process. It may be expensive, but after encountering enough deletions it may be worth it, and the off-line process can perhaps shield users from any time delay.

The primary cost of this solution is the risk of false positives. Last time, we also heard promise of a way to minimize this problem by choosing our level of risk. That is, we can put an upper bound on the false positive rate at the time we construct the filter. This is good news indeed. Today, we'll see this technique.

In any case, we may well be willing to pay the prices associted with a Bloom filter. Many situations do not require deletions, at least not frequently. And the cost of a few false positives is often dwarfed by the extraordinary gain we receive in the space and time needed to process the set.

Now let's look at how we can minimize the cost of false positives...

Making Bloom Filters for Efficient Membership Testing

Suppose that we are willing to settle for a false positive rate of, say, 1%. That is, we are willing for 1 out of 100 database queries to come up empty. How do we design a Bloom filter to accomplish this, and how much benefit do we receive?

We stumbled across a couple of ideas last time:

The false positive rate depends primarily on the relationship between length of the bit vector and the number of keys stored. These both affect the density of the vector, which is the percentage of bits set to 1. The less dense the vector, the less likely that a key not in the vector will hash to slots that are all 1.

The nugget above about the number of hash functions, k, is only partly right. It's certainly true that we don't want to use too many hash functions, because that leads to a denser vector and thus more false positives. But we also don't want k to be too small, because then we won't be able to differentiate very well among the keys, which increases the chance of collisions -- which also increases the number of false positives! What is the right value for k?

The beauty of Bloom filters is that, with a little arithmetic, we can calculate the relationship among k, m, and the false positive rate, which we will call c.

After inserting n keys, the chance that a given bit is still 0 is:

    [     1 ]kn
    | 1 - - |
    [     m ]

So the probability of a false positive is:

        [                 ]k
        |     [     1 ]kn ]     [      kn/m ] k
    c = | 1 - | 1 - - |   ]   | 1 - e     |
        |     [     m ]   ]     [           ]
        [                 ]

In practice, you know the number of keys you want to store and the false positive rate you're willing to accept. What you'd like to do is choose k and m in a way that minimizes m. So, we can solve the false-positive rate expression above for m, which gives:

    m = ------------
        ln(1 - c1/k)

This formula still has k as a free variable. Calculus to the rescue! With a little more arithmetic, we see that this expression is minimized at k = ln 2 * m/n. You could also find this solution experimentally with a little program that considers all reasonable combinations of k and m for your input parameters.

The results are remarkable. For 1000 keys and a false positive rate of 1%, you need a vector of only ~1870 bits. For a 0.1% rate, you need only ~2800 bits. For 0.01%, only ~3740 bits.

How about a 0.01% false positive rate over 1 million keys? The vector need only contain ~3.74 million bits. That may sound like a lot, but it's really not so much... fewer than 500K bytes. And this is true no matter how big the keys themselves are!

Hashing can produce remarkably efficient algorithms. Bloom filters are an example of how to use hashing as leverage against a real problem.

Bloom Filters in the WWW Era

Why are Bloom filters popular now? With the advent of the World Wide Web came an explosion of on-line social networks -- communities of people who build up friendships with one another, or who refer to one another's work. These networks are large, and they create new issues of privacy and security.

Suppose that you would like to contact the friends of one of your friends, whom you think might like some information. Most approaches to on-line sharing used to require that each person "publish" her list of contacts, either on a community server (say, Facebook) or on the Internet (say, FOAF). This compromises the privacy of the publisher's friends, as well as that of the publisher!

Suppose, though, that two people exchange Bloom filters containing their acquaintances. Folks can now share information about their friend networks with others without having to tell the world who their friends are. You can use a Bloom filter to check to see if one of your friends is in the set of friends of another person, but you can't extract the list of values stored in the set. And, given that the Bloom filter will return false positives, there will always be a reasonable doubt about whether the person really is in the set.

We can even use the false positive rate as a feature. If I am especially paranoid, I can generate Bloom filters, using different sets of hash functions, each with a high false positive rate. I then give a different filter to each of my acquaintances. This allows them to query my set of friends without sharing any information about my friends. These acquaintances will know only that there is a some chance that another person is a friend of mine.

As I grow to trust a person, I can give him more filters, all of which have that high false positive rate. As that person collects more filters, he will be able to make queries against them all and drive the false positive rate he sees down.

For example, suppose that my recently-acquired friend collects 4 of my filters, each of which has a false positive 50%. His false positive rate will be only (1/2)4, or about 6%. If give him another filter, I can drive his false positive rate down to approximately 3% -- without sharing more information with anyone else.

If a thief intercepts one of these filters in transit, then he will see a 50% false positive rate. This means I can control my privacy risk by spreading information over multiple interactions. My friends will have a high degree of confidence in their uses of multiple filters, but an interloper who illicitly collects one or two filter will learn very little.

Here are a couple of other uses of Bloom filters in this spirit:

... is this just the beginning?

A Few References

My primary source for our discussion of Bloom filters was a fine article by Maciej Ceglowski, Using Bloom Filters. I have a local mirror of this paper on the course web site.

Burton Bloom published the original article on this idea, Space-Time Trade-offs in Hash Coding with Allowable Errors, back in 1970. By 2000, there were many new applications of Bloom filters!

Here is a good paper on applications to networks, not all of which are about social networks. I have a local mirror of this paper as well.

Wrap Up

Eugene Wallingford ..... ..... March 13, 2014