CS 3530 Session 18

Session 18

A Set Membership Problem

CS 3530
Design and Analysis of Algorithms

Challenge... Accepted!

Today's challenge won't wear us out, but it will warm up your algorithm design muscles for our exam. Let's call it My Own Private Pandora.

You and your friends decide to open an on-line music store. Users will come to your site, request information about a song, and then choose to buy it or not. In the beginning, your licensing arrangements are limited, which means that a user will occasionally request a song you can't sell them.

You are going to out-source your song database to another vendor, and each query against the database costs you. You would like to minimize the number of database queries by asking for song information only when you know you have licensed it.

Work with someone else, if you like, to...

Describe a solution that assumes unlimited local resources (time, space, ...).

Describe a solution that uses limited resources.

What concession are you willing to make if you can't eliminate all of the unsuccessful queries?

Implementing Membership Tests

We need a way to determine if a song is in a set of songs. Your entire course in data structures has prepared you for problems of this sort.

We could maintain a flat file of song titles, or even a local database of song titles, and search that data structure. But the performance of these solutions is undesirable, especially on the web where we need to maintain service performance even at peak usage times.

We could use a tree to improve look-up times, at the cost of the space overhead for the tree.

We could keep a hash table of song names. That would use O(n) space and give O(1) time performance for look-ups. But if the number of keys becomes too large, or if the size of keys is large, a hash table in memory becomes untenable.

What if we are willing to trade an occasional unsuccessful database search for size and speed in nearly all cases?

This question brings us to a new idea: a probabilistic data structure.

Implementing Membership Tests Using a Bloom Filter

A Bloom filter is a data structure that supports membership tests using relatively little time and space. It consists of two components:

a bit vector of length m and
a set of k hash functions that return values between 0 and m-1.

In order for a Bloom filter to work well, we need good hash functions, ones that:

spread the keys evenly throughout their range, and
are different enough from one another that they don't produce the same values for the same keys very often.

Initially, the bit vector is zeroed out. For example, if m = 11, our vector begins its life as:

00000000000

Inserting Keys

The algorithm for inserting a key into a Bloom filter is straightforward:

Apply each hash function to the key.
Set the bit corresponding to each hash function's value to 1.

For example, suppose that we would like to add "Only the Good Die Young" to the collection. We run the hash functions on the title, getting, say,

    hash1( "Only the Good Die Young" ) = 3
    hash2( "Only the Good Die Young" ) = 10
    hash3( "Only the Good Die Young" ) = 8

So we set the corresponding bits to 1:

00010000101

Suppose that next we would like to add John Cougars's "I Need a Lover" to the collection. We do the same thing: run the hash functions...

    hash1( "I Need a Lover" ) = 5
    hash2( "I Need a Lover" ) = 8
    hash3( "I Need a Lover" ) = 1

... and set the corresponding bits:

01010100101

Notice that the 8th bit had already been set. It stays on. In effect, this bit now plays two roles: It tells us that the set contains "Only the Good Die Young", and it tells us that the set contains "I Need a Lover". As we add more items to the set, many bits will have to record information about two or more items.

Deleting Keys, Deleted

The fact that Bit 8 corresonds to two songs exposes one of the prices associated with using Bloom filters. If a single bit can record information about two songs, we cannot remove an item from the collection. Consider: If we just set the bits corresponding to the deleted value back to 0, we might reset one of the bits that is storing information about another song, too. And we have no way of knowing that without doing an exhaustive search over the entire collection.

Keep this characteristic in mind. It turns out to be a feature in many applications of Bloom filters, not a bug!

In any case, we may well be willing to pay this price. Many problems do not require deletions, at least not frequent deletions. And if do need to delete items occasionally, we can save a bunch of deletions to process in batch... and then build a new Bloom filter that contains only the items still in the set.

Looking Up Items

The algorithm for looking up a key is similar:

Apply each hash function to the key.
If the bit corresponding to each hash function's value is 1, then say that the filter contains the object. If not, then the collection does not.

So, to see if the set contains "Only the Good Die Young", we run the hash functions ...

    hash1( "Only the Good Die Young" ) = 3
    hash2( "Only the Good Die Young" ) = 10
    hash3( "Only the Good Die Young" ) = 8

... and check to see whether the 3rd, 8th, and 10th bits of the filter are set to 1. They are, so we conclude that the set contains this title.

If any bit is 0, we know that the item has never been inserted into the collection, so it is not a member. Suppose we look up "Oops!...I Did It Again". We run the hashes...

    hash1( "Oops!...I Did It Again" ) = 2
    hash2( "Oops!...I Did It Again" ) = 3
    hash3( "Oops!...I Did It Again" ) = 4

... and check the bit string. The presence of a 0 bit in the 2nd position of our bit vector tell us that this song has never been added to the collection. The 0 in the 4th position does, too. If it had been added to the set, then both of those bits would be 1s.

In our music store scenario, we would like to be able to proceed with an off-line database look-up, secure in the knowledge that the query will succeed, and serve up song information to the user.

Unfortunately, we cannot be sure. The look-up algorithm can only really say "the filter may contains the song" when all of the song's bits are set to 1 -- because we don't know for sure that the set contains the key.

How so? Suppose that we do a look up for the key "Afternoon Delight" and the hash functions return:

    hash1( "Afternoon Delight" ) = 3
    hash2( "Afternoon Delight" ) = 1
    hash3( "Afternoon Delight" ) = 5

We check our bit vector, find that these three bits are set, and conclude that the set contains "Afternoon Delight". But it doesn't. We never added it to the set. Those bits were set by my Billy Joel and John Cougar songs.

This is an example of a false positive. We will not know that our collection doesn't contain this song until the database query fails.

If we design our Bloom filter well, we should be able to say more than "maybe". We'd like to be able to say at least that the set probably contains the key, for some degree of "probable". But we can't be sure.

Bit overlap is what makes a Bloom filter so efficient in its use of space. The primary cost of this solution is the risk of false positives, those cases in which the filter says that the set contains an item but does not. This may seem like a foolish price to pay in order to save space, but...

we can select the level of this risk when we create our Bloom filters, and
we can save a lot of space.

So, we can allow as few or as many false positives as we like, trading this risk for less or more space.

Constructing a Bloom Filter

Suppose that we are willing to settle for a false positive rate of 1%. That is, we are willing for one out of every one hundred off-line database queries to come up empty. How do we design a Bloom filter to accomplish this? What size and time benefit do we receive?

These are questions we'll explore next time. But we can already see some of the key issues:

We don't want too many bits to be set to 1. Otherwise, nearly every look-up will signal membership. Too many bits end up recording information about multiple songs.

As a result, we probably want the number of hash functions, k, to be much smaller than the size of the bit string, m. Otherwise, we risk setting too many bits to one for any given song.

What else do you think we need to consider?

Wrap Up

Reading -- Here are a few pages to refresh your memory about hash tables and the functions that make them work:
- the entry on hash tables at Wikipedia
- a short summary on hash tables and the key terms involved
- a short tutorial on hash functions
You'll have to look up the songs yourself!

Homework -- Homework 4 was due today.

Exam 2 -- Exam 2 was this session.

Eugene Wallingford ..... wallingf@cs.uni.edu ..... March 13, 2014