Design and Analysis of Algorithms

You have a long list of values -- maybe even 10^{12}
of them. You'd like to know if all the values are unique.
You would also like to know the answer before the fan on your
laptop overheats, so you need an efficient way of finding it.

*Propose as many ways to solve your problem as you can
think of*.

How efficient is each of your ideas? Can you achieve
O(*n*)?

What are our options?

- For each item in the list, check to see if it occurs later in the list.
- Sort the list. Check to see if any item is the same as the item that follows it.
- Put the items in a set. Check to see if the list is the same size the set.

These have short implementations in most languages. The first
is a three-line doubly-nested loop plus a return statement.
** sort my_list | uniq -c** sorts and squashes
a list in the Unix shell, and

How efficient computationally is each of these?

Can we achieve even O(1)? Maybe an idea from our previous session would be of use...

Clearly, the brute-force solution -- a nested `for`
loop that compares each item to every item after it in the
list -- is a no-go: O(*n*²). As some of you
found out on
Homework 3,
O(*n*²) is slow for a million items or more.

But what option do we have? We could use a data structure to
help us. If we insert each item in the list into a *binary
search tree*, we will recognize a repeat as soon as we
insert it into the tree. In the average case, each insertion
takes only O(log*n*), so the whole process is
O(*n*log*n*). Unfortunately, the worst case is
still O(*n*²).

But the idea of using a suitable data structure as a helper
is a good one. For today,
you read
a little refresher on **hashing**. Hash tables provide a
wonderful O(1) look-up time in their best and average cases,
by distributing values evenly over an array. They do so by use
of a well-chosen hashing function that maps values into the
range [1..*n*], where *n* is the size of the
table.

So, let's try this, which uses a hash table with buckets:

input : array A[1..n] of values output: boolean -- all values are unique, or not table ← empty hash table of size n for i ← 1 to n insert A[i] into table if there is a collision, then for each item in the bucket if item = A[i] return false return true

What is the run-time complexity of this algorithm in the average case? The worst case? Can we avoid the worst case?

The average case is Θ(n).

The worst case is still O(n²).

Would use of multiple hash tables work? What about space?

Think some about the trade-off between algorithms (behavior)
and data structures (state)... In many cases, we can simply
use a data structure to solve a problem. The price we pay
is the space it requires. Other times, we can use the
*idea* at the kernel of a data structure to create an
algorithm that is more efficient.

Last time,
which seems so long ago, we considered an interesting
question: How efficient an algorithm can we develop for
solving the set membership problem if we are willing to accept
occasional false positives? We learned about one candidate, a
**Bloom filter**, that supports membership tests using
relatively little time and space.

A Bloom filter consists of two components: a bit vector of
length *m* and a set of *k* hash functions that
return values between 1 and *m*. This technique exacts
two kinds of cost:

- It admits false positives.
- It does not support deletions.

Recall the basic idea:

- Recall that a Bloom filter begins with a
0
bit vector.^{m} - To insert a key into the filter, we apply each hash function to the key and set the corresponding bits in the vector to 1.
- To look up a key, we begin with the same first step as for an add, by applying each hash function to the key. If any of the corresponding bits in the vector are 0, then the filter doesn't contain the key. Otherwise, it may.

Why only *may*? Some bits may record information about
multiple values, so we can't be *certain* that we've
inserted this key before. This means that answering yes may
signal a *false positive*. The same problem affects
deletions. Setting any bit to 0 could affect more than one
item's membership in the set.

Unfortunately, bit overlap is also what makes a Bloom filter so efficient in space and time. What are we to do?

One way to work around the deletion problem is to save all deletions to be done later. Then, at some suitable point, we build a new filter off-line containing only the items still in the set. This simulates the deletions in a batch process. It may be expensive, but after encountering enough deletions it may be worth it, and the off-line process can perhaps shield users from any time delay.

The primary cost of this solution is the risk of false
positives. Last time, we also heard promise of a way to
minimize this problem by choosing our level of risk. That
is, we can put an *upper bound* on the false positive
rate at the time we construct the filter. This is good news
indeed. Today, we'll see this technique.

In any case, we may well be willing to pay the prices associted with a Bloom filter. Many situations do not require deletions, at least not frequently. And the cost of a few false positives is often dwarfed by the extraordinary gain we receive in the space and time needed to process the set.

Now let's look at how we can minimize the cost of false positives...

Suppose that we are willing to settle for a false positive rate of, say, 1%. That is, we are willing for 1 out of 100 database queries to come up empty. How do we design a Bloom filter to accomplish this, and how much benefit do we receive?

We stumbled across a couple of ideas last time:

- We don't want too many bits to be set to 1, or nearly every look-up will signal positive. That would give a lot of false positives.
- As a result, we probably want
*k*<<*m*, or we risk setting too many bits to one for any given song.

The false positive rate depends primarily on the relationship between length of the bit vector and the number of keys stored. These both affect the density of the vector, which is the percentage of bits set to 1. The less dense the vector, the less likely that a key not in the vector will hash to slots that are all 1.

The nugget above about the number of hash functions, *k*,
is only partly right. It's certainly true that we don't want
to use too many hash functions, because that leads to a denser
vector and thus more false positives. But we also don't want
*k* to be **too** small, because then we won't be able
to differentiate very well among the keys, which increases the
chance of collisions -- which also increases the number of false
positives! What is the right value for *k*?

The beauty of Bloom filters is that, with a little arithmetic,
we can calculate the relationship among *k*, *m*,
and the false positive rate, which we will call *c*.

After inserting *n* keys, the chance that a given bit is
still 0 is:

[ 1 ]kn | 1 - - | [ m ]

So the probability of a false positive is:

[ ]k | [ 1 ]kn ] [ kn/m ] k c = | 1 - | 1 - - | ] ≈ | 1 - e | | [ m ] ] [ ] [ ]

In practice, you know the **number of keys** you want to
store and the **false positive rate** you're willing to
accept. What you'd like to do is choose *k* and
*m* in a way that minimizes *m*. So, we can
solve the false-positive rate expression above for *m*,
which gives:

-kn m = ------------ ln(1 - c^{1/k})

This formula still has *k* as a free variable. Calculus
to the rescue!
With a little more arithmetic, we see that this expression is
minimized at `k = ln 2 * m/n`. You could also find
this solution experimentally with a little program that
considers all reasonable combinations of *k* and
*m* for your input parameters.

The results are remarkable. For 1000 keys and a false positive rate of 1%, you need a vector of only ~1870 bits. For a 0.1% rate, you need only ~2800 bits. For 0.01%, only ~3740 bits.

How about a 0.01% false positive rate over 1 *million*
keys? The vector need only contain ~3.74 million bits. That
may sound like a lot, but it's really not so much... fewer
than 500K bytes. And this is true no matter how big the keys
themselves are!

Hashing can produce remarkably efficient algorithms. Bloom filters are an example of how to use hashing as leverage against a real problem.

Why are Bloom filters popular now? With the advent of the World Wide Web came an explosion of on-line social networks -- communities of people who build up friendships with one another, or who refer to one another's work. These networks are large, and they create new issues of privacy and security.

Suppose that you would like to contact the friends of one of your friends, whom you think might like some information. Most approaches to on-line sharing used to require that each person "publish" her list of contacts, either on a community server (say, Facebook) or on the Internet (say, FOAF). This compromises the privacy of the publisher's friends, as well as that of the publisher!

Suppose, though, that two people exchange Bloom filters
containing their acquaintances. Folks can now share information
about their friend networks with others *without having to
tell the world who their friends are*. You can use a
Bloom filter to check to see if one of your friends is in the
set of friends of another person, but you can't extract the
list of values stored in the set. And, given that the Bloom
filter will return false positives, there will always be a
reasonable doubt about whether the person really is in the
set.

We can even use the false positive rate as a feature. If I am especially paranoid, I can generate Bloom filters, using different sets of hash functions, each with a high false positive rate. I then give a different filter to each of my acquaintances. This allows them to query my set of friends without sharing any information about my friends. These acquaintances will know only that there is a some chance that another person is a friend of mine.

As I grow to trust a person, I can give him more filters, all of which have that high false positive rate. As that person collects more filters, he will be able to make queries against them all and drive the false positive rate he sees down.

For example, suppose that my recently-acquired friend collects
4 of my filters, each of which has a false positive 50%.
His false positive rate will be only (1/2)^{4}, or
about 6%. If give him another filter, I can drive his false
positive rate down to approximately 3% -- without sharing more
information with anyone else.

If a thief intercepts one of these filters in transit, then
*he* will see a 50% false positive rate. This means
I can control my privacy risk by spreading information over
multiple interactions. My friends will have a high degree of
confidence in their uses of multiple filters, but an interloper
who illicitly collects one or two filter will learn very little.

Here are a couple of other uses of Bloom filters in this spirit:

- Two friends use the same-length bit vector and the same set of hash functions. Now they can compare their filters bitwise to determine how much their sets of friends overlap. The number of shared bits gives a rough but useful measure of the "distance" between the sets.
- Same scenario. The friends can create the union of the two sets by computing the bitwise OR of the two bit vectors. They might do this to create a "white list" from two or more address books. No one person will know the contents of anyone else's address list, but the composite filter will perform as expected.

... is this just the beginning?

My primary source for our discussion of Bloom filters was a fine article by Maciej Ceglowski, Using Bloom Filters. I have a local mirror of this paper on the course web site.

Burton Bloom published the original article on this idea, Space-Time Trade-offs in Hash Coding with Allowable Errors, back in 1970. By 2000, there were many new applications of Bloom filters!

Here is a good paper on applications to networks, not all of which are about social networks. I have a local mirror of this paper as well.

- Reading -- Studies these lecture notes and at least browse the references listed above.
- Homework -- Homework 5 is available and due next week.