## Homework Assignment 5

### Empirical Analysis of a Bloom Filter

#### Due: Thursday, April 3, at 8:00 AM

Start small. Test your program on a small set of keys and a small bit vector to be sure that it performs as you expect. Only then try larger data sets and vectors.

Use temporary code. For example, you might want to create an inefficient but very simple implementation of a bit vector in order to get your Bloom filter working. Then, after the rest works, improve the implementation to support larger data sets efficiently enough.

Lists are O(n). Lists in Python (and Scheme, for that matter) are not a great way to implement Bloom filters and other array-based structures, because they are linked lists. Use a true array in your language of choice, whether that be a Python array or a Racket bit vector. Arrays in languages such as Java, Ada, and C should work fine. Use this homework as an opportunity to learn or polish your skills with bit manipulation in the language you choose.

Resources. Use Session 18 and Session 19 as source material. You may also use the resources suggested at the end of Session 19, if you'd like. See below for a bit more on optimal values for m and k.

1. Implement a simple hash function factory. Given an argument m, the desired hash table size, the factory should return a hash function that maps integers into a table of that length.

You can do this in one of two ways:

• Do a little research to create a generator that is capable of creating any number of hash functions, as needed.

• Create by hand five functions templates that you can use specialize with the table size. This will pace an upper limit of five on the number of hash functions that your factory can produce for any given table size.

2. Implement a Bloom filter as a class, package, or abstract data type.

• A Bloom filter's primary constructor receives two arguments: the desired false positive rate, c, and the expected number of keys to store, n. This constructor should use the formulas from Session 19 to select an appropriate bit vector size, m, and number of hash functions, k.

• A second constructor receives those two arguments plus the size of the bit vector to use, m. This constructor should use the formulas from Session 19 to select an appropriate number of hash functions, k.

In both cases, if your hash function factory has a limit of five from above, then you will limit k to five. In the primary constructor, adjust m according to the formula. In the secondary, use the m requested.

3. Generate two sets of data for testing your Bloom filter:

• a membership set: a list of 10,000 unique integers selected randomly from the range [10000..99999]

• a test set: a list of 10,000 unique integers not in the membership set, also selected randomly from the range [10000..99999]

4. Demonstrate your Bloom filter for each of these false positive rates:
• 0.01
• 0.001
• 0.0001

In all cases, insert the items in the membership set into a Bloom filter using its primary constructor. Then look up all the items in the test and compute the false positive rate.

5. Repeat the process of Task 4 for each of these bit vector sizes:
• 1.50m
• 0.75m
• 0.50m
where m is the ideal vector size based on n and c.

Use the Bloom filter's secondary constructor to parameterize the bit vector size.

Create a readme.txt file that presents:

• anything I need to know to compile and run your program
• a table summary of the results of your experiments
• a discussion of how the empirical data match up with the expected values
(If they don't match well, suggest possible reasons.)
• a discussion of the efficiency of your implementation

#### Deliverables

By the due time and date, submit a zipped archive named homework05 containing:

• all of your source files

Be sure that your submission follows all homework submission requirements. Note that the only file you need to print is your readme.txt.

#### Values for m and k

Session 19 includes a brief discussion of formulas for computing the best m and k to use for a given key set size n and maximum false positive rate c. But they leave you with a bit more arithmetic to do.

The Wikipedia page for Bloom filters has a section on the optimal number of bits and hash functions that finishes off some of the arithmetic. To save you the computation, or to double-check your own computation, here are the final formulae:

```          n ln c                   m
m = – -------     THEN     k = - ln 2
(ln 2)²                  n
```

So, if n = 10,000 and c = 0.01, then the optimal m is 95850.58 and the optimal k is 6.64.

Recall that ln refers to the natural logarithm. Many languages provide a library function for computing it. The natural log of 2, which is so important that it has its own Wikipedia page, is approximately 0.69314718055994.

Eugene Wallingford ..... wallingf@cs.uni.edu ..... April 2, 2014