Design and Analysis of Algorithms

Spring Semester 2014

*Start small*. Test your program on a small set of keys
and a small bit vector to be sure that it performs as you expect.
Only then try larger data sets and vectors.

*Use temporary code*. For example, you might want to
create an inefficient but very simple implementation of a bit
vector in order to get your Bloom filter working. Then, after
the rest works, improve the implementation to support larger
data sets efficiently enough.

*Lists are O(n)*. Lists in Python (and Scheme, for that
matter) are not a great way to implement Bloom filters and
other array-based structures, because they are linked lists.
Use a true array in your language of choice, whether that be
a Python array
or
a Racket bit vector.
Arrays in languages such as Java, Ada, and C should work fine.
Use this homework as an opportunity to learn or polish your
skills with bit manipulation in the language you choose.

*Resources*. Use
Session 18
and
Session 19
as source material. You may also use the resources suggested
at
the end of Session 19,
if you'd like. **See below** for a bit more on
optimal values for *m* and *k*.

- Implement a simple hash function factory. Given an argument
*m*, the desired hash table size, the factory should return a hash function that maps integers into a table of that length.You can do this in one of two ways:

- Do a little research to create a generator that is capable of creating any number of hash functions, as needed.
- Create by hand five functions templates that you can use specialize with the table size. This will pace an upper limit of five on the number of hash functions that your factory can produce for any given table size.

- Implement a Bloom filter as a class, package, or abstract
data type.
- A Bloom filter's primary constructor receives two
arguments: the desired false positive rate,
*c*, and the expected number of keys to store,*n*. This constructor should use the formulas from Session 19 to select an appropriate bit vector size,*m*, and number of hash functions,*k*. - A second constructor receives those two arguments plus
the size of the bit vector to use,
*m*. This constructor should use the formulas from Session 19 to select an appropriate number of hash functions,*k*.

In both cases, if your hash function factory has a limit of five from above, then you will limit

*k*to five. In the primary constructor, adjust*m*according to the formula. In the secondary, use the*m*requested. - A Bloom filter's primary constructor receives two
arguments: the desired false positive rate,
- Generate two sets of data for testing your Bloom filter:
- a
**membership set**: a list of 10,000 unique integers selected randomly from the range [10000..99999] - a
**test set**: a list of 10,000 unique integers*not*in the membership set, also selected randomly from the range [10000..99999]

- a
- Demonstrate your Bloom filter for each of these false
positive rates:
- 0.01
- 0.001
- 0.0001

In all cases, insert the items in the membership set into a Bloom filter using its primary constructor. Then look up all the items in the test and compute the false positive rate.

- Repeat the process of Task 4 for each of these bit vector
sizes:
- 1.50
*m* - 0.75
*m* - 0.50
*m*

*m*is the ideal vector size based on*n*and*c*.Use the Bloom filter's secondary constructor to parameterize the bit vector size.

- 1.50

Create a `readme.txt` file that presents:

- anything I need to know to compile and run your program
- a table summary of the results of your experiments
- a discussion of how the empirical data match up with
the expected values

(If they don't match well, suggest possible reasons.) - a discussion of the efficiency of your implementation

By the due time and date, submit a zipped archive named
** homework05** containing:

- your
`readme.txt`file - all of your source files

Be sure that your submission follows all
homework submission requirements.
*Note that the only file you need to print* is your
`readme.txt`.

Session 19 includes a brief discussion of formulas for
computing the best *m* and *k* to use for
a given key set size *n* and maximum false positive
rate *c*. But they leave you with a bit more
arithmetic to do.

The Wikipedia page for Bloom filters has a section on the optimal number of bits and hash functions that finishes off some of the arithmetic. To save you the computation, or to double-check your own computation, here are the final formulae:

nlncmm= – ------- THENk= - ln 2 (ln 2)²n

So, if *n* = 10,000 and *c* = 0.01, then the
optimal *m* is 95850.58 and the optimal *k*
is 6.64.

Recall that ** ln** refers to the
natural logarithm.
Many languages provide a library function for computing it.
The natural log of 2, which is so important that it has
its own Wikipedia page,
is approximately 0.69314718055994.