Homework Assignment 5

Empirical Analysis of a Bloom Filter


CS 3530
Design and Analysis of Algorithms
Spring Semester 2014


Due: Thursday, April 3, at 8:00 AM


Programming Advice

Start small. Test your program on a small set of keys and a small bit vector to be sure that it performs as you expect. Only then try larger data sets and vectors.

Use temporary code. For example, you might want to create an inefficient but very simple implementation of a bit vector in order to get your Bloom filter working. Then, after the rest works, improve the implementation to support larger data sets efficiently enough.

Lists are O(n). Lists in Python (and Scheme, for that matter) are not a great way to implement Bloom filters and other array-based structures, because they are linked lists. Use a true array in your language of choice, whether that be a Python array or a Racket bit vector. Arrays in languages such as Java, Ada, and C should work fine. Use this homework as an opportunity to learn or polish your skills with bit manipulation in the language you choose.

Resources. Use Session 18 and Session 19 as source material. You may also use the resources suggested at the end of Session 19, if you'd like. See below for a bit more on optimal values for m and k.



Tasks

  1. Implement a simple hash function factory. Given an argument m, the desired hash table size, the factory should return a hash function that maps integers into a table of that length.

    You can do this in one of two ways:


  2. Implement a Bloom filter as a class, package, or abstract data type.

    In both cases, if your hash function factory has a limit of five from above, then you will limit k to five. In the primary constructor, adjust m according to the formula. In the secondary, use the m requested.


  3. Generate two sets of data for testing your Bloom filter:

  4. Demonstrate your Bloom filter for each of these false positive rates:

    In all cases, insert the items in the membership set into a Bloom filter using its primary constructor. Then look up all the items in the test and compute the false positive rate.


  5. Repeat the process of Task 4 for each of these bit vector sizes: where m is the ideal vector size based on n and c.

    Use the Bloom filter's secondary constructor to parameterize the bit vector size.

Create a readme.txt file that presents:



Deliverables

By the due time and date, submit a zipped archive named homework05 containing:

Be sure that your submission follows all homework submission requirements. Note that the only file you need to print is your readme.txt.



Values for m and k

Session 19 includes a brief discussion of formulas for computing the best m and k to use for a given key set size n and maximum false positive rate c. But they leave you with a bit more arithmetic to do.

The Wikipedia page for Bloom filters has a section on the optimal number of bits and hash functions that finishes off some of the arithmetic. To save you the computation, or to double-check your own computation, here are the final formulae:

          n ln c                   m
    m = – -------     THEN     k = - ln 2
          (ln 2)²                  n

So, if n = 10,000 and c = 0.01, then the optimal m is 95850.58 and the optimal k is 6.64.

Recall that ln refers to the natural logarithm. Many languages provide a library function for computing it. The natural log of 2, which is so important that it has its own Wikipedia page, is approximately 0.69314718055994.



Eugene Wallingford ..... wallingf@cs.uni.edu ..... April 2, 2014