perl.com: Using Bloom Filters

Articles

Weblogs

Books

Learning Lab

News

O'Reilly Open Source Convention: July 26-30, Portland, OR.

Perl.com

Perl Columns

Sites
LinuxDevCenter.com
MacDevCenter.com
WindowsDevCenter.com
Mozilla DevCenter
ONDotnet.com
ONJava.com
ONLamp.com
Apache
BSD
MySQL
PHP
Python
Security
OpenP2P.com
Perl.com
Policy DevCenter
Wireless DevCenter
XML.com
WebServices.XML.com

Affiliate Sites
LinuxQuestions.org
MobileWhack
OSDir.com
Servlets.com

Resource Centers
Bioinformatics
C/C++
Databases
Digital Media
Enterprise Development
Game Development
Java
Linux/Unix
Macintosh/OS X
.NET
Open Source
Oracle
Perl
Python
Scripting
Security
SysAdm/Networking
Web
Web Services
Windows
Wireless
XML

Traveling to
a tech show?

Discount Hotel Search
Kissimmee Hotels, Florida
Memphis Hotels
Chicago Hotels
San Antonio Hotels
Quebec Hotels
Deals on Las Vegas Hotels
Hotels in New York City

Perl.com
Supported by:

Car Insurance
Refinancing Online

		Print
		Email article link

Using Bloom Filters

by Maciej Ceglowski
April 08, 2004

Anyone who has used Perl for any length of time is familiar with the lookup hash, a handy idiom for doing existence tests:

foreach my $e ( @things ) { $lookup{$e}++ }

sub check {
	my ( $key ) = @_;
	print "Found $key!" if exists( $lookup{ $key } );
}

As useful as the lookup hash is, it can become unwieldy for very large lists or in cases where the keys themselves are large. When a lookup hash grows too big, the usual recourse is to move it to a database or flat file, perhaps keeping a local cache of the most frequently used keys to improve performance.

Many people don't realize that there is an elegant alternative to the lookup hash, in the form of a venerable algorithm called a Bloom filter. Bloom filters allow you to perform membership tests in just a fraction of the memory you'd need to store a full list of keys, so you can avoid the performance hit of having to use a disk or database to do your lookups. As you might suspect, the savings in space comes at a price: you run an adjustable risk of false positives, and you can't remove a key from a filter once you've added it in. But in the many cases where those constraints are acceptable, a Bloom filter can make a useful tool.

For example, imagine you run a high-traffic online music store along the lines of iTunes, and you want to minimize the stress on your database by only fetching song information when you know the song exists in your collection. You can build a Bloom filter at startup, and then use it as a quick existence check before trying to perform an expensive fetching operation:

use Bloom::Filter;

my $filter = Bloom::Filter->new( error_rate => 0.01, capacity => $SONG_COUNT );
open my $fh, "enormous_list_of_titles.txt" or die "Failed to open: $!";

while (<$fh>) {
	chomp;
	$filter->add( $_ );
}

sub lookup_song {
	my ( $title ) = @_;
	return unless $filter->check( $title );
	return expensive_db_query( $title ) or undef;
}

In this example, there's a 1% chance that the test will give a false positive, which means the program will perform the expensive fetch operation and eventually return a null result. Still, you've managed to avoid the expensive query 99% of the time, using only a fraction of the memory you would have needed for a lookup hash. As we'll see further on, a filter with a 1% error rate requires just under 2 bytes of storage per key. That's far less memory than you would need for a lookup hash.

How Bloom Filters Work

A Bloom filter consists of two components: a set of k hash functions and a bit vector of a given length. We choose the length of the bit vector and the number of hash functions depending on how many keys we want to add to the set and how high an error rate we are willing to put up with -- more on that a little bit further on.

All of the hash functions in a Bloom filter are configured so that their range matches the length of the bit vector. For example, if a vector is 200 bits long, the hash functions return a value between 1 and 200. It's important to use high-quality hash functions in the filter to guarantee that output is equally distributed over all possible values -- "hot spots" in a hash function would increase our false-positive rate.

To enter a key into a Bloom filter, we run it through each one of the k hash functions and treat the result as an offset into the bit vector, turning on whatever bit we find at that position. If the bit is already set, we leave it on. There's no mechanism for turning bits off in a Bloom filter.

As an example, let's take a look at a Bloom filter with three hash functions and a bit vector of length 14. We'll use spaces and asterisks to represent the bit vector, to make it easier to follow along. As you might expect, an empty Bloom filter starts out with all the bits turned off, as seen in Figure 1.

Figure 1. An empty Bloom filter.

Let's now add the string apples into our filter. To do so, we hash apples through each of our three hash functions and collect the output:

hash1("apples") = 3
hash2("apples") = 12
hash3("apples") = 11

Then we turn on the bits at the corresponding positions in the vector -- in this case bits 3, 11, and 12, as shown in Figure 2.

Figure 2. A Bloom filter with three bits enabled.

To add another key, such as plums, we repeat the hashing procedure:

hash1("plums") = 11
hash2("plums") = 1
hash3("plums") = 8

And again turn on the appropriate bits in the vector, as shown with highlights in Figure 3.

Figure 3. The Bloom filter after adding a second key.

Notice that the bit at position 11 was already turned on -- we had set it when we added apples in the previous step. Bit 11 now does double duty, storing information for both apples and plums. As we add more keys, it may store information for some of them as well. This overlap is what makes Bloom filters so compact -- any one bit may be encoding multiple keys simultaneously. This overlap also means that you can never take a key out of a filter, because you have no guarantee that the bits you turn off don't carry information for other keys. If we tried to remove apples from the filter by reversing the procedure we used to add it in, we would inadvertently turn off one of the bits that encodes plums. The only way to strip a key out of a Bloom filter is to rebuild the filter from scratch, leaving out the offending key.

Checking to see whether a key already exists in a filter is exactly analogous to adding a new key. We run the key through our set of hash functions, and then check to see whether the bits at those offsets are all turned on. If any of the bits is off, we know for certain the key is not in the filter. If all of the bits are on, we know the key is probably there.

I say "probably" because there's a certain chance our key might be a false positive. For example, let's see what happens when we test our filter for the string mango. We run mango through the set of hash functions:

hash1("mango") = 8
hash2("mango") = 3
hash3("mango") = 12

And then examine the bits at those offsets, as shown in Figure 4.

Figure 4. A false positive in the Bloom filter.

All of the bits at positions 3, 8, and 12 are on, so our filter will report that mango is a valid key.

Of course, mango is not a valid key -- the filter we built contains only apples and plums. The fact that the offsets for mango point to enabled bits is just coincidence. We have found a false positive -- a key that seems to be in the filter, but isn't really there.

As you might expect, the false-positive rate depends on the bit vector length and the number of keys stored in the filter. The roomier the bit vector, the smaller the probability that all k bits we check will be on, unless the key actually exists in the filter. The relationship between the number of hash functions and the false-positive rate is more subtle. If you use too few hash functions, there won't be enough discrimination between keys; but if you use too many, the filter will be very dense, increasing the probability of collisions. You can calculate the false-positive rate for any filter using the formula:

c = ( 1 - e(-kn/m) )k

Where c is the false positive rate, k is the number of hash functions, n is the number of keys in the filter, and m is the length of the filter in bits.

When using Bloom filters, we very frequently have a desired false-positive rate in mind and we are also likely to have a rough idea of how many keys we want to add to the filter. We need some way of finding out how large a bit vector is to make sure the false-positive rate never exceeds our limit. The following equation will give us vector length from the error rate and number of keys:

m = -kn / ( ln( 1 - c ^ 1/k ) )

You'll notice another free variable here: k, the number of hash functions. It's possible to use calculus to find a minimum for k, but there's a lazier way to do it:

sub calculate_shortest_filter_length {
	my ( $num_keys, $error_rate ) = @_;
	my $lowest_m;
	my $best_k = 1;

	foreach my $k ( 1..100 ) {
		my $m = (-1 * $k * $num_keys) / 
			( log( 1 - ($error_rate ** (1/$k))));

		if ( !defined $lowest_m or ($m < $lowest_m) ) {
			$lowest_m = $m;
			$best_k   = $k;
		}
	}
	return ( $lowest_m, $best_k );
}

To give you a sense of how error rate and number of keys affect the storage size of Bloom filters, Table 1 lists some sample vector sizes for a variety of capacity/error rate combinations.

Error Rate	Keys	Required Size	Bytes/Key
1%	1K	1.87 K	1.9
0.1%	1K	2.80 K	2.9
0.01%	1K	3.74 K	3.7
0.01%	10K	37.4 K	3.7
0.01%	100K	374 K	3.7
0.01%	1M	3.74 M	3.7
0.001%	1M	4.68 M	4.7
0.0001%	1M	5.61 M	5.7

You can find further lookup tables for various combinations of error rate, filter size, and number of hash functions at Bloom Filters -- the math.

Pages: 1, 2