As promised in my previous post, I spent some time today understanding Bloom filters. While [the wiki page]((http://en.wikipedia.org/wiki/Bloom_filter) intimidated me a bit when I first looked at it, it’s actually pretty simple once you sit down and patiently read through the whole thing.
n
elements, if you have the ability to devote m
bits of memory to maintaining the bloom filter, you will need k
different hash functions that hash each of the elements in your data set with a uniform random distribution.k
hash functions to get k
array positions in your map of size m
. Set the bits at all of these positions to 1.k
hash functions to get k
array positions. If any of the bits at these k
positions are 0, then the element is definitely not in the set. If all are 1, then the element is either in the set, or you have a false positive.See, pretty simple.
For a given m
and n
, the value of k
that minimizes the probability of a false positive is
and the relationship between the length of the bloom filter m
, the number of elements in the data set n
and the probability of false positives p
is given by
Now, the reason I set out to learn about bloom filters is because I came across a couple of blog posts on Hackers News about using them to implement search on static websites, such as those generated by Jekyll. Search on dynamic sites is usually implemented as a server-side feature, where the client javascript sends the search query to the server, which then quickly searches in the content of all the webpages it serves and returns the list of results to the client. Since this is not possible in static websites, there’s only one option for implementing search: do it clientside. Now one could simply brute-force this by sending the entire site over to the client and then searching through it in Javascript, but that’s the stupid way of doing it. The more elegant way is to send as little data to the client as possible, only enough for the client to determine which pages have the results of the search query, and then to show those pages as the search results. And given the space-efficiency of bloom filters, they appear to be a natural fit for this problem.
I’m not sure what the ideal way of using bloom filters would be for implementing a site-wide search, but the best implementation I’ve come up with is to create one filter per page and then to send those filters to the client and have the client look up the search query in each filter. I’m sure there are more efficient ways of doing this, but just for fun I’m going to do it my way first, and then look at other solutions only if I’m not happy with mine.
So, time to learn some ruby.