Search for most dynamically-generated websites is usually implemented as a server-side feature. The server generates and maintains an index of all unique words and their occurrences on the site. When a client runs a search query, the query is sent to the server which then looks up the query in its index and returns a list of results, which are then rendered on the client.
An approach like this is clearly infeasible on static websites such as those generated using a static site generator like Jekyll, so one has to come up with alternative solutions. One possible alternative is to send the content of the entire site to the client, and to then search through it clientside. This approach is infeasible for two reasons:
So, we need some sort of optimization during the site’s generation process that will optimize on both the network usage (the content that the client has to download) and execution time (how long it takes the client to search through the downloaded content). With respect to the second requirement, a hash map would be the fastest possible data structure for the client to search through, but it would be ever worse than downloading the entire site in terms of bandwidth usage. Enter bloom filters.
(I won’t go into details about how bloom filters work here, for that have a look at my previous post)
So, for the purposes of implementing search for a statically-generated site (namely this one), here is the approach that I followed:
Below I’ll describe my implementation of each of the above steps in some detail.
Once I had my implementation of the bloom filter class ready, creating one for each page was relatively trivial. Jekyll plugins have a Generator
category that lets you generate new pages besides the templates that you already have. If you implement a def generator
, that method gets invoked with the entire site’s contents just prior to the actual generation. site.posts
contains a list of all posts and site.pages
contains a list of all pages. All I had to do was iterate over each page and
python
and Python
will not be treated as unique elements in the set{% raw %} {% and %} {% endraw %}
and between {% raw %}{{ and }}{% endraw %}
. This strips out liquid template variables and code segmentsThe following two lines accomplished this:
content = post.content.downcase.gsub(/{% raw %}{%.*?%}{% endraw %}/, " ").gsub(/{{.*?}}/, " ").gsub(/[^a-zA-Z]/, " ")
words = content.split.uniq
After that, I iterate over each word and call filter.insert
where filter
is the bloom filter created for that page. Once I’m done with the page, the final filters is inserted into an array.
This was slightly trickier. I need a template to fill up with the generated JSON, so I obviously need a file somewhere in my site that will be filled up. I opted for a file named search.json
in the root of my site. I started off filling it manually by iterating over the generated filters via liquid tags and then slotting member variables into different fields, but then I thought of an easier approach: have the template be empty, and fill it entirely from ruby!
In my def generate
method, once I’m done creating the array of bloom filters, I convert the array to json and insert it directly into the template:
json_page = site.pages.detect {|page| page.name == "search.json"}
json_page.data['filters'] = filters.to_json
Of course this needed a corresponding to_json
for each of my filter objects, which was also trivial:
def to_json(options = {})
{
'u' => @url,
'l' => @length,
'h' => @hash_count,
'f' => @filter
}.to_json
end
Now, running jekyll build
generates _site/search.json
which contains the bloom filters for all posts in my site.
Since I want my page loads to appear faster than they actually are, I opted for fetching the search JSON asynchronously after pageload was done. This was easily done with a simple jquery $.get
call that fetches the json, iterates over each item in the JSON (which will be defined by the to_json
method above) and creates a bloom filter out of it.
var initialize = function() {
var index;
$.get("/search.json", function(data) {
for ( index = 0; index < data.length; index += 1 ) {
filters.push( BloomFilter.create(data[index].u,
data[index].l,
data[index].h,
data[index].f) );
}
})
.fail(function(jqXHR, textStatus, errorThrown) {
console.log("textStatus = " + textStatus + ", errorThrown = " + errorThrown);
});
};
The BloomFilter.create
method uses the exact same hashing functions as were used to create the filter in the Jekyll plugin. I ended up using Murmur hashes.
The search function has to do the same sort of setup that was required for inserting words into the filters. It converts the query to lowercase and replaces all non-alphabetic characters to spaces. It then iterates over bloom filter (and hence over each post) and returns a list of words that were found in that bloom filter. The return-type of this function is a list of JSON objects that have two fields - the URL of the page and a list of matches for that page.
var search = function(query) {
var matches = [];
var i;
query = query.toLowerCase().replace(/[^a-zA-Z]/, " ");
var words = query.split(/\s+/);
for ( i = 0; i < filters.length; i += 1 ) {
var results = words.filter( function(word) {
return filters[i].has(word);
} );
matches.push({
url: filters[i].url(),
words: results
});
}
return matches.filter(removeMisses).sort(rsortMatches);
};
The removeMisses
function on the last line of the function removes any object whose list of matched words is empty. That’s required because the var results = words.filter
will be an empty array if there was no match for any of the words in words
, and that’s therefore a page that is of no interest to us. The rsortMatches
function sorts the matches in reverse order by:
Once have a list of all the posts that matched the user’s query and also the words that matched for each page, I display the results to the user in a div. This div starts off hidden on each page and sits alongside the main #content
div. When I have a list of search results, I hide the #content
div, append the results to the #search-results
div and unhide it. Clearing all text from the search input box hides the #search-results
div and unhides the #content
div. Pretty simple.
While the approach outlined above is pretty solid in its own right, it has two shortcomings:
The solution to both of the above involves downloading a bit more content, but I believe it’s worth it. What I do is, I display the results of the user’s query as expected the moment I’ve checked through all the bloom filters and got the results. However, as soon as that is done, I iterate over all the matched pages and I fire off ajax requests to my webserve to fetch the post itself. In the callback for the request:
indexOf
on the fetched content, and if that returns -1
then I know that the page was a false positive, so I remove it from the search results. This fixes the first drawback mentioned above.indexOf
returned a valid index into the content, I run a regexp that extracts the entire sentence in which the match was found, up to a maximum of ten words before and ten words after the match. I match the regexp a maximum of two times over the post’s content, highlight the searched word in the result of the regexp, and then replace the simple matched: keyword1 keyword2
of the search result with a ... and in the context of blah and bloop, the awesome keyword1 was unmatched ... however, some people have recommended keyword2 as a highly viable alternative, and I am highly tempted to ...
. This (I hope) addresses the second drowback by making it easier for users to judge for themselves if the content on that page is what they are looking for.