Faceted Search with Lucene.NET
March 27, 2014
Solr and newer versions of Lucene offer easy ways to let you add faceted navigation to your website, but sometimes you’re in a situation where you’re tied to an earlier version and can’t easily upgrade. For example, if you’re a Sitecore developer, you’re probably very familiar with Lucene 2.3.1 and 220.127.116.11.
Let’s look at an example scenario inspired by recent experiences:
Imagine you’re a developer working on a Sitecore 6.5 site, and the marketing department wants to add faceted navigation to a page displaying a significant amount of your content. Your company plans to upgrade to Sitecore 7 which would give you access to a version of Lucene.NET which would make your job much easier, but that’s not on the roadmap until the following year, and the marketing department wants faceted navigation up next week.There are thousands of content entries, hundreds of facets, and dozens of facet groupings. Starting with a naive approach, you decide to simply write logic to perform a search for documents matching each facet. While this may be a little expensive, you figure you can just cache the results and everything will work out.
You decide to go ahead and implement this for the first page, and everything looks great. You can see all of the facets, you can see how many documents belong in each facet, the counts look right, and you filter the results down to a facet. The problem is solved before noon, it’s time to go out lunch and have a few mojitos then call it a day, right?But wait, not so fast! Facets aren’t mutually exclusive, so when you filter down to items that belong in facet ‘A’, you notice the counts for all of the other facets are off. Your UI can’t continue showing that facet ‘B’ has 297 documents and facet ‘C’ has 50 documents – you actually need to show that only 100 documents are in both facets A and B, and only 9 are in A and C.
If only marketing were content to restrict end users’ filter by one facet at a time, you could probably get away with “brute forcing” this. You quickly realize that the effective combination of facets is astronomical, and you’re probably not going to get away with running expensive queries and throwing the results in cache. There are a very large number of possible combinations of 300 facets, and you’ll have a lot of queries for every single new combination a user wants to explore on every cache miss. That is to say, If a user has already narrowed their search down to documents in facets 1, 3 and 11, you’ll need to run an additional query to see what documents are in facets 1,2,3 and 11 to know how many documents will match if the user selects ‘2’ and show that to them in the navigation. You’ll also have to run queries for “1,3,4,11”, “1,3,5,11” … all the way up to “1,3,11,300”.
Let’s illustrate this with some simple examples. Imagine you have an index covering 5 documents, and run a search against it that documents 1, 3 and 4 match. Lucene represents these results as a bit array that looks like this:
Once you have a couple of these BitArray objects, you can start to do some interesting things with them. Ordinarily if you wanted to know the documents that matched searches ‘A’ and ‘B’, you’d have to build and execute a new search specifically. However, if we have the BitArrays for searches A and B, we can perform some simple and extremely fast Boolean logic on them to get the documents that match both queries without hitting the index again.
In order to count up how many bits are flipped on in a BitArray, you need to get its cardinality. I don’t believe .NET offers a built in function to do this , but I found a very convenient method to do so here.
Going back to our original scenario, we now have all of the tools we need to be able to efficiently display counts for any number of facets. We can perform one search per facet value, and cache the resulting BitArray. On the first visit to the search page, this is all we will need to display the faceted navigation. When a user selects a facet (or combination of facets), all we need to do is loop over the remaining facets and AND each one together with their selection to be able to quickly and efficiently get a count for how many documents would still be in the results if the user drilled down to that facet.
I haven’t yet found a quick and easy way to go back from a BitArray to the corresponding documents – conventional wisdom seems to be to loop over the array and fetch the documents one by one, but I personally find it easier just to build and execute the corresponding query instead.
Stay up to date with our email updates!