Faceted Search with Lucene.NET

March 27, 2014

Blog | Technology | Faceted Search with Lucene.NET
Faceted Search with Lucene.NET

Solr and newer versions of Lucene offer easy ways to let you add faceted navigation to your website, but sometimes you’re in a situation where you’re tied to an earlier version and can’t easily upgrade. For example, if you’re a Sitecore developer, you’re probably very familiar with Lucene 2.3.1 and 2.9.4.1.

In situations like these, it’s helpful to understand a little bit more about how these tools work under the hood and be able to work within their capabilities and around their limitations.

Let’s look at an example scenario inspired by recent experiences:

Imagine you’re a developer working on a Sitecore 6.5 site, and the marketing department wants to add faceted navigation to a page displaying a significant amount of your content. Your company plans to upgrade to Sitecore 7 which would give you access to a version of Lucene.NET which would make your job much easier, but that’s not on the roadmap until the following year, and the marketing department wants faceted navigation up next week.There are  thousands of content entries, hundreds of facets, and dozens of facet groupings. Starting with a naive approach, you decide to simply write logic to perform a search for documents matching each facet. While this may be a little expensive, you figure you can just cache the results and everything will work out.

You decide to go ahead and implement this for the first page, and everything looks great. You can see all of the facets, you can see how many documents belong in each facet, the counts look right, and you filter the results down to a facet. The problem is solved before noon, it’s time to go out lunch and have a few mojitos then  call it a day, right?But wait, not so fast! Facets aren’t mutually exclusive, so when you filter down to items that belong in facet ‘A’, you notice the counts for all of the other facets are off. Your UI can’t continue showing that facet ‘B’ has 297 documents and facet ‘C’ has 50 documents – you actually need to show that only 100 documents are in both facets A and B, and only 9 are in A and C.

If only marketing were content to restrict end users’ filter by one facet at a time, you could probably get away with “brute forcing” this.  You quickly realize that the effective combination of facets is astronomical, and you’re probably not going to get away with running expensive queries and throwing the results in cache. There are a very large number of possible combinations of 300 facets, and you’ll have a lot of queries for every single new combination a user wants to explore on every cache miss. That is to say, If a user has already narrowed their search down to documents in facets 1, 3 and 11, you’ll need to run an additional query to see what documents are in facets 1,2,3 and 11 to know how many documents will match if the user selects ‘2’ and show that to them in the navigation. You’ll also have to run queries for “1,3,4,11”, “1,3,5,11” … all the way up to “1,3,11,300”.

You quickly realize that this approach isn’t going to scale, and wonder if there’s a better way.

Although you probably don’t have to deal with it on a daily basis, Lucene actually represents the results of a search as a bit array.

Let’s illustrate this with some simple examples. Imagine you have an index covering 5 documents, and run a search against it that documents 1, 3 and 4 match. Lucene represents these results as a bit array that looks like this:

01101 (search A – documents 1, 3 and 4 match)

 

A search matching results 1, 2 and 3 would look like this:

00111 (search B – documents 1, 3 and 3 match)

 

A search that matches documents 2 and 4 looks like this:

01010 (search C – documents 2 and 4 match)

 

A search that matches… well, you get the point by now.

Normally, you’re interested in the documents that match a search and these BitArrays are an implementation detail that get ignored, but you from calling the QueryFilter’s .Bits method.

Once you have a couple of these BitArray objects, you can start to do some interesting things with them. Ordinarily if you wanted to know the documents that matched searches ‘A’ and ‘B’, you’d have to build and execute a new search specifically. However, if we have the BitArrays  for searches A and B, we can perform some simple and extremely fast Boolean logic on them to get the documents that match both queries without hitting the index again.

You can do that by calling the .And method of the BitArray:

BitArray combinedResults = searchABits.And(searchBBits);

The result of this would be a new BitArray with the bits flipped on where the corresponding bits of A and B were on, and off otherwise:

00101 (documents 1 and 3).

If we were to AND this new array with the bits for search ‘C’, we’d get back a new BitArray of:

00000 (no documents match)

In order to count up how many bits are flipped on in a BitArray, you need to  get its cardinality. I don’t believe .NET offers a built in function to do this , but I found a very convenient method to do so here.

Going back to our original scenario, we now have all of the tools we need to be able to efficiently display counts for any number of facets. We can perform one search per facet value, and cache the resulting BitArray. On the first visit to the search page, this is all we will need to display the faceted navigation. When a user selects a facet (or combination of facets), all we need to do is loop over the remaining facets and AND each one together with their selection to be able to quickly and efficiently get a count for how many documents would still be in the results if the user drilled down to that facet.

If we have 300 facets, this will still require us to hit the index 300 times initially, but once we have the BitArrays we can combine them any way we want to retrieve facet counts.

I haven’t yet found a quick and easy way to go back from a BitArray to the corresponding documents – conventional wisdom seems to be to loop over the array and fetch the documents one by one, but I personally find it easier just to build and execute the corresponding query instead.

Jay Oliver, Chief Technology Officer, GeekHive

Jay Oliver

Chief Technology Officer
Tags
  • Faceted Search
  • Lucene
  • Sitecore

Recent Work

Check out what else we've been working on