| Subcribe via RSS

Adding External Datasources to Lucene Scoring

September 22nd, 2009 Posted in java

Here is a common scenario that a lot of websites encounter. Say you have a nice lucene index setup to handle searching on your site. For an example, let’s use an ecommerce site where all the products are stored in a lucene index. You’ve tweaked the query parameters and you think the results are fairly accurate. Now you want to make your results even better by adding a boost to products that are popular in your store. How do you do this?

The simplest, but least scalable, solution is to add a popularity field in your lucene index. Periodically you would run a job that would rank allĀ  your products by popularity in 1 to X order then save this popularity as a field in each lucene document. Then, using lucene’s FieldCacheSource and ValueQuery you can add this popularity field as part of the query score.

public class PopularityFieldSource extends FieldCacheSource {

    @Override
    public DocValues getCachedFieldValues(FieldCache cache,
            String field, IndexReader reader) throws IOException {
        int[] popularities = cache.getInts(reader, field);

        float[] weights = new float[popularities.length];

        for (int i = 0; i < popularities.length; i++) {
            // create an inverse of the popularity value
            if (popularities[i]>0) {
	    	weights[i] = 1 + 1 / popularities[i];
	    }
	    else {
		    weights[i] = 1;
	    }
	}

        final float[] arr = weights;

        return new DocValues() {

            public float floatVal(int doc) {
                return (float) arr[doc];
            }

            public int intVal(int doc) {
                return (int) arr[doc];
            }

            public String toString(int doc) {
                return description() + '=' + intVal(doc);
            }
        };

    }
}

What happens in the line with

int[] popularities = cache.getInts(reader, field);

is that lucene will create an array of the popularity field for all the lucene documents. This is a key point, this array relates to the document order within the lucene index. If the document order changes (you add or delete documents from the index), the order of this array will change.

Now you just need to use a ValueSourceQuery to use this PopularityFieldSouce.

Query query = ... your lucene query
PopularityFieldSource fieldSource = new PopularityFieldSource("name_of_popularity_field");
ValueSourceQuery valueQuery = new ValueSourceQuery(fieldSource);
CustomScoreQuery customQuery = new CustomScoreQuery(termQuery, valueQuery);

So now you customQuery will incorporate popularity into your search results. I based this on Rob Young’s blog about Extending Lucene’s Scoring to use the document creation date to boost newer documents.

As I said earlier, this works, but it is not the most scalable solution. First, you probably keep all your analytics data in a separate database instead of in lucene. And secondly, you lucene index is changing all the time, so you cannot constantly run a job to update the popularity field in your index.

Don’t worry, there is a way to include data external to the lucene index at query time.

The first assumption is that each document in your lucene index has a field that is used as the “id” for the document. In our ecommerce example, that field would normally be the “sku” or product id. The second assumption is that we can create a map Map<String, Float> of the popularity rankings for our products. This is done outside of lucene and can just be a simple database call that ranks all your products and then stories that ranking in a map with the product sku as the map key.

Now we can change our PopularityFieldSource to use this map of rankings.

public class PopularityFieldSource extends FieldCacheSource {

    private static final String POPULAR_FIELD = "_popular";
    private static final String ID_FIELD = "sku";

    private Map values;

    public PopularityFieldSource(Map values) {
        super(POPULAR_FIELD);
        this.values = values;
    }

    @Override
    public DocValues getCachedFieldValues(FieldCache cache,
           String field, IndexReader reader) throws IOException {
        String[] skus = cache.getStrings(reader, ID_FIELD);

        float[] weights = new float[skus.length];

        if (values!=null) {

        for (int i = 0; i < skus.length; i++) {
                if (values.get(skus[i])!=null) {
                    weights[i] = values.get(skus[i]);
                }
                else {
                    weights[i] = 1;
                }
            }
        }
        else {
            Arrays.fill(weights, 1);
        }

        final float[] arr = weights;

        return new DocValues() {

            public float floatVal(int doc) {
                return (float) arr[doc];
            }

            public int intVal(int doc) {
                return (int) arr[doc];
            }

            public String toString(int doc) {
                return description() + '=' + intVal(doc);
            }
        };

    }

}

Let’s go over the few changes.

First, since the popularity field is not stored in our lucene index, we have to fiddle with the “field” name used. In this example we are hard coding our id field, in this case “sku”. We are also saying that our PopularityFieldSource will be used for field “_popular”. The “_popular” field doesn’t exist, but don’t worry, that field name is only used for debugging, so you can name it whatever you want.

When we create the PopularityFieldSource, we pass in our map of weights. The weight values are based on our popularity rankings. In this case, the weight = 1 + 1/ranking. I wanted to make the weight non-zero because I found that when the weight was zero, documents would be excluded from the search results. So this weight is just a simple way to have the weights be in the range of 2 (the hightest) to 1 (the lowest). We also make a case that if the document does not appear in the popularity rankings, it still gets a weight of 1.

As in the first version of PopularityFieldSource, the document order within the lucene index is important. So we have to find a way to relate our weightings to the particular document in the lucene index.

In the line:

String[] skus = cache.getStrings(reader, ID_FIELD);

we get an array of all the sku field for all documents in the lucene index. This array will change if we alter our lucene index, so we have to get this array at query time. But this also makes this process nice because we can continually modify both our lucene index and our product popularity rankings at different times.

So once we have an array of all our lucene documents, we loop through the sku array and pull in the corresponding weight from our values map.

for (int i = 0; i < skus.length; i++) {
   if (values.get(skus[i])!=null) {
      weights[i] = values.get(skus[i]);
   }
   else {
      weights[i] = 1;
   }
}

This loop has successfully applied our external popularity weights to each document in the lucene index. The final query is exactly the same as above:

Query query = ... your lucene query
Map<String, Float> popularityWeights = ... external process to create weights for each product
PopularityFieldSource fieldSource = new PopularityFieldSource(popularityWeights);
ValueSourceQuery valueQuery = new ValueSourceQuery(fieldSource);
CustomScoreQuery customQuery = new CustomScoreQuery(termQuery, valueQuery);

Since the ValueSourceQuery implements the base lucene Query, you can tweak the query even more by applying boosts to the valueQuery or you can adjust how you calculate your popularityWeights. And if you want to see exactly how the scores are calculated, you can use the IndexSearcher.explain(customQuery, docid) to see the full details.

There you go. Your search engine just got a little smarter and can continually adjust itself based on your website traffic. In a later post, I will tell how you can create a custom sorter so you can find the most popular items that match a query.

Leave a Reply