| Subcribe via RSS

Adding External Datasources to Lucene Scoring

September 22nd, 2009 | No Comments | Posted in java

Here is a common scenario that a lot of websites encounter. Say you have a nice lucene index setup to handle searching on your site. For an example, let’s use an ecommerce site where all the products are stored in a lucene index. You’ve tweaked the query parameters and you think the results are fairly accurate. Now you want to make your results even better by adding a boost to products that are popular in your store. How do you do this?

The simplest, but least scalable, solution is to add a popularity field in your lucene index. Periodically you would run a job that would rank all  your products by popularity in 1 to X order then save this popularity as a field in each lucene document. Then, using lucene’s FieldCacheSource and ValueQuery you can add this popularity field as part of the query score.

public class PopularityFieldSource extends FieldCacheSource {

    @Override
    public DocValues getCachedFieldValues(FieldCache cache,
            String field, IndexReader reader) throws IOException {
        int[] popularities = cache.getInts(reader, field);

        float[] weights = new float[popularities.length];

        for (int i = 0; i < popularities.length; i++) {
            // create an inverse of the popularity value
            if (popularities[i]>0) {
	    	weights[i] = 1 + 1 / popularities[i];
	    }
	    else {
		    weights[i] = 1;
	    }
	}

        final float[] arr = weights;

        return new DocValues() {

            public float floatVal(int doc) {
                return (float) arr[doc];
            }

            public int intVal(int doc) {
                return (int) arr[doc];
            }

            public String toString(int doc) {
                return description() + '=' + intVal(doc);
            }
        };

    }
}

What happens in the line with

int[] popularities = cache.getInts(reader, field);

is that lucene will create an array of the popularity field for all the lucene documents. This is a key point, this array relates to the document order within the lucene index. If the document order changes (you add or delete documents from the index), the order of this array will change.

Now you just need to use a ValueSourceQuery to use this PopularityFieldSouce.

Query query = ... your lucene query
PopularityFieldSource fieldSource = new PopularityFieldSource("name_of_popularity_field");
ValueSourceQuery valueQuery = new ValueSourceQuery(fieldSource);
CustomScoreQuery customQuery = new CustomScoreQuery(termQuery, valueQuery);

So now you customQuery will incorporate popularity into your search results. I based this on Rob Young’s blog about Extending Lucene’s Scoring to use the document creation date to boost newer documents.

As I said earlier, this works, but it is not the most scalable solution. First, you probably keep all your analytics data in a separate database instead of in lucene. And secondly, you lucene index is changing all the time, so you cannot constantly run a job to update the popularity field in your index.

Don’t worry, there is a way to include data external to the lucene index at query time.

The first assumption is that each document in your lucene index has a field that is used as the “id” for the document. In our ecommerce example, that field would normally be the “sku” or product id. The second assumption is that we can create a map Map<String, Float> of the popularity rankings for our products. This is done outside of lucene and can just be a simple database call that ranks all your products and then stories that ranking in a map with the product sku as the map key.

Now we can change our PopularityFieldSource to use this map of rankings.

public class PopularityFieldSource extends FieldCacheSource {

    private static final String POPULAR_FIELD = "_popular";
    private static final String ID_FIELD = "sku";

    private Map values;

    public PopularityFieldSource(Map values) {
        super(POPULAR_FIELD);
        this.values = values;
    }

    @Override
    public DocValues getCachedFieldValues(FieldCache cache,
           String field, IndexReader reader) throws IOException {
        String[] skus = cache.getStrings(reader, ID_FIELD);

        float[] weights = new float[skus.length];

        if (values!=null) {

        for (int i = 0; i < skus.length; i++) {
                if (values.get(skus[i])!=null) {
                    weights[i] = values.get(skus[i]);
                }
                else {
                    weights[i] = 1;
                }
            }
        }
        else {
            Arrays.fill(weights, 1);
        }

        final float[] arr = weights;

        return new DocValues() {

            public float floatVal(int doc) {
                return (float) arr[doc];
            }

            public int intVal(int doc) {
                return (int) arr[doc];
            }

            public String toString(int doc) {
                return description() + '=' + intVal(doc);
            }
        };

    }

}

Let’s go over the few changes.

First, since the popularity field is not stored in our lucene index, we have to fiddle with the “field” name used. In this example we are hard coding our id field, in this case “sku”. We are also saying that our PopularityFieldSource will be used for field “_popular”. The “_popular” field doesn’t exist, but don’t worry, that field name is only used for debugging, so you can name it whatever you want.

When we create the PopularityFieldSource, we pass in our map of weights. The weight values are based on our popularity rankings. In this case, the weight = 1 + 1/ranking. I wanted to make the weight non-zero because I found that when the weight was zero, documents would be excluded from the search results. So this weight is just a simple way to have the weights be in the range of 2 (the hightest) to 1 (the lowest). We also make a case that if the document does not appear in the popularity rankings, it still gets a weight of 1.

As in the first version of PopularityFieldSource, the document order within the lucene index is important. So we have to find a way to relate our weightings to the particular document in the lucene index.

In the line:

String[] skus = cache.getStrings(reader, ID_FIELD);

we get an array of all the sku field for all documents in the lucene index. This array will change if we alter our lucene index, so we have to get this array at query time. But this also makes this process nice because we can continually modify both our lucene index and our product popularity rankings at different times.

So once we have an array of all our lucene documents, we loop through the sku array and pull in the corresponding weight from our values map.

for (int i = 0; i < skus.length; i++) {
   if (values.get(skus[i])!=null) {
      weights[i] = values.get(skus[i]);
   }
   else {
      weights[i] = 1;
   }
}

This loop has successfully applied our external popularity weights to each document in the lucene index. The final query is exactly the same as above:

Query query = ... your lucene query
Map<String, Float> popularityWeights = ... external process to create weights for each product
PopularityFieldSource fieldSource = new PopularityFieldSource(popularityWeights);
ValueSourceQuery valueQuery = new ValueSourceQuery(fieldSource);
CustomScoreQuery customQuery = new CustomScoreQuery(termQuery, valueQuery);

Since the ValueSourceQuery implements the base lucene Query, you can tweak the query even more by applying boosts to the valueQuery or you can adjust how you calculate your popularityWeights. And if you want to see exactly how the scores are calculated, you can use the IndexSearcher.explain(customQuery, docid) to see the full details.

There you go. Your search engine just got a little smarter and can continually adjust itself based on your website traffic. In a later post, I will tell how you can create a custom sorter so you can find the most popular items that match a query.

Jackrabbit 1.5 vs 1.6 Query Performance

September 2nd, 2009 | No Comments | Posted in java

Yes, I’m still talking about Jackrabbit query performance. But this time, I finally have something positive to say.

In our existing Jackrabbit setup, we are using version 1.5.0. I thought I would try out version 1.6 to see if it provides any query performance boosts. The short answer, yes it does.

Test Setup

My test setup is really basic. I created a simple program that would create 100 threads, each running the same query at the same time. I then measured how long it took for all 100 queries to complete. You might say this vaguely represents 100 concurrent connections, but I just intended the test to run the same query over and over. For each query type (more on that later), I ran the test program 3 separate times for Jackrabbit 1.5.0 and 3 separate times for Jackrabbit 1.6.

Query Types

Looking through our application code, I came up with some basic query types that we use. These are very general queries intended to help point out what types of queries perform better in version 1.6. All the queries tested are written in XPath.

Single Property
//element(*,my:type)[@property='value']

Two Properties
//element(*,my:type)[@property1='value1' and @property2='value2']

Like on Property
//element(*,my:type)[jcr:like(@property,'value%')]

Like on Child Property
//element(*,my:type)[jcr:like(./child/@property,'value%')]

Likes on Two Child Properties
//element(*,my:type)[jcr:like(./child/@property1,'value1%') and jcr:like(./child/@property2,'value2%')]

If Child Property Exists Or Is Not
//element(*,my:type)[not(./child/@property) or ./child/@property!='value')]

Results

Query Type v1.5 Ave v1.6 Ave % Improvement
Single Property 28.5 s 20.3 s 29 %
Two Properties 16.7 s 9.7 s 42 %
Like on Property 17.8 s 10.2 s 43 %
Like on Child Property 94.5 s 42.8 s 55 %
Like on Two Child Properties 65.3 s 34.3 s 47 %
If Child Exists Or Is Not 137.4 s 55.4 s 60 %

Summary

So what do the results show us? First, that if you want increased query performance, moving to v1.6 is something you should really consider. Second, v1.6 shows large performance gains in querying across axis.

Tags: ,

A Java Plugin Framework Wishlist

September 1st, 2009 | 2 Comments | Posted in java

Yes, there are times that I am jealous of the apparent simplicity of php driven sites. Take for instance Drupal, I really like all the functionality it has. I know there has been a ton of work and support to get Drupal where it is today, but the thing that makes me really jealous is there plugin framework. Anyone can write a plugin and distribute it to other Drupal users. This just seems so simple. And if you look at other php systems like Joomla, they all have similar plugin/module frameworks. So where is the java equivilant?

Now from a technical aspect, I understand why it is easier to write a plugin system in php. Since php is a scripting language, everything happens at runtime, there is no pre-compile stage. With java, everything has to be compiled first before deploying it. So in java, it is a little harder to add things at runtime that have not been pre-compiled with the rest of your app. There are definitely ways around this, it just is a big nightmare to handle all the different configuration settings to make it work.

I’ve looked at a few existing java projects that use plugins, Hudson, Magnolia and OpenCMS. All of them work, I just haven’t fully immersed myself enough to completely understand how each of the different systems work. With java, unlike php, you will sooner or later have to address dependency management and how to organize all the different versions of jars that get thrown into your app. This goes back to handling all configuration of all your different plugins.

So finally, to my wish list. What I would love to have is a basic framework that accepts different plugins that will ultimately build an app. An easy example is for a simple website. I would want one module for the admin section, one module for blogs, one module for message boards, one module for commets. I think you get the picture. Everything is modular, this way you can just stack together your modules to create the functionality of your site.

In my research, OSGi almost looks like what I am looking for. I say almost, because one of the requirements I have is to make it easy to use within a servlet container. With OSGi, your servlet container is actually a module itself. To me, this just adds another layer into the stack. And I most developers already know how to use a servlet container, so I don’t want to make them learn how to use OSGi.

Now to my wishlist of what I want the plugin framework to do:

1. Be able to assign a plugin to act like a servlet filter

This functionality is used to do things before or after a web request. You could use this to intercept incoming request parameters and do something with it, like set a cookie based on the refering source of the request.

2. Be able to register new url actions

I want to be able to add new pages to the site. So I would need to add all the actions associated with view a blog for instance. This would include both the page logic and the view layer (images, templates, etc).

3. Be able to assign plugins to a specific lifecycle phase

This is just like “hooks” in php.  Say I want certain logic to fire every time I save something or run some special code when the page is rendered.

So there you have it. Sounds simple right? Well, hopefully I can find something that will meet my needs, otherwise I may have to start writing my own.