Jackrabbit Query Tips: Better Where Clauses
If you’ve paid attention, you have probably noticed that I have a love/hate relationship with Jackrabbit. Luckily this week I ran into a developer who has been successfully running a high traffic Jackrabbit site for several years. One of the major tips he gave me was to look at how I structured my queries. This was something that I toyed with a few months ago, but never put into production. So for the last few days I’ve been tweaking jackrabbit queries. Everything that I’m doing is found in the Jackrabbit mailing list, but I thought I would just summarize here for those who are interested.
Note: All my queries are in XPath. I’m sure these same ideas apply to SQL queries, I just haven’t done the conversions my self.
Use Meaningful Where Clauses
Where clauses are a must in Jackrabbit queries. The way that a Jackrabbit query works is that it first finds all entries that match the where clause, then filters those results by any path limitations. So if your where clauses are not restrictive, Jackrabbit will have to do a lot of extra work to find the desired results.
Say we have blog post data mixed in with product review data. If our content is organized using Rule 2 of Davids Model, it would look something like:
/mysite/mycontent/blogs/2009/08/27/…
/mysite/mycontent/reviews/2009/08/27/…
In this setup, our content is organized hierarchically by content type and then date.
Now we can query the content to find all blogs by doing:
/jcr:root/mysite/mycontent/blogs//*
The down side to this is that this query will actually get ALL elements, blogs and reviews, then loop through those to find which ones belong in the /mysite/mycontent/blogs path. So what you can do is add a property to your content. I use something like @contentType. In your app, you would assign values to this property like ‘blog’ or ‘review’. So all blog entries would get a property of @contentType=’blog’ and all reviews would get a @contentType=’review’. This will greatly help our query because now we can do:
/jcr:root/mysite/mycontent/blogs//*[@contentType='blog']
What happens in this query is that Jackrabbit first matches all elements with @contentType=’blog’ then it filters by the path /mysite/mycontent/blogs. Say you have 1,000 blogs and 1,000 reviews. Just by adding @contentType=’blog’, you essentially cut in half the number of nodes that Jackrabbit has to analyze during the final part of this query.
So look at your queries. Are there any other properties that you can add to the where clause? Possible a date field like start date or created date?
Move Some Path Date to Properties
The mailing list mentions that there are ways to have Jackrabbit index the full path of a node, but it isn’t an easy thing to change and it also hinders moving nodes around easily. So what I would suggest is look for parts of your path that can work as properties like we did with the @contentType above.
The system I am using hosts multiple websites within the same Jackrabbit workspace. Each site is separated into a different path.
/sites/site1
/sites/site2
One thing that we did is add the site as a property. So all nodes for site “site1″ have the property @site=’site1′. Then in our query, we are able to add that property as a where clause:
//*[@site='site1' and @contentType='blog']
Debugging Help
A great way to find what queries are running is to turn on DEBUG log for org.apache.jackrabbit.core.query.QueryImpl Everytime a query is executed, it will show the query run and how long it took to execute. By watching the logs, you can focus your attention on queries that take a long time to run.
Summary
As you can see, just by tweaking your query you can greatly improve your Jackrabbit performance. One thing that helped me a lot is I created a script that runs the same query 100 times simultaneously and records how long it took to run all 100 queries. I then continually tweak the query and re-run the script until I find a query that works best.
Tags: jackrabbit