| Subcribe via RSS

Jackrabbit Query Tips: Better Where Clauses

August 27th, 2009 | No Comments | Posted in java

If you’ve paid attention, you have probably noticed that I have a love/hate relationship with Jackrabbit. Luckily this week I ran into a developer who has been successfully running a high traffic Jackrabbit site for several years. One of the major tips he gave me was to look at how I structured my queries. This was something that I toyed with a few months ago, but never put into production. So for the last few days I’ve been tweaking jackrabbit queries. Everything that I’m doing is found in the Jackrabbit mailing list, but I thought I would just summarize here for those who are interested.

Note: All my queries are in XPath. I’m sure these same ideas apply to SQL queries, I just haven’t done the conversions my self.

Use Meaningful Where Clauses

Where clauses are a must in Jackrabbit queries. The way that a Jackrabbit query works is that it first finds all entries that match the where clause, then filters those results by any path limitations. So if your where clauses are not restrictive, Jackrabbit will have to do a lot of extra work to find the desired results.

Say we have blog post data mixed in with product review data. If our content is organized using Rule 2 of Davids Model, it would look something like:

/mysite/mycontent/blogs/2009/08/27/…
/mysite/mycontent/reviews/2009/08/27/…

In this setup, our content is organized hierarchically by content type and then date.

Now we can query the content to find all blogs by doing:

/jcr:root/mysite/mycontent/blogs//*

The down side to this is that this query will actually get ALL elements, blogs and reviews, then loop through those to find which ones belong in the /mysite/mycontent/blogs path. So what you can do is add a property to your content. I use something like @contentType. In your app, you would assign values to this property like ‘blog’ or ‘review’. So all blog entries would get a property of @contentType=’blog’ and all reviews would get a @contentType=’review’. This will greatly help our query because now we can do:

/jcr:root/mysite/mycontent/blogs//*[@contentType='blog']

What happens in this query is that Jackrabbit first matches all elements with @contentType=’blog’ then it filters by the path /mysite/mycontent/blogs. Say you have 1,000 blogs and 1,000 reviews. Just by adding @contentType=’blog’, you essentially cut in half the number of nodes that Jackrabbit has to analyze during the final part of this query.

So look at your queries. Are there any other properties that you can add to the where clause? Possible a date field like start date or created date?

Move Some Path Date to Properties

The mailing list mentions that there are ways to have Jackrabbit index the full path of a node, but it isn’t an easy thing to change and it also hinders moving nodes around easily. So what I would suggest is look for parts of your path that can work as properties like we did with the @contentType above.

The system I am using hosts multiple websites within the same Jackrabbit workspace. Each site is separated into a different path.

/sites/site1
/sites/site2

One thing that we did is add the site as a property. So all nodes for site “site1″ have the property @site=’site1′. Then in our query, we are able to add that property as a where clause:

//*[@site='site1' and @contentType='blog']

Debugging Help

A great way to find what queries are running is to turn on DEBUG log for org.apache.jackrabbit.core.query.QueryImpl Everytime a query is executed, it will show the query run and how long it took to execute. By watching the logs, you can focus your attention on queries that take a long time to run.

Summary

As you can see, just by tweaking your query you can greatly improve your Jackrabbit performance. One thing that helped me a lot is I created a script that runs the same query 100 times simultaneously and records how long it took to run all 100 queries. I then continually tweak the query and re-run the script until I find a query that works best.

Tags:

High Performance Jackrabbit, Where Are You?

August 14th, 2009 | 4 Comments | Posted in java

So I’ve had a good amount of time running a high traffic content site using Apache Jackrabbit as the content store. Jackrabbit provides a nice, flexible way to store a variety of content. The one that that is lacking for me is performance.

I’ve looked around the Jackrabbit mailing list and wiki and there are a few points about how to get better performance out of Jackrabbit. Most of these center around how you structure your nodes and how to write better “optimized” queries. That is all fine and dandy, but my problem comes when Jackrabbit is put under heavy load from many concurrent connections.

With lots of concurrent queries, I noticed the site response time dropping dramatically. I tweaked the queries as much as I could, but I soon figured that I would have to get under the hood of Jackrabbit to make any gains. And just to give you the short answer, I didn’t find any answers.

First, Jackrabbit does not have a pluggable cache system. So the idea of, “maybe if I just tweak the cache” things will get better. I’ve read many postings on the mail list that search results are tied to a search session. So even if you could cache search results, you could run into problems with this session variable down the line. Well, any chance of fixing this is very hard to do unless you want to actually change the cache code within org.apache.jackrabbit. I didn’t feel like making a custom port of jackrabbit just to play with caching, so I soon backed off the caching idea.

Another thing I thought about was increasing the number of  connections accessing the Jackrabbit repository. Well, Jackrabbit isn’t able to use a connection pool. Instead, it opens a handful of persistent connections to our database (in my case, MySql). So just adding more connections is out.

I asked on the mailing list several time about how Jackrabbit handles concurrent query requests. I never got a straight answer. But, I was lucky enough to talk with 2 other people who had previously used Jackrabbit in similar projects. Through them I got the answer I didn’t want to hear. Jackrabbit isn’t actually able to handle concurrent queries well. One of the previous Jackrabbit users told me that deep within the bowels of the Jackrabbit code, there are bits of synchronized code that ultimately turn Jackrabbit into a single threaded process. So there goes your ability to handle simultaneous queries. The few answers I got from the mailing list did mention that most Jackrabbit queries actually hit the internal cache, not the database. So I don’t know if these synchronized bits of code affect this or not.

Well, maybe there is a way to have a read-only version of Jackrabbit to speed things up? Nope. As of version 1.5, this isn’t available.

So where does that leave me? I’ve had to start splitting my data between Jackrabbit and a traditional database structure fronted by Hibernate. I put all content where the schema is flexible, like articles, in Jackrabbit. For content that has a rigid schema, like comments, I put those in the traditional database.

I know that Magnolia uses Jackrabbit but I haven’t spent a good deal of time with their code. For my system, I am using Spring and Spring Modules to access Jackrabbit. Magnolia doesn’t use Spring and I thought I show a class that mentioned something about multi-threaded request. So maybe they have figured a way around the performance problems.

Until then, I will just have to keep banging on Jackrabbit in hopes that it will speed up.

Tags: ,