So I’ve had a good amount of time running a high traffic content site using Apache Jackrabbit as the content store. Jackrabbit provides a nice, flexible way to store a variety of content. The one that that is lacking for me is performance.
I’ve looked around the Jackrabbit mailing list and wiki and there are a few points about how to get better performance out of Jackrabbit. Most of these center around how you structure your nodes and how to write better “optimized” queries. That is all fine and dandy, but my problem comes when Jackrabbit is put under heavy load from many concurrent connections.
With lots of concurrent queries, I noticed the site response time dropping dramatically. I tweaked the queries as much as I could, but I soon figured that I would have to get under the hood of Jackrabbit to make any gains. And just to give you the short answer, I didn’t find any answers.
First, Jackrabbit does not have a pluggable cache system. So the idea of, “maybe if I just tweak the cache” things will get better. I’ve read many postings on the mail list that search results are tied to a search session. So even if you could cache search results, you could run into problems with this session variable down the line. Well, any chance of fixing this is very hard to do unless you want to actually change the cache code within org.apache.jackrabbit. I didn’t feel like making a custom port of jackrabbit just to play with caching, so I soon backed off the caching idea.
Another thing I thought about was increasing the number of connections accessing the Jackrabbit repository. Well, Jackrabbit isn’t able to use a connection pool. Instead, it opens a handful of persistent connections to our database (in my case, MySql). So just adding more connections is out.
I asked on the mailing list several time about how Jackrabbit handles concurrent query requests. I never got a straight answer. But, I was lucky enough to talk with 2 other people who had previously used Jackrabbit in similar projects. Through them I got the answer I didn’t want to hear. Jackrabbit isn’t actually able to handle concurrent queries well. One of the previous Jackrabbit users told me that deep within the bowels of the Jackrabbit code, there are bits of synchronized code that ultimately turn Jackrabbit into a single threaded process. So there goes your ability to handle simultaneous queries. The few answers I got from the mailing list did mention that most Jackrabbit queries actually hit the internal cache, not the database. So I don’t know if these synchronized bits of code affect this or not.
Well, maybe there is a way to have a read-only version of Jackrabbit to speed things up? Nope. As of version 1.5, this isn’t available.
So where does that leave me? I’ve had to start splitting my data between Jackrabbit and a traditional database structure fronted by Hibernate. I put all content where the schema is flexible, like articles, in Jackrabbit. For content that has a rigid schema, like comments, I put those in the traditional database.
I know that Magnolia uses Jackrabbit but I haven’t spent a good deal of time with their code. For my system, I am using Spring and Spring Modules to access Jackrabbit. Magnolia doesn’t use Spring and I thought I show a class that mentioned something about multi-threaded request. So maybe they have figured a way around the performance problems.
Until then, I will just have to keep banging on Jackrabbit in hopes that it will speed up.