Friday, February 27, 2009

Clojure + Terracotta = Yeah, Baby!

Update: I've gotten a REPL running with Terracotta. http://paul.stadig.name/2009/03/clojure-terracotta-next-steps.html.

What is Terracotta?

Terracotta provides a network-attached, virtual, persistent heap and transparent inter-JVM thread coordination. With Terracotta, you no longer need to map your objects to database tables and back. You simply hand your object to Terracotta and it will cache your data. Not only does it cache your data, but it will make your object available to a cluster of networked JVMs. Not only that, but it will also spill your objects to disk if necessary (just like Virtual Memory), so you need not worry about having gobs of memory to hold all of your objects.

What is Clojure?

Clojure is a Lisp for the JVM with a software transactional memory, and agents (asynchronous, message based concurrency). It is a functional language with immutable datatypes. It can also inter-operate with any existing Java code.

NOTE: you need to use Clojure r1310 or later, because the Keyword class needs to have hashCode implemented to play nicely with Terracotta.

Clojure + Terracotta = ?

These two seem like an interesting combination. Imagine the possibilities...kill your database, simple POJO applications, free distributed transactions, clustered JVMs with limitless memory...it would make your hair would grow back, you'd get women, and become filthy rich...well...maybe not, but at least you'd have more fun writing software.

After some initial setup (the code and instructions are at http://github.com/pjstadig/terraclojure/tree/master/), there are two things that need to be done to integrate Clojure and Terracotta: 1) instead of running Clojure with the 'java' command, you run it with the 'dso-java.{sh,bat}' script provided with Terracotta, and 2) you need to create a configuration file that defines how your objects will be shared between JVMs.

Configuration

The configuration for Terracotta (at least in our case) consists of defining: roots, instrumented classes, auto-locks, additional boot jar classes, and servers. At this point it's probably helpful to take a peek at the config.xml file that comes with the code and follow along.

  • Roots. A root is a object that is shared between JVMs. Any objects that are part of the object graph that can be reached from the root are also shared, so any objects that are assigned to data members, etc. A common use case is to have a ConcurrentHashMap (or in our case a PersistentHashMap from Clojure) that is shared as a root. This creates a flexible hierarchy of shared objects. In Clojure's case, we also share clojure.lang.Keyword.table, so that our keywords are unique across all of the JVMs, otherwise inserting into a hash map would create multiple entries for the same keyword.
  • Instrumented classes. Any class that is shared (either directly as a root, or indirectly as a part of a root's object graph), must be instrumented. I made all of the clojure.lang.* classes instrumented. It's a bit of a broad stroke, but there aren't any performance problems that result from instrumenting too many classes. Terracotta is helpful in this case, if you end up inserting an uninstrumented class into the object graph, it'll throw a RuntimeException that explains exactly how to modify your config file to instrument that class.
  • Auto-locks. Terracotta will transparently convert your synchronized blocks into distributed transactions across all the JVMs in the cluster. Again, I made broad strokes here and just defined auto-locks for all of the methods on any clojure.lang.* class, and again, there aren't any performance penalties for auto-locking methods that don't have any synchronized blocks. I used write locks, and Terracotta has a few different types of locks that are worth looking into if you need to do something more serious. In the case of auto-locks, Terracotta will also help you out by throwing a RuntimeException if you leave out anything.
  • Additional boot jar classes. Frankly, this was something Terracotta told me to do, and I don't know exactly what is going on here. (Perhaps someone else can explain?) I think what happens is that by default Terracotta instruments the java.lang.* and java.util.concurrent.* classes, but to instrument other Java core classes you have to add them in this configuration element.
  • Servers. Terracotta is very easy to work with, and by default will just run a single server on localhost. You can define more than one server in a cluster. In my case, I only wanted one server, but I wanted to change the persistence mode. By default the persistence mode is a temporary-swap-only mode. The objects will be preserved across stopping and starting clients, but once the server is stopped, the data disappears. To have the objects persisted across restarting the server, you have to set the persistence mode to permanent-store. The temporary swap mode will be faster for data like the intermediate results of calculations, caching, etc., but if you need to permanently persist the data, then you need to use permament-store.

There are instructions about how to run this example in the README with the code, so I won't bother to duplicate that here. I'd just like to share some of the issues I encountered, the results, and any future direction that could be taken.

Issues

The first major issue that I encountered was that Keywords weren't unique across JVMs, so I had to make clojure.lang.Keyword.table a root. This ensured that keywords are unique across JVMs, but I still ran into an issue when using keywords as keys for a PersistentHashMap. The result of identical? was true for keywords from two JVMs, but I was still getting duplicate entries in my hash map. After some debugging, I was able to determine that the issue is that the keyword class did not override the default implementation of hashCode. After mentioning this to Rich, and a quick fix in r1310, it worked nicely.

The only other major issue was how to reference Clojure vars and refs from the Java side. The main reason for this is to define a root that will be shared by Terracotta. When Clojure code gets compiled some Java classes get generated with mangled names. As far as I can tell, there isn't a good predictable way to get at a Clojure var, because Clojure will generate a class for each namespace called my/namespace/namespace__init.class and it creates static fields on that class for various definitions (functions, vars, etc.). Those fields are named const_1, const_2, const_3, etc. There is no reliable, flexible way to predict the name of a particular Var.

My solution was to create a simple Java class called terraclojure.Root with a couple of static fields containing refs. At first I just used that class directly to access the refs, but then I decided to actually assign the static fields to some vars in my namespace, i.e. (def *hash* terraclojure.Root/hash). This works and it makes it a little more transparent on the Clojure side. I would be happy to hear if there is another way to do this.

Result

The result of this whole experiment was that I am able to use the Software Transactional Memory with a couple of refs, and to have my changes shared across multiple JVMs. I didn't do any extensive testing to verify that transaction retries work as expected, but since Clojure uses the java.util.concurrent.* classes and standard synchronization, I don't expect there would be an issues.

Where do we go from here?

I only experimented with the STM. I didn't experiment with Agents, so that is certainly an area for future work. On the Terracotta side, I only used one server, I didn't setup a whole array of servers, nor did I try using one or more servers on different machines. All my testing was local, so the performance reflected that (it was pretty good! :)). If you do any further experimentation, then please share it on a blog or to the Google group.

Conclusion

I don't have a lot of experience with Terracotta, but it seems to be quite mature and easy-to-use. I also think that Clojure is a very exciting language, and the combination of the two opens up some interesting possibilities for how to architect highly available, scalable, database-less applications.

P.S. I have a B.S. in Computer Science and will have an M.S. in Computer Science in May. I don't do anything near this interesting at my job. If you have any need for consulting, or if you'd like to offer me a job ;), then feel free to contact me at paul@stadig.name.

2 comments:

Paul Dorman said...

Hi Paul, thanks for the write-up! Let us all know how you go if you continue with this experiment.

pveentjer said...

What kind of performance do you get? I have no idea about the STM performance of Clojure and in combination with the remote communication needed for Terracotta, the performance could be an issue.

Very interesting though :) I have been playing with the idea to cluster Multiverse with terracotta
http://multiverse.googlecode.com/