Friday, February 27, 2009

Clojure + Terracotta = Yeah, Baby!

Update: I've gotten a REPL running with Terracotta. http://paul.stadig.name/2009/03/clojure-terracotta-next-steps.html.

What is Terracotta?

Terracotta provides a network-attached, virtual, persistent heap and transparent inter-JVM thread coordination. With Terracotta, you no longer need to map your objects to database tables and back. You simply hand your object to Terracotta and it will cache your data. Not only does it cache your data, but it will make your object available to a cluster of networked JVMs. Not only that, but it will also spill your objects to disk if necessary (just like Virtual Memory), so you need not worry about having gobs of memory to hold all of your objects.

What is Clojure?

Clojure is a Lisp for the JVM with a software transactional memory, and agents (asynchronous, message based concurrency). It is a functional language with immutable datatypes. It can also inter-operate with any existing Java code.

NOTE: you need to use Clojure r1310 or later, because the Keyword class needs to have hashCode implemented to play nicely with Terracotta.

Clojure + Terracotta = ?

These two seem like an interesting combination. Imagine the possibilities...kill your database, simple POJO applications, free distributed transactions, clustered JVMs with limitless memory...it would make your hair would grow back, you'd get women, and become filthy rich...well...maybe not, but at least you'd have more fun writing software.

After some initial setup (the code and instructions are at http://github.com/pjstadig/terraclojure/tree/master/), there are two things that need to be done to integrate Clojure and Terracotta: 1) instead of running Clojure with the 'java' command, you run it with the 'dso-java.{sh,bat}' script provided with Terracotta, and 2) you need to create a configuration file that defines how your objects will be shared between JVMs.

Configuration

The configuration for Terracotta (at least in our case) consists of defining: roots, instrumented classes, auto-locks, additional boot jar classes, and servers. At this point it's probably helpful to take a peek at the config.xml file that comes with the code and follow along.

  • Roots. A root is a object that is shared between JVMs. Any objects that are part of the object graph that can be reached from the root are also shared, so any objects that are assigned to data members, etc. A common use case is to have a ConcurrentHashMap (or in our case a PersistentHashMap from Clojure) that is shared as a root. This creates a flexible hierarchy of shared objects. In Clojure's case, we also share clojure.lang.Keyword.table, so that our keywords are unique across all of the JVMs, otherwise inserting into a hash map would create multiple entries for the same keyword.
  • Instrumented classes. Any class that is shared (either directly as a root, or indirectly as a part of a root's object graph), must be instrumented. I made all of the clojure.lang.* classes instrumented. It's a bit of a broad stroke, but there aren't any performance problems that result from instrumenting too many classes. Terracotta is helpful in this case, if you end up inserting an uninstrumented class into the object graph, it'll throw a RuntimeException that explains exactly how to modify your config file to instrument that class.
  • Auto-locks. Terracotta will transparently convert your synchronized blocks into distributed transactions across all the JVMs in the cluster. Again, I made broad strokes here and just defined auto-locks for all of the methods on any clojure.lang.* class, and again, there aren't any performance penalties for auto-locking methods that don't have any synchronized blocks. I used write locks, and Terracotta has a few different types of locks that are worth looking into if you need to do something more serious. In the case of auto-locks, Terracotta will also help you out by throwing a RuntimeException if you leave out anything.
  • Additional boot jar classes. Frankly, this was something Terracotta told me to do, and I don't know exactly what is going on here. (Perhaps someone else can explain?) I think what happens is that by default Terracotta instruments the java.lang.* and java.util.concurrent.* classes, but to instrument other Java core classes you have to add them in this configuration element.
  • Servers. Terracotta is very easy to work with, and by default will just run a single server on localhost. You can define more than one server in a cluster. In my case, I only wanted one server, but I wanted to change the persistence mode. By default the persistence mode is a temporary-swap-only mode. The objects will be preserved across stopping and starting clients, but once the server is stopped, the data disappears. To have the objects persisted across restarting the server, you have to set the persistence mode to permanent-store. The temporary swap mode will be faster for data like the intermediate results of calculations, caching, etc., but if you need to permanently persist the data, then you need to use permament-store.

There are instructions about how to run this example in the README with the code, so I won't bother to duplicate that here. I'd just like to share some of the issues I encountered, the results, and any future direction that could be taken.

Issues

The first major issue that I encountered was that Keywords weren't unique across JVMs, so I had to make clojure.lang.Keyword.table a root. This ensured that keywords are unique across JVMs, but I still ran into an issue when using keywords as keys for a PersistentHashMap. The result of identical? was true for keywords from two JVMs, but I was still getting duplicate entries in my hash map. After some debugging, I was able to determine that the issue is that the keyword class did not override the default implementation of hashCode. After mentioning this to Rich, and a quick fix in r1310, it worked nicely.

The only other major issue was how to reference Clojure vars and refs from the Java side. The main reason for this is to define a root that will be shared by Terracotta. When Clojure code gets compiled some Java classes get generated with mangled names. As far as I can tell, there isn't a good predictable way to get at a Clojure var, because Clojure will generate a class for each namespace called my/namespace/namespace__init.class and it creates static fields on that class for various definitions (functions, vars, etc.). Those fields are named const_1, const_2, const_3, etc. There is no reliable, flexible way to predict the name of a particular Var.

My solution was to create a simple Java class called terraclojure.Root with a couple of static fields containing refs. At first I just used that class directly to access the refs, but then I decided to actually assign the static fields to some vars in my namespace, i.e. (def *hash* terraclojure.Root/hash). This works and it makes it a little more transparent on the Clojure side. I would be happy to hear if there is another way to do this.

Result

The result of this whole experiment was that I am able to use the Software Transactional Memory with a couple of refs, and to have my changes shared across multiple JVMs. I didn't do any extensive testing to verify that transaction retries work as expected, but since Clojure uses the java.util.concurrent.* classes and standard synchronization, I don't expect there would be an issues.

Where do we go from here?

I only experimented with the STM. I didn't experiment with Agents, so that is certainly an area for future work. On the Terracotta side, I only used one server, I didn't setup a whole array of servers, nor did I try using one or more servers on different machines. All my testing was local, so the performance reflected that (it was pretty good! :)). If you do any further experimentation, then please share it on a blog or to the Google group.

Conclusion

I don't have a lot of experience with Terracotta, but it seems to be quite mature and easy-to-use. I also think that Clojure is a very exciting language, and the combination of the two opens up some interesting possibilities for how to architect highly available, scalable, database-less applications.

P.S. I have a B.S. in Computer Science and will have an M.S. in Computer Science in May. I don't do anything near this interesting at my job. If you have any need for consulting, or if you'd like to offer me a job ;), then feel free to contact me at paul@stadig.name.

Friday, February 20, 2009

Rails, respond_to, IE6, and the Accept Header

Pain...much pain caused by IE6.

If you've worked with respond_to in Rails, you know what a cool idea it is. Provide access to the same resource in different formats based on either the extension on the URL (i.e. http://something/people/1.xml), or based on an HTTP header that your browser send to the web server, called the Accept header.

It sounds good, but in practice there is one particular browser (*cough* IE) that causes problems. I got into it thinking, "I don't need to worry about this 'Accept' header thing. If a user pulls up http://something/people/1 they'll get an HTML version and if they pull up http://something/people/1.xml they'll get an XML version." This fallacious (?!) reasoning works like a champ with Firefox and IE7 (I think it's getting hazy at this point), but IE6 FAIL!

Change the order of my respond_to block? FAIL! How about forcing a sane Accept header? Sweet! It works, until I upgrade rails and now the request headers are frozen. FAIL! (This may have been my problem because I wasn't doing it right, but it doesn't matter, there is a better way.) How about just explicitly specifying the :format for every URL in the application? Annoying, tedious, but it works, until I get a call from a user, "When I search here and click there I get a 'data dump.'" FAIL!

At this point, I may be doing something wrong. Perhaps one of the above solutions "should" have worked, but I'm mad...there has to be a better way. Can't Rails just serve HTML by default and some other format when you specify the extension? Can't Rails just ignore the Accept header? It turns out that there was a commit on June 27, 2008 that did just that. This was supposedly done for Rails 2.2, and I'm running 2.2.2, so why am I not benefiting from it? Because two weeks later it was undone. However, we're on the right track now.

Given that this is such a widely known issue, I don't know why someone hasn't posted the magic solution until now, but here it is...Are you ready? Add this line to config/environments/{test,development,production}.rb:

config.action_controller.use_accept_header = false

There...that was simple. You're welcome.

Friday, February 13, 2009

To 'and' or not to 'and'

Ruby has two 'or' operators ('||' and 'or'). It also has two 'and' operators ('&&' and 'and'). This can be confusing to people, but especially to those learning the language. There is a temptation to use 'and' and 'or' because it is more readable, and I can certainly appreciate that. However, there are some serious differences between these operators, and I recommend only using '&&' and '||' in boolean expressions.

Of the two, 'and' has lower precedence than '&&', and it is the same with 'or' and '||'. This means that there is a difference between:

irb(main):001:0> true || false && false
=> true

and:

irb(main):002:0> false || true and false
=> false

You might then be tempted to just adopt the practice of always using 'or' and 'and', but that also might surprise you:

irb(main):023:0> true or false and false
=> false

This surprising result follows from the fact that, whereas '&&' has a higher precedence than '||', 'or' has the same precedence as 'and', so Ruby just evaluates the statement left to right handling first the 'or' then the 'and'.

Even though you may understand the nuances between these operators, not everyone may understand, and the fact is that 99% of programmers in the world (really 100% I would hope) can understand statements involving '&&' and '||'. So let's just stick with the traditional boolean operators, because in the end it is actually more readable.

Friday, February 6, 2009

CSS: multiple class selection

I don't know how many times I've wished that I could select an HTML element that has two classes. I want to select the table rows that are both 'odd' and 'awesome'. So instead of doing this:

...
<tr class="even awesome">...</tr>
<tr class="odd awesome">...</tr>
...

I end up doing this:

...
<tr class="even awesome even_awesome">...</tr>
<tr class="odd awesome odd_awesome">...</tr>
...

I always felt dirty doing something like that, and thought there had to be a better way to do it. Well there is! It turns out that '.odd.awesome' will select those elements with both the 'odd' and 'awesome' classes.

This stylesheet:

.odd_awesome {
  color: red;
  size: 48pt;
}

Has now become:

.odd.awesome {
  color: red;
  size: 48pt;
}

And the HTML is simply:

...
<tr class="even awesome">...</tr>
<tr class="odd awesome">...</tr>
...

Now that I know this secret, I vaguely remember having known it many years ago (like when I was first introduced to CSS), but somehow I had forgotten it. It's like running into an old friend. "Hello Mr. CSS Selector! It's been a long time."

Now go simplify your HTML/CSS!