Saturday, November 28, 2009

Scala: Under The Hood of Hello World

The exercise of printing 'Hello World' from a new programming language is so popular because it allows you to see quite a few things about a language straight away, including the style of the syntax, how to print output and the amount of boilerplate required just to start a simple application.

After getting Hello World to work, there's really two directions you could go: you can go forward, and start doing other stuff with the language, or you can go down, drill deeper, by asking the question "How does it work?"

While moving forward is certainly more productive and exciting in the short term, understanding what's going on underneath the covers of your code can be very beneficial and can lead to insights that may affect the way you write code for the rest of your programming life. For instance, if you've done a bit of Java and a bit of Scala, it may have occurred to you that Scala allows you to define objects (as opposed to classes), while Java has no concept of compile-time objects, so how does the Scala compiler map the concept of an object into Java's single-minded view of classes? We'll find out…

Opening the Hood with Scala
Unless you've managed to run a Scala Hello World program accidentally, you have probably read enough of some Scala tutorial, book or blog to know that Scala is compiled to Java Bytecode - the binary language that is read and executed by the Java Virtual Machine (JVM). This means that a good way to get an idea of what actually happens when we run a Scala program is to have a look at the bytecode that is generated when we compile the application.

So, here's your standard Scala Hello World program:

object HelloWorld {
def main(args: Array[String]) {
println("Hello World!")
}
}

If we compile this program and have a look at the output directory for classes, we'll find two Java class files:

-rw-r--r-- 1 graham staff 778 28 Nov 09:43 HelloWorld$.class
-rw-r--r-- 1 graham staff 607 28 Nov 09:43 HelloWorld.class

Two classes? Yep, two classes. We compile one "object" and we end up with two classes. Let's have a look at what's inside these classes.

If you want to see what's going on inside a Java class (or a Scala class or object that's been compiled to a Java class) you can inspect the class file with the JDK tool 'javap'. Depending on the options you provide to javap, it can print you out just a summary of the class' method signatures or you can see a pseudo-English translation of the all compiled bytecodes. Let's start by having a look at just the signatures:

[scala-tests] graham$ javap -classpath . HelloWorld HelloWorld$
Compiled from "HelloWorld.scala"
public final class HelloWorld extends java.lang.Object{
public static final void main(java.lang.String[]);
public static final int $tag() throws java.rmi.RemoteException;
}

Compiled from "HelloWorld.scala"
public final class HelloWorld$ extends java.lang.Object implements scala.ScalaObject{
public static final HelloWorld$ MODULE$;
public static {};
public HelloWorld$();
public void main(java.lang.String[]);
public int $tag() throws java.rmi.RemoteException;
}

Wow! So, we wrote one method - HelloWorld.main() - and we've ended up with 6 methods and one static field across two classes. Obviously not all of this code is relevant to the running of our HelloWorld program, so let's discuss some of the surrounding fluff and then put it out of mind.

What Is This $tag() Thing?
Probably the first obvious thing is that both classes have a method called $tag(). If we have a look at the Scaladocs for the ScalaObject trait, the base class of all classes and objects compiled by Scala, we'll see no mention of the $tag() method. However, if you have a look at the source of ScalaObject.scala, you'll see this definition, which is been in the Scaladoc:

/** This method is needed for optimizing pattern matching expressions
* which match on constructors of case classes.
*/
@remote
def $tag(): Int = 0

Basically, the $tag() method is a simple categorisation method akin to java.lang.Object's hashCode() that is used by Scala to make it's much-lauded pattern matching perform better. Interestingly, in the 2.8.0 branch of Scala, $tag() has been removed, so now that we understand it, we know that we don't really have to worry about understanding it any more!

What Makes It Run?
Having learnt enough to ignore the $tag() method, let's have a look at how our program runs. Your keen eye may have noticed that we have two main() methods - one on the HelloWorld class and one on the HelloWorld$ class. If you're lucky enough to have two keen eyes, you would have noticed that the main() method on HelloWorld is static, but the main() method on HelloWorld$ is not. The significance of this is that, though there are two main() methods, only one of the classes - HelloWorld - can be used to start the application. If I tried to start the application by telling the java command to start with the HelloWorld$ class, the JVM will happily tell me that I'm an idiot:

[scala-tests] graham$ java -cp .:$HOME/Library/Scala/Current/lib/scala-library.jar HelloWorld$
Exception in thread "main" java.lang.NoSuchMethodError: main

Instead of acting like an idiot, let's have a look at what this HelloWorld.main() method does. The javap command allows us to see the actual bytecode instructions that are contained within the class if we pass the -c option:

[scala-tests] graham$ javap -c -classpath . HelloWorld
Compiled from "HelloWorld.scala"
public final class HelloWorld extends java.lang.Object{
public static final void main(java.lang.String[]);
Code:
0: getstatic #11; //Field HelloWorld$.MODULE$:LHelloWorld$;
3: aload_0
4: invokevirtual #13; //Method HelloWorld$.main:([Ljava/lang/String;)V
7: return
...

If you've learnt a little bit about Java bytecode, reading what this method does is pretty simple…
0: This instruction retrieves the value of the static field HelloWorld$.MODULE$ and pushes it onto the stack. We can see both from this line (after the colon) and from the signatures we looked at above that the type of the MODULE$ field is HelloWorld$.
3: This instruction takes the first argument to the method - the String[] that represents the command-line arguments - and pushes it onto the stack.
4: This instruction invokes the main() instance method on the HelloWorld$ object that was pushed onto the stack at 0, passing it the String[] pushed onto the stack at 3.
At the heart of it, this is a pretty simple operation. If we were to write this method ourselves in Java, it would look like this:

public static void main(String[] args) {
HelloWorld$.MODULE$.main(args);
}

What Is This MODULE$ Thing?
Moving on, I think the next thing we want to find out is, what is MODULE$ and how is it initialised? Chances are you're pretty smart and you've probably figured this out already, so let's just cut straight to the bytes:

[scala-tests] graham$ javap -c -classpath . HelloWorld$
Compiled from "HelloWorld.scala"
public final class HelloWorld$ extends java.lang.Object implements scala.ScalaObject{
public static final HelloWorld$ MODULE$;

public static {};
Code:
0: new #10; //class HelloWorld$
3: invokespecial #13; //Method "":()V
6: return

public HelloWorld$();
Code:
0: aload_0
1: invokespecial #17; //Method java/lang/Object."":()V
4: aload_0
5: putstatic #19; //Field MODULE$:LHelloWorld$;
8: return
...

Again, reading this is pretty simple. Basically it's creating a singleton instance of HelloWorld$ which is stored in a public static field. There's a static block that creates a new HelloWorld$ object and there's a HelloWorld$() constructor, which does something which I find a little bit odd. If we translated it into Java, we'd have something like this:

public final class HelloWorld$ {
public static final HelloWorld$ MODULE$;

static {
new HelloWorld$();
}

public HelloWorld$() {
super();
MODULE$ = this;
}
}

Does that assignment in the constructor look a little weird to you? The constructor is setting the value of the static MODULE$ field. This is especially strange seeing as MODULE$ is a final static field. If you tried to use this Java code you'd find that it wouldn't even compile. At first I thought that this was so odd that I must have interpreted the bytecode incorrectly, so I wrote this little Java class to test out what happens when you call the HelloWorld$() constructor directly:

public class HelloWorldTest {
public static void main(String[] args) {
System.out.println(System.identityHashCode(HelloWorld$.MODULE$));
new HelloWorld$();
System.out.println(System.identityHashCode(HelloWorld$.MODULE$));
}
}

Sure enough, the output of running this code shows that every invocation of the HelloWorld$() constructor does indeed change the value of MODULE$, even though it's final:

2114843907
1179703452

That's a bit of an oddity, but there's a lesson there - never call the constructor of a Scala object from Java code - there is no telling what kind of havoc you may wreak if you instantiate a second instance of an object that is assumed to be a singleton within your Scala code.

Dude, Where's My Code?
Okay, so nothing we've examined up until now has had ANYTHING to do with the one line of imperative code that we wrote in Scala. But we're almost there, as there's only one piece of the double-main() puzzle left to examine - HelloWorld$.main():

[scala-tests] graham$ javap -c -classpath . HelloWorld$
Compiled from "HelloWorld.scala"
public final class HelloWorld$ extends java.lang.Object implements scala.ScalaObject{
...
public void main(java.lang.String[]);
Code:
0: getstatic #26; //Field scala/Predef$.MODULE$:Lscala/Predef$;
3: ldc #28; //String Hello World!
5: invokevirtual #32; //Method scala/Predef$.println:(Ljava/lang/Object;)V
8: return

And here we finally see the code that we actually wanted to run. This method is extremely similar to the main() method we saw in HelloWorld: It pushes the singleton instance of Scala's Predef object onto the stack, pushes the "Hello World!" constant onto the stack and then invokes the Predef.println() method.

Can You Repeat That In English?
Let's summarise what we found in the bytecode of this HelloWorld program. When you define a main() method on a Scala object, the Scala compiler splits the responsibility of running your application into two classes. One class has the same name as your object and contains a static method that is invoked by the JVM to start the application. The other class, which has the same name as your object except with a dollar sign ($) at the end, contains the actual code of your object's main() method in an instance method called main(), as well as a constructor and a public static field for creating and accessing a singleton instance of this special class.

If you want to read some more about why the code from you object ends up in a class that has a dollar sign at the end of its name, you might like to do some reading on companion objects in Scala.

Bonus Marks for Pointing Out Something Nerdy
There is one other little surprise in here that I skipped over and which I only noticed while looking at a decompilation of a Java HelloWorld program. The HelloWorld class generated by the Scala compiler has no constructor at all. Not one. It is im-possi-ble to create an object of type HelloWorld. This is a minor departure from the convention of Java where every class has at least one constructor, by virtue of the Java compiler generating one for you if you decide not to define one. This little difference has absolutely no effect on anything at all, but I always find it interesting when the Scala guys decide to contradict ideas that have been Java "laws" since what seems like the beginning of time. It almost feels like they're slaying Jack Harkness.

Friday, November 6, 2009

XML Generation with Scala

When I first heard the about a proposal to add native XML (a.k.a. XML literals) into Java, my first thought was: who writes XML in their code? I've created plenty of services that generate XML in my time, but they've all either used a template engine like JSPs or Velocity or been generated by an object tree, vis-a-vis JAXB. So my question became: who would even WANT to write XML in their code? It stank of a poor separation of concerns.

Today, however, I am eating my words. I've just written a small utility that uses Scala to generate the HTML of a simple web page, so I used Scala's native XML and it is Good. It's remarkably simple and intuitive, so I thought I'd share the love.

XML Literal Coercion

So, Scala essentially has XML literals, which means that you can just start typing XML in the middle of your code and the Scala compiler will automatically coerce it into an object that can be used for performing xml-ish operations. If you try this using the 'scala' interpreter, you can see some of what happens under the hood:

scala> val myXml = <test><tag/></test>
myXml: scala.xml.Elem = <test><tag></tag></test>

Note that there's no quotes or special charters around the XML - it's just XML. Scala automatically turns it into an Elem object.

But this isn't where the fun ends! The power of native XML comes from the fact that it's very easy to escape out of XML and into Scala code, right in the middle of your XML. You can escape into Scala simply to include some computed output in the XML, or you can do more crazy things like looping through a list and yielding a list of more Elem objects to be included.

Escaping to Scala: Text

From what I've seen so far, there are two slightly different ways to escape into Scala from an XML literal. The first is used when you want to escape into Scala to include text or another XML element and it looks like this:

val name = "Graham"
val xmlOne = <test>{name}</test>
println(xmlOne)

Output:

<test>Graham</test>

Escaping to Scala: Attributes

If, however, you want to escape at the XML attribute level, you would be WRONG if you tried to do this (like I did):

val xmlTwo = <test name="{name}"/>
println(xmlTwo)

You don't get a compilation error, but the "escaping" is not interpreted, so you get this:

<test name="{name}"></test>

The syntax for escaping an attribute properly looks like this:

val xmlThree = <test name={name}/>
println(xmlThree)

All we've done is drop the quotes, and now we get the output we want:

<test name="Graham"></test>

Escaping to Scala: Elements and Looping

Finally - this is where the power becomes really evident - you can escape into Scala and evaluate an expression that yields a list of Elems in order to include an iteration of values in your XML. Observe:

val hobbies = List("Scala", "Photography", "Cycling")
val xmlFour =
<test name={name}>
{for (hobby <- hobbies) yield <hobby>{hobby}</hobby>}
</test>
println(xmlFour)

Output:

<test name="Graham">
<hobby>Scala</hobby><hobby>Photography</hobby><hobby>Cycling</hobby>
</test>

Did you notice the really cool part of that? After we escaped out of the XML literal into Scala so that we could loop through the list, we then started another XML literal and then escaped out of that one to print the value 'hobby'! At that point we are nested 4-deep: Our outer XML literal contains nested Scala, which contains a nested XML literal, which contains more nested Scala. And yes, this could go on forever, down and down and down (just like the turtles). The great thing is that the syntax is so intuitive that you probably hardly noticed that we had gone that deep in this example - the code is simple and clear.

A Little Warning

Note that, when you're looping and yielding, there's nothing to stop you returning something that's not an Elem - Scala will simply call toString on whatever you return and insert it into your XML as text. For example, the following compiles and runs, but doesn't produce what we want:

object Test {
case class Scala;
case class Photography;
case class Cycling;

def main(args: Array[String]) {
val name = "Graham"
val xmlFive =
<test name={name}>
{for (hobby <- List(Scala, Photography, Cycling)) yield hobby}
</test>
println(xmlFive)
}
}

The output from this?

<test name="Graham">
&lt;function&gt;&lt;function&gt;&lt;function&gt;
</test>

Ewwww… (and, yes, the entities do appear in the output)

In conclusion: I used to think that native XML in Java was a stupid idea. Now I don't really care because Scala has it and it looks good and it works fine so whenever I need to produce XML from within code I'll just use Scala.

A Side-Note

If you're slightly genius, you're probably wondering why, in the last example, I showed my whole 'Test' object, whereas in all the other examples I just showed the contents of the main() method. The simple explanation is that I had originally defined the case classes in the main method, but I got this compile error:

Test.scala:19: error: forward reference extends over definition of value xmlFive
{for (hobby <- List(Scala, Photography, Cycling)) yield hobby}
^

To be honest, I wasn't (and I'm still not) exactly sure why this was a problem, but it seemed obvious that scalac was not happy for me to define a case class inside a method and then use that class in the same method (well, actually, I'm using its companion object). I wonder if the compiler injects the instantiation of the companion object at the end of the method, rendering it unusable throughout? That would seem a bit silly. Anyway, the obvious fix was to move the declaration of the case classes out of the definition of main(), and hence why I posted the whole class as the example.