Saturday, November 28, 2009

Scala: Under The Hood of Hello World

The exercise of printing 'Hello World' from a new programming language is so popular because it allows you to see quite a few things about a language straight away, including the style of the syntax, how to print output and the amount of boilerplate required just to start a simple application.

After getting Hello World to work, there's really two directions you could go: you can go forward, and start doing other stuff with the language, or you can go down, drill deeper, by asking the question "How does it work?"

While moving forward is certainly more productive and exciting in the short term, understanding what's going on underneath the covers of your code can be very beneficial and can lead to insights that may affect the way you write code for the rest of your programming life. For instance, if you've done a bit of Java and a bit of Scala, it may have occurred to you that Scala allows you to define objects (as opposed to classes), while Java has no concept of compile-time objects, so how does the Scala compiler map the concept of an object into Java's single-minded view of classes? We'll find out…

Opening the Hood with Scala
Unless you've managed to run a Scala Hello World program accidentally, you have probably read enough of some Scala tutorial, book or blog to know that Scala is compiled to Java Bytecode - the binary language that is read and executed by the Java Virtual Machine (JVM). This means that a good way to get an idea of what actually happens when we run a Scala program is to have a look at the bytecode that is generated when we compile the application.

So, here's your standard Scala Hello World program:

object HelloWorld {
def main(args: Array[String]) {
println("Hello World!")
}
}

If we compile this program and have a look at the output directory for classes, we'll find two Java class files:

-rw-r--r-- 1 graham staff 778 28 Nov 09:43 HelloWorld$.class
-rw-r--r-- 1 graham staff 607 28 Nov 09:43 HelloWorld.class

Two classes? Yep, two classes. We compile one "object" and we end up with two classes. Let's have a look at what's inside these classes.

If you want to see what's going on inside a Java class (or a Scala class or object that's been compiled to a Java class) you can inspect the class file with the JDK tool 'javap'. Depending on the options you provide to javap, it can print you out just a summary of the class' method signatures or you can see a pseudo-English translation of the all compiled bytecodes. Let's start by having a look at just the signatures:

[scala-tests] graham$ javap -classpath . HelloWorld HelloWorld$
Compiled from "HelloWorld.scala"
public final class HelloWorld extends java.lang.Object{
public static final void main(java.lang.String[]);
public static final int $tag() throws java.rmi.RemoteException;
}

Compiled from "HelloWorld.scala"
public final class HelloWorld$ extends java.lang.Object implements scala.ScalaObject{
public static final HelloWorld$ MODULE$;
public static {};
public HelloWorld$();
public void main(java.lang.String[]);
public int $tag() throws java.rmi.RemoteException;
}

Wow! So, we wrote one method - HelloWorld.main() - and we've ended up with 6 methods and one static field across two classes. Obviously not all of this code is relevant to the running of our HelloWorld program, so let's discuss some of the surrounding fluff and then put it out of mind.

What Is This $tag() Thing?
Probably the first obvious thing is that both classes have a method called $tag(). If we have a look at the Scaladocs for the ScalaObject trait, the base class of all classes and objects compiled by Scala, we'll see no mention of the $tag() method. However, if you have a look at the source of ScalaObject.scala, you'll see this definition, which is been in the Scaladoc:

/** This method is needed for optimizing pattern matching expressions
* which match on constructors of case classes.
*/
@remote
def $tag(): Int = 0

Basically, the $tag() method is a simple categorisation method akin to java.lang.Object's hashCode() that is used by Scala to make it's much-lauded pattern matching perform better. Interestingly, in the 2.8.0 branch of Scala, $tag() has been removed, so now that we understand it, we know that we don't really have to worry about understanding it any more!

What Makes It Run?
Having learnt enough to ignore the $tag() method, let's have a look at how our program runs. Your keen eye may have noticed that we have two main() methods - one on the HelloWorld class and one on the HelloWorld$ class. If you're lucky enough to have two keen eyes, you would have noticed that the main() method on HelloWorld is static, but the main() method on HelloWorld$ is not. The significance of this is that, though there are two main() methods, only one of the classes - HelloWorld - can be used to start the application. If I tried to start the application by telling the java command to start with the HelloWorld$ class, the JVM will happily tell me that I'm an idiot:

[scala-tests] graham$ java -cp .:$HOME/Library/Scala/Current/lib/scala-library.jar HelloWorld$
Exception in thread "main" java.lang.NoSuchMethodError: main

Instead of acting like an idiot, let's have a look at what this HelloWorld.main() method does. The javap command allows us to see the actual bytecode instructions that are contained within the class if we pass the -c option:

[scala-tests] graham$ javap -c -classpath . HelloWorld
Compiled from "HelloWorld.scala"
public final class HelloWorld extends java.lang.Object{
public static final void main(java.lang.String[]);
Code:
0: getstatic #11; //Field HelloWorld$.MODULE$:LHelloWorld$;
3: aload_0
4: invokevirtual #13; //Method HelloWorld$.main:([Ljava/lang/String;)V
7: return
...

If you've learnt a little bit about Java bytecode, reading what this method does is pretty simple…
0: This instruction retrieves the value of the static field HelloWorld$.MODULE$ and pushes it onto the stack. We can see both from this line (after the colon) and from the signatures we looked at above that the type of the MODULE$ field is HelloWorld$.
3: This instruction takes the first argument to the method - the String[] that represents the command-line arguments - and pushes it onto the stack.
4: This instruction invokes the main() instance method on the HelloWorld$ object that was pushed onto the stack at 0, passing it the String[] pushed onto the stack at 3.
At the heart of it, this is a pretty simple operation. If we were to write this method ourselves in Java, it would look like this:

public static void main(String[] args) {
HelloWorld$.MODULE$.main(args);
}

What Is This MODULE$ Thing?
Moving on, I think the next thing we want to find out is, what is MODULE$ and how is it initialised? Chances are you're pretty smart and you've probably figured this out already, so let's just cut straight to the bytes:

[scala-tests] graham$ javap -c -classpath . HelloWorld$
Compiled from "HelloWorld.scala"
public final class HelloWorld$ extends java.lang.Object implements scala.ScalaObject{
public static final HelloWorld$ MODULE$;

public static {};
Code:
0: new #10; //class HelloWorld$
3: invokespecial #13; //Method "":()V
6: return

public HelloWorld$();
Code:
0: aload_0
1: invokespecial #17; //Method java/lang/Object."":()V
4: aload_0
5: putstatic #19; //Field MODULE$:LHelloWorld$;
8: return
...

Again, reading this is pretty simple. Basically it's creating a singleton instance of HelloWorld$ which is stored in a public static field. There's a static block that creates a new HelloWorld$ object and there's a HelloWorld$() constructor, which does something which I find a little bit odd. If we translated it into Java, we'd have something like this:

public final class HelloWorld$ {
public static final HelloWorld$ MODULE$;

static {
new HelloWorld$();
}

public HelloWorld$() {
super();
MODULE$ = this;
}
}

Does that assignment in the constructor look a little weird to you? The constructor is setting the value of the static MODULE$ field. This is especially strange seeing as MODULE$ is a final static field. If you tried to use this Java code you'd find that it wouldn't even compile. At first I thought that this was so odd that I must have interpreted the bytecode incorrectly, so I wrote this little Java class to test out what happens when you call the HelloWorld$() constructor directly:

public class HelloWorldTest {
public static void main(String[] args) {
System.out.println(System.identityHashCode(HelloWorld$.MODULE$));
new HelloWorld$();
System.out.println(System.identityHashCode(HelloWorld$.MODULE$));
}
}

Sure enough, the output of running this code shows that every invocation of the HelloWorld$() constructor does indeed change the value of MODULE$, even though it's final:

2114843907
1179703452

That's a bit of an oddity, but there's a lesson there - never call the constructor of a Scala object from Java code - there is no telling what kind of havoc you may wreak if you instantiate a second instance of an object that is assumed to be a singleton within your Scala code.

Dude, Where's My Code?
Okay, so nothing we've examined up until now has had ANYTHING to do with the one line of imperative code that we wrote in Scala. But we're almost there, as there's only one piece of the double-main() puzzle left to examine - HelloWorld$.main():

[scala-tests] graham$ javap -c -classpath . HelloWorld$
Compiled from "HelloWorld.scala"
public final class HelloWorld$ extends java.lang.Object implements scala.ScalaObject{
...
public void main(java.lang.String[]);
Code:
0: getstatic #26; //Field scala/Predef$.MODULE$:Lscala/Predef$;
3: ldc #28; //String Hello World!
5: invokevirtual #32; //Method scala/Predef$.println:(Ljava/lang/Object;)V
8: return

And here we finally see the code that we actually wanted to run. This method is extremely similar to the main() method we saw in HelloWorld: It pushes the singleton instance of Scala's Predef object onto the stack, pushes the "Hello World!" constant onto the stack and then invokes the Predef.println() method.

Can You Repeat That In English?
Let's summarise what we found in the bytecode of this HelloWorld program. When you define a main() method on a Scala object, the Scala compiler splits the responsibility of running your application into two classes. One class has the same name as your object and contains a static method that is invoked by the JVM to start the application. The other class, which has the same name as your object except with a dollar sign ($) at the end, contains the actual code of your object's main() method in an instance method called main(), as well as a constructor and a public static field for creating and accessing a singleton instance of this special class.

If you want to read some more about why the code from you object ends up in a class that has a dollar sign at the end of its name, you might like to do some reading on companion objects in Scala.

Bonus Marks for Pointing Out Something Nerdy
There is one other little surprise in here that I skipped over and which I only noticed while looking at a decompilation of a Java HelloWorld program. The HelloWorld class generated by the Scala compiler has no constructor at all. Not one. It is im-possi-ble to create an object of type HelloWorld. This is a minor departure from the convention of Java where every class has at least one constructor, by virtue of the Java compiler generating one for you if you decide not to define one. This little difference has absolutely no effect on anything at all, but I always find it interesting when the Scala guys decide to contradict ideas that have been Java "laws" since what seems like the beginning of time. It almost feels like they're slaying Jack Harkness.

3 comments:

  1. Another way to see under the hoods is -Xprint:phase, which prints the tree at the completion of the specified compiler phase. The phases can be listed with scalac -Xshow-phases.

    To see the tree after every phase, use -Xprint:all

    e.g. scala -Xprint:all -e 'println("Hello, World!")'

    This is especially useful for seeing which implicit conversions are applied and which implicit arguments are passed into implicit parameter lists.

    ReplyDelete
  2. Hi,
    Nice article.

    I see that the article has some updates for 2.8.0 - I just wanted to add another bit there - now the constructor of HelloWorld$ is private and the HelloWorldTest bit of your article, where you call "new HelloWorld$()" to create multiple instance of the singleton object, is not really valid anymore.

    Cheers,
    Roshan

    ReplyDelete
  3. @Roshan, but then you can always use reflection and suppress the normal java access restrictions

    ReplyDelete