Tuesday, January 18, 2011

Scala Pragmatism: Ignore CanBuildFrom

After a recent blog of mine about converting two lists to a map, a commentor wrote that they were...
trying to understand the code I read [in] the ScalaDoc for the zip() [function]:
def zip[A1 >: A,B,That](that: Iterable[B])(implicit bf: scala.collection.generic.CanBuildFrom[Repr,(A1, B),That]): That
would you please help us to read/understand this?

This is indeed curly stuff which, I'll have to admit, I wouldn't be able to fully explain without consulting a text book or two to get the terms right. Then you'd also have to read the same text books before you could understand my explanation.

If I can take some poetic license with the question, though, I think what the poster really wants to know is why the signature is so complex when the intent of the function is so simple. So instead of explaining the intricate details of what this signature means, I'm going to explain why, to make practical use of the zip() function, you don't need to understand the above signature, as well as a little bit about what the added complexity actually achieves.

Firstly: Why you don't need to bother too much.

If you look at the Scala API for List, you'll see that as well as the above signature for zip(), it also lists a "[use case]" signature for zip, which is simply this:
def zip [B] (that: Iterable[B]) : List[(A, B)]
The purpose of this 'Use Case' signature is to let you know you that, most of the time, you will be able to call zip() on a List and pass in just another List (or some other concrete subclass of Iterable) and get back a List of Tuple2s. Just like this:
scala> val list1 = List(1, 2, 3)
list1: List[Int] = List(1, 2, 3)

scala> val list2 = List('a', 'b', 'c')
list2: List[Char] = List(a, b, c)

scala> val result = list1.zip(list2)
result: List[(Int, Char)] = List((1,a), (2,b), (3,c))
That's the pragmatic part, and the most important part to understand. If you cast your eyes down the List scaladoc, you'll notice that there's lots of these use cases. (You should understand why this is the case once you understand why the extra complexity is there.)

To sum that up: to use the zip function, you just have to pass in an Iterable. You will rarely, if ever, need to know what CanBuildFrom is or what all those type parameters refer to.

Secondly: Why then have all that complexity in the signature, when I'm not using it?

Whoa! That's actually not true! The fact that we've called the function without explicitly providing that second parameter doesn't mean that you haven't used the extra complexity. In fact, you have used it, and it's provided you a small benefit, all without you realising. This is where a little bit of magic happens...

In understanding why that signature is so complex, the key is actually in the explanation above of what zip does. Note that I said that you can call zip() on a List and pass in just another List, and you will get back another List. This is actually a little bit amazing. It probably doesn't seem that amazing, but then you have to consider that the implementation of the zip function is not actually defined on the List class, but is defined in a trait that List inherits from called IterableLike. What is surprising here is that the IterableLike trait doesn't know anything about the List class, and yet its zip function is able to return an instance of List. (And remember that Scala's List is a concrete class, very different from Java's List interface.)

This is what the CanBuildFrom parameter is all about: it allows functions like zip to be defined once, high up in the collections API hierarchy, but then to also return a different type of result depending on the type of the collection on which the method is actually invoked. To put that in practical terms: if I call zip on a List, I'll get back a List of tuples, but if I call zip on a mutable.HashSet, I'll get back a mutable.HashSet of tuples. The same goes for map() and filter() and countless other functions. This is the magic: these functions don't know anything about these concrete types they are returning, but you haven't had to tell it about the concrete types either! To achieve the same result in Java without any casting, you would have to override the zip function in each and every concrete subclass of IterableLike from which you wished to return a specific collection type. (Edit: Actually, I don't think this is true. I think I've figured out in my head a way to do it that doesn't require casting or overriding.)

There is just one mystery left: why don't you have to pass in an instance of CanBuildFrom? As you can see, the parameter list containing this parameter is implicit. This means that, if this parameter list is not explicitly defined in any invocation, the Scala compiler will to attempt to find a variable or object that is in scope, and declared to be implicit, and also meets the type criteria of the parameter. So you don't have to provide a CanBuildFrom instance because scalac is tracking one down for you.

But where is it tracking it down from? All code has to come from somewhere, right? At first, I thought that there might be one CanBuildFrom instance in the Predef object that served the needs of nearly every collection. The real answer is a bit less exciting: each concrete subclass provides its own instance of CanBuildFrom in its companion object, though in many cases this definition is as simple as creating a new instance of GenericCanBuildFrom. I have to be honest and admit that I haven't yet figured out how this definition in the companion object makes its way into the zip function and other functions. While you unwittingly import the companion object whenever you import a collection type, you haven't imported all the companion object's functions into scope. Perhaps someone with a bit more patience for searching the Scala source code will nice enough to explain it in the comments.

So, to sum up:
  • You don't need to understand CanBuildFrom in order to use functions that have one in their signature.
  • Look for "[use case]" entries in the scaladoc for the collections API and use those as the basis for function calls unless the compiler seems to be telling you to do otherwise
  • The advantage of the complex function signatures is that the definition can be written once, but can return many different concrete types of collection, without you having to tell it how to do that.

Want to learn more?

If you'd like to get a bit more detail about how Scala's collections and the related classes like CanBuildFrom operate under the covers, there is a series of pages on scala-lang describing some of the implementation details of Scala's new collections API.

And if you really want to know every last thing about why this signature looks the way it looks and how it works, you might want to read the academic Scala paper, 'Generics of a Higher Kind', but please don't post questions about that on my blog! ;)

If you just want to learn a bit more about Scala, try one of these:

From Amazon...

From Book Depository...

Monday, January 10, 2011

Scala == Effective Java ?

I started reading Joshua Bloch's Effective Javalast week. I'll have to admit that I haven't read it before, but only because I've been told by several people, "you already do most of what's in there anyway." Seeing as we tell all the new recruits to read it, I thought I should actually flip through it myself so I know what's in there.

Books of best practices are always written in relation to domains that have many possibilities for bad practices (choosing otherwise would make for a very short book). Reading the first chapter of Effective Java, I was amused as I realised that, if you're coding in Scala instead of Java, many of the book's recommendations are either unnecessary, because Scala doesn't permit the corollary bad practice, or built into the language of Scala, or made easier to implement than they are in Java. This isn't a criticism of the book, but an observation that the state of the art is moving on, and Java is being left behind.

From the first 25 items in the book, here are my notes on practices that either become easier to follow or unnecessary if you are using Scala:

Item 1: Consider static factory methods
One of the four advantages given for static factory methods is that construction of classes with type parameters is briefer through static methods because the type parameters can be inferred on a parameterised method call. Scala solves this one by inferring type parameters on constructors, but not only that. It infers LOTS of types, almost everywhere: fields, local variables, method return types, anonymous function parameters - all can usually have their type inferred from the context, meaning that Scala code has a lot less declarative noise, while still remaining statically type-checked.

Item 2: Consider the Builder pattern instead of long constructors
Joshua writes himself that the Builder pattern simulates the combination of named parameters and default parameters that are found in other languages, and Scala supports named and default parameters. What he means by "simulates" is that you need to write a whole extra Builder class in Java to get the nice Builder effect at the client site, whereas in Scala, you essentially get a Builder for free every time you write a constructor.

Item 3: Enforce Singletons
While code-enforced singletons have gone out of fashion a bit with the popularity of IoC containers, they're still around a lot and these rules are still important. Scala supports singletons at the language level, providing the 'object' keyword which, substituted for the word 'class' in a declaration, will compile a singleton object into your code instead of a class. The generated object follows all of Josh's recommendations, including deserializing to the same instance (as long as you use Scala's @serializable rather than extends java.io.Serializable)

Item 4: Enforce non-instantiability with private constructors
There's really only two cases where you want to do this: singletons (see above), and utility classes, which are really just singletons that have no state. Either way, Scala's 'object' keyword for creating singletons is the simple answer.

Item 5: Avoid auto-boxing
The recommended panacea for unwanted auto-boxing is to prefer primitives wherever possible. Now, Scala is a purely object-oriented language, which means there are no primitives, and all numbers, characters and booleans are represented at the language level as objects. However, Scala uses of these primitive wrapper objects compile to byte code which uses Java's primitives wherever possible, so this recommendation is implemented for you by the Scala compiler.

Items 7, 8 and 9: Overriding toString, hashCode and equals
If you're authoring an immutable data structure, declaring your Scala class as a case class will tell the compiler to automatically implement toString, hashCode and equals for you, along with an unapply() method that can be used in a pattern matching clause. (There are some disadvantages to using Scala's case classes, but I believe they work well for most situations.)

Item 11: Override clone() judicously
While Scala as a language doesn't provide an answer to this one, it's considered best-practice Scala to favour immutable types, with transformation being much favoured over mutation. Following this principle will reduce the need to ever use clone(), because immutable objects can be shared among many clients and threads without the shared-mutable-state worry that might cause you to consider cloning something.

Item 14: Use public accessors methods rather than public fields
While Scala appears to have public fields - and indeed to make fields public by default - in fact Scala implements Bertrand Myer's "Uniform Access Principle", with all access to fields (which are in fact all private) being made through accessor and mutator functions that have the same name as the field. In other words, the compiler writes get and set methods for you and everything that looks like field access in your code actually goes through these methods.

Item 15: Minimize mutability
As already mentioned, it's considered Scala best practice to shun mutable state as much as possible. One of Josh's four recommendations for decreasing mutability is to make values final wherever possible. All Scala fields and local variables must be preceded by either 'val', indicating immutability (of the reference), or 'var', indicating that the reference can change. (Function parameters are always vals, and hence final.) Forcing programmers to make this choice when each variable is declared encourages the practice of using a lot of immutability, compared to final which is an optional modifier and seen by many as extra noise in most declarations.

Item 18: Prefer interfaces to abstract classes
Scala has traits - abstract classes that essentially allow multiple inheritance of function definitions (as opposed to interfaces, which only inherit function declarations). The possibility of multiple inheritance discounts quite a few of the disadvantages Josh raises against using abstract classes in Java. Of course, it also introduces some new issues, and exacerbates some others, like those Josh lists in Item 16 about preferring composition+delegation over inheritance. Multiple inheritance is a double-edged sword, for sure.

Item 21: Use function pointers
Scala, as a functional language, has functions as first-class members of the language, with every function naturally being available as an object (same as what Josh calls a function pointer) should the need arise. It also supports anonymous, inline functions, which, if available in Java, could reduce current "function pointer" logic like this:
new Thread(new Runnable() { public void run() { System.println("Running"); } } );
down to something like this:
new Thread({System.println("Running")})

Item 23: Don't use raw generic types
Scala doesn't allow you to write code that uses raw generic types, even if those types are defined in Java: it just won't compile. For what it's worth, raw generic types are not a feature, but merely an artefact of backwards-compatibility. Scala, not trying to be backwards-compatible with Java 4, just doesn't need raw types and, as a result, is able to provide stricter type safety for type-parameterised classes and functions.

Item 25: Prefer Lists over Arrays
While you should probably still prefer Scala's List class to Arrays for most applications, Scala prevents the chief problem cited with Java's arrays. In Java, you can cast a String[] to an Object[] and then assign a Long to one of the entries. This compiles without error, but will fail at runtime. In Scala, however, arrays are represented in source code as a parameterized class, Array[T], where T is an invariant type parameter. This basically just means that you can't assign an Array[String] to an Array[Object], and so Scala prevents this problem at compile time rather than choking at runtime.

Scala == Effective Java ?
So I'm going to put this question out there:
If 'Effective Java' is considered essential reading, and the best practices in it are the de facto standard for writing good programs, shouldn't we all be giving serious consideration to switching to a language that is so very close to Java, but makes good programming even easier?

Want to learn more?
From Amazon...
From Book Depository...