Tuesday, January 18, 2011

Scala Pragmatism: Ignore CanBuildFrom

After a recent blog of mine about converting two lists to a map, a commentor wrote that they were...
trying to understand the code I read [in] the ScalaDoc for the zip() [function]:
def zip[A1 >: A,B,That](that: Iterable[B])(implicit bf: scala.collection.generic.CanBuildFrom[Repr,(A1, B),That]): That
would you please help us to read/understand this?

This is indeed curly stuff which, I'll have to admit, I wouldn't be able to fully explain without consulting a text book or two to get the terms right. Then you'd also have to read the same text books before you could understand my explanation.

If I can take some poetic license with the question, though, I think what the poster really wants to know is why the signature is so complex when the intent of the function is so simple. So instead of explaining the intricate details of what this signature means, I'm going to explain why, to make practical use of the zip() function, you don't need to understand the above signature, as well as a little bit about what the added complexity actually achieves.

Firstly: Why you don't need to bother too much.

If you look at the Scala API for List, you'll see that as well as the above signature for zip(), it also lists a "[use case]" signature for zip, which is simply this:
def zip [B] (that: Iterable[B]) : List[(A, B)]
The purpose of this 'Use Case' signature is to let you know you that, most of the time, you will be able to call zip() on a List and pass in just another List (or some other concrete subclass of Iterable) and get back a List of Tuple2s. Just like this:
scala> val list1 = List(1, 2, 3)
list1: List[Int] = List(1, 2, 3)

scala> val list2 = List('a', 'b', 'c')
list2: List[Char] = List(a, b, c)

scala> val result = list1.zip(list2)
result: List[(Int, Char)] = List((1,a), (2,b), (3,c))
That's the pragmatic part, and the most important part to understand. If you cast your eyes down the List scaladoc, you'll notice that there's lots of these use cases. (You should understand why this is the case once you understand why the extra complexity is there.)

To sum that up: to use the zip function, you just have to pass in an Iterable. You will rarely, if ever, need to know what CanBuildFrom is or what all those type parameters refer to.

Secondly: Why then have all that complexity in the signature, when I'm not using it?

Whoa! That's actually not true! The fact that we've called the function without explicitly providing that second parameter doesn't mean that you haven't used the extra complexity. In fact, you have used it, and it's provided you a small benefit, all without you realising. This is where a little bit of magic happens...

In understanding why that signature is so complex, the key is actually in the explanation above of what zip does. Note that I said that you can call zip() on a List and pass in just another List, and you will get back another List. This is actually a little bit amazing. It probably doesn't seem that amazing, but then you have to consider that the implementation of the zip function is not actually defined on the List class, but is defined in a trait that List inherits from called IterableLike. What is surprising here is that the IterableLike trait doesn't know anything about the List class, and yet its zip function is able to return an instance of List. (And remember that Scala's List is a concrete class, very different from Java's List interface.)

This is what the CanBuildFrom parameter is all about: it allows functions like zip to be defined once, high up in the collections API hierarchy, but then to also return a different type of result depending on the type of the collection on which the method is actually invoked. To put that in practical terms: if I call zip on a List, I'll get back a List of tuples, but if I call zip on a mutable.HashSet, I'll get back a mutable.HashSet of tuples. The same goes for map() and filter() and countless other functions. This is the magic: these functions don't know anything about these concrete types they are returning, but you haven't had to tell it about the concrete types either! To achieve the same result in Java without any casting, you would have to override the zip function in each and every concrete subclass of IterableLike from which you wished to return a specific collection type. (Edit: Actually, I don't think this is true. I think I've figured out in my head a way to do it that doesn't require casting or overriding.)

There is just one mystery left: why don't you have to pass in an instance of CanBuildFrom? As you can see, the parameter list containing this parameter is implicit. This means that, if this parameter list is not explicitly defined in any invocation, the Scala compiler will to attempt to find a variable or object that is in scope, and declared to be implicit, and also meets the type criteria of the parameter. So you don't have to provide a CanBuildFrom instance because scalac is tracking one down for you.

But where is it tracking it down from? All code has to come from somewhere, right? At first, I thought that there might be one CanBuildFrom instance in the Predef object that served the needs of nearly every collection. The real answer is a bit less exciting: each concrete subclass provides its own instance of CanBuildFrom in its companion object, though in many cases this definition is as simple as creating a new instance of GenericCanBuildFrom. I have to be honest and admit that I haven't yet figured out how this definition in the companion object makes its way into the zip function and other functions. While you unwittingly import the companion object whenever you import a collection type, you haven't imported all the companion object's functions into scope. Perhaps someone with a bit more patience for searching the Scala source code will nice enough to explain it in the comments.

So, to sum up:
  • You don't need to understand CanBuildFrom in order to use functions that have one in their signature.
  • Look for "[use case]" entries in the scaladoc for the collections API and use those as the basis for function calls unless the compiler seems to be telling you to do otherwise
  • The advantage of the complex function signatures is that the definition can be written once, but can return many different concrete types of collection, without you having to tell it how to do that.

Want to learn more?

If you'd like to get a bit more detail about how Scala's collections and the related classes like CanBuildFrom operate under the covers, there is a series of pages on scala-lang describing some of the implementation details of Scala's new collections API.

And if you really want to know every last thing about why this signature looks the way it looks and how it works, you might want to read the academic Scala paper, 'Generics of a Higher Kind', but please don't post questions about that on my blog! ;)

If you just want to learn a bit more about Scala, try one of these:

From Amazon...


From Book Depository...


4 comments:

  1. Hi Graham - check out
    http://www.scala-lang.org/docu/files/collections-api/collections-impl.html and also
    http://lampwww.epfl.ch/~odersky/papers/fsttcs2009.pdf.

    Both go into detail regarding the design of the 2.8 collections API

    ReplyDelete
  2. Very good explanation, Graham! To answer your question at the end: If the compiler searches for an implicit parameter with type T, it will first consider all implicits in scope. If none is found, it will next consider all implicits that are contained in a companion object of some part of T.

    In the concrete case you mention, you are looking for an implicit value of type

    CanBuildFrom[List[A], (A, B), ?]

    where the ? can be any type. List is a part of this type, so the compiler will look up the implicit canBuildFrom in its companion object. And that's how it is found.

    But I want to really stress here that the user of the collection library does not need to know anything about all this - it just works. As you write, concentrate on the [use case] defs and you are fine.

    ReplyDelete
  3. @Eric P: Thanks very much for those links.
    The 'collections-impl' pages on scala-lang in particular look like they give a much more practically-minded description of the underlying magic than Adrian's paper. (That's not a criticism of the paper, of course - it's for a different audience.) I think I'll add that one into the post.

    ReplyDelete
  4. @martin: Thanks very much for your compliment, and even more so for solving the final mystery. I never knew that implicit resolution extended to companion objects, so that's a nice little nugget.

    Your explanation that it searches companion objects for "some part of T" is interesting in its lack of specificity. When you say "some part", it seems you are saying that the compiler flattens out the required type and its parameters. Is that right? So if the type of some more restrictive implicit parameter were
    CanBuildFrom[List[Lead], (Lead,Gold), ?]
    then would it flatten that type out to:
    CanBuildFrom, List, Lead, Gold
    and then search any companion objects of all four of those?

    Cheers,

    Graham.

    ReplyDelete