A faster `List.map` for Elm

I just ran some benchmarks on an alternative List.map implementation in JavaScript and it looks really fast!

There’s an old LISP trick from the 70’s called “tail recursion modulo cons” that allows you to optimise certain linked list functions that are “not quite” tail recursive. “Not quite” in this case means they do a “cons” before the tail call. The trick allows you to get around that and turn it into a while loop as the compiler does for proper tail recursive functions.

@jreusch mentioned this to me once in the comments of a post about my Elm in WebAssembly project over a year ago now.

Then recently I was writing a JS function a bit like List.map as part of my project, remembered the idea, and tried it out. I got curious and decided to benchmark it against the core List.map and see what happened. Well… good things happened!

| length | Core ns/cell | Modulo Cons ns/cell |
| ------ | ------------ | ------------------- |
| 10     | 74.5         | 11.3                |
| 100    | 63.3         | 8.5                 |
| 1000   | 65.1         | 8.4                 |
| 10000  | 75.2         | 9.8                 |
| 20000  | 85.0         | 11.4                |
| 50000  | 100.0        | 16.5                |

chart

The benchmark code is here if you want to try it out yourself. It involves a bit of hacking the compiled JS but there are instructions in the comments in the Elm file.

Generality

I think the optimisation should work for most of the core List functions that are currently based on foldr, which is quite a few of them:

  • List.map
  • List.indexedMap
  • List.filter
  • List.append
  • List.concat
  • List.intersperse
  • List.partition
  • List.unzip

However the foldr function itself seems to be too general. This optimisation relies on knowing the constructor for the return type.

List.partition and List.unzip each return a tuple of lists rather than a single list, but it should be easy to extend the optimisation to cover that case.

26 Likes

Very cool!

I’d be open to adding this to elm-optimize-level-2 if you’re interested!
Robin mentioned this optimization, (we have it described here on our list of interesting transformations) but we didn’t quite get around to adding it. Cool to see the difference in the benchmarks!

We also have a pretty good sized benchmarking suite for realworld packages, so we could see how this affects their performance as well.

8 Likes

This is a neat optimization! I think the most generic version I have heard of is to do this for any direct use of a constructor. For example:

type Pairs = Nil | Cons Int Int Pairs

swap : Pairs -> Pairs
swap pairs =
  case pairs of
    Nil         -> Nil
    Cons x y ps -> Cons y x (swap ps) 

Should be able to do this optimization in that case as well. And any other “list-like” constructor. I think any constructor should work to some extent though. For example:

type Tree = Empty | Node Int Tree Tree

double : Tree -> Tree
double tree =
  case tree of
    Empty      -> Empty
    Node n l r -> Node (2 * n) (double l) (double r)

I think it’d be possible to figure out some variation that’d partially handle this kind of scenario. Like recursing normally on r and doing some sort of trick for l. I am sure I have read something about this before, but I cannot remember now! Maybe in OCaml or SML? I think there are additional techniques that can handle even more cases (like mutually recursive types and functions) but I am drawing a blank on what terms to search for. If someone can find the papers on “generalizing tail recursion” or whatever, please share them!

Anyway, I would be curious the effect this has on speed/locality of allocating data structures. Maybe things end up in a nicer order in the heap, but I would want to see a breakdown.

Would be really cool to explore in elm-optimize-level-2 for now I think. Probably combines in interesting ways with deforestation as well!

6 Likes

I thought a lot about this kind of optimization and if/how it could be implemented in current Elm.
tl;dr: You only need to completely rewrite the compiler!

First, we need to formalize the not-quiteness of this kind of tail-recursion. We know it should work for Cons, but the List type is implemented as a library, not compiler magic!

Here is a basic template of what a not-quite-tail-recursive function might look like in Elm:

notQuite x =
   f (notQuite (loopBody x))

f (...) here is some function application that traditionally would prevent tail recursion. As I tried to explain in my example in that other thread, the important fact here is to be able to prove that this function application does not dereference the argument. If it doesn’t, it would’nt be able to tell if we just called it with a pointer to some uninitialized memory, allowing us to initialize that memory later, for example in a tail-recursive call!

Since Elm is such a simple language, here are the rules:

  • Primitive types (Int, Float, String as well as “newtype” wrappers around those) are always dereferenced, since they are not references in the first place.
  • An if ... then ... else ... expression dereferences a value if any of the ... deferences the value.
  • A case expression always dereferences the value being examined, but follows the same rules as if for all of its branches
  • Calling constructors, including the record literal syntax, never dereferences the fields (except when they are primitives). This is the magic sauce that makes tail recursion after cons work!
  • Record extension dereferences the extended record, but not the fields.
  • Function application dereferences the function, but only dereferences the applied value if the function itself dereferences it.
  • Accessing a field of a record dereferences that record.

Note that this is not full lazyness - an expression either fully dereferences a value (think deepSeq), or leaves it fully lazy. It also does not evaluate a thunk - the value has to already be computed prior to executing that expression! There are also no partially computed values, since we try to track everything at compile time. This is especially important in function application - f x may not read x at all, even though f would, because it could also construct a closure intead! For the sake of sanity I strongly advice ignoring this.

Everything else seems pretty manageable, we just track a deref annotation for every expression, and then check if that holds for the f in our example - If it only accesses notQuite (loopBody x) by reference, we can reverse the execution order, as long as we can provide the memory address (the reference) of the result before computing it!

I minor detail I hinted at earlier is that we now have 2 different kinds of functions:

&a -> b -- only access reference to a
!a -> b -- access value to a

I just made up that reference annotation, and it would not be part of the user-facing language, but that means we need to possibly generate multiple versions of our not-quite-tail-recursive functions, for all different kinds of parameters. This also has to do with a limitation in Javascript, where objects are always passed by reference, but primitive values never are.

So now we need to additionally implement monomorphisation and inlining (or alternatively make everything run slow because you box every integer), because you may need to generate multiple versions for foldr, depending on the kind of function f is. Also, type information in the Elm compiler does not live long enough to do any of this…

I gave up.


Edit: This may be one of the things that may be easier in WebAssembly than Javascript - I still haven’t found a good encoding for references to values (numbers) on the stack in Javascript yet, or rather a general encoding for references that isn’t slow. If there was one, the first rule would be lifted, meaning all values follow the same rules and can be references. This in turn would then allow us to only generate 1 JS function per Elm function again.

Thanks for the replies everyone!

To be honest I just thought of this as a potential enhancement to a few functions in List.js. I brought up more general stuff and of course we’re going to chat about that. But realistically I thought this was about that one JS file.

I was only vaguely aware of the elm-optimize-level-2 project, looks very good! I wouldn’t have thought of code mods in this case so I’m curious. A code mod script sounds like something for modifying generated JS compiled from Elm, rather than code that’s written in JS in the first place. So I’m wondering why this is the approach being suggested. I haven’t been on the Elm community sites much this year so I might be out of touch.

@evancz you mention allocation and heap layout.
If you were wondering whether there’s just loads of allocations or loads of garbage or something, then no, there isn’t. Just one garbage Cons cell. Otherwise similar to foldl.

Here’s the code:

var $author$project$Main$moduloConsMap = F2(function(f, xs) {
  var tmp = _List_Cons(undefined, _List_Nil);
  var end = tmp;
  while (xs.b) {
    end.b = _List_Cons(f(xs.a), _List_Nil);
    xs = xs.b;
    end = end.b;
  }
  return tmp.b;
});

It pre-allocates a Cons cell (tmp) before starting. This is an extra Cons in front of the list that we’re going to return, so the initial Cons ends up being garbage. We just use it to keep track of the head of the list.

The while loop iterates down the xs list from head to tail, creating a new Cons for the output list at every step. Each step mutates the tail of the previous step. That’s how it avoids having to reverse the list, or iterate over it backwards.

So for every step it’s just allocating a Cons cell plus whatever the function f allocates. That’s why I say it’s like foldl.

The main difference is that the tail of each Cons cell points to a newer Cons cell, rather than an older one. This is the less common direction for pointers to go. v8’s GC will track old-to-new pointers if they cross from one generation to another. But that seems unlikely within one list. Should be easily worth being able to use a while loop!

1 Like

In elm-optimize-level-2 we can essentially adjust any of the JS that Elm ultimately produces. There’s not really a distinction about who writes it originally, compiler or human. It just reads the JS output of the Elm compiler and makes a bunch of adjustments.

Code mods can do fancy AST transformations, but they can also just do a simple find-replace when we want to try out a new handwritten implementation for something.

The general benefits being:

  1. People can use the optimization now in real projects
  2. We can gather real world data on performance.
  3. We can screen for issues in people’s real Elm projects.
4 Likes

Thanks Matt,
Yep I understood that it was possible for JS source, I was wondering why it was considered more desirable even in that case. Thanks for the explanation, I’ll look into it!

1 Like

Just to close off on this, it’s been merged into elm-optimize-level-2

19 Likes