Elm core libs in WebAssembly

I’ve been working on an experiment to implement parts of Elm in WebAssembly. I’ve discussed it with a couple of people, but wanted to share around a bit more and maybe spark some discussions.

This repo implements parts of elm/core in C rather than JS, and compiles it to WebAssembly. It includes:

  • Byte level implementations of all of Elm’s data types
  • First-class functions (currying, higher-order functions, etc.)
  • The basic arithmetic operators (+,-,*,/)
  • Record updates and accessors
  • A working garbage collector designed for Elm

Demos (very low level and basic ones), and docs, are linked in the README.

I’m also working on a fork of the Elm compiler but I don’t really have anything worth sharing on that just yet.

By the way, I should probably also be clear that this is an experimental hobby project by some random guy. There’s nothing official about it!

Let me know what you think!

55 Likes

I’ll maybe highlight two parts of the docs that people might have missed if they just scanned the README.

There’s a discussion on using UTF16 vs UTF8 for strings that might interest some people. One of the things I’m thinking about is that browser Web APIs are implemented using UTF16, so does that push us that direction even in Wasm? There’s a bit of guesswork in that so if anyone has knowledge on this and sees a gap in the reasoning, let me know! It’s the first link below.

There’s also an analysis of what kind of Garbage Collector makes sense for Elm. There have been a few lively threads on here about that topic in the past. Anyone see an angle I’ve missed on this? See second link below.

3 Likes

Very interesting!

I remember hearing about a StringView proposal for JS. As I recall, the goal was to let folks work with UTF-8 strings and give them to DOM APIs directly. I also recall hearing that you may get some performance benefits from something like this if the underlying implementation wants UTF-8 anyway.

All I can find now is StringView, so perhaps I am misremembering!

2 Likes

The document about GC is really nice!

It reminds me of a neat idea that I have heard in various places.

Double-buffer the View

The view function has pretty interesting allocation behavior, and it may be possible to “double buffer” to avoid ever scanning, copying, or moving any Virtual DOM values.

Say there are two memory regions. You could have a situation like this:

  1. Calling view allocates into region 1. Do the first render.
  2. Calling view allocates into region 2. Do a diff and update DOM.

At this point, none of the virtual DOM values in region 1 are alive. You can just move the allocation pointer back to the beginning of region 1 and start writing over it!

  1. Calling view allocates into region 1. Do a diff and update DOM.
  2. Calling view allocates into region 2. Do a diff and update DOM.
  3. etc.

I think this is safe as long as the allocation regions are only used for values of type Html and Attribute. The presence of lazy means you may have values you want to keep around, so any closures from code like onInput (Changed "thing") would need to be in the normal heap and the mark phase would have to go through the two view regions.

My profiling always suggests that rendering is the expensive part, so this seems really promising to me. One neat side benefit is that you end up having fewer collections for normal heap. It should only contain values from update and subscriptions which should be a lot less.

I think the root idea here is from Mozilla people and we discussed it with Richard and Robin a long time ago. Anyway, perhaps it is interesting to you!

8 Likes

Thanks Evan! :smiley:
Hadn’t heard of StringView before but I’ve played with TextDecoder as part of this project.

you may get some performance benefits… if the underlying implementation wants UTF-8 anyway

This is what I’m concerned about! My impression is that hardly any of the underlying implementations do want UTF-8! So the performance argument goes the other way.

My understanding is that all of the string-related C++ code in all the browsers work on UTF-16. I remember reading a comment from a Servo developer on Hacker News, who said that since it was a new ground-up browser development, they wanted to go with UTF-8 for the internal string stuff. But all the W3C specs are based on UTF-16 and they kept running into problems. In the end they had to abandon that idea and switch to UTF-16. They just couldn’t take the performance hit with all the conversions. Unfortunately I haven’t been able to dig up that link again!

If that’s right and Elm uses UTF-8 as the underlying representation in the String library, then it will pay some performance penalty for it, because there will be lots of conversions to UTF-16 somewhere in order to interface with browser C++ code. How big is that penalty? I don’t know. But why pay it? I guess UTF-8 saves you memory though. Which matters most? Dunno, needs benchmarking!

I’ll provide some of the evidence that led me to this conclusion. If you look in the W3C specs you’ll find lots of references to DOMString. Like for example when you do document.createElement('div'), the bytes that representing that string 'div' must be a DOMString.

DOMString is defined to be UTF-16 here
The spec for the Document interface is here. It specifies that a Document has a method called createElement that has an argument tagName whose type is DOMString.

If you browse around those specs, hit ctrl-F, and search for DOMString, your screen will light up. It seems crazy that they don’t just leave this up to browser implementations.

All of this was a huge shock to me by the way. I thought that, since HTML documents are usually transmitted over the wire as UTF-8, then surely the DOM would also be based on UTF-8? Nope! That’s not how it works! I guess it gets converted during HTML parsing? I’m not sure. I’d love to know, but this kind of detail is pretty hard to find.

I would love to be wrong about this so if someone knows I am, please tell me!

No, Elm just uses JavaScript strings which are UTF-16. See e.g. 26.2.1 at Unicode in ES6

Strings are internally stored as UTF-16 code units.

1 Like

Thanks yes I know that’s the case for the current JavaScript implementation. I was talking about a potential future WebAssembly implementation of Elm.

For context, this project was inspired by a (now two-years old) list of potential projects, in which UTF8 is implied to be the default choice for string encoding.

That’s why I wrote up a document, posted above, about whether a future Wasm implementation of Elm should use UTF-8 or UTF-16. I sort of challenge the UTF-8 assumption and give some arguments in favour of UTF-16. Basically I believe browser internals use UTF-16 so it would be faster.

That is the context in which I said "if that’s right and Elm uses UTF-8 … "
The word “that” refers to my belief that browser internals use UTF-16

1 Like

The trick is that immutable data NEVER points to younger values… At any time we can scan the last values created and free those that are not pointed to from the same set

A while back there was a thread about improving the performance of List.append, which lead me to research methods in how to optimize (essentially) right folds. As most CS problems, this was already solved in the seventies and is called Tail recursion modulo cons (even though it’s not limited to Cons)

Here is a concrete example (in C) of what this could look like:

void List_Map (void* (*f)(void*), List *list, List *result) {
    // case list of 
    if (list->tag == 0) { // [] ->
        List_Nil(result); // []
    } else { // x :: xs ->
        void *x = list->head;
        List *xs = list->tail;
 
        // f x :: map f xs
        List *tail = malloc(sizeof(List));

        // +++ we swapped the execution order of the next 2 lines +++

        // List_Cons is a constructor that doesn't dereference its arguments,
        // and can therefore not observe that we did not fully construct the tail yet
        List_Cons(f(x), tail, result);
        // now this function is tail-recursive!
        //`clang -O` and `g++ -O2` already compile this to the "loopified" version!
        List_Map(f, xs, tail);
    }
}

I wanted to try and implement a prototype version of this based on the Elm compiler, but Evan seemed to be in the middle of refactoring stuff for 0.19.1, so I decided to wait until the release before I tried it again.

Of course, this is all extremly hypothetical at the moment, but I think it would be a shame to prevent these kinds of optimizations. But maybe it’s worth it!

If we could always reverse the allocation order like this, would it be still possible by looking the other way instead?


My idea for a garbage collector for Elm always looked like this:

  • There is a single “entry” reference, the current model. The compiler can emit special code to traverse the old model and the new one to find differences.
  • 2 memory arenas: 1 for old values, 1 temporary for new ones.
  • If we find the same reference in a field, we can skip it entirely.
    If we find a different reference in the new model, mark the old value as dead, and the new value as live. Traverse recursively.
  • Free all dead values from the old arena, copy all live values from the new arena to the old one.
  • Free the whole new arena.

I guess this is similar to what Ocaml / V8 are doing?

1 Like

Thanks @jreusch, great comment!
I had heard the term “tail call modulo cons” but had never dug in to understand what it was. I just spent a while looking at it on Wikipedia trying to get my head around it! Interesting!
I’m pretty sure it doesn’t work with the GC I’ve built, for the reason you pointed out - it creates “old values” that point to “new values”. I’ve heavily relied on the assumption that that’s impossible, and breaking that assumption messes everything up. :frowning_face:

On your idea for the garbage collector, yeah it sort of sounds like semi-space copying but with diffing instead of just wiping the old area completely clean.
I can see that it could save copying the parts that don’t change. But copying a large sequential block of memory all in one go is super-fast. CPUs can spot simple patterns like this and use hardware optimisations for them. They do stuff like branch prediction and pre-fetching. Anything that makes it more irregular might not make it faster!

The algorithm I found for marking actually takes advantage of pre-fetching. All the “mark” bits are kept in one chunk of memory rather than being embedded in the values themselves. So there’s less “hopping around” in memory. It’s pretty neat. I found it in the garbage collection handbook.

Also I think the memory might get fragmented? The scheme seems like it prevents you from ever filling in “gaps” left behind by the dead stuff. Those gaps would accumulate over time. If you compact it at any point, you lose the benefit of keeping the unchanged parts. Or I guess you could try to keep track of it all. I dunno, it’s not obvious to me that it’s a big enough benefit to be worth the complexity.

Oh wow! I had no idea! I need to go and think about how I could use this, thanks!

2 Likes

I’ve been thinking about this.
It’s pretty similar to @jreusch’s comments, but he focuses on the model and you’re focusing on the view.

One thought is what happens when the current virtual DOM region is the lower one (in terms of memory addresses) and the view suddenly gets much bigger (user expands a menu or whatever). Now the lower region needs to expand into where the upper region is, but you don’t want to overwrite that upper region. So you have to detect this condition and move the upper region further away to make room. Of course you can do it and big “proper” GCs do it. But it just seems like a fair bit of complexity and so far I’ve been avoiding that.
You’ve also got the same problem if the model, subscriptions etc are below the view regions and suddenly someone puts a massive string into the model. This is why I was trying to avoid “regions” altogether.
I think the main work this would save is moving the virtual DOM during compaction. But I’ve been thinking that the GC should happen in the idle time after rendering, perhaps triggered by window.requestIdleCallback(). Hopefully that would take it out of the critical path.
Interesting ideas for future research though!

On running out of space

My understanding of compiling C to JS is that you allocate a big ArrayBuffer and use that as your memory. So accessing a pointer is like memory[42] and writing to something out of bounds of the JS ArrayBuffer is the equivalent of a segfault.

Based on that understanding, here are two ideas of how to account for view needing more memory.

Idea 1: space at the end

Say memory is one ArrayBuffer of size 64kb. Set aside some reserved space at the end of the ArrayBuffer. Maybe 8kb for the two regions.

Say you have allocated into Region 1 successfully, but calling view with Region 2 runs out of space:

  • bail on executing view
  • bump the reserved space to 16kb

The newly acquired 8kb of space is now Region 2. The previously held 8kb is now Region 1, and it contains all the info from the old Region 1 within it.

So now you have 8kb for each view call without needing to move anything.

Note: If you cannot get 16kb because the heap is too big, do a GC and try agian. I suspect there is some protocol for detecting when the heap is totally full and more memory is needed, so this could be hooked into that as well.

Idea 2: new spaces

I don’t think you must limit yourself to one ArrayBuffer.

You could instead have Region 1 and Region 2 each be an ArrayBuffer of some default size. Let’s say they start at 4kb each. If any view ever runs out of space, you could allocate a new ArrayBuffer with twice the space and try again.

This essentially gives the bookkeeping tasks to the underlying JS VM which will make space for your three ArrayBuffer values. (One for the heap, one for Region 1, and one for Region 2.)

On motivation

The major benefit of an optimization like this would be (1) to have fewer GC pauses and (2) for each pause to be shorter since there are fewer live values in the heap. These give a separate set of benefits from doing collections at stategic times. For example:

  • Animations - That’s the primary situation you’d actually see stuttering due to GC since there is no idle time. It’s also a case where your Model is probably not changing too much and your view is generating a bunch of nodes really quickly.
  • Battery Life - By having fewer and cheaper collections, you end up doing less work overall. This could be especially nice on phones and laptops.

Doing collections during an idle time if the heap is likely to get full on the next Msg still makes sense, but it provides different benefits.


Anyway, this idea may not work for some reason or another, but it just seems promising and I figured I’d mention it!

5 Likes

Yep that all makes sense. It can definitely be done. Idea 1 is pretty much how I imagine it would work. I think I know how to implement all of it in terms of the big picture stuff. I might be just having an involuntary initial reaction to the amount of nitty gritty work that I know lies beneath the surface here! It took me over 6 months to build the simplest GC design I could come up with and get it to actually work! But sharing ideas like this, and looking at the big picture, was the whole point of starting this thread. This is a great idea that I hadn’t thought of, so thank you! Great to have your insights on where the big benefits are likely to be found.

As for idea 2, actually Wasm is currently limited to one ArrayBuffer but the spec allows for more in future. Swapping ArrayBuffers sounds fine on the JS side but I bet it would be a complicated thing for the C program to deal with. Also I like the fact that so far all of my C code can run as a native executable, without any JS involved.

I agree with the motivation too. The fastest work is work you never actually do. The idle time thing was more of a justification that the GC I already have is maybe OK for now.

I think my next steps will be about code gen and getting a proper Elm app working. Once I have that, it’ll be easier to profile things, which would inform this kind of optimisation stuff and provide a way to test it out.

3 Likes

Hey @evancz , I just spent the evening thinking about your “special GC tricks for VDOM” idea and I’m definitely warming to it!

One of the big ways I’ve kept this GC relatively simple is having only one region. The mark-compact architecture means I get a lot of the benefits that people normally get from multiple regions but without the book-keeping overhead. That was a big benefit of the design and I didn’t want to lose it.

But having thought through it a bit more, having a region for VDOM is definitely simpler than a region for anything else.

The big realisation is that in an Elm program, nothing outside the view function can have a reference to the VDOM! This is a huge deal. If I had pointers from the “normal” heap to some other heap region it would mess up a lot of the simplifying assumptions in the code. So that’s nice!

Also, the fact that nothing points at VDOM makes it easier to move around when you run out of space and have to reorganise the heap. Normally when you’re moving a value, you have to find everything that was previously pointing to it and update all those pointers to the new address. But nothing points into the VDOM! You do need to update the internal pointers (parent node to child node) but that’s just a constant offset because you moved them all by the same amount.

The one complication is that the VDOM can contain references to the model (I guess strings would be the most common case). So that’s a pointer between regions. It means that whenever you collect the normal region you have to search for pointers in the VDOM and update them. But at least I already have code to calculate where things in the normal region moved to, so that limits the extra complexity.

So I guess the main part of the implementation work is the decision logic about when to do a minor GC or a major GC, when to request more memory from the system, when to move the VDOM region around, when you can steal memory from one region and give it to the other… etc. That’s definitely more complicated with two regions than with one. But it’s not as bad as I’d feared.

Interesting stuff, thanks again for the idea!

1 Like

In elm 0.19 there is nothing actually preventing you to put some Html msg in your Model and then use those in the view function, you’re not actually forced to create nodes there unfortunately

1 Like

Ha! Very good point, I’d never thought of that!

I don’t think it breaks anything though, at least not the way I was thinking of implementing it.
The way it would work is that before calling the view function, the runtime would switch to allocating in the “VDOM” region. But when it calls update and subscriptions and init, it would go back to allocating in the “normal” region.

So what I was calling the “VDOM region” would be more accurately labelled “view region”. There would be all sorts of things in there along with the Html msg values, from whatever was used in calculating the VDOM itself.

This means the properties I mentioned still hold. Values in the view region can have references to values in the normal region, but not the other way around.

Oh… unless you create a value in your view function and put it into an onClick handler, that then gets sent back around to update… That seems to violate the assumption that anything created in view is immediately garbage after the next call.

OK it’s late where I am, I’m going to bed!

1 Like

Most well-formatted reply I’ve ever seen. Thanks for taking the thought and time to do that.

1 Like

I wonder, are all onX functions routed through on:

https://package.elm-lang.org/packages/elm/virtual-dom/latest/VirtualDom#on

In which case your compiler can have an intrinsic for on that allocates anything passed to it to the normal region - perhaps by copying-by-value at the moment of invoking on?

Yeah probably. I had the same thought afterwards.
There’s a solution for everything, but all these edge cases add to the complexity.

I agree. I would even consider a GC region for the VDOM / view garbage too specific. I’m thinking about the future applications for elm, e.g. web servers.

1 Like

Oh, I was not aware of that! I had assumed that I worked like Windows or Linux, where you can just VirtualAlloc or mmap if you ran out of memory or needed another arena.

If you do it the way Evan suggests, this should not be a problem!

The normal cycle in TEA looks like this:

view ----> update ---> view
             ^     |
             +-----+

Since Elm needs to keep the VDOM alive until the second call to view, every value that might be sent to update that was created during the first call to view is still alive.

update can however put the values it recieves into the model - effectively breaking the assumption that nothing allocated in view can be inside of the model. But I think you already handle this case as well: Values from view are definitely not part of the old generation, so the compact phase of the GC should just copy them anyways into the old generation if they survive update. This would require to always run the GC before calling view though.