WebGL and runtime performance in a functional paradigm

TL;DR: Performance is important for WebGL. Repeated allocations harm performance. Recycling variables can help.

I’ve worked as a WebGL graphics engine engineer for almost 5 years. Performance was a constant issue. Between puke-free VR, better mobile experience or just fancier graphics, there was always a reason to get better performance. This is why I’ve never done much with WebGL in Elm, despite liking both Elm and graphics. A language that relies on constant allocation and garbage collection seemed like a bad start for a “serious” WebGL application. Missing RTT would also have been a problem but that’s a different matter.

The issue resurfaced for me a few months ago at a presentation by unsoundscapes about his physics engine. In particular the fact that it was faster when using record-typed linear algebra than the elm-experimentation ones.

The reason for this was obvious to me: allocation of objects and arrays is faster than allocation of typed arrays in JavaScript. I used to choose between regular and typed arrays on a case-by-case basis, depending on their longevity. A test to show the difference:

var nIterations = 5000000;

var t0 = performance.now();

for (let i = 0; i < nIterations; ++i) {
  var v3 = {x:1,y:2,z:3};
}

var t1 = performance.now();
console.log("Object creation took " + (t1 - t0) + " milliseconds.");

var t2 = performance.now();

for (let i = 0; i < nIterations; ++i) {
  var v3 = new Float32Array([1,2,3]);
}

var t3 = performance.now();
console.log("Typed array creation the Elm way took " + (t3 - t2) + " milliseconds.");

var t4 = performance.now();

for (let i = 0; i < nIterations; ++i) {
  var v3 = new Float32Array(3);
  v3[0] = 1;
  v3[1] = 2;
  v3[2] = 3;
}

var t5 = performance.now();
console.log("Typed array direct creation took " + (t5 - t4) + " milliseconds.");

var t6 = performance.now();

for (let i = 0; i < nIterations; ++i) {
  var v3 = [1,2,3];
}

var t7 = performance.now();
console.log("Regula array creation took " + (t7 - t6) + " milliseconds.");

Resulting in 1120ms, 4070ms, 5232m and 1578ms. IOW, allocating objects is a lot faster. Typed arrays on the other hand are faster for computation, especially when SIMD instructions are available.
In Elm, every computation comes with an allocation, and generally speaking allocation is the expensive part of a program, generic memory managers being terribly complex.

Again I did a little experiment to illustrate how much of a difference this is. Setting the x value of 5 million vectors in an Elm program takes 3637ms. I did a little bit of cheating to see how far this can be reduced by removing allocation:

/*var _MJS_v3setX = F2(function(x, a) {
    return new Float64Array([x, a[1], a[2]]);
});*/
var _MJS_v3setX = F2(function(x, a) {
    a[0] = x;
    return a;
});

We get 827ms, the cost of just the computation we’re interested in (and surrounding loops etc). Of course setting the x value is one of the least computationally-intensive linear algebra operations there is, but I don’t think I need to show that allocation is relatively expensive, this is a well-understood property. In fact prominent js linear algebra libraries use out arguments to avoid allocations. C-libraries on the other hand tend to pass vectors by value, avoiding the heap altogether.

In short: I would not consider Elm for any “serious” WebGL work due to how allocations with const-type implementation harm performance. I love Elm, but this is just not currently its strong point.

What to do?

I see a few potential solutions, there may be more.

1. Do nothing

WebGL does not have to be an important part of Elm. For this case, I think it would be fair to put a warning in the WebGL library documentation discussing this performance issue to avoid disappointing people down the line.

2. Replace native linear algebra with records

Not actually a solution, as allocation is still expensive, but it would alleviate the problem a bit.

3. Cheat with allocation

There’s a bunch of optimizations you could do to allocations. When working on WebGL professionally I kept arrays of objects and pulled/pushed pre-allocated objects from them to avoid allocations. Elm could do this using reference counting. It might be difficult to figure out for which types to use these pre-allocated pools. Doing it for small product types (such as vectors) only might also be a good heuristic. Profiling and annotations could further expand the use-cases, but that doesn’t sound very Elm-like.

4. Introduce linear types

Linear types are ‘consumed’ when passed to a function. They are old and well-understood, but don’t fit in many languages. Rust is built around them, Haskell got them recently, Idris has them as uniqueness types, even C++ has something similar in the form of unique_ptr.
I believe Elm is sufficiently clean and simple that it could leverage linear (or rather, uniqueness) types without any real issue.
Essentially, if a type is consumed, you can safely recycle its allocated memory in a function, removing the need for allocation and garbage collection.
The downside is the introduction of a new concept to the language, possibly requiring new syntax, the upside is that it would work with anything. As a bonus, this could be used for RTT, safer random seeds and similar inherently ‘consuming’ operations.

5. Pretend some const types are linear

If a const type is used in only one place (let’s call this unique), it can safely be recycled, same as a linear type, as nothing else has access to that memory location.
Const types can become unique by making a deep copy whenever a new reference is required. Again, this could work well with small product types, although I’d expect programmer annotations or profiling to be more powerful. This solution can be easily combined with #3, removing the need for reference counting.

Finally

To clarify: I’m not saying Elm should do X or that it is doing Y wrong. I simply like theory on FP and performance. Would love to hear some thoughts on this.

I have a few more ideas about cool (IMO) stuff you could do with/to Elm code, if this kind of post is undesirable please let me know and I will keep it to myself in the future.

21 Likes

@Oscar thanks for writing it down!

I totally get your point about performance, however I would like to talk separately about WebGL and projects, that require a lot of linear algebraic computations 60 frames per second (like elm-physics).

WebGL in Elm has a certain model with its own limitations, but it is still possible to make graphics performant by leveraging the fact that meshes are cached on GPU, and the only changing data is uniforms that parametrize the scene. This is already documented in the readme.

It is a different problem if something that we want to animate requires a lot of linear algebraic computations. But in the end we may or may not choose to render the result with WebGL. That’s why I think it would make sense to focus on the performance of linear-algebra separately.

While linear types are above my level of competence, I have already suggested the solution number 2 in my second post about elm-physics. After that post I spent some time on benchmarking record based linear algebra against the one based on typed arrays in this project. Unfortunately these microbenchmarks showed weird results that I could not trust (things just cannot get 1000x faster). That’s why I think it would be better to have more projects like elm-physics that would push linear-algebra to its limits, and thus could be used for such benchmarks.

Your main point is that it doesn’t make sense to use WebGL in Elm because it would perform bad, my point is that we should try and implement things regardless. This way we can get real performance numbers that would help to shape the limits or convince that certain improvements are necessary. Otherwise this is a chicken and egg kind of problem.

9 Likes

I may have skipped a few too many reasoning steps for the sake of brevity.

Technically you are absolutely right. This proposal would affect any project where the performance of evaluating Elm code matters (rather than DOM changes) and is separate from WebGL.

My reasoning is that websites/frontends are generally a presentation of sorts, and the only presentation I can think of (that isn’t extremely niche) that would require a lot of performance is WebGL-based.

It’s a similar story for linear algebra. Proposals 3-5 would be applicable to any type.
If you e.g. change an object’s position vector using const types, you also need to reallocate and shallow copy that object and the entire scene, not just the position. In retrospect, I probably should not have focused on linear algebra so much.

I don’t mean to say nobody should use WebGL in Elm at all, but as it stands I would advise against using WebGL in Elm as the basis of a 3D graphics project, because those projects tend to demand a lot from the CPU.

Caching meshes on the GPU often isn’t enough, although one should definitely do it.
3D applications tend to have a lot of things to update every frame: camera or shadow-casting light movement requires recomputing MV matrices for every object in the scene, raytracing is usually necessary for interaction, animations (even just translations) require some work on the CPU…
Some applications may also have e.g. physics or path-planning as CPU-intensive components.
I’m speaking from experience in saying that CPU-bound performance is a real issue, and there are a lot of tasks in complex projects that cannot or should not be offloaded to the GPU.

Linear types

I would like to quickly explain the basics of linear types so far as it is relevant for the proposals.

Essentially, a variable that is linear can be used only once. So is you have a linear x you can do let y = f x but not let (y, z) = (f x, g x), the compiler would complain because x is used more than once. f x would have ‘consumed’ x so there’s no x left for g x. When a linear type variable is ‘consumed’, the compiler knows that it can be freed or recycled.

4 Likes

@Oscar thanks, since my only experience with WebGL is with Elm I never had problems with its performance or have any level to compare it to. But I learned a lot from this post.

Speaking of projects that push Elm forward. Have you seen Herzog Drei by @xarvh ? It could be another example that has many moving objects and is rendered with WebGL. It could benefit from proposed improvements, but maybe the performance is already good enough for this kind of projects.

I don’t have enough knowledge about the internals of V8, what if the JavaScript engine is already doing some optimizations to make things run faster? How to justify that the proposed improvements are necessary without having more real examples that hit performance limits?

2 Likes

I personally think the active exploration that is happening with WebGL in Elm is great and is great motivation for improving performance. I really appreciate and encourage the work that is going on there!

Based on:

It sounds like OP is interesting on hearing about specific performance ideas. Let’s stick to that.

I think there are some very interesting possibilities for Elm when comes to performance. Some examples include:

  • Escape Analysis to figure out when values can be stack allocated (or broken into smaller parts to avoid allocating the outer record at all) and as far as I can tell, this idea has not been very fully explored in a widely used ML variant. This would mean less allocation and fewer young-to-old generation copies. (My understanding is that V8 tries to avoid object allocations in a similar way when it does its inlining.)
  • Reference Counting in the old generation of GC. As of 0.19, there are no longer cyclic data structures, which should make this a decent bit easier to achieve.
  • Monomorphizing as in MLton. They had the downside of long compile times in all cases because the FFI was specifically reliant on everything being monomorphized. It seems that this style of optimization can be behind a flag such that the typical case is still quick (as long as language features stay independent of runtime representation)

I think any reference counting idea that relies on checking “does this have one reference or more-than-one reference?” introduces a branch, which may mean it sounds better than it is in practice.

Anyway, the point is that I think there are a bunch of neat things that can be done without changing the language itself, and they are rather unexplored in widely used ML variants. If you can control the GC implementation as well, there is even more cool things that can be done.

Again, let’s stick to this aspect of the conversation. Maybe other people have references to other techniques and research. Exploring and highlighting practical results (not already covered) would be a great outcome of this thread.

8 Likes

FWIW, the main performance bottlenecks in Herzog Drei were:

  1. Dict.get used in the Dijkistra pathfinding algorithm.
    Even after I switched from a Dict (Int, Int) a to a Dict Int, I had to limit the map size for the processing time to be acceptable.
    The same algorithm implemented in JavaScript was about 80 times faster, and running it through a port is still on the TODO list.
    I assume that the new Array implementation might be faster than Dict for random access, but haven’t tried yet.

  2. The need to re-allocate the game state for every single modification that is applied to it.
    This becomes quickly noticeable when the frame rate drops as the number of entities in the game increases.
    If I have 1000 entities, that’s more than 1000 re-allocations every 16ms.
    There are a few tricks that can be done to reduce the number of game state modifications, but they help only so much.
    While I can’t Haskell much, I understand that Haskell has a monad (the State monad?) that allows values to be mutated within it, reducing the need for re-allocation.

5 Likes

I’d certainly be interested in checking out a few of those. Is there any particular way you think is best? I imagine eventually some of those are going to require testing. Would the first step then be something like creating a collection of meaningful tests/samples like Haskell’s no-fib?

I think this could be further optimized away (if necessary) using conditional writeback similar to what GPUs do. I.e. rather than having a branch in the instruction path, you do all the computations but optionally don’t save the result to memory.

In this case it would work by having a large vector of memory locations to be freed, using it much like a stack. When references decrease, we (conditionally on numRefs == 0) write the “pointer location” to the stack and increase the stack’s top-element pointer. There would still be a conditional branch because the vector can become full, but this should provide better predictability, reducing pipeline evictions caused by mispredictions.

If conditional writeback is absent (which I’m pretty sure is the case in js at least, no clues about WASM yet) it can be simulated by dumping variables on an unused location, e.g. the first element of the vector.

I’ve written a bit of code to illustrate how this would work. I’m representing bools as ints, assuming boolean ops on ints result in 0 or 1. Admittedly, I’m not sure you can use ints like this in JS, it should be possible with bitwise ops. Anyway, here it is:

canFree := (numRefs == 0)
writeLocation := canFree * topElementOffset[1]
vector[writeLocation] := referenceLocation
topElementOffset[canFree] := topElementOffset[1] + 1

1st line represents canFree as 0 or 1, something we can multiply numbers by.
2nd line computes writeback location, either the last or the first (dummy) element of the vector
3rd line write is the (now conditional) write.
4th line either increments the ‘last element’ value (topElementOffset[0]) or the dummy variable (topElementOffset[1]).

1 Like

I’m definitely keen to see Elm be an awesome language for working with WebGL and general 3D geometry/graphics/physics etc. Based on my experience optimizing geometric code in Elm, escape analysis would be huge. The most dramatic speedups I’ve been able to accomplish have all been by avoiding object allocation such as temporary vectors used as part of a computation…often this means manually inlining code to just work on individual Float values.

Take for example CubicSpline3d.derivativeMagnitude - this was a bottleneck for some graphics code I was working on. It’s now implemented in a way that makes zero allocations, which led to a dramatic speedup (at least an order of magnitude) over a previous version that allocated some temporary vectors. If the Elm compiler was able to allocate temporary vectors on the stack instead, I think you should be able to achieve a similar speedup without having to do all the manual inlining.

10 Likes

Since it looks like this topic is about to reach it’s natural end, I’d like to start a repository to explore some of the mentioned problems and proposed solutions more fully. First as a purely theoretical exploration, but if all goes well I’d eventually like to fork and modify the Elm compiler to see how well they work in practice. Any comments or objections?

I’m new to OSS collaboration so I don’t know what the right etiquette is.

5 Likes

It’s different in every community. I’d recommend Code is the Easy Part and this post for info on this particular community.

My technical comment would be that compiler speed is a priority, especially for projects with 100k+ lines of code. So getting better runtime perf for P% of the community must be balanced against the compile time cost to the (100 - P)% of the community that does not benefit appreciably from better runtime perf. I do not think that hiding things behind flags is a full solution to this. More generally, there are no guarantees when working on projects like this, so think of your work as personal exploration. The perf numbers may be good, but the compile time cost, code quality, code complexity, working relationships, etc. etc. may mean that the code is not right for the project overall, even though the learnings are worth sharing.

Also, I was thinking more about the “see if there is only one reference” trick. The problem is that there may be only one direct reference, but many transitive references to the value, so this check is a bit trickier than I mentioned before. You’d have to be sure it was allocated locally to guarantee that there are no transitive references, making it a bit more limited.

Anyway, check out those links. Those are the best resources we have for this sort of thing!

1 Like

Not directly applicable because the Elm compiler isn’t written in Elm and Elm is routinely described as not targeting general purpose programming, but a useful point to think about when considering code optimizations v compiler complexity. (Evan may be surprised to find that I am actually on his side on something.)

Compiler Optimizations Should Pay for Themselves - M. Franz 1994:

Mark

2 Likes

I’ve started a repo, but I’m still compiling all the knowledge from this thread and some extra literature.

2 Likes

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.