Performance Optimization

Performance Optimization

I would like record here, while it is fresh in my mind, the saga of how @folkertdev helped to make a crucial performance optimization in the text editor project that I am working on. In my case, it made the difference between the editor being very laggy and a pain to use on documents of over about 1000 words, and being rather snappy.

I believe that this is a topic of general interest and is important for building performant apps. Many in the community have something to say on the subject. Perhaps we could gather this knowledge into a single shared document that others could use

The Saga

Here is the story. I mentioned my problem to Folkert de Vries in a private chat, describing the lagging problem and asking for some advice on what turned out to be a completely wrong way of addressing the underlying issue. A few minute later Fokert wrote back saying that my app was spending a lot of time in the function ensureNonBreakingSpace, which is defined this way:

   ensureNonBreakingSpace : Char -> Char
   ensureNonBreakingSpace char =
       if char == ' ' then
           nonBreakingSpace

       else
           char

He suggested using instead this function:

    ensureNonBreakingSpace : Char -> Char
    ensureNonBreakingSpace char =
        case char of
            ' ' ->
                nonBreakingSpace
    
            _ ->
                char

This small change turned the app from laggy to snappy.

The Method

One might view this saga as an instance of magic, but it instead a combination of art and science.

Step 1: Profile

Always do this! Your hunches, if you have any, will often be wrong, even wildly wrong. Here is how to do it.

  1. Open your app in Chrome. Then open devtools to the Performance tab.

  2. Click on the record button of the Performance tab.

  3. Play around with the app.

  4. Stop recording, and view the results using the Bottom-Up tab. Here is a screenshot from devtools:

Step 2: Think about it

You will notice that the function _Utils_eqHelp is taking up 28.4% of the execution time. To find out who is calling this function, click on the triangle to the left of this function. Here is what one sees:

Scanning down, one finds $jxxcarlson$elm_text_editor followed by $Editor$View$ensureNonBreakingSpace. This is the Javacript incarnation of the Elm function defined this way in module Editor.View:

   ensureNonBreakingSpace : Char -> Char
   ensureNonBreakingSpace char =
       if char == ' ' then
           nonBreakingSpace

       else
           char

Seems rather innocuous, no? But it is in fact the culprit. Folkert suggested using this code instead:

    ensureNonBreakingSpace : Char -> Char
    ensureNonBreakingSpace char =
        case char of
            ' ' ->
                nonBreakingSpace
    
            _ ->
                char

This does not seem like much of a change, but under the hood, it is because checking for equality is expensive. When I replaced my code by Folkert’s, the change was evident: laggy had been replaced by snappy. When I profiled the code a second time, _Utils_eqHelp_ did not appear on the first page. Success!

Here was my sign-off to Folkert:

Wow! I tried your version of ensureNonBreakingSpace — it results in a HUGE performance gain. The marker for that function doesn’t even show up on the first page of the performance results. Number 1 is MinorGC at 31% and number 2 is _VirtualDom_diffFacts at 7.7%. From the user perspective: snappy editor performance on the 3000 word file.

Conclusions and comments

  1. Profile to find where the real bottlenecks are. Don’t optimize before you know this!

  2. Work to understand why the bottleneck code is taking such a great share of the execution time. Understanding now the compiled Javascript code works is a big help.

  3. Optimizations are often needed to make apps performant and pleasant to use.

Perhaps at some future date, the compiler will be able to do optimizations of this sort.

45 Likes

Some elaboration on why this gives such a big improvement

There are two ways that elm compiles the equality operator ==. It can’t just always compile to javascript == because for instance dictionaries can have different internal structures, but store the same keys and values. So for some types, elm uses custom equality, which is defined in _Utils_eq. But calling that function is much slower than the built-in javascript == operator.

Additionally, the elm compiler doesn’t currently have type information available when generating code. This means that when code is generated for an expression x == y (where x and y are elm variables), it must be assumed that x and y are of a type that uses custom equality. This is the reason you may find x == y written as (x - y) == 0 in high-performance code. When one of the arguments is an integer literal, elm emits the javascript equality operator.

So for some expressions, when it’s clear what the type is from the expression, the built-in javascript == is generated, otherwise the _Utils_eq function is called. There is a big speed difference between a javascript built-in and the function call (or, several calls as Utils_eq calls Utils_eqHelp).

Currently, a comparison with a character literal isn’t optimized in the same way an integer literal is. By using an elm case expression, we make the compiler generate a javascript case statement. These again are much faster because there is no function call involved.

Note in step 2, the ensureNonBreakingSpace you see in the image takes only a small percentage of the time. But when you unfold the _Utils_eq call further, you see ensureNonBreakingSpace occur with a much higher value. So sometimes you have to dig a little to find the function names that mean something to you.

35 Likes

Great write up! (x - y) == 0 is a trick that I’ve been using in elm-physics too.

case .. of works faster than chaining operations on the Maybe type. That’s why elm-physics has a bunch of nested case .. of.

Another performance trick is to not use the record update syntax and manually write down all the properties. Because record update has a slower JavaScript implementation of iterating over fields and copying values.

Also I’ve been using tail optimised recursion with a function that takes many arguments as opposed to a single argument that is a record. The compiled code is really fast. All these arguments become local variables that are reused in the while loop in the compiled JavaScript code.

30 Likes

For me this looks like abstraction leak.

I assume those tricks won’t be needed when Elm compiles to webassembly. Am I right?

Well, it’s complicated.

Because elm is a language on the web platform, the size of a program matters a lot in comparison to native languages (gotta have fast page loads!). Program speed and size need to be balanced, and can be at odds.

In the case of comparing to a Char literal specifically, it should be simple to fix this issue. Emitting the javascript == for types that support it (Int, Float, Bool, Char, and some others) is trickier because it requires that type information is kept around until code generation. Even bigger improvements in speed could be made by specializing operators like == to all types they are used for (monomorphization) but that would greatly increase code size.

WebAssembly is not a magic solution to this problem, or some of the other problems where elm picks small code size over best performance (notably record update syntax). Mainly, it requires compiler engineering, carefully balancing compiler speed, code size and code speed. While WASM is more compact (allowing for slightly faster but larger programs), it doesn’t fundamentally change the constraints that elm is subject to.

10 Likes

I don’t think WASM by itself solves those issues. These seem like compiler optimizations that can happen higher up than at the code generation stage.

I must admit that I am quite surprised that Elm does not keep the type information around until code generation. What is the reason for this?

Very interesting insights, thanks for this !

I think I have also an edge case:

  • I am generating large lists of data to create a world map for a game
  • The generation process is batch processing a least 1000-1500 Messages but it can be way more depending on the world map size
  • Since it would block the main thread if I would do it synchronously, I subscribed to Time.every 0 TakeMessageAndProccessIt to process each message (not animation frame , this would be too slow - 60fps only). It takes about 6 seconds to process about 1400 Messages, and I noticed that about 4 Seconds of it, it is idle ! I understand why there is idle time, but why is it so much ? Should I process multiple Messages in each step (Problem is also that some Messages in the Batch use Cmd with Random.generate) ?

Here a small footprint of my Performance Profile:

Can you see the idle gaps ?
So my question is: Is there a way to make this better ? I also thought of a Web Worker to process this long calculations …

These are my messages for each field in the map: (Except DroppedGenerationDataForPerformance, SetCurrentEcoSystemType, SetBiomeList, EndStep)

type GenerationStepMsg
    = DroppedGenerationDataForPerformance
    | SetCurrentEcoSystemType EcoSystemType
    | SetBiomeList (List Biome)
    | RollRandomCoordinate (List WorldSpace -> Random.Generator Int) (List WorldSpace)
    | RollRandomBiome (List Biome -> Random.Generator Int)
    | RollChunkTreesSubCoordinates (List.Nonempty.Nonempty Tree -> Random.Generator (List ScreenSpace))
    | RollChunkTreeTypes (List.Nonempty.Nonempty Tree -> Random.Generator (List TreeType))
    | AddChunkToGlobalList
    | DropPickedBiomeFromBiomeList
    | CreateChunk (EcoSystemType -> Biome -> WorldSpace -> Chunk)
    | CalculatePossibleCoordinates (List Chunk -> List WorldSpace)
    | EndStep

I’m not totally sure, and I’m not the right person to answer this question.

I think this is at least related to the compiler being written in a pure functional language, and common type inference using mutation. Typically, you would use a union-find data structure to store a mapping from type variables to their content (effectively substitutions). Elm instead uses a bunch of scattered IORefs (one per variable), which means that the substitutions don’t live in a central place. I assume there is a good reason for that design, but it means that preserving the relevant type variables isn’t so straightforward.

Then of course holding on to the types is a bunch of work that isn’t very visible to most end users. It absolutely seems possible, there are some things that can be done with the type information, but it hasn’t been a priority so far.

@eimfach could you open a new thread for this problem? I think it would be better to do more work per message (so batching of some kind). Possibly doing manual threading of randomness would be beneficial too.

1 Like

Actually, you can get many of the same benefits of monomorphization by making it easy for the Javscript JIT to inline functions. This would require a change in how function calls are compiled, but would give us a lot of the same benefits without increasing code size. This would also require that type information is kept around until code generation, though.

More info: https://dev.to/skinney/improving-elm-s-compiler-output-5e1h

5 Likes

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.