Plots and outlier removal in elm-benchmark 2.0.3

I told a bunch of people that elm-benchmark would have these two features in version 2, but it was so full that I cut them from the first release. Now they’re in!

There are two nice benefits here:

  1. You can examine the data that elm-benchmark collected for you. Humans have really good heuristics for things being off visually, and looking at a plot really kicks the funny-detector on.
  2. We can now reliably measure “heavy” (read: less than 100,000 runs/second) functions. These tend to have big spikes and dips, and chopping off outliers lets us get a better read on what’s actually going on.

As before, you can find out how to use elm-benchmark at

Enough talk, now we plot! :nail_care:


This shows a heavy function (calling length on a long list.) We’ve found and eliminated a bunch of outliers, so the goodness of fit has recovered nicely. Before this change it was hovering around 80%.


We take the same approach when there are multiple benchmarks in a comparison. In these cases, lower means faster, but we also assign each series a color. I think these colors are reasonable for colorblind folks, but please open an issue if they’re not and I’ll try something else!

You can also see that in the cases of lighter functions we have fewer outliers. The calculations here do not change nearly as much as with heavy functions, but everything gets a little more reliable. :+1:


Scale benchmarks get more colors. Right now we only have a few colors after which the originals will start repeating. If you run into this, you probably are running very large scale benchmarks, which the library is not really designed for. If this is your case, please open an issue.


Great, the new plots look very nice!

Do you have a demo somewhere on ellie?
I’d love to play with it a bit without setting up my own repo.

Also a small nitpick:
There are no labels on the axis of the plots, how am I supposed to know what the plots are showing?

You can use Benchmark.Runner.program as main and it will Just Work™ in Ellie.

:man_facepalming: I knew there was something I missed. I spent a long time getting the legend reasonable and totally missed this!

The x-axis is sample size, and the y-axis is time in milliseconds.

Is it possible to make plots optional for scales?

I am currently benchmarking multiple data structures (compared to List and Array) at exponential size scales (1 to 100K) on different operations. I have one file per benchmarked operation. The plots are cluttering a bit the nice quick overview of all results that used to fit roughly in the page (for exploration I only use 3 scales, 10 to 1000).

The code looks like this:

main : BenchmarkProgram
main =
    program <|
        describe "Append" <|
            [ lists
            , hamtArrays
            , uint8Arrays
            , float64Arrays

lists : Benchmark
lists =
        |> (\size -> ( size, List.repeat size 0 ))
        |> (\( size, list ) -> ( toString size, \_ -> List.append list list ))
        |> scale "List"

hamtArrays : Benchmark
hamtArrays =
        |> (\size -> ( size, Array.repeat size 0 ))
        |> (\( size, array ) -> ( toString size, \_ -> Array.append array array ))
        |> scale "Hamt Array"

uint8Arrays : Benchmark
uint8Arrays =
        |> (\size -> ( size, JsUint8Array.repeat size 0 ))
        |> (\( size, array ) -> ( toString size, \_ -> JsTypedArray.append array array ))
        |> scale "Uint8 Array"

Here are visuals with the plot and before the plot (that take roughly the same size as previously):



I don’t get much valuable info from these plots. Please let me know if you think my request is unjustified, and thanks for the great job for this library! It really is a pleasure to benchmark with it!

PS: Ultimately, a plot would be very nice, but not those that are presented. In my wildest dreams, for scales, elm-benchmark would automatically provide a plot, not of the samples, but of the timing performance for each scale (maybe a boxplot to encode standard deviation or fitting) with logarithmic scale, of the methods compared :slight_smile: :heart:

PPS EDIT: if this context of comparing more than two methods at different scales with a final plot is something you think is worth exploring, I’m willing to try to implement something.

Makes sense.

So these plots are meant to quickly spot if your benchmark behaves nicely, right?
Although quite useful, I actually expected other plots.

For comparing different implementations, I’d expect a bar plot with the average run time and standard deviation.
For comparing different implementations with varying input sizes, I’d expect a line plot with input size vs. average run time, with different lines for different implementations.

Anyway, this library is on a very good track, I love the new changes :heart_eyes:

I ended up with what we have now by going through a lot of the same things the two of you are saying. :slight_smile: TL;DR: sounds like you two have interesting ideas, let’s all collaborate to make things nicer.

Scale benchmarks are marked as experimental for some of the reasons y’all are saying. They’re a tool I wanted to get into people’s hands to see what you’re doing… it’d really help me if you could open issues on the repo with your use cases.

More inline…

I’d really rather not. If you visualize your data every time you tend to develop a heuristic for how things should look. Once you get a feel for that, you can spot weirdness pretty easily and in a way that a statistical approach couldn’t catch.

Anscombe’s quartet illustrates this in an interesting way: it contains four data sets which have identical basic statistics but are visually distinct. (a more recent, and IMO funnier take: the datasaurus dozen.)

See if you can get used to it. I’d hate for you to miss out on the benefit of your brain’s excellent heuristic creation. If you’re having a really bad time, I’d welcome a PR to make the plots slightly smaller/denser or adjust their layout to be less intrusive. (caveat: horizontal space is off-limits because of some plans I’m not ready to talk about just yet.) Shortening the plot area might be a good way to do this, but makes series odd.

I’m curious about your use case for this. Why would this be more important than verifying the samples are acceptable? I’m having some trouble imagining what you want, as well. Can you link to an example of this visualization elsewhere?

I considered plots like this, but the primary motivation is to show the data collected so you can decide whether or not to trust it. Bar plots with runs/second would be a nice alternative, but I’m not sure how to fit them in visually without spreading out vertically even more.

This would be pretty cool, but depending on the visualization it would require a pretty big refactoring to keep track of the input size. If you want more visualization options it makes sense to start elsewhere.

Yep, I’ll do that then

These are really interesting demonstrations that you cannot use the same hammer (classic statistic measures like mean and standard deviation) for every problem. However, you must always take into account the particularities of the problem faced.

In elm-benchmark case, I’ve not read the source code yet, but according the plots, I would assume that you are choosing the distribution of sample sizes uniformely. Meaning that the only free variable is the time it takes (Y-axis). This rules out most of what could go wrong in distributions of 2 free variables (like X4 plot of Anscombe’s quartet).

Still, I agree a lot of things can go wrong. For example, X1, X2 and X3 are still very different. One of the simplest yet very robust way of fitting regression model that would distinguish those quite easily is using a RANSAC approach. Provided some threshold (outlier distance limit), it will return the best regression (most of the time, it’s not deterministic). The distance function is quite important though, and using an L1 norm (sum of absolute distances) instead classic L2 norm (sum of squares) makes it even more robust to outliers.

That being said, the method isn’t deterministic since there is a random initialization of the pair points starting the iterative algorithm. I’ve some ideas though that could make it deterministic by having some heuristic to choose the point pairs.

Anyway this was mainly to argue the fact that given the circumstances (only one free variable) there are very few things that could go wrong if using a robust regression technique. And some “quality” metrics could provide very quickly the quality of the benchmark without needing a plot.

Yep, use case detailed in this issue:

1 Like