What is the most useful report from a benchmarking run?

In planning future versions of elm-benchmark, I wonder what the most useful information from a benchmarking run is. I suspect that it may be most useful to provide one metric which says “it’s about this fast” and one which says “it will probably be about this fast in the future.”

I’m currently thinking that in future versions, we may change to:

  • it’s about this fast is the median of runs in the current population. To my mind, this is the best indicator of the current population, since it’s actually present! (See footnote for a caveat, though.)

  • it’s will be about this fast is a prediction interval Summary: “given the current population, we are 95% confident that a new point would fall between this upper and lower bound.”

I’m interested in two questions:

  1. Current users of elm-benchmark: would this be intuitive and useful for you? How are you interpreting the data, currently?
  2. Current users of benchmarking tools in other languages: what stands out to you in those tools as especially helpful? Especially bad?

Footnote: What’s a run?

Runs are technically the mean time for a bunch of function executions. This is because browsers performance.now used to be 50µs resolution data, but is now much higher because of the response to the Spectre vulnerabilities. MDN has the deets, as always. By taking the mean, we can measure performance below the resolution threshold, no matter where it lies, as long as our total sampling time is significantly above the browers’ resolution.

1 Like

Maybe you could output the distribution of the durations?
That way, when comparing two sets of runs (eg., when comparing two implementations) you compare the two distributions, which is more informative than a single number.
Disclaimer: I’ve never seen this done in a software benchmark.

Here is an illustrative example (hand-drawn no less):

The green is the distribution or histogram of timings, and the red is the estimated cumulative distribution function (ecdf in R).

The red is useful because you can read off the baseline latency from it, and also visualise the thickness of the tail at the top-end. When optimising, sometimes you do things to push down the baseline latency, and sometimes to squeeze-the-tail which usually means reducing memory usage to reduce garbage collection.

I don’t think these graphs are what you would want to output directly from a benchmarking program though. It is most useful to be able to get all of the raw latency numbers for a run, and then post-process them somewhere else; I used R.

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.