Summary: Haskell QuickCheck provides some neat statistical tools to let people avoid bad assumptions about values in their property tests. I would like these (or something like them) in Elm! Let’s talk about how to make them idiomatic to our ecosystem.
I recently watched John Hughes’ talk Building on Developers’ Intuitions to Create Effective Property-Based Tests. Watch that video before continuing with this post if you have time (50 minutes). It’s worth your time if you’re interested in testing. But, if you don’t have time, please read on and I’ll try to summarize!
Summary
The talk’s first example is adding range-limited integers. We want to make some function add
which does not overflow. So if our range is 0–5:
add 0 0 == Just 0
add 1 2 == Just 3
add 3 2 == Just 5
add 3 3 == Nothing -- 6 is out of the range!
We begin by testing boundary conditions:
-
0
plus any valid number isJust otherNumber
- the max value plus any value is
Nothing
.
We start with unit tests, then generalize them using assumptions: for the zero case, we tell QuickCheck to assume that one of the numbers is zero. For the max value case, we tell QuickCheck that one of the numbers is the maximum value. (Note: this is not a custom generator, just using the equivalent of Fuzz.int
) Does that work? Well…
+++ OK, passed 100 tests; 96 discarded.
Whoops! We address this by unifying the two tests. If a + b > maxValue
then we assert that add
returns Nothing
, otherwise Just (a + b)
.
But how do we know that we are not running into the same situation as before, where we are missing some values without meaning to? The solution: labeling! We create a labeling function that examines the result of adding the ints and adds a label to the test run:
summarize n
| abs (n - maxCoinValue) < 3 = "boundary"
| n <= maxCountValue = "normal"
| n > maxCountValue = "overflow"
This shows up in the QuickCheck output like this:
*Coins> quickCheck . withMaxSuccess 10000 $ prop_Add
+++ OK, passed 10000 tests:
50.61% normal
49.39% overflow
John Hughes uses this as a springboard to say that maybe we ought to only have one generator and make assumptions about its output instead of writing a generator/validator pair for each test. That sounds nice and I plan to use it regardless of whether we can add this kind of thing in Elm!
Now, I’m cutting out a lot of the talk here (you really should make time to watch it!) but we can also use these labels to generate examples. That lets us be certain that our labeling function is performing properly! In this case, for testing insertion into an ordered collection:
*** Found example of at start, update
(0,0)
{(0,0)}
[(0,0)] == [(0,0)]
This allows us to check our assumptions about what’s actually being generated in our tests.
Then, to finish off, we see that you can assign multiple labels per test by calling classify (a > 0) "positive"
. We can also assert that we have enough coverage per label with cover 5 (a == 0) "zero"
, where that number is the percent you want the case to show up.
Of course, failing the coverage here produces a warning instead of a failure by default. This is important so that tests don’t randomly flake just because they started with an unlucky seed. But you can ask QuickCheck to generate enough cases to be statistically certain that your labeling is accurate!
Bringing this to Elm
To make a long story short, I want these things in Elm! Specifically, I would like:
- the ability to make explicit assumptions about fuzzer inputs (and report that a certain number of cases have been skipped because of an assumption)
- the ability to label test cases so as to get statistics about the values
- the ability to assert coverage statistics
I think all three of these are important: the first two make our assumptions about code explicit, and the third lets us communicate our intent around what exactly should be tested to other developers.
We already publish some statistics to give developers an intuition about the fuzzers, but what about when they’re combined? I think it’d be useful to see if the things I’m assuming about the code are actually borne out in a typical test run. This could certainly help!
Is this a good idea?
It’s worth being cautious in cases like this… APIs infrequently transfer cleanly between languages. In this case, I think it we can make it work, but I’d be really interested in hearing more about how we could make these ideas idiomatic to Elm. What do y’all think?
Specifically, I have some ideas for what the API could look like, but I’d like us to hold off until we are sure we are solving the right problem. Remember: code is the easy part.