Idea: add statistical labeling to elm-test

  • the ability to make explicit assumptions about fuzzer inputs (and report that a certain number of cases have been skipped because of an assumption)

Filtering fuzzed values is a very bad idea. Any filter has the potential to filter out all the values, or even enough of them that your test runs very slowly. We’ve considered workarounds but it’s been pretty ugly, and we removed them. This is true of other fuzz libraries in other languages I’ve tried that offer a filter – sooner or later, you run in to trouble.

  • the ability to label test cases so as to get statistics about the values

Sounds reasonable, at least in isolation. You could pass an a -> String alongside the Fuzzer a into the test.

  • the ability to assert coverage statistics

This is really just codifying what distribution of labels are acceptable. I think the progression is actually add labels => assert distribution => filter fuzzed values to make the distribution acceptable. And again, that last part isn’t viable. So maybe we should rethink the seemingly innocuous labeling feature if it’s building towards something we can’t have?

Ian’s bounding box test is a pretty weak test. intersection box1 box2 = Nothing would cause it to pass. While seeing the distribution of expectations would tell you something isn’t right, it wouldn’t tell you how to fix it. Most tests invoke every expectation every run (either there’s one expectation or there’s an Expect.all). Putting expectations in branches almost inherently means that you don’t have your test value nailed down enough.

The elm-test README offers this piece of advice (courtesy of yours truly):

If you find yourself inspecting the fuzzed input and making different expectations based on it, split each code path into its own test with a fuzzer that makes only the right kind of values.

So to use Ian’s bounding box example, don’t fuzz two bounding boxes independently. Fuzz a pair of disjoint boxes and ensure there’s no intersection, and fuzz a pair of intersecting boxes and ensure the intersection is what you expect. That’s two separate fuzz tests, so you know that you’re testing 100 of each.

Of course, this does mean that you’re putting more work into the fuzzers, but I think that pays off. The solution to “how do I filter a fuzzer” is to only generate the values you want in the first place. (And yes, fuzzers can get as complicated as you like. There’s no Fuzz.andThen, but there’s Random.andThen, and if you’re constructing something that complicated you should write your own shrinker anyway.)

Oh by the way, there’s a lot of breaking changes on master with no set release date (including renaming Shrinker to Simplifier and making them much easier to write). So if we do want to add a feature, now is the time!

3 Likes

Thanks for your thoughts, Max! Addressing a few things:

This seems pretty central to your feedback, so I’ll get it out of the way up front: this is not what I’m suggesting!

filter-like functions do what you say, but assume-like functions elsewhere do not. Instead of sending values back to the fuzzer, assume-like functions skip the test run if the assumption is bad. It works more like a contract than a post-hoc shrinker. But you are right to point out that among the three pieces of proposed functionality, assume is the weakest. Used in isolation, your tests will be worse! OK! Let’s drop it then. :smile:

I further concede your point about filter: it usually causes more trouble that it’s worth and I tend to avoid it too. Fortunately the talk in question here—and this proposal following—actually give us tools to avoid filtering! Lemme 'splain:

Quite right! Test writers in general may need filter less if we/they could see and formalize value distributions. For example, in the talk he goes from the equivalent of Fuzz.int to the equivalent of this:

Fuzz.oneOf
    [ Fuzz.constant 0
    , Fuzz.intRange 0 maxValue
    , Fuzz.constant maxValue
    ]

This is a sneaky difference in our case, and not an optimization you would ever make if you did not measure the distribution. The important part here is that we are changing the shape of the distribution at the level of the generator, not at the level of the test. This lets us avoid using filter at all!

For at least this reason, I think labeling is independently valuable. I hope we can agree on this part at least. :slight_smile:

I don’t think this is completely fair. Round-trip tests for decoders have similar flaws: if your property is \x -> decode (encode x) == x then deocde = identity and encode = identity trivially pass, but are also incorrect. But that does not stop round-trip tests from being valuable or useful! Further, Ian has clearly put a lot of thought and work into this test suite, and it’s working well. elm-geometry is really solid!

Plus, his approach here lines up with Hughes’. Even though we have this advice to avoid branching on fuzz input, that’s exactly what he does and recommends. It’s in this segment of the talk, three minutes or so. Please watch at least that, as he explains the problems with multiple fuzzers pretty clearly! This recommendation was really surprising to me, as I’ve done the discrete-fuzzer-per-property dance many times myself and this contradicts it.

Maybe it’s time to reexamine our advice on this matter? I don’t want to turn this into “John Hughes said so” but I found this idea compelling and his background in industrializing and teaching these methods makes me think that there’s maybe something to them.

So, why do we have tell people to have a discrete fuzzer per property? Is it partly because we do not have a way to measure our case coverage now? Have we found easier to learn this way? In particular you mentioned wanting to nail test values down—why is that valuable? Isn’t part of property testing asserting that the property always holds across disparate input?

Good to know, thanks! I’m excited about the stuff coming up too!

So, I watched the talk (like I should have before replying!) and I’m a lot more sold. (Hughes knows how to present – I felt like I was thinking what he was going to say right before he said it.)

I think there’s a dependency chain here (not a tree, thankfully, just a chain):

  1. The ability to label values generated by the fuzzer
    1b. The ability to see the distribution of labels on a test
  2. The ability to see shrunken/simplified examples of each label, to qualify your label function
  3. The ability to encode advisory coverage requirements for each label
  4. The ability to enforce mandatory coverage requirements with “military grade statistics”

1 isn’t useful without 1b, hence the numbering. The part of the talk describing 2 was super fun and quite convincing. I’m thinking that invoking the label explanation functionality would skip all tests and print the explanation, and then have a “yellow” test run similar to Test.only so you don’t commit it.

Even if we don’t get to 3 and 4 immediately (4 being the severe implementation challenge), it’s worth thinking about the API. (A quick glance finds some union types that might need extra cases.) The obvious way to write a labeler an a -> String function with an if or case ... of expression, but that can’t tell you how many different cases to expect. Using the style of adding one case at a time is essential to 3 and 4, but also helpful for 2. One downside of this style is that the labelers might not be disjoint or complete; presumably the most recent one wins, and an unclassified value shows up in its own category, and fails 4?

Fuzz.intRange tests the boundaries too. (Perhaps 10% for each boundary is too low, if Hughes recommends 1/3?)

Yes, it’s pretty clear that writing fuzzers for each case of overlapping rectangles is going to balloon exactly as he’s describing. Opinion changed!

If you’re transforming between two different types, that won’t compile, but I digress.

It is, I’ve used it!

3 Likes

One small point here is that labelling happens not necessarily on the fuzzer output, but more inside the test itself. So for example Ian’s test, one would naturally label both branches:

intersectionIsValidOrNothing =
    Test.fuzz2 Fuzz.boundingBox2d
        Fuzz.boundingBox2d
        "intersection of two boxes is either Nothing or Just a valid box"
        (\first second ->
            case BoundingBox2d.intersection first second of
                Nothing ->
                    Expect.pass 
                        |> Expect.label "Not intersecting"

                Just result ->
                    Expect.validBoundingBox2d result 
                        |> Expect.label "Intersecting"
        )

Thinking about this stuff a bit more (especially about #4), it reminds me a bit of AFL style fuzzing where one could use coverage information and generate exactly as many test cases to make sure all branches where hit. This seems like a manual version of that.

I think you’ve got it in one! In particular, I see a couple of implementation challenges hiding here:

  1. what does a nice Elmy API for this look like?
  2. what does the output look like? Obviously we want it to be sufficient verbose to be usable, but running the whole suite might result in logspam if each test is annotated. (One idea: export structured output that an editor or external tool could use, but to be really useful we’d also probably want to print include source location information which would be another challenge on top of this!)
  3. the statistical test for sufficient coverage, as you mentioned. I think this is work worth doing, though, as I’m pretty sure the same family of statistical tests could be used to make elm-explorations/benchmark operate more confidently with fewer test runs and maybe get rid of the JIT warmup period.

Also, a question for you @mgold… how would you anticipate something like this adding to the maintenance burden of elm-test? I think it’d be useful to have, but too many useful-to-have features can really hurt in the long term.

API Draft

Anyway, it sounds like we are more-or-less on board with the ideas here! Maybe it’s time to start sketching out an API. I’ll give these rough names just so that we can talk about them more easily without saying “the first idea, the second idea”, et cetera.

“Inline” Style

This is really similar to what @gampleman posted, but with an explicit Bool in front.

label : Bool -> String -> Expectation -> Expectation
cover : Float -> Bool -> String -> Expectation -> Expectation

(or it could vary label : String -> …, labelIf : Bool -> String -> … but I would rather minimize the surface area here.)

benefits here:

  • you can label one input multiple times. This would have been useful in Hughes’ example of the dictionary. One label for “at front”, another for “update”. I assume reporting would be based on sets of labels, not per label. (e.g. “at front, update” with one percentage/example, instead of a separate percentage/example for each.)
  • information about labels—and therefore the cases the test is concerned with—stays in the test
  • there is only one way to do it. Labels will be consistent throughout the codebase.

and drawbacks:

  • one might not fully cover the inputs with labels. I have to assume this would produce a special warning (i.e. if there’s any label there must be complete coverage), but what if we could just avoid that?
  • it might actually not be good to have multiple labels per input! I’m not sure yet but it seems like you’d want to be really precise about this, and your test would be worse if you were not able to.

“Function” Style

We might take a cue from the classify function shown earlier and build an API around that:

label : (a -> String) -> a -> Expectation -> Expectation
cover : (a -> (Float, String)) -> a -> Expectation -> Expectation

benefits:

  • functions have to be total (or crash, I guess) so now you cannot have any gaps in coverage

drawbacks:

  • a classification function suggests having a separate top-level definition. I’m not sure that’s a good thing, as now the assumptions about the test input are stored outside the test. You could define it inline, of course, but it’d be nice if this could suggest that people do the right thing.
  • you could easily do this with the first API by saying label True (classify foo). Maybe that makes this too strict, or maybe it makes it just strict enough. I don’t know!

“Minimal” Style

What if we got rid of as many arguments as possible to maximize flexibility in calling?

label : String -> Expectation -> Expectation
cover : Float -> String -> Expectation -> Expectation

Benefits:

  • using in a case statement or conditional (as in @gampleman’s example above) is extremely straightforward
  • you can use if, case, or a function call in the expression that produces the String. You can be as complete or sloppy as you want.
  • if you do use a function, this has no opinions on the signature

Drawbacks:

  • it is not obvious how to vary output (function call). That may be hard to learn.
  • expecting people to do whatever suits them best to vary the string may result in inconsistency across a codebase.

Considerations With Both

  • it is a bit awkward to annotate the expectation like this. The assumption comes after the use!
  • do these functions live in Expect because the annotate expectations, or Fuzzer because they work with fuzzers?
  • this API cannot be separated from a test case. In the examples case, this means doing the full work of the test and discarding it just to show examples. Is that OK? Maybe?

In the end, I think if we were going to go with one of these three I would most interested in choosing “inline”—it’s big on consistency, which makes for obviously correct code in tests. That’s super important, because who tests the tests?

Anyway I would be interested in hearing people’s thoughts about this API as well as other ideas of how to accomplish these goals!

I don’t know. I’m not too worried, but it’s something to watch for during implementation. But first to design it.

elm-test’s philosophy is that passing tests are uninteresting, beyond how many ran. One objection is that seeing the printout of test names is a useful form of documenting what the code under test does, and makes missing tests more visible. (See: RSpec’s documentation format.) Another is that now we’d have these label outputs, which actually aren’t useful for a failing test – their purpose is to give you confidence in the passing tests. One solution is to show the output doing a Test.only. Another is to turn the output into a failure with the statistical test for sufficient coverage. Speaking of which:

If someone else makes a package, we’ll use it.

As for API design, it’s not obvious whether labeling lives on the expectation (as all three of your examples do) or on the fuzzer. Also, I’d really like to see examples of these used in the pipeline style in a test. (And before we finalize anything, let’s write stub implementations and make sure it compiles.)

“Fuzzer Function” Style

label : (a -> String) -> Fuzzer a -> Fuzzer a
cover : (a -> (Float, String)) -> Fuzzer a -> Fuzzer a

classifier xs = if List.isEmpty xs then "empty" else "non-empty"
classifyingFuzzer = Fuzz.list Fuzz.int |> Fuzz.label classifier
myTest = fuzz classifyingFuzzer "reversing a list twice" \aList ->
 aList |> List.reverse |> List.reverse |> Expect.equalLists aList

testThatPrintsExplanation = Fuzz.explain classifyingFuzzer

“Fuzzer Conditional” Style

labelIf : (a -> Boolean) -> Fuzzer a -> Fuzzer a
coverIf : (a -> Boolean) -> Float -> Fuzzer a -> Fuzzer a

classifyingFuzzer = Fuzz.list Fuzz.int
  |> Fuzz.labelIf List.isEmpty "empty"
  |> Fuzz.labelIf (not<<List.isEmpty) "non-empty"

-- or perhaps
classifyingFuzzer = Fuzz.list Fuzz.int
  |> Fuzz.labelIfElse List.isEmpty "empty" "non-empty"

-- myTest is the same

The benefits of doing classification on the fuzzer include that it can be reused for multiple tests, and it keeps the test cases clean.

To see examples of each label, with a fuzz style it’s one new line. In the expectation style, you’d have to mark the test in some way to indicate you want to see that information. You’d have to do that for each test. So that’s why I’m leaning towards the fuzzer style right now. Here’s how you would do Hughs’s tree insertion example:

classifiedTrees = Fuzz.pair fuzzElement fuzzTree
 |> Fuzz.labelIfElse (\(x, tree) -> Tree.member x tree) "update" "create"
 |> Fuzz.labelIf elementAtFront "front"
 |> Fuzz.labelIf elementInMiddle "middle"
 |> Fuzz.labelIf elementAtBack "back"

Which might result in

(update, front)  12.4%
(update, middle) 27.3%
...etc

This makes me think that maybe we should only have cover… why would you label output without saying what percentage you expect it to be generated? Considering the situations:

  1. when you are working on a case whose coverage is correct.
    You want the coverage tool to stay out of your way.
  2. when you are working on a case whose coverage is incorrect.
    You want to see the incorrect values until they work across all tests.
  3. when you are working on another test using the same one-ring-to-rule-them-all fuzzer.
    You want to see if your changes break another test’s coverage as soon as possible so you can consider the larger effects of your work.
  4. when you are working on a case whose percentages you have not determined yet.
    You are forced to guess… this is maybe the worst of the three, but I think it’s an acceptable test writing strategy to write down a guess and have the testing system check out on it!

As to your API sketches… I wonder if these annotations should live on the test or the fuzzer. (Note I’m concerned with the test, not the expectation. I only grabbed the expectation because each test must have one.)

Say you have two properties for your tree: items are inserted at the right place, and new values always replace old. If you put the labeling in the test somehow, you can indicate that the former only cares about keys, and the latter only cares about values. But is that good?

Location Pros Cons
Test Specify only, and exactly what you want. Re-specify every time (e.g. the value test probably does care about the keys, but you’ve gotta say that again.)
Fuzzer Specify the exact labels once. Tests almost certainly will have different requirements based on their semantics. You also have to duplicate fuzzers to get different semantics, avoiding which is a major point of Building on Developers Intuitions.

Here’s a meta-property: maybe it’s true that you want a failure to shrink to only one label? That would be another point in favor of locating the labels on the Fuzzer.

As to the rest of it, we have kind of similar looking designs… the thing they seem to come back to is: should labels be allowed to overlap? And a second question: should labels be allowed to have gaps?

My take is: yes, and yes. We can work with both, and it opens up nice possibilities like labelIfElse above!

Sure! You could always give a zero percent minimum coverage to opt-out (although maybe that should be an error?).

Sounds fine. Each fuzzed input maps to a (possibly empty) set of labels. (“Set” implies no ordering, but labelIfElse would allow us to be smart about printing opposite labels.) Any kind of map, composition, or transformation discards the label functions.

Nope, what if you use labelIfElse twice? Then you’ll have two labels on each value. But there are other useful interactions between labels and shrinking. Rarely, shrinking returns multiple values, and we would prune down to one value of each label.

The expectation and the test are closely linked, because the expectation is the return value of the test’s function. So the label has to be part of the expectation (or the return value becomes a tuple). The test is now doing two things: it needs to determine the correctness of the code under test, and classify the inputs it have been given. Philosophically, classifying inputs fits better with the fuzzer that creates and shrinks them.

Pragmatically, if you write a classifier for each test, you have to go through the “satan reading the Bible” debugging process for each test. It adds verbosity to the tests; we want tests to be as quick and easy to write as possible to encourage people to write them. And often, tests only have one expectation; how would you label the empty and non-empty cases for reversing a list twice?

I’m not sure that’s true. If you have a test like that, you can write a fuzzer with a custom label function and pass it to that test. (Important: you only have the write the label function, not the generator nor the shrinker.) But if you have several tests that could reuse labeling, it’s basically impossible if labeling is done once per test.

I also think that coverage requirements are a property of the values being generated, and the requirements don’t change much from test to test. If each test had its own labeler, you’d have to run coverage checks on each test. If the labeling is on the fuzzer, you have fewer things to check and can also only run the generator and not the test, making the statistical checking faster.

Re only having cover: an important thing to note is to determine whether or not the coverage criteria have been met may require running an enormous number of test runs. While that’s probably fine for a CI run, it’s most certainly not fine for TDD with watch mode style development.

Now one solution to that would be to only check coverage numbers in a particular runner mode, but I think we should think a bit about the UX of that.

I even wonder if a really nice way to go about some of these things wouldn’t involve launching a little web app to show graphical visualizations of test result distributions? Or perhaps that’s a step too far. But potentially have some fuzzer debug mode that would printout distributions and shrunk examples for each test case could be useful.


Re labelling on fuzzer/expectation: Personally I would strongly advocate for labeling on expectation, since that’s where practically case-logic occurs. It also cleanly solves issues like what happens with fuzz3 etc.

Why? What about fuzz2 and so on? There’s composition there, would it discard the labels? If no, why? If you want a string and an int, and labels live on the fuzzer, you’d have to make a fuzzer like Fuzz.map2 Tuple.pair Fuzz.int Fuzz.string with new labeling and validation per test. This indicates to me that the labels, at least in the composition case, are better as a test responsibility.

I don’t understand how you got from the first sentence to the second. Could you help me by saying more about the link here? What requires this, philosophically? What philosophy are we talking about here?

That said, I think there are tradeoffs either way. Based on your understanding of the world you say that they should live on the fuzzer. Based on mine, I say they should live on the test. I don’t see anything that prevents either from being successful. I started from the test side because that’s what the Haskell QuickCheck does, but from what I understand they have less control over fuzz input than we do. If that creates some bad assumptions for us, OK! I’ll drop 'em. :slight_smile:

Verifying coverage targets have been met will have to be a special mode for exactly this reason (like seeing examples.) We can warn if coverage is not met during a regular test run, though.

Don’t think that would be very valuable, since whether or not one gets the warning is pure luck as is whether the warning is a false positive.

Have a look at the talk starting about here: https://youtu.be/NcJOiQlzlXQ?t=2402

He justifies why this must be the case. :slight_smile:

Kind of a half-baked idea on implementation. If there’s a sensible way for composition to work, let’s do it.

Philosophically = intuiting single responsibility and cohesion and other OO words for code structure. Contrast with pragmatism, with is a more concrete objection.

When Richard and I did a big overhaul of elm-test, we made a new repo with a silly name, iterated quickly, and released a bunch of major versions. I think it’s time we do that and see what ideas actually pan out. (It would also be fine to stub the statistics part and fail if you hit the unlucky case – and fix it before release, of course.)

1 Like

Sounds like a plan. I’m happy to take on this work (though it will be a couple weeks before I have anything to show for it, as I’m moving this weekend.) Anything I should know about before getting started?

Fork elm-test from master, which has some changes from the latest release. I’d like to see labelling on both fuzzers and tests/expectations implemented so we can figure out what’s best. And can you give elm-test regulars a commit bit please? And take your time, and good luck with your move!

My brain is pretty mired in the “testing effects” API right now (currently on its ~fourth major overhaul), and I don’t have concentration bandwidth to think deeply about both that and this, but a few thoughts:

  1. I watched this talk some time ago, and I remember thinking that I agreed with the problem statement, but wasn’t totally sold on labeling as the best way to address it. It seemed like the solution space deserved further exploration. Since we’re in no rush, I’d encourage exploring this with an eye towards trying radically different things to see what we can learn!
  2. In particular, it seems worth exploring designs which might address these use cases while also addressing others. An example:

Emphasis mine. I’m not sure if it is or isn’t, but it seems worth exploring!

Right now in Elm we use fuzz testing almost exclusively for regression tests and TDD. We’ve never really designed for the use case of “run this test for a really long time - like, overnight at a minimum - and see what it reveals.” This seems like a good opportunity to explore that.

For example, if we had some notion of “exploratory testing,” where the goal was not to detect regresssions (so they wouldn’t be run on CI on every branch or PR) but rather to reveal deep edge cases, what would that look like? Supposing we had that on the table, and we revisited the problems mentioned by Hughes and by @ianmackenzie and so on, what new parts of the solution space does that open up?

Anyway, just some things to think about!

I would love for elm-test to support guided coverage-based fuzzing. The technique can find a lot of really weird/buggy inputs, especially crashers in handwritten parsers! I’ve had good luck with doing exactly that in some other open source endeavors!

Because of that experience, I have a lot of thoughts here which basically come down to this: guided coverage fuzzing is amazing for finding the state space your program actually covers, while labels are great for expressing what you intend to cover. I think both are useful!

Imagine this: you work with labels to make sure that your inputs are shaped roughly like you want, and then turn on an exploratory mode which takes and combines labeled examples in order to find really large or weird edge cases. Part of guided coverage fuzzing is starting off with a good corpus: for example, if you’re writing a PNG parser you probably want a lot of your examples to start with the PNG header: 137 80 78 71 13 10 26 10. But, you don’t want all of them to do that or you’ll miss branches! Working with labeling functions makes this initial corpus generation a lot nicer. Plus, you can now use the fuzzers in your TDD-style tests.

As to your general point, Richard, I’ll keep an eye out for smashing these things together to make something even better. :stuck_out_tongue:

2 Likes

This thread triggered my “coverage + AFL + fuzzing” buzzword bingo sense, so I’ll leave this here in case some folks don’t know about elm-coverage by @ilias :upside_down_face: https://github.com/zwilias/elm-coverage

1 Like

Speaking of AFL, I wonder how feasible it would be to start thinking about this in terms of symbolic execution, i.e. move away (at least partially) from the random aspect of property testing.

I think the advantage here is that Elm is quite a minimalist language, so perhaps this could be a fruitful area of exploration. In particular purity seems to be a huge boon for symbolic execution as does the very limited FFI (i.e. kernel code would have to probably be reimplemented in some logic language to correctly suggest branches, but there is not that much of it).

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.