JSON Decoders - did you get all the fields?

When writing a decoder for some JSON model that you are reverse engineering (that is, not one you designed yourself, and don’t have a JSON schema for), you are typically starting with some example JSONs. You could have a large number of example JSONs or they could be very long. So how do you make sure the model you are decoding into captured all of the fields? It is going to be a potentially long and tedious approach if you are to manually work through all of the example JSON you have to work with.

At the moment, I am looking at the AWS service definitions. I wrote an initial decoder based on looking at the contents of one of them, but then wondered does that decoder capture all of the fields in all of them? It didn’t and I had to add at least 20 more fields to complete the job.

In order to find out, I need some kind of JSON diffing tool. But no good if its a JSON diff for just one hand-written model, as that model is incomplete; I need one that is generic enough to compare any JSON. So I wrote one based on a generic JSON decoder.

Here is what I came up with as a first pass at this:

And here is the code that compares my hand-built decoder with the actual contents of one of the JSON files:

            let
                example =
                    Codec.decodeString AWSService.awsServiceCodec val
            in
            case example of
                Ok service ->
                    let       
                        original =
                            Decode.decodeString Generic.json val

                        parsed =
                            Decode.decodeString Generic.json (Codec.encodeToString 0 AWSService.awsServiceCodec service)

                        diffs =
                            case ( original, parsed ) of
                                ( Ok jsonl, Ok jsonr ) ->
                                    Diff.diff jsonl jsonr |> Diff.diffsToString |> logIfVal "Diffs"

                                ( _, _ ) ->
                                    "Failed to generic decode" |> Debug.log "Error"

In particular this bit:

    parsed =
        Decode.decodeString Generic.json (Codec.encodeToString 0 AWSService.awsServiceCodec service)

Takes the json, decodes it with my hand-written decoder, then turns that back into a Value which is then decoded with the generic decoder. This process means that the generic JSON now only contains the fields that my hand-written decoder picked out.

Diffing that against the original file decoded into the generic JSON reveals which fields I missed.

This then leads me to think of a way of using this to automatically infer models and decoders from JSON; start with an empty model and apply the diffing to the actual JSON, add missing fields and repeat until there are no more diffs. This process can easily be extended to work over multiple JSON inputs until a decoder is constructed that is rich enough to handle all of them.

Tricky bits in writing an algorithm to do this are:

  • Discriminator tags. Some JSON models have special fields or even groups of fields that discriminate between different structures. For example in the AWS JSON I am using here, the “shape” model has a “type” field which can be “int”, “bool”, “string”… and so on. When it is “int”, there can be max and min fields giving upper and lower bounds on the allowed values, and so on. Automatically recognising discriminator tags sounds possible based on some heuristics and stats.

  • Dict structures. Sometimes JSON contains non-fixed-name fields in records (example below). Here the values ‘AddTagsToCertificate’ and ‘Whatever’ are not really fields of a record, they are keys in a Dict, because they do not represent a fixed set of field names, but things that can vary from JSON instance to instance of the same model. Again, some kind of heuristics might help here.

"operations": {
    "AddTagsToCertificate": { ... },
    "Whatever": { ... }
}

Heuristics or some way of giving explicit instructions on how to handle these cases for when the heuristics don’t work out would be needed.

A last thought - such an algorithm sounds closely related to unification, or is a variant of unification. Unification takes 2 structures with variables in them, and finds a variable binding that makes the 2 structures equal, and is used amongst other things in the type checking and inference algorithms in Elm.

5 Likes

Sorry, bit of a rambling collection of thoughts… just sounding out some ideas and see if it chimes with anyones thinking around this. Or any links or papers to similar type/structure inference efforts.

There is https://noredink.github.io/json-to-elm/, which has a go at solving this problem, but is easily caught out on the tricky cases. Also no mechanism through its UI for the user to provide guidance when it doesn’t get it right.

Some more tricky issues in solving this problem:

  • Naming. What it two parts of the mode appear to have the same name but are different? Like an “options” field in two different places that denotes an object with different options fields in each case? Maybe one should really be EndpointOptions and the other ModelOptions.

  • Identifying the same record structure that appears in multiple different positions. Made hard with optional fields. Two records maybe share some fields in common, at what point do you decide they are actually the same record type? Some kind of measure of the fit amongst a group of records is needed. Another clue might be if those records appear in an array, which kind-of implies they have the same type, but JSON entries in an array are not restricted in this like they are in Elm.

I’ve had good luck with a combination of diff and a tool called gron: https://github.com/tomnomnom/gron

Basically, it turns a possibly-nested json object into a rectangular object to better use grep and friends.

e.g.

{ "id": 1, "name": "Foo", "price": 123, "tags": [ "Bar", "Eek" ], "stock": { "warehouse": 300, "retail": 20 } }
becomes

json = {};
json.id = 1;
json.name = "Foo";
json.price = 123;
json.stock = {};
json.stock.retail = 20;
json.stock.warehouse = 300;
json.tags = [];
json.tags[0] = "Bar";
json.tags[1] = "Eek";

Maybe I just like command line tools though :slight_smile:

Nothing wrong with simple solutions and of course the ability to pipe through other unix tools is always helpful.

One way this could be use would be to pipe through sed to trim off the values coming after the =, like:

json.contact.email = "mail@tomnomnom.com";
json.contact.twitter = "@TomNomNom";
json.github = "https://github.com/tomnomnom/";

to

json.contact.email
json.contact.twitter
json.github

That would allow diffing against another JSON for the same model, but a different instance with different values.

This sed command is a quick attempt at doing that: sed 's/\(.*\)=.*/\1/g'

1 Like

I had a similar problem recently, so I decided to create a small package for the problem this weekend: eike/json-decode-complete. It allows you to write object decoders that fail if you don’t handle all fields:

import Json.Decode exposing (Decoder)
import DecodeComplete exposing (..)

type alias User =
    { name : String
    , age : Int
    }

userDecoder : Decoder User
userDecoder =
    object User
        |> required "name" Decode.string
        |> required "age" Decode.int
        |> discard "email"
        |> complete

This decoder will fail if the provided JSON has fields other than name, age and email.

Unfortunately, my approach requires first decoding into a Dict String Decode.Value and then decoding the elements which degrades the error messages because I have to re-throw them with fail. Maybe there is a better way to implement this?

1 Like

Like this you mean?

D.fail (D.errorToString err)

I guess its a bit ugly to do this, but still preserves the error message, so I think it is ok.

I think you could possibly work with signatures on your functions like this:

rest : Decoder (Result Error a) -> ObjectDecoder (Dict String a -> b) -> Decoder (Result Error b)

and the locally redefine Decoder:

type alias CompleteDecoder a = Decoder (Error a)
rest : CompleteDecoder a -> ObjectDecoder (Dict String a -> b) -> CompleteDecoder b

Would something along these lines work? The idea is that by using Result you have somewhere to keep the Errors without toStringing them. Maybe there is some reason this isn’t going to work out.

Yes, that’s what I mean. The problem is that I first have to decode to Values so I can track which ones have been used, and that you cannot “inject” a Value back into the decoding context, which means that every Error gets turned into a Failure.

My problem with that is that these decoders don’t combine as nicely with regular decoders because you now have to handle Results.

But I just had an idea: It might be possible to just keep track of the decoded field names, but to call field again in the original context. This way, their errors would come up in the original context. I will play around with that some more later (or on the weekend). Thanks for starting to discuss it: that helped!

And now that’s done. The new version just tracks which fields remain unhandled without also keeping them as Values. When decoding fields, they are taken from the orginal JSON and their name is removed from the set. This way, no errors need to be re-thrown (and the code gets nicer in a couple of places).

The API didn’t change.

1 Like

For reference, I have a similar thingymajob here: https://package.elm-lang.org/packages/zwilias/json-decode-exploration/latest/Json-Decode-Exploration

It offers overlapping functionality, in that it, also, allows checking if all values in the JSON were actually read (or marked as read) by the decoder.

Oh, I must be blind. I actually looked at your library before I wrote mine but somehow decided that yours did something else. Had I noticed, I wouldn’t have implemented it.

But maybe that’s okay because in the end, some of our decisions differ: Your library tracks decoding into subfields and elements of arrays (and has cool result types), which mine doesn’t—if you do anything with a field, I consider that field “handled”. (Of course, you can decode a subfield with a tracking decoder again, but things like requiredAt don’t exist.) In exchange, my library returns standard Decoders (so you can use them at ports, for example) and decodes fields with standard Decoders (so you can use decoders provided by other libraries). I think at least the second part would be difficult with your approach.

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.