When writing a decoder for some JSON model that you are reverse engineering (that is, not one you designed yourself, and don’t have a JSON schema for), you are typically starting with some example JSONs. You could have a large number of example JSONs or they could be very long. So how do you make sure the model you are decoding into captured all of the fields? It is going to be a potentially long and tedious approach if you are to manually work through all of the example JSON you have to work with.
At the moment, I am looking at the AWS service definitions. I wrote an initial decoder based on looking at the contents of one of them, but then wondered does that decoder capture all of the fields in all of them? It didn’t and I had to add at least 20 more fields to complete the job.
In order to find out, I need some kind of JSON diffing tool. But no good if its a JSON diff for just one hand-written model, as that model is incomplete; I need one that is generic enough to compare any JSON. So I wrote one based on a generic JSON decoder.
Here is what I came up with as a first pass at this:
And here is the code that compares my hand-built decoder with the actual contents of one of the JSON files:
let
example =
Codec.decodeString AWSService.awsServiceCodec val
in
case example of
Ok service ->
let
original =
Decode.decodeString Generic.json val
parsed =
Decode.decodeString Generic.json (Codec.encodeToString 0 AWSService.awsServiceCodec service)
diffs =
case ( original, parsed ) of
( Ok jsonl, Ok jsonr ) ->
Diff.diff jsonl jsonr |> Diff.diffsToString |> logIfVal "Diffs"
( _, _ ) ->
"Failed to generic decode" |> Debug.log "Error"
In particular this bit:
parsed =
Decode.decodeString Generic.json (Codec.encodeToString 0 AWSService.awsServiceCodec service)
Takes the json, decodes it with my hand-written decoder, then turns that back into a Value
which is then decoded with the generic decoder. This process means that the generic JSON now only contains the fields that my hand-written decoder picked out.
Diffing that against the original file decoded into the generic JSON reveals which fields I missed.
This then leads me to think of a way of using this to automatically infer models and decoders from JSON; start with an empty model and apply the diffing to the actual JSON, add missing fields and repeat until there are no more diffs. This process can easily be extended to work over multiple JSON inputs until a decoder is constructed that is rich enough to handle all of them.
Tricky bits in writing an algorithm to do this are:
-
Discriminator tags. Some JSON models have special fields or even groups of fields that discriminate between different structures. For example in the AWS JSON I am using here, the “shape” model has a “type” field which can be “int”, “bool”, “string”… and so on. When it is “int”, there can be max and min fields giving upper and lower bounds on the allowed values, and so on. Automatically recognising discriminator tags sounds possible based on some heuristics and stats.
-
Dict structures. Sometimes JSON contains non-fixed-name fields in records (example below). Here the values ‘AddTagsToCertificate’ and ‘Whatever’ are not really fields of a record, they are keys in a
Dict
, because they do not represent a fixed set of field names, but things that can vary from JSON instance to instance of the same model. Again, some kind of heuristics might help here.
"operations": {
"AddTagsToCertificate": { ... },
"Whatever": { ... }
}
Heuristics or some way of giving explicit instructions on how to handle these cases for when the heuristics don’t work out would be needed.
A last thought - such an algorithm sounds closely related to unification, or is a variant of unification. Unification takes 2 structures with variables in them, and finds a variable binding that makes the 2 structures equal, and is used amongst other things in the type checking and inference algorithms in Elm.