Mapping Tagged Union types to other languages?


#1

Hi,

Tagged union types (now called ‘custom types’) are a great feature of Elm. Coming from languages that do not have them, we generally get a nice warm fuzzy aha moment when we first grok them.

I like how they fill out a missing dimension in data modelling, compared with languages that only support records.

  • A record type is a ‘product’; an instance of a record is a member of the set defined by the cross product of the sets defined by each of its fields.

  • A union type is a ‘sum’; an instance of a union type is a member of the set defined by taking the union of the sets defined by each of its constructors. That gives you a new angle on data modelling that is not so well expressed in some other languages and better enables choices to be encoded.

‘sum’ and ‘product’ form a natural pair of operators to work with; sometimes we combine things together into groupings that naturally align together as a record; sometimes we make choices between models to represent different possible beliefs about the state of the world that our application represents. Sum and product also follow similar rules to their mathematical counterparts, with multiplication distributing over addition and so on.

(there are other set operators that can be interesting to consider too - many of them find a natural home in SQL)

Often we need to interact with languages that do not directly support tagged union types. But the concept is so powerful, I am interested to see how they look when mapped to some frequently used languages that we might interact with when writing Elm applications.

Here is a simple example in Elm, that is not parameterized, has no functions in it, and is not recursive. Its from a content model I am working with that is fetched from a back-end:

type Content
  = Text { uuid : String , title : String }
  | Markdown { uuid : String , markdown : String }
  | Figure { uuid : String , markdown : String , title : String , imageId : Maybe String }

#2

JSON

JSON is an obvious one that Elm frequently works with, as it is very commonly used to talk to server side APIs.

For eached constructor, we can use a value from the enum of all the possible constructors. This goes in a discriminator field, I use one called ‘@type’, with the inital @ signifiying it as meta-data that helps describe the choice structure of the data.

{
   '@type': 'Text',
   uuid: '123e4567-e89b-12d3-a456-426655440000',
   title: 'Hello'   
}

JSON-schema for this might look like:

{
   title: 'Content',
   type: object,
   required: ['@type', 'uuid'],
   properties: {
     '@type': { enum: [ 'Text', 'Markdown', 'Figure' ],
     uuid: { type: string },
     title : { type: string },
     markdown : { type: string }, 
     imageId : { type: string }     
   }
}

A weakness of this is that I was only able to mark the ‘uuid’ field as required, since it is common to all the constructors. I did not state that some of the other fields must be present, depending on the constructor type. I also just let ‘imageId’ be nullable with the obvious mapping of that to Maybe.

I could define one schema per constructor:

{
   title: 'Text',
   type: object,
   required: ['uuid', 'title']
   properties: ...
}

and so on. And then in the ‘Content’ schema use oneOf to reference them, something like:

{
   title: 'Content',
   type: object,
   required: ['@type', '@fields'],
   properties: {
     '@type': { enum: [ 'Text', 'Markdown', 'Figure' ],
     '@fields': { oneOf: [
       { '$ref': '#/definitions/Text' },
       { '$ref': '#/definitions/Markdown' },
       { '$ref': '#/definitions/Figure' }
     ]
   }
}

Although I did not do a great job there of ensuring the @type matches the correct set of @fields. Can it be improved?


#3

I’ve posted an example for JSON, please feel free to contribute examples from other languages.

I am currently typing up some notes on typical OO languages defaulting to using Java as an example, and also SQL. Will post them up as I complete them.

Would be interested to see some examples in Elixir, Python, C#, F#, or whatever other languages you use in conjunction with Elm.


#4

Object Oriented Languages.

An OO language like Java, or C#, C++, or perhaps Python but I don’t know it. I’ll give an example in Java, as that is what I do know.

Could do something a bit hacky with using enums to enumerate the possible constructors:

class Content {
   enum Type = {
      Text;
      Markdown;
      Figure;
   }

   class TextFields {
      private String uuid;
      private String title;
   }

   class MarkdownFields {
      private String uuid;
      private String markdown;
   }

   class FigureFields {
      private String uuid;
      private String title;
      private String markdown;
      private Long imageId;   
   }

   private Type type;
   private TextFields text; // May be null
   private MarkdownFields markdown; // May be null
   private FigureFields figure; // May be null
}

The more idiomatic way is with sub-classing and inheritence. This differs with a tagged union type in two ways.

  • The first is that its type is evaluated at runtime with a cast that can fail with a runtime exception. In Elm the compiler ensures all branches must be covered so there can be no runtime exception.

  • The second is that the OO inheritence is open-ended and new members of the union can be added later, even at runtime with dynamic class loading as per Java. In Elm the full type is always known at compile time which enhances the effectiveness of type-checking by the compiler.

This might look something like:

abstract class Content {
   protected String uuid;
}

class Figure extends Content {
   private String title;
   private String markdown;
   private Long imageId;
}

and so on for the other constructors. This is probably the best way to map the concept onto an OO language with inheritance.

To write a case statement on such a union, you can use instanceof:

if (val instanceof Text)
  doA();
else if (val instanceof Markdown)
  doB();
else if (val instanceof Figure)
  doC();

with no exhaustiveness checking. Or make use of virtual methods as a more efficient way of encoding choices:

class Text extends Content { void do() { doA(); } ... }

#5

Union types were a revelation to me when I learned about them in Elm. They’re so incredibly useful :smiley:

I’ve been experimenting with them in JavaScript+Flow recently. To model your example, I’d define a content.js module along these lines:

// @flow

export type Text = {
    type: 'text',
    uuid: string,
    title: string,
};

export type Markdown = {
    type: 'markdown',
    uuid: string,
    markdown: string,
};

export type Figure = {
    type: 'figure',
    uuid: string,
    markdown: string,
    title: string,
    imageId: string | null,
}

export type Content = Text | Markdown | Figure;

type Handlers<T> = {
    text: Text => T,
    markdown: Markdown => T,
    figure: Figure => T,
}

export const match = <T>(content: Content, handlers: Handlers<T>): T => {
    switch (content.type) {
        case 'text':
            return handlers.text(content);
        case 'markdown':
            return handlers.markdown(content);
        case 'figure':
            return handlers.figure(content);
        default:
            (content.type: empty);
            throw new Error(`Unrecognised content type: ${content.type}`);
    }
}

And here’s an example React component that renders Content:

type Props = {
    content: Content,
};

export const Content = ({ content }: Props) =>
    match(content, {
        text: textContent => <Text content={textContent} />,
        markdown: markdownContent => <Markdown content={markdownContent} />,
        figure: figureContent => <Figure content={figureContent} />,
    });

We get all the usual benefits we expect from union types:

  • If we add a new case to the union, we’ll immediately get a type error in the match definition, which will force us to add another case. Adding the new case will give us type errors in all places where match is used, forcing us to handle the new case throughout the codebase.

  • Within each handler, Flow’s type refinement means we can safely access the relevant properties. E.g. in the text handler Flow knows that textContent.title is valid, but would give us an error if we tried to access textContent.imageId.

(And slightly tangential, but Flow also supports opaque types!)

I’m not 100% sure if this is a good pattern to follow in JavaScript though. I’m concerned by the amount of boilerplate, and that it’ll be seen as confusing/non-idiomatic by most JavaScript programmers.

Has anyone else been using these kinds of patterns in JavaScript/Flow/TypeScript? Is there a more concise way of doing it?


#6

Would be interested to see some examples in Elixir, Python, C#, F#, or whatever other languages you use in conjunction with Elm.

I’d also be very interested in this, particularly Elixir. I had a fairly horrible time dealing with encoding and decoding tagged unions between Elm <-> JSON <-> Elixir a couple of weeks ago, and I’m sure there’s a better way if you know what you’re doing with Elixir.

For what it’s worth, my approach was to consider each branch of a tagged union as a tuple, where the first element is an Elixir atom representing the name of the Elm type constructor, and the rest of the elements are the contents.

So:

type Content
  = Text { uuid : String , title : String }
  | Markdown { uuid : String , markdown : String }
  | Figure { uuid : String , markdown : String , title : String , imageId : Maybe String }

Would become something like:

@type content ::
  {:text, %{ uuid : String.t(), title : String.t() } }
  | {:markdown, %{ uuid : String.t(), title : String.t() } }
  | {:figure, %{ uuid : String.t(), markdown : String.t(), title : String.t(), image_id : String.t() | nil } }

But it feels very clunky compared to Elm.


#7

Both TypeScript and Flow have “Discriminated unions”/“Disjoint unions”, and opt-in exhaustiveness checking using never/empty (usually in a switch).

http://www.typescriptlang.org/docs/handbook/advanced-types.html#discriminated-unions


#8

Never used myself but for elixir there is a library called algae https://github.com/expede/algae#sum-builder that represents Custom types as Structs:

defmodule Pet do
  defsum do
    defdata Cat do
      name :: String.t()
      claw_sharpness :: String.t()
    end

    defdata Dog do
      name :: String.t()
      bark_loudness :: non_neg_integer()
    end
  end
end

Will create structs and functions so your data is represented as %Pet.Dog{name: "bud", bark_loudness: 5}


#9

This pattern for supplying a set of Handlers would work in Java too. By implementing Handlers as an anonymous inner class, the code for each option could at least be kept together, unlike using methods on each sub-class - potentially aiding readability.


#10

Yes. Not tried Elixir yet, but the syntax does not seem as elegant as Elm.


#11

Databases

A relational database.

I think the choices here are the same as taking the class model above (in the OO example) and looking at the ways that inheritence is mapped to a SQL database by an Object Relational Mapping tool.

Single-Table

Put all the fields for all the tags in one big table.

create table content (
    uuid char(32) not null,
    title varchar(255),
    markdown text, 
    imageId int8, -- Foreign key on images table
    primary key uuid
);   

Joined-Table

Put the fields for each tag in a separate table, and have a table that brings them together by joining onto them. Only one join can be non-null, and that also acts as the discriminator. If there are common fields amongst all the constructors they might get pulled up into the join table, and I have done that here with the ‘uuid’ field.

create table content (
    uuid char(32) not null,
    textId int8, -- FK on text table.
    markdownId int8, -- FK on markdown table.
    figureId int8, -- FK on figure table
    primary key uuid
);

create table markdown (
    id int8 not null,
    markdown text,
    primary key id
);

and so on...

Table-Per-Class

Do away with the join table, and just have one table for each constructor. This can be inconvenient to work with though, as every other table that reference the type must have one join column for each constructor, and ensure the only one is non-null constraint is upheld.

create table figure (
    uuid char(32) not null,
    title varchar(255),
    markdown text, 
    imageId int8, -- Foreign key on images table
    primary key uuid
);

and so on...

A no-sql database:

Just store the JSON document.

Is there more to do here? For example, do some JSON structures query more easily than others?

Hybrid Approach

A database like Postgres supports storing document models as JSON. So can pull up just the ‘uuid’ field, but store the rest of the model as JSON and use the discriminator field in the JSON.

create table content (
    uuid char(32) not null,
    content json, -- Holds the model as JSON.
    primary key uuid
);

I like the flexibility of Postgres in this regard and have often made use of this approach.

Which is best?

I think a no-sql database can be very convenient as well as performant.

A relational database can be useful when you want to normalize data models to help ensure the quality and consistency of data over time.

It can often be going too far to fully expand something like this into a relational model, so the hybrid approach can have a lot of appeal.

Depends on the application really.


#12

From a F# perspective - it’s similar to Elm (since both are ML based languages).
The big difference is F# doesn’t have in-line type aliases in its Discriminated Union. So you have to define your record types and use them:

type UUID = UUID of string
type TextType = { uuid : UUID; title : string }
type MarkdownType = { uuid : UUID; markdown : string }
type FigureType = { uuid : UUID; markdown : string; title : string; imageId : string option }

type Content
  = Text of TextType
  | Markdown of MarkdownType
  | Figure of FigureType

#13

Does F# have a similar notion to JSON Encoders/Decoders as Elm? Or can it automatically serialize data model to JSON/XML/Whatever?

There is something to be said for having to explicitly name the in-line type aliases if automatic generation is involved, as the names you give can then be used when creating database tables, XML elements, and so on.


#14

As Math

type Content
  = Text { uuid : String , title : String }
  | Markdown { uuid : String , markdown : String }
  | Figure { uuid : String , markdown : String , title : String , imageId : Maybe String }

is equivalent to the expression:

uuid * title + uuid * markdown + uuid * markdown * title * imageId

taking the common factor

=  uuid * (title + markdown + markdown * title * imageId)

Which converted back into Elm gives me an equivalent data model:

type alias Content =
  { uuid : String
  , contentType : ContentType
  }

type ContentType
  = Text { title : String }
  | Markdown { markdown : String }
  | Figure { markdown : String, title : String, imageId : String }

I could also have taken the common factor markdown out of Markdown and Figure, but that seems to be going too far.

I just like that the normal rules of math let you manipulate data models as refactoring rules.

Given that a type can be written out as a sum-of-products by applying the multiplications over all the bracketed sums, it seems likely that it would be possible to prove that any type can always be made only 2 levels deep, as a simple sum-of-products. That is, so long as the type does not recurse. I can apply this ‘theorem’ to my Content type by explicitly multiplying out the Maybe in imageId, like this:

    uuid * title + uuid * markdown + uuid * markdown * title * imageId
  = uuid * title + uuid * markdown + uuid * markdown * title * (int + empty)
  = uuid * title + uuid * markdown + uuid * markdown * title * int + uuid * markdown * title * empty

  and taking empty = 0
  
  = uuid * title + uuid * markdown + uuid * markdown * title * int + uuid * markdown * title

Which suggests that a simpler way of building my Content model might be:

type Content
  = Text { uuid : String , title : String }
  | Markdown { uuid : String , markdown : String }
  | Figure { uuid : String , markdown : String , title : String , imageId : String }
  | Caption { uuid : String , markdown : String , title : String }

#15

F# uses a mechanism called “Type Providers” which allow it to inter-operate with many kinds of structured data. There are type providers for all of the typical things such as JSON, HTML, XML, SQL, CSV, SOAP - or even other languages, like R (so it can inter-operate with it and its libraries).

Essentially you point it at a sample set of data - and the compiler will infer the record types and functions to interact with it. Even before compile time you’ll have access to the data (as if it’s just another library in your code).

To give you an example, here’s how you’d chart data from the World Bank:
http://fsharp.github.io/FSharp.Data/library/WorldBank.html

It’s not a perfect system - it’s only as good as your sample data. So curved balls in real-world data can cause issues.

Here’s the research paper it’s based on:


#16

Rust is pretty much the same as Elm

enum Content {
      NoContent,
      Text { uuid: String, title: String },
      Markdown { uuid : String , markdown : String },
      Other(String),
}

There are a few ways that these can be serialized/deserialized automatically, see https://serde.rs/enum-representations.html


#17

Thanks for all the examples.

Anyone know Python? I think Python perhaps works with JSON directly, so its just the same as for JSON?


#18

In a current project, I have the following in Elm:

type TemporalAspect 
    = NowAspect
    | AtAspect Date
    | BetweenAspect Date Date

In the backend, written in C#, I do something like the following (similar to your Java example):

public enum TemporalAspectType
{
    Now,
    At,
    Between
}

public abstract class TemporalAspect
{
    public abstract TemporalAspectType Type { get; }
}

public class NowAspect : TemporalAspect
{
    public override TemporalAspectType Type { get => TemporalAspectType.Now; }
}

public class AtAspect : TemporalAspect
{
    public override TemporalAspectType Type { get => TemporalAspectType.At; }
    public DateTime At { get; set; }
}

public class BetweenAspect : TemporalAspect
{
    public override TemporalAspectType Type { get => TemporalAspectType.Between; }
    public DateTime From { get; set; }
    public DateTime To { get; set; }
}

I also declare abstract methods in the base class and then override them as appropriate in the derived concrete classes.

The only practical reason for the TemporalAspectType enum is that it simplifies implementation of switch statements where needed, although whenever possible I avoid these by using the aforementioned abstract methods.


#19

For the people wondering on what most Elixir libraries and applications use: They encode tagged unions as tagged tuples, where the first element is a symbol representing the current case.

This means that there is no possibility to do case exhaustiveness checking etc. but Elixir is a dynamic, unityped language anyway.

The most common approaches occurrence of this are {:ok, some_data} | {:error, some_error}-tuples. Many libraries (an example: exceptional exist to make these tuples behave more like the ‘Result’-structure they represent, although most applications/libraries still chug forward without depending on these.

The Algae Sum-Builder that was linked to earlier in this topic is mathematically nice, but probably a lot slower than working on tuples, since Elixir structs compile down to hashmaps. Also, unfortunately many Elixir-programmers are new to these functional concepts, so in practice they do not see a lot of use.


For Python, I came across the sumtypes library, but I have not used it myself yet.


#20

Thank you for this, @Qqwy! (and thanks to @danmarcab for the algae suggestion too, it looks interesting).

Out of interest, how do Elixir people typically encode their tagged tuples in JSON? I’ve thought of two approaches:

As an array:
{:ok, 1} -> [":ok", 1]

Or as some kind of object:
{:ok, 1} -> {"tag": ":ok", "payload": 1}

I’m not sure which is best. It doesn’t really matter from the Elm side, because I have to write a decoder anyway. So if there’s an easy/idiomatic option on the Elixir end, I’ll use that.