Considering Creating a DataFrame Package

I’m pretty new to the Elm world, but at first glance it seems there isn’t a strong set of tools around manipulating tabular / relational data in a declarative way. I remember seeing a talk by Evan where he mentioned one value proposition that Elm could strive for might be in-browser analytics scientific computing, specifically data visualization (see edited note). I work as a data scientist who primarily writes internal packages for other data scientists, and it seems like a DataFrame construct with declarative grammars for data manipulation and graphics (like Python’s pandas or the tidyverse in R) would be a good foundation for data science and analytics work. I wouldn’t start on this right away, but it has been swirling around my head and wanted to see what others thought.

First question: Is it true that this is a use-case that doesn’t have a strong toolset? Does this even seem valuable for the users of Elm?

I can imagine implementing a DataFrame either as a list of records or a list of dicts. There are upsides and downsides to either approach.

Using records, there is native support for varying column types, and it becomes easy to pass around accessors to row attributes using the .x syntax. But the idea of joining two tables made of records would be difficult, since using merge syntax I would need to know the attributes of the record in advance. This fits in well with type safety but is not scalable. If I have two tables with 20 fields each and wish to join on a single key field, listing the other 19 fields during a join is tedious. I can think of some ways around this, such as treating joins like a linked list of records, but that seems like a difficult mental model to reason about.

Using dicts, the big upside is the schema is much more flexible, and could even be decided at runtime. This obviously isn’t the most idiomatic approach in a language that with a type system like Elm’s, but I think careful wrapping in Maybe / Result or some other custom monadic type could handle this and still force the user to handle undefined operations, invalid schema, etc. The biggest complication here is that to put values of different types in a dict, they would need to be wrapped in some other type. Because of this, it would probably be limited to a pre-defined set of types, most likely just numbers, strings, chars, and dates.

Second question: Do either of these approaches seem reasonable? My gut says that the flexibility of a DataFrame and performing sql-like operations lends itself to the dict approach, but I know I come from languages with a very different tradition of dynamic typing compared to Elm. I would want a solution to feel as natural and idiomatic to Elm as possible given the use cases.


Edited Note

I found the talk I was referring to, it was “What is Success” at 2018 Elm Europe. Specifically he was referring to data visualization, but I think that the way most data scientists / analysts are introduced to code-driven data-visualization is through tabular data with declarative semantics for data wrangling and plotting.

4 Likes

You’re right in that there isn’t any standard way of doing data science in elm. Some of it is just because the data science community within the elm community is small or inexistent. Some of the data science field though, in particular visualization, has a good intersection with other needs in web development so there has been a few interesting approaches on the matter. I could suggest you have a look at elm-charts, elm-visualization, and elm-vega that each take a different approach.

There hasn’t been a standard way to deal with the associated data however, data frames. I suppose one reason would be because elm is not good to manipulate big amounts of data. That’s in part due to the immutability constraint, and in part due to the fact that doing advanced math and linear algebra in elm is not ergonomic (no operator creation possible) and not fast (you cannot handle operations like matrix multiplication on big matrices and algorithms requiring mutation efficiently).

One thing that elm is good at though is encoding and decoding data safely. And since elm needs a way to gather data without computing it (not its strength), I think I would try to start on standard interop formats for data science such that hdf5 and others. Then find good ways to interop between decoded versions of those formats and inputs of visualizations tools already in the ecosystem. Maybe a good design for data frames in elm will rise from those needs.

That’s my 2 cents, I hope that helps.

1 Like

Hello @AustinNar, exciting to see discussion of this topic!

On the join question, at some point I explored some different options and l personally like the path of using tuples to combine the records.

So joining { name = "tom", x = 3, y = 4 } and { name = "tom", age = 52 } on the name field would result in:

("tom", { name = "tom", x = 3, y = 4 }, { name = "tom", age = 52 })

Still pretty easy to work with, no sacrifice on type safety, and easy to access the join key for further joins. If you had a bunch of joins that resulted in a structure like ((a,b),(c,d)) I might do a map at the end to turn it into a nicer data structure. Maybe extracting the specific data that I want, or just flattening that outer layer into a single record.

Generally speaking, my feeling is that merging records is always equivalent to nesting records from a practical perspective, and luckily nesting is not such a hard type theory problem! :sweat_smile:

Anyway, does the tuple idea sound like a plausible route for you?

I am one of a (small?) number of people who routinely use Elm for ‘data science’ type work. I am the author of the two visualization packages elm-vega-lite and elm-vega (where elm-vegalite is probably the more useful package for standard data visualization) and the tabular data package Tidy. I am also one of the authors of litvis, which provides a literate Elm environment orientated around visualization and data science (think Jupyter or Observable notebooks but with Elm as the primary programming language)

It may be that Tidy fulfils many of the use cases you have in mind. For example, relational joining (left, right, inner, outer and difference); gathering (pivot long) and spreading (pivot wide); table input/output (JSON, CSV, markdown).

Currently I simplify the API and type conversion issues by representing all values as Strings, but would be keen to know if a more flexible typing would be required/useful or there are other use-cases that such a package should accommodate.

8 Likes

Thank you all for the feedback!

@jwoLondon, I had not seen Tidy, but yes that is very close to what I was imagining! And at first glance the API design is pretty close to what I imagined for the list of dicts implementation I mentioned. Per your question about more flexible typing, one thing I was considering for an implementation like that would be to use something similar to what pyspark uses for their dataframe implementation. They expose a set of standard functions for referring to columns and manipulating them. I would imagine having a Field union type to describe the different types the fields could have, with an UndefinedField type for when the wrong types are used in a calculation. For example

-- A wrapped scalar value. Like a "cell"
type Field
  = IntField Int
  | StringField String 
  | UndefinedField 

-- Rows map strings (column names) to fields. DataFrames are a list of rows
type alias Row = Dict String Field
type alias DataFrame = List Row

-- Calcs are functions that return a field from a given row
type alias Calc = Row -> Field

-- Simplest calc, col, pulls the field value by name from a row
col : String -> Calc
col name row = 
  Dict.get name row
    |> Maybe.withDefault UndefinedField

-- Add takes two other calcs, and if they are ints, adds the results. 
-- Otherwise Undefined
add : Calc -> Calc -> Calc
add x y row = 
  let
    xField = x row
    yField = y row
  in
    case ( xField, yField ) of
      ( IntField xInt, IntField yInt ) ->
        IntField ( xInt + yInt )
    _ ->
      UndefinedField

-- For each row, apply the calc, and assign to `name`
mapCol : Calc -> String -> DataFrame -> DataFrame
mapCol calc name df = 
  List.map (\row -> Dict.insert name (calc row) row) df

...

-- Add the x and y columns and assign to z
df
  |> mapCol ( add ( col "x" ) ( col "y" ) ) "z"

If the x and y columns are both int’s, then z is an int, otherwise it is undefined.

Things like the add functions would mostly be for inline math to make the code more expressive and concise, but if you needed something more complex all the ingredients to define a user defined functions would be exposed in the same style as the add function.


@evancz, I like that approach! That’s similar to what my comment on a ‘linked list of records’ was going for, but it’s more concrete and I can see how that could be recursively defined for lookup, since if you have multiple nestings this is just a binary tree! :slight_smile:

There still is some discussion / thought to be had I think around if records or dicts are a better model from an API design perspective (records are more explicit and familiar, but probably require some more boilerplate when being modified since fields cant be iterated over or referred to by string name). But I’m glad to see that it’s actually a matter of design tradeoffs, and not that the choice is pre-made due to something like joins being impossible.

As a small aside, are there non-dependent type systems that allow you to express a join the “natural way”? That is

joinOn : 
  ({fn | ?x : val} -> val) -> 
  List { a | ?x : val } -> 
  List { b | ?x : val } -> 
  List { a ∪ b }

this would require I suppose three orthogonal features:

  1. ?x, a way to assert that all the records share the same field, but the code is generic as to which field.
  2. { a ∪ b }, a way to merge records.
  3. I suppose the signature above would need the fn type variable to be existentially quantified…, as it would be applied to records of different types. I suppose one could easily fix that with passing the function twice in this case.

Are there languages that implement 1 and/or 2 (I’m mostly interested in languages without dependent typing, as there one can probably implement this quite easily).

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.