Binary and file handling in Elm


#1

Binary and file handling in Elm

Elm does not have native support for binary data. The most obvious use case is file handling, but there are many areas in which binary data is used.

I’ve tried to wrap my head around all the use cases, attempts and discussions. It seems that many people had tried to implement solutions for these problems.

Below I have a tentative suggestion for a Bytes type, file reading, and HTTP PUT/POST/GET.

Points focused on in this document

  • efficient binary representation
  • reading binary files
  • binary http GET/POST/PUT
  • insertion/deletion/concatenation
  • canvas

Not focused on, but interesting

  • binary websockets
  • protobuf/other binary protocols
  • binary encoding
  • binary decoding
  • binary manipulation/operations
  • web audio
  • streams

Properties of a binary type that we would like

  • Efficient memory usage, for instance avoid multiple copying when sending a file using HTTP
  • Fast random access to bytes and slices
  • Fast concatenation/inserting/removing/modifying bytes
  • Fast transfer to and from arrays and lists of bytes

A Bytes API

First, we need the basic unit of data, a Byte.

type Byte = Byte Int -- opaque
fromInt : Int -> Byte -- clamps between 0 and 255
toInt : Byte -> Int

Since we are to work in the browser, we need to be able to use a FileReader to get the contents of a file (in chunks for large files) and for that we use the readAsArrayBuffer() that gets us ArrayBuffer/Uint8ClampedArray objects.

We wrap these as ByteArray in Elm. This type will not be visible to users, since the Bytes type can handle everything you want. This type exposes basic methods and properties on Uint8ClampedArray like length and from() and get() and slice()/subarray().

type ByteArray = -- kernel type for Uint8ClampedArray

Both Haskell and Elixir/Erlang support a kind of “chunked” byte sequences, that is, when you concatenate them you don’t make a copy but keep a list or tree of chunks. In Elixir this is just for IO operations, but I think it makes sense to have this as the only binary data type.

Access in the chunked data type will be slower, but if an algorithm wants fast access they can call the pack function, which would create a Single byteArray but of course using memory and time to do it. pack can also be used to return memory that slice may have left (if implemented using subarray() which I think it should be).

We need to have fast access to the correct ByteArray inside the Bytes. Iinspired by Elixir’s IOList, here is a suggested binary tree structure where the Concat contains the total length of the bytes.

type Bytes
    = Single ByteArray
    | Concat Int Bytes Bytes
pack : Bytes -> Bytes

This type will support relevant operations from Elm’s Array. Exactly which functions to include is something to discuss.

We will also have functions to convert to and from Bytes, Array Byte, List Byte etc, all backed up by this data structure.

Handling files

JavaScript has a File type. You can get files from the <input type="file"> and from the drop event on DOM elements. In both cases you get access to one or more File objects. You can also get a File from a HTTP GET request, but in that case I don’t see why you wouldn’t want Bytes directly instead, see below.

type File = -- opaque
lastModified : File -> Time
name : File -> String
size : File -> Int
mediaType : File -> String
-- shallow slice for reading parts of a file
slice : Int -> Int -> String -> File -> File
-- Decoder for target.files in the change event
filesDecoder : Decoder (List File)

Another common use of the file is to read it’s contents into the Elm application itself. For this we have the FileReader type in JavaScript. This lets the application read the contents of files asynchronously.

The FileReader lets you start reading asynchronously, and provide a callback for when the file is read in.

The file can be read in three formats:

  • binary - readAsArrayBuffer()
  • data URL / base64 encoded - readAsDataURL()
  • text with provided encoding - readAsText()

We only need the binary format, since both base64 encoding and text decoding can be done in Elm.

type alias Error = { name : String, message : String } -- DOMException
read : File -> Task Error Bytes

The most common use of files in Elm would be file upload and download. Since XHR allows you to pass a File or a slice of it as body it makes sense to have the direct access to at least fileBody so that you don’t have to read the file into memory yourself.

And then we can PUT/POST/GET either our file or our bytes via HTTP.

fileBody : File -> Http.Body
bytesBody : Bytes -> Http.Body
expectBytes : String -> Http.Expect Bytes
expectFile : String -> Http.Expect File

Future: using Bytes as the new String?

In Erlang/Elixir, strings are a subset of binaries that are valid UTF-8.

Right now Elm is using the JavaScript String type, which does make sense since a lot of interop with JavaScript is using strings. However, this also has some downsides, like String.length and String.substring being slightly wrong for large codepoints and combining characters.

It may be worth exploring using Bytes for String, or using a similar approach, solving this like it’s done in Elixir.

Canvas

You can get the image data as a Uint8ClampedArray with getImageData() from your CanvasRenderingContext2D.

Implementations in other language

JavaScript

  • ArrayBuffer - a generic, fixed-length Slice binary data buffer
  • DataView - a low-level interface for reading and writing multiple number types in an ArrayBuffer
  • TypedArray - an array-like view of an underlying ArrayBuffer, types are Int8Array, Uint8Array, Uint8ClampedArray, In16Array etc.
  • Blob - a file-like object of immutable, Slice data
  • File - a specific kind of a Blob representing a file
  • FileReader - lets web applications asynchronously read the contents of files
  • XMLHttpRequest with binary data -

Haskell

  • bytestring - An immutable array of bytes (both strict and lazy/chunks)
  • binary - serialization of values to and from ByteString
  • utf8-string - converting bytestrings to and from strings using UTF-8

Erlang/Elixir

Various related Elm implementations of things

Discussions


#2

This makes sense to me. I’m mixed about whether to actually have a tagged Byte type, that is a boxed Int between 0 and 255. Pretty much all the sequence types are going to want functions that take/return Int or List Int or Array Int. I really don’t want to convert Bytes into Array Byte, and then Array.map (\(Byte int) -> int) on it. It might be better to just forego the Byte type and use Int everywhere.

In any case, this is all doable as pure Elm to get it working, and then as an elm-exporations module to kernelize the parts that need to be faster/smaller. Having a real implementation in hand would help in analyzing its merits.


#3

Sure, that can certainly be discussed. The point was to have some kind of forced clamping, but the JavaScript typed array functions do that anyway, so it may not be needed. We can certainly

type alias Byte = Int

instead.


#4

Yes, I have already started (using Array Byte as the type) but I was hoping to get some feedback before coding everything :slight_smile:


#5

Early this year, up to February, I took a stab at a thin “native” (0.18 semantics) package to enable usage of JS Typed Arrays. Originally, I wanted to build a linear algebra library (elm-tensor) to manipulate matrices and interact with Web APIs needing typed arrays without paying the price of conversion of (huge) matrices through ports.

I wanted to have a very flexible and robust underlying typed array interface and spent a lot of (too much) time on it. I did an exhaustive report of my experiments in discourse. I think my main mistake was trying to cover everything, I didn’t have the time or energy to do so. As Brian mentioned in that post, the most important thing is to focus on a concrete use case. I see that you have quite a big use case for files/blob sending and receiving over network/file system. I’d focus on that and leave aside strings / canvas / etc.

I also like your idea of trying to implement a pure Elm (slow doesn’t matter) prof of concept to propose an API.


#6

Yes, that’s sort of what I’m doing. I’m just mentioning them in order to show that they are probably not incompatible.


#7

Yes, I think so too. It’s not the Elm way to just cover an existing API. I see very little use for the full flora of Typed Arrays, one good Bytes type should do the trick.


When do we need binary decoder and encoder?
#8

I’m very pleased with seeing the new elm/bytes repository. I think that covers the basic byte sequence type I describe here, and if people want to make this kind of tree like structure for efficient concatenation and modification (instead of using the encoders) it can be implemented on top of them.

One thing is missing though and that is shallow slicing. For our purposes we want to divide up a file into slices but avoid making new copies, the decoder can’t do that. Perhaps a decodeShallow could be made to that uses shallow slicing for bytes decoders.

In fact, it looks like you can slice at all without copying the bytes before the slice and then throwing them away. But that can easily be fixed by a skip decoder or similar.

I know, I don’t have a very convincing use case. I’m going to write that up in a new post when I find the time.