Binary and file handling in Elm
Elm does not have native support for binary data. The most obvious use case is file handling, but there are many areas in which binary data is used.
I’ve tried to wrap my head around all the use cases, attempts and discussions. It seems that many people had tried to implement solutions for these problems.
Below I have a tentative suggestion for a Bytes
type, file reading, and HTTP PUT/POST/GET.
Points focused on in this document
- efficient binary representation
- reading binary files
- binary http GET/POST/PUT
- insertion/deletion/concatenation
- canvas
Not focused on, but interesting
- binary websockets
- protobuf/other binary protocols
- binary encoding
- binary decoding
- binary manipulation/operations
- web audio
- streams
Properties of a binary type that we would like
- Efficient memory usage, for instance avoid multiple copying when sending a file using HTTP
- Fast random access to bytes and slices
- Fast concatenation/inserting/removing/modifying bytes
- Fast transfer to and from arrays and lists of bytes
A Bytes
API
First, we need the basic unit of data, a Byte
.
type Byte = Byte Int -- opaque
fromInt : Int -> Byte -- clamps between 0 and 255
toInt : Byte -> Int
Since we are to work in the browser, we need to be able to use a FileReader
to get the contents of a file (in chunks for large files) and for that we use the readAsArrayBuffer()
that gets us ArrayBuffer
/Uint8ClampedArray
objects.
We wrap these as ByteArray
in Elm. This type will not be visible to users, since the Bytes
type can handle everything you want. This type exposes basic methods and properties on Uint8ClampedArray
like length
and from()
and get()
and slice()
/subarray()
.
type ByteArray = -- kernel type for Uint8ClampedArray
Both Haskell and Elixir/Erlang support a kind of “chunked” byte sequences, that is, when you concatenate them you don’t make a copy but keep a list or tree of chunks. In Elixir this is just for IO operations, but I think it makes sense to have this as the only binary data type.
Access in the chunked data type will be slower, but if an algorithm wants fast access they can call the pack
function, which would create a Single byteArray
but of course using memory and time to do it. pack
can also be used to return memory that slice
may have left (if implemented using subarray()
which I think it should be).
We need to have fast access to the correct ByteArray
inside the Bytes
. Iinspired by Elixir’s IOList, here is a suggested binary tree structure where the Concat
contains the total length of the bytes.
type Bytes
= Single ByteArray
| Concat Int Bytes Bytes
pack : Bytes -> Bytes
This type will support relevant operations from Elm’s Array
. Exactly which functions to include is something to discuss.
We will also have functions to convert to and from Bytes
, Array Byte
, List Byte
etc, all backed up by this data structure.
Handling files
JavaScript has a File
type. You can get files from the <input type="file">
and from the drop
event on DOM elements. In both cases you get access to one or more File
objects. You can also get a File
from a HTTP GET request, but in that case I don’t see why you wouldn’t want Bytes
directly instead, see below.
type File = -- opaque
lastModified : File -> Time
name : File -> String
size : File -> Int
mediaType : File -> String
-- shallow slice for reading parts of a file
slice : Int -> Int -> String -> File -> File
-- Decoder for target.files in the change event
filesDecoder : Decoder (List File)
Another common use of the file is to read it’s contents into the Elm application itself. For this we have the FileReader
type in JavaScript. This lets the application read the contents of files asynchronously.
The FileReader
lets you start reading asynchronously, and provide a callback for when the file is read in.
The file can be read in three formats:
- binary -
readAsArrayBuffer()
- data URL / base64 encoded -
readAsDataURL()
- text with provided encoding -
readAsText()
We only need the binary format, since both base64 encoding and text decoding can be done in Elm.
type alias Error = { name : String, message : String } -- DOMException
read : File -> Task Error Bytes
The most common use of files in Elm would be file upload and download. Since XHR allows you to pass a File
or a slice of it as body it makes sense to have the direct access to at least fileBody
so that you don’t have to read the file into memory yourself.
And then we can PUT/POST/GET either our file or our bytes via HTTP.
fileBody : File -> Http.Body
bytesBody : Bytes -> Http.Body
expectBytes : String -> Http.Expect Bytes
expectFile : String -> Http.Expect File
Future: using Bytes
as the new String
?
In Erlang/Elixir, strings are a subset of binaries that are valid UTF-8.
Right now Elm is using the JavaScript String type, which does make sense since a lot of interop with JavaScript is using strings. However, this also has some downsides, like String.length
and String.substring
being slightly wrong for large codepoints and combining characters.
It may be worth exploring using Bytes
for String
, or using a similar approach, solving this like it’s done in Elixir.
Canvas
You can get the image data as a Uint8ClampedArray
with getImageData()
from your CanvasRenderingContext2D
.
Implementations in other language
JavaScript
- ArrayBuffer - a generic, fixed-length Slice binary data buffer
- DataView - a low-level interface for reading and writing multiple number types in an ArrayBuffer
- TypedArray - an array-like view of an underlying ArrayBuffer, types are Int8Array, Uint8Array, Uint8ClampedArray, In16Array etc.
- Blob - a file-like object of immutable, Slice data
- File - a specific kind of a Blob representing a file
- FileReader - lets web applications asynchronously read the contents of files
- XMLHttpRequest with binary data -
Haskell
- bytestring - An immutable array of bytes (both strict and lazy/chunks)
- binary - serialization of values to and from ByteString
- utf8-string - converting bytestrings to and from strings using UTF-8
Erlang/Elixir
- Bitstring, binary, string - all the same type, binary is a bitstring in even 8 bit chunks, string is a binary that has valid UTF-8 data
- “iolist” or “chardata” - you can use lists of (lists of) binaries as a binary in many cases
- see video about string/binary/iolist in Elixir and the blog post about unicode and the blog post about iolist.
- << >> operator
Various related Elm implementations of things
-
dividat/elm-binary - basic wrapper for JavaScript
TypedArray
-
mpizenberg/elm-js-typed-array - another wrapper for JavaScript
TypedArray
- tiziano88/elm-protobuf - google protobuf proto3/JSON encoder/decoder
- jinjor/elm-binary-decoder - binary file decoder in elm-parser style
- truqu/elm-base64 - base64 encoder/decoder (using strings for binary)
- newlandsvalley/elm-binary-base64 - another base64 encoder/decoder
- norpan/elm-file-reader - A file reader input component/drop zone (using base64)
- simonh1000/file-reader - Another file reader wrapper
- simonh1000/elm-s3-example - S3 file upload client/server