Binary and file handling in Elm

norpan · September 9, 2018, 12:05pm

Binary and file handling in Elm

Elm does not have native support for binary data. The most obvious use case is file handling, but there are many areas in which binary data is used.

I’ve tried to wrap my head around all the use cases, attempts and discussions. It seems that many people had tried to implement solutions for these problems.

Below I have a tentative suggestion for a Bytes type, file reading, and HTTP PUT/POST/GET.

Points focused on in this document

efficient binary representation
reading binary files
binary http GET/POST/PUT
insertion/deletion/concatenation
canvas

Not focused on, but interesting

binary websockets
protobuf/other binary protocols
binary encoding
binary decoding
binary manipulation/operations
web audio
streams

Properties of a binary type that we would like

Efficient memory usage, for instance avoid multiple copying when sending a file using HTTP
Fast random access to bytes and slices
Fast concatenation/inserting/removing/modifying bytes
Fast transfer to and from arrays and lists of bytes

A `Bytes` API

First, we need the basic unit of data, a Byte.

type Byte = Byte Int -- opaque
fromInt : Int -> Byte -- clamps between 0 and 255
toInt : Byte -> Int

Since we are to work in the browser, we need to be able to use a FileReader to get the contents of a file (in chunks for large files) and for that we use the readAsArrayBuffer() that gets us ArrayBuffer/Uint8ClampedArray objects.

We wrap these as ByteArray in Elm. This type will not be visible to users, since the Bytes type can handle everything you want. This type exposes basic methods and properties on Uint8ClampedArray like length and from() and get() and slice()/subarray().

type ByteArray = -- kernel type for Uint8ClampedArray

Both Haskell and Elixir/Erlang support a kind of “chunked” byte sequences, that is, when you concatenate them you don’t make a copy but keep a list or tree of chunks. In Elixir this is just for IO operations, but I think it makes sense to have this as the only binary data type.

Access in the chunked data type will be slower, but if an algorithm wants fast access they can call the pack function, which would create a Single byteArray but of course using memory and time to do it. pack can also be used to return memory that slice may have left (if implemented using subarray() which I think it should be).

We need to have fast access to the correct ByteArray inside the Bytes. Iinspired by Elixir’s IOList, here is a suggested binary tree structure where the Concat contains the total length of the bytes.

type Bytes
    = Single ByteArray
    | Concat Int Bytes Bytes
pack : Bytes -> Bytes

This type will support relevant operations from Elm’s Array. Exactly which functions to include is something to discuss.

We will also have functions to convert to and from Bytes, Array Byte, List Byte etc, all backed up by this data structure.

Handling files

JavaScript has a File type. You can get files from the <input type="file"> and from the drop event on DOM elements. In both cases you get access to one or more File objects. You can also get a File from a HTTP GET request, but in that case I don’t see why you wouldn’t want Bytes directly instead, see below.

type File = -- opaque
lastModified : File -> Time
name : File -> String
size : File -> Int
mediaType : File -> String
-- shallow slice for reading parts of a file
slice : Int -> Int -> String -> File -> File
-- Decoder for target.files in the change event
filesDecoder : Decoder (List File)

Another common use of the file is to read it’s contents into the Elm application itself. For this we have the FileReader type in JavaScript. This lets the application read the contents of files asynchronously.

The FileReader lets you start reading asynchronously, and provide a callback for when the file is read in.

The file can be read in three formats:

binary - readAsArrayBuffer()
data URL / base64 encoded - readAsDataURL()
text with provided encoding - readAsText()

We only need the binary format, since both base64 encoding and text decoding can be done in Elm.

type alias Error = { name : String, message : String } -- DOMException
read : File -> Task Error Bytes

The most common use of files in Elm would be file upload and download. Since XHR allows you to pass a File or a slice of it as body it makes sense to have the direct access to at least fileBody so that you don’t have to read the file into memory yourself.

And then we can PUT/POST/GET either our file or our bytes via HTTP.

fileBody : File -> Http.Body
bytesBody : Bytes -> Http.Body
expectBytes : String -> Http.Expect Bytes
expectFile : String -> Http.Expect File

Future: using `Bytes` as the new `String`?

In Erlang/Elixir, strings are a subset of binaries that are valid UTF-8.

Right now Elm is using the JavaScript String type, which does make sense since a lot of interop with JavaScript is using strings. However, this also has some downsides, like String.length and String.substring being slightly wrong for large codepoints and combining characters.

It may be worth exploring using Bytes for String, or using a similar approach, solving this like it’s done in Elixir.

Canvas

You can get the image data as a Uint8ClampedArray with getImageData() from your CanvasRenderingContext2D.

Implementations in other language

JavaScript

ArrayBuffer - a generic, fixed-length Slice binary data buffer
DataView - a low-level interface for reading and writing multiple number types in an ArrayBuffer
TypedArray - an array-like view of an underlying ArrayBuffer, types are Int8Array, Uint8Array, Uint8ClampedArray, In16Array etc.
Blob - a file-like object of immutable, Slice data
File - a specific kind of a Blob representing a file
FileReader - lets web applications asynchronously read the contents of files
XMLHttpRequest with binary data -

Haskell

bytestring - An immutable array of bytes (both strict and lazy/chunks)
binary - serialization of values to and from ByteString
utf8-string - converting bytestrings to and from strings using UTF-8

Erlang/Elixir

Bitstring, binary, string - all the same type, binary is a bitstring in even 8 bit chunks, string is a binary that has valid UTF-8 data
“iolist” or “chardata” - you can use lists of (lists of) binaries as a binary in many cases
see video about string/binary/iolist in Elixir and the blog post about unicode and the blog post about iolist.
<< >> operator

Various related Elm implementations of things

dividat/elm-binary - basic wrapper for JavaScript TypedArray
mpizenberg/elm-js-typed-array - another wrapper for JavaScript TypedArray
tiziano88/elm-protobuf - google protobuf proto3/JSON encoder/decoder
jinjor/elm-binary-decoder - binary file decoder in elm-parser style
truqu/elm-base64 - base64 encoder/decoder (using strings for binary)
newlandsvalley/elm-binary-base64 - another base64 encoder/decoder
norpan/elm-file-reader - A file reader input component/drop zone (using base64)
simonh1000/file-reader - Another file reader wrapper
simonh1000/elm-s3-example - S3 file upload client/server

Discussions

billstclair · September 9, 2018, 12:52pm

This makes sense to me. I’m mixed about whether to actually have a tagged Byte type, that is a boxed Int between 0 and 255. Pretty much all the sequence types are going to want functions that take/return Int or List Int or Array Int. I really don’t want to convert Bytes into Array Byte, and then Array.map (\(Byte int) -> int) on it. It might be better to just forego the Byte type and use Int everywhere.

In any case, this is all doable as pure Elm to get it working, and then as an elm-exporations module to kernelize the parts that need to be faster/smaller. Having a real implementation in hand would help in analyzing its merits.

norpan · September 9, 2018, 1:36pm

Sure, that can certainly be discussed. The point was to have some kind of forced clamping, but the JavaScript typed array functions do that anyway, so it may not be needed. We can certainly

type alias Byte = Int

instead.

norpan · September 9, 2018, 1:37pm

Yes, I have already started (using Array Byte as the type) but I was hoping to get some feedback before coding everything

mattpiz · September 9, 2018, 4:11pm

Early this year, up to February, I took a stab at a thin “native” (0.18 semantics) package to enable usage of JS Typed Arrays. Originally, I wanted to build a linear algebra library (elm-tensor) to manipulate matrices and interact with Web APIs needing typed arrays without paying the price of conversion of (huge) matrices through ports.

I wanted to have a very flexible and robust underlying typed array interface and spent a lot of (too much) time on it. I did an exhaustive report of my experiments in discourse. I think my main mistake was trying to cover everything, I didn’t have the time or energy to do so. As Brian mentioned in that post, the most important thing is to focus on a concrete use case. I see that you have quite a big use case for files/blob sending and receiving over network/file system. I’d focus on that and leave aside strings / canvas / etc.

I also like your idea of trying to implement a pure Elm (slow doesn’t matter) prof of concept to propose an API.

norpan · September 9, 2018, 4:37pm

Yes, that’s sort of what I’m doing. I’m just mentioning them in order to show that they are probably not incompatible.

norpan · September 9, 2018, 4:50pm

Yes, I think so too. It’s not the Elm way to just cover an existing API. I see very little use for the full flora of Typed Arrays, one good Bytes type should do the trick.

norpan · September 15, 2018, 6:09am

I’m very pleased with seeing the new elm/bytes repository. I think that covers the basic byte sequence type I describe here, and if people want to make this kind of tree like structure for efficient concatenation and modification (instead of using the encoders) it can be implemented on top of them.

One thing is missing though and that is shallow slicing. For our purposes we want to divide up a file into slices but avoid making new copies, the decoder can’t do that. Perhaps a decodeShallow could be made to that uses shallow slicing for bytes decoders.

In fact, it looks like you can slice at all without copying the bytes before the slice and then throwing them away. But that can easily be fixed by a skip decoder or similar.

I know, I don’t have a very convincing use case. I’m going to write that up in a new post when I find the time.

system · September 25, 2018, 6:09am

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Binary File Parsing for Streaming Applications Learn	3	869	October 29, 2019
My use case for binary data Request Feedback	4	783	November 16, 2018
File uploads without native code? Learn	14	3545	March 15, 2018
Two use cases of the new elm/bytes Request Feedback	7	1488	November 26, 2018
When do we need binary decoder and encoder? Request Feedback	11	2544	October 7, 2018

Binary and file handling in Elm

Binary and file handling in Elm

Points focused on in this document

Not focused on, but interesting

Properties of a binary type that we would like

A Bytes API

Handling files

Future: using Bytes as the new String?

Canvas

Implementations in other language

JavaScript

Haskell

Erlang/Elixir

Various related Elm implementations of things

Discussions

Related topics

A `Bytes` API

Future: using `Bytes` as the new `String`?