Recently I’ve been working on elm-brotli, a brotli decoder, and elm-flate, an implementation of DEFLATE compression.
Brotli is used in woff2 font files, DEFLATE is used in among others the gzip, png and woff file formats. Font parsing has some cool uses (static layout, automatic text breaking in svg), generating png images is useful for webgl/games. These packages are low-level building blocks that can be used in many creative ways.
Unfortunately, they are quite slow. Performance is not horrible, but it’s not good either. Most of the issues arise because the current elm/bytes
package does not support arrays well. The algorithms used in these packages assume that the data is represented in array form. They use and encode indices into arrays in ways that are hard and inefficient to do with just the current Bytes.Decode.Decoder
type. The elm/bytes
docs ask for use cases where the current api doesn’t quite work, so here goes.
Problem 1: Array Decoding/Encoding
The use of arrays means that the data has to actually be converted from Bytes
to Array Int
, then re-encoded again after decompression. The current elm/bytes
implementation does not do that efficiently. For instance, the “natural” way to encode a array of bytes is slower than turning the whole thing into a string, and then encoding the string:
Benchmark.compare "encode array of bytes"
"string"
(\_ ->
data
|> Array.foldl (\e a -> a ++ String.fromChar (Char.fromCode e)) ""
|> Encode.string
|> Encode.encode
)
"bytes"
(\_ ->
data
|> Array.foldr (\e a -> Encode.unsignedInt8 e :: a) []
|> Encode.sequence
|> Encode.encode
)
with --optimize
, "bytes"
is usually about 15% slower than "string"
. Weird right? The culprit is Encode.sequence
, which internally needs to check the width of all elements of its argument. That makes sense in general, but is very wasteful when you know the width of all the elements to be the same up-front, like with an array of bytes.
I actually use a more optimized encoder that groups bytes into sets of 4, then encodes as unsignedInt32
. That means only a quarter the number of encoders, therefore a quarter the number of width checks. Fewer list elements also helps with garbage collection: since all JS numbers use 8 bytes anyway, it’s a good idea to use them as efficiently as we can (which is 4 bytes per number because JS numbers).
This problem could be solved without large changes to the package. For instance a unsignedInt8Array : Array Int -> Encoder
(and corresponding decoder) would help a lot.
Problem 2: Large constants
These packages use some rather large constants (brotli uses a dictionary array of over 200K elements), and initializing them is pretty slow. Currently arrays are initialized as elmArrayFromList (ListFromJSArray jsArray)
that (seems to) allocate the data 3 times. At least the List Int -> Array Int
could be removed with a rewrite rule in the compiler.
Problem 3: Array Usage
Many of the algorithms expect fast array access, for instance run-length encoding and corresponding decoding. This algorithm encodes e.g. “aaaaa” as “a<distance=1, length=4>”. To do that it needs to find common substrings in the byte array, so we can store them once and then reference from other locations where the substring occurs. In an imperative language, that would look like this:
maxLength = 258
length = 0
while buffer[i] == buffer[j] && length < maxLength:
length += 1
i += 1
j += 1
The elm version is currently:
longestCommonPrefixLoop : Int -> Int -> Int -> Int -> Array Int -> Int
longestCommonPrefixLoop i j maxLength length buffer =
if i < maxLength then
case Array.get i buffer of
Nothing ->
length
Just value1 ->
case Array.get j buffer of
Nothing ->
length
Just value2 ->
-- use arithmetic to force inline the compare
if (value1 - value2) == 0 then
longestCommonPrefixLoop (i + 1) (j + 1) limit (length + 1) buffer
else
length
else
length
Because the definition is tail-recursive, it gets compiled to javascript that looks reasonably similar to the imperative code, but there are some important differences:
- the
buffer
is an elmArray
. Its performance is generally good, but the byte arrays can become so large that just the sheer number of array operations becomes a problem. The array operations also seem to cause a lot of GC time. Additonally, in contrast to c/rust, every element takes 8 bytes (in JS numbers use 64 bits) while only 1 byte is really needed for our data. So just loading the 100kb file as an array uses 800kb of memory. - I believe we don’t benefit (as much) from caching on the cpu because we’re using a high-level language. In c and similar languages, accessing adjacent elements is typically faster than random access because the relevant section of the array is cached and does not need to be retrieved from RAM).
- and the big one: the
Maybe
wrapping. The extra alloctation, case matches, and garbage collection time for every array access are a killer.
Solutions
I think a dedicated ByteArray
type makes sense, but there are some challenges in functional languages because of immutability, and many other details to think about.
-
storage: a
ByteArray
should use only one byte of storage per byte of data. This could be built on top ofUInt8Array
. -
conversion: Conversion between
ByteArray
andArray Int
/List Int
should be fast, not only in code but also when large literals are used. -
maybes: should be avoided. I think this is possible with good design, focussing on slices and folds instead of individual accesses. Folds have the nice property that they don’t need to wrap every element in a maybe and they can efficiently go through adjacent elements. A problem is that currently folds can only start at the very start or end of an array. Starting anywhere else requires a
Array.slice
which will allocate. So repeatedArray.get
is the most efficient right now. -
immutability: The tricky thing about c style arrays (big chunks of contiguous memory) in functional languages is that whenever an element is updated, we’d have to copy the whole thing to guarantee immutability. But we don’t want to copy our whole byte array whenever we make a change, that would defeat the purpose of a dedicated data structure (speed and memory efficiency).
But so far I haven’t really needed random write access into these large arrays, the main operations i’ve used are:
-
Array.get
an element somewhere - working with a subsection: often a combination of
Array.slice
andArray.foldl
better expresses what’s going on, but because of the cost ofArray.slice
, a loop of repeatedArray.get
is usually faster. -
Array.push
onto the end
So once data is in the array, it (almost always) stays there unchanged until the whole array is encoded into
Bytes
again. -
Conclusion
There are probably improvements that can be made to the code/algorithms with the current primitives (help welcome!). But especially for elm-flate
, the profiling is so dominated by datastructure operations (array, list, dict), byte array en/decoding, and GC that it’s hard to see what other bottlenecks there are.
I hope that with these packages (and some real examples that use them) we can find more efficient primitives for working with binary data in elm.
If this fits into your projects or you run into (performance) trouble, let me know. Always happy to help.