Brotli is used in woff2 font files, DEFLATE is used in among others the gzip, png and woff file formats. Font parsing has some cool uses (static layout, automatic text breaking in svg), generating png images is useful for webgl/games. These packages are low-level building blocks that can be used in many creative ways.
Unfortunately, they are quite slow. Performance is not horrible, but it’s not good either. Most of the issues arise because the current
elm/bytes package does not support arrays well. The algorithms used in these packages assume that the data is represented in array form. They use and encode indices into arrays in ways that are hard and inefficient to do with just the current
Bytes.Decode.Decoder type. The
elm/bytes docs ask for use cases where the current api doesn’t quite work, so here goes.
Problem 1: Array Decoding/Encoding
The use of arrays means that the data has to actually be converted from
Array Int, then re-encoded again after decompression. The current
elm/bytes implementation does not do that efficiently. For instance, the “natural” way to encode a array of bytes is slower than turning the whole thing into a string, and then encoding the string:
Benchmark.compare "encode array of bytes" "string" (\_ -> data |> Array.foldl (\e a -> a ++ String.fromChar (Char.fromCode e)) "" |> Encode.string |> Encode.encode ) "bytes" (\_ -> data |> Array.foldr (\e a -> Encode.unsignedInt8 e :: a)  |> Encode.sequence |> Encode.encode )
"bytes" is usually about 15% slower than
"string". Weird right? The culprit is
Encode.sequence, which internally needs to check the width of all elements of its argument. That makes sense in general, but is very wasteful when you know the width of all the elements to be the same up-front, like with an array of bytes.
I actually use a more optimized encoder that groups bytes into sets of 4, then encodes as
unsignedInt32. That means only a quarter the number of encoders, therefore a quarter the number of width checks. Fewer list elements also helps with garbage collection: since all JS numbers use 8 bytes anyway, it’s a good idea to use them as efficiently as we can (which is 4 bytes per number because JS numbers).
This problem could be solved without large changes to the package. For instance a
unsignedInt8Array : Array Int -> Encoder (and corresponding decoder) would help a lot.
Problem 2: Large constants
These packages use some rather large constants (brotli uses a dictionary array of over 200K elements), and initializing them is pretty slow. Currently arrays are initialized as
elmArrayFromList (ListFromJSArray jsArray) that (seems to) allocate the data 3 times. At least the
List Int -> Array Int could be removed with a rewrite rule in the compiler.
Problem 3: Array Usage
Many of the algorithms expect fast array access, for instance run-length encoding and corresponding decoding. This algorithm encodes e.g. “aaaaa” as “a<distance=1, length=4>”. To do that it needs to find common substrings in the byte array, so we can store them once and then reference from other locations where the substring occurs. In an imperative language, that would look like this:
maxLength = 258 length = 0 while buffer[i] == buffer[j] && length < maxLength: length += 1 i += 1 j += 1
The elm version is currently:
longestCommonPrefixLoop : Int -> Int -> Int -> Int -> Array Int -> Int longestCommonPrefixLoop i j maxLength length buffer = if i < maxLength then case Array.get i buffer of Nothing -> length Just value1 -> case Array.get j buffer of Nothing -> length Just value2 -> -- use arithmetic to force inline the compare if (value1 - value2) == 0 then longestCommonPrefixLoop (i + 1) (j + 1) limit (length + 1) buffer else length else length
bufferis an elm
Array. Its performance is generally good, but the byte arrays can become so large that just the sheer number of array operations becomes a problem. The array operations also seem to cause a lot of GC time. Additonally, in contrast to c/rust, every element takes 8 bytes (in JS numbers use 64 bits) while only 1 byte is really needed for our data. So just loading the 100kb file as an array uses 800kb of memory.
- I believe we don’t benefit (as much) from caching on the cpu because we’re using a high-level language. In c and similar languages, accessing adjacent elements is typically faster than random access because the relevant section of the array is cached and does not need to be retrieved from RAM).
- and the big one: the
Maybewrapping. The extra alloctation, case matches, and garbage collection time for every array access are a killer.
I think a dedicated
ByteArray type makes sense, but there are some challenges in functional languages because of immutability, and many other details to think about.
ByteArrayshould use only one byte of storage per byte of data. This could be built on top of
conversion: Conversion between
List Intshould be fast, not only in code but also when large literals are used.
maybes: should be avoided. I think this is possible with good design, focussing on slices and folds instead of individual accesses. Folds have the nice property that they don’t need to wrap every element in a maybe and they can efficiently go through adjacent elements. A problem is that currently folds can only start at the very start or end of an array. Starting anywhere else requires a
Array.slicewhich will allocate. So repeated
Array.getis the most efficient right now.
immutability: The tricky thing about c style arrays (big chunks of contiguous memory) in functional languages is that whenever an element is updated, we’d have to copy the whole thing to guarantee immutability. But we don’t want to copy our whole byte array whenever we make a change, that would defeat the purpose of a dedicated data structure (speed and memory efficiency).
But so far I haven’t really needed random write access into these large arrays, the main operations i’ve used are:
Array.getan element somewhere
- working with a subsection: often a combination of
Array.foldlbetter expresses what’s going on, but because of the cost of
Array.slice, a loop of repeated
Array.getis usually faster.
Array.pushonto the end
So once data is in the array, it (almost always) stays there unchanged until the whole array is encoded into
There are probably improvements that can be made to the code/algorithms with the current primitives (help welcome!). But especially for
elm-flate, the profiling is so dominated by datastructure operations (array, list, dict), byte array en/decoding, and GC that it’s hard to see what other bottlenecks there are.
I hope that with these packages (and some real examples that use them) we can find more efficient primitives for working with binary data in elm.
If this fits into your projects or you run into (performance) trouble, let me know. Always happy to help.