Elm-flate & elm-brotli: (de)compression in elm

Recently I’ve been working on elm-brotli, a brotli decoder, and elm-flate, an implementation of DEFLATE compression.

Brotli is used in woff2 font files, DEFLATE is used in among others the gzip, png and woff file formats. Font parsing has some cool uses (static layout, automatic text breaking in svg), generating png images is useful for webgl/games. These packages are low-level building blocks that can be used in many creative ways.

Unfortunately, they are quite slow. Performance is not horrible, but it’s not good either. Most of the issues arise because the current elm/bytes package does not support arrays well. The algorithms used in these packages assume that the data is represented in array form. They use and encode indices into arrays in ways that are hard and inefficient to do with just the current Bytes.Decode.Decoder type. The elm/bytes docs ask for use cases where the current api doesn’t quite work, so here goes.

Problem 1: Array Decoding/Encoding

The use of arrays means that the data has to actually be converted from Bytes to Array Int, then re-encoded again after decompression. The current elm/bytes implementation does not do that efficiently. For instance, the “natural” way to encode a array of bytes is slower than turning the whole thing into a string, and then encoding the string:

 Benchmark.compare "encode array of bytes"
    "string"
    (\_ ->
        data
            |> Array.foldl (\e a -> a ++ String.fromChar (Char.fromCode e)) ""
            |> Encode.string
            |> Encode.encode
    )
    "bytes"
       (\_ ->
           data
               |> Array.foldr (\e a -> Encode.unsignedInt8 e :: a) []
               |> Encode.sequence
               |> Encode.encode
       )

with --optimize, "bytes" is usually about 15% slower than "string". Weird right? The culprit is Encode.sequence, which internally needs to check the width of all elements of its argument. That makes sense in general, but is very wasteful when you know the width of all the elements to be the same up-front, like with an array of bytes.

I actually use a more optimized encoder that groups bytes into sets of 4, then encodes as unsignedInt32. That means only a quarter the number of encoders, therefore a quarter the number of width checks. Fewer list elements also helps with garbage collection: since all JS numbers use 8 bytes anyway, it’s a good idea to use them as efficiently as we can (which is 4 bytes per number because JS numbers).

This problem could be solved without large changes to the package. For instance a unsignedInt8Array : Array Int -> Encoder (and corresponding decoder) would help a lot.

Problem 2: Large constants

These packages use some rather large constants (brotli uses a dictionary array of over 200K elements), and initializing them is pretty slow. Currently arrays are initialized as elmArrayFromList (ListFromJSArray jsArray) that (seems to) allocate the data 3 times. At least the List Int -> Array Int could be removed with a rewrite rule in the compiler.

Problem 3: Array Usage

Many of the algorithms expect fast array access, for instance run-length encoding and corresponding decoding. This algorithm encodes e.g. “aaaaa” as “a<distance=1, length=4>”. To do that it needs to find common substrings in the byte array, so we can store them once and then reference from other locations where the substring occurs. In an imperative language, that would look like this:

maxLength = 258
length = 0
while buffer[i] == buffer[j] && length < maxLength:
    length += 1
    i += 1
    j += 1

The elm version is currently:

longestCommonPrefixLoop : Int -> Int -> Int -> Int -> Array Int -> Int
longestCommonPrefixLoop i j maxLength length buffer =
    if i < maxLength then
        case Array.get i buffer of
            Nothing ->
                length

            Just value1 ->
                case Array.get j buffer of
                    Nothing ->
                        length

                    Just value2 ->
                        -- use arithmetic to force inline the compare
                        if (value1 - value2) == 0 then
                            longestCommonPrefixLoop (i + 1) (j + 1) limit (length + 1) buffer

                        else
                            length

    else
        length

Because the definition is tail-recursive, it gets compiled to javascript that looks reasonably similar to the imperative code, but there are some important differences:

  • the buffer is an elm Array. Its performance is generally good, but the byte arrays can become so large that just the sheer number of array operations becomes a problem. The array operations also seem to cause a lot of GC time. Additonally, in contrast to c/rust, every element takes 8 bytes (in JS numbers use 64 bits) while only 1 byte is really needed for our data. So just loading the 100kb file as an array uses 800kb of memory.
  • I believe we don’t benefit (as much) from caching on the cpu because we’re using a high-level language. In c and similar languages, accessing adjacent elements is typically faster than random access because the relevant section of the array is cached and does not need to be retrieved from RAM).
  • and the big one: the Maybe wrapping. The extra alloctation, case matches, and garbage collection time for every array access are a killer.

Solutions

I think a dedicated ByteArray type makes sense, but there are some challenges in functional languages because of immutability, and many other details to think about.

  1. storage: a ByteArray should use only one byte of storage per byte of data. This could be built on top of UInt8Array.

  2. conversion: Conversion between ByteArray and Array Int/List Int should be fast, not only in code but also when large literals are used.

  3. maybes: should be avoided. I think this is possible with good design, focussing on slices and folds instead of individual accesses. Folds have the nice property that they don’t need to wrap every element in a maybe and they can efficiently go through adjacent elements. A problem is that currently folds can only start at the very start or end of an array. Starting anywhere else requires a Array.slice which will allocate. So repeated Array.get is the most efficient right now.

  4. immutability: The tricky thing about c style arrays (big chunks of contiguous memory) in functional languages is that whenever an element is updated, we’d have to copy the whole thing to guarantee immutability. But we don’t want to copy our whole byte array whenever we make a change, that would defeat the purpose of a dedicated data structure (speed and memory efficiency).

    But so far I haven’t really needed random write access into these large arrays, the main operations i’ve used are:

    • Array.get an element somewhere
    • working with a subsection: often a combination of Array.slice and Array.foldl better expresses what’s going on, but because of the cost of Array.slice, a loop of repeated Array.get is usually faster.
    • Array.push onto the end

    So once data is in the array, it (almost always) stays there unchanged until the whole array is encoded into Bytes again.

Conclusion

There are probably improvements that can be made to the code/algorithms with the current primitives (help welcome!). But especially for elm-flate, the profiling is so dominated by datastructure operations (array, list, dict), byte array en/decoding, and GC that it’s hard to see what other bottlenecks there are.

I hope that with these packages (and some real examples that use them) we can find more efficient primitives for working with binary data in elm.

If this fits into your projects or you run into (performance) trouble, let me know. Always happy to help.

18 Likes

I get the impression that an immutable language ought to be able to slice a large, incremental data structure like an array or even a dict or a list without having to copy the payload data. Just allocate a new, smaller structure for the metadata describing which and how much of the payload belongs to the smaller slice.

Or even re-use some structure from the calling parameters, since caller is already specifying begin/end or keys they want included in their slice.

In the circumstance where “a large number of concrete use cases will require handling large structures and processing those in arbitrary chunks”, I think this sounds like an ideal way to meet such a need without any complication or negative performance impact to more mundane uses. I’ll be interested to see what direction things are taken in though. :slight_smile:

2 Likes

From talking to Robin (who implemented the current Array) I understood that this Array implementation has a O(n) worst case for slice, but there are some alternative approaches that can guarantee O(1) slicing.

In general Array is really a greatest common denominator data structure. For instance, the Array can’t really know whether I want to just iterate over a slice, or also set/push. It tries to make all of these efficient. So if a slice is really just a start and end point, a push onto the sliced part would be expensive because then copying has to be done. same with Array.set.

I think a fold that can start and stop at an arbitrary position in the array would be very helpful and could be implemented efficiently. Currently Array also doens’t have an update function, so you pay the cost of “finding” an element twice if you want to modify a specific value (get, then set). But there’s a PR for that.

As an aside, I’ve been getting some good speed improvements by building more specific data structures. Specifically, a ByteArray that can store 4 bytes in one integer (fewer elements, many operations much faster) and BitSet that speeds up some Dict operations where the domain is known and small.

super aside: To really optimize data structure usage, I think you’d need uniqueness types: the assurance that there are no other references to a piece of data. In that case in-place updates are safe (because you know nobody will notice: there are no other references) which would really speed up array code and record updates. From a technical standpoint that really makes sense for elm (models are unique, lots of record usage), but I’m not sure the idea is fully battle-tested and if it would work with other aspects of elm (beginner friendliness, simple syntax, etc.)

2 Likes

I’ve published a 2.0.0 with some big performance improvements. The big one is a ByteArray type that stores 4 bytes in a single integer. That means arrays are now a quarter the size, and lookup of consequtive elements is faster in some cases. In particular, copying segments (a common operation in inflate) only requires one lookup per 4 elements. The GC times are also much better, as are conversions from/to Bytes.

This type/module isn’t exposed yet, but I could split it out if there is interest. The full implementation is here.

The verison change is major because I swapped List Code to Array Code in LZ77. Because of how the data is produced and consumed arrays are significantly faster .

8 Likes

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.