Optimizing Base64 encoding/decoding

#1

Triggered by this thread I’ve been working on an optimized elm/bytes-based Base64 encoder/decoder.

Here are some notes. I hope this is helpful if you are writing and optimizing bytes encoders and decoders. As always, premature optimization is the root of all evil. It is also really important to actually run a benchmark. I still often have wrong intuitions about how fast something is: the elm => js => jit => hardware pipeline has some surprises. But when using bytes, we really want all the performance we can get, so here goes.

I picked @dan s danfishgold/base64-bytes as a starting point. The code is very readable, it alone was enough to understand what Base64 encoding is and how it works. The package also has a complete set of fuzz tests which were very helpful during the optimization process.

Removing intermediate data structures

Or more generally, minimize allocation. @ianmackenzie pointed this out to me when working on Voronoi tesselation in elm-geometry. The jit optimizer is quite good at fast computation, but you pay a heavy price for allocation.

For instance, the original code turns a string into a list of characters, then into a list of integers, and finally a Bytes sequence. On its own that is good practice: it splits the process into self-contained pieces. But in tight loops the cost of allocating the intermediate lists is a problem. The phases can be fused together to eliminate materializing the intermediate data structures.

Another optimization I found amusing is that in

a = 34
b = 42
c = 140
-- in practice calculating `combined` is cheap (a bitshift)
-- if we had to calculate it like below the uint16 case is slower
combined = Bitwise.or (Bitwise.shiftLeftBy 8 a) b

encodingInts =
    Benchmark.compare "byte encoding performance"
        "with uint16"
        (\_ ->
            Encode.encode <|
                Encode.sequence
                    [ Encode.unsignedInt16 BE combined
                    , Encode.unsignedInt8 c
                    ]
        )
        "only uint8"
        (\_ ->
            Encode.encode <|
                Encode.sequence
                    [ Encode.unsignedInt8 a
                    , Encode.unsignedInt8 b
                    , Encode.unsignedInt8 c
                    ]
        )

The uint16 case is faster. This is not because unsignedInt16 is faster than two unsignedInt8s! The elm/bytes package uses a javascript DataView which internally implements writing u16 as writing two u8s. In Chrome writing two u8s is actually much faster than one u16 (in FF they are comparable). The actual reason is that one fewer list element is allocated and there is one fewer list element to go through for each 24-bit chunk, which is significant when decoding KBs of data.

Bitwise operations

Bitwise operations are intimidating and cryptic: they require a lot of practice to (quickly) understand and effectively use. But in this case they are convenient and fast: the original code used integer division, modulo and exponentiation by a power of 2 to achieve what is effectively a bitshift. Using actual bit shifts is much faster and more clearly communicates the intent.

Safety checks

Both encoding and decoding need to perform checks. Are characters within the base64 range, and do the bits actually represent a base64 character. It is good practice to use types like Maybe (make impossible states unrepresentable), but in tight loops there is a real cost to using such types. For instance wrapping an integer in Just allocates and is a function call.

Results

base64-result

On an input of 1KB, decoding is ~4x faster, encoding ~3.5x. this means roughly 13Mb/s decoding and 9MB/s encoding. (In chrome with --optimize, in FF the number of MB/s is lower but the speedup is still there). Not shocking considering it is possible to parse a GB of JSON in a second, but a significant improvement nonetheless.

In general it is nice to have a reference point of decoding speed with elm/bytes in MB/s. The current code can be found at base64-benchmarks.

PS. if you have further optimization suggestions let me know!

14 Likes
#2

This is awesome and super interesting! Thank you for doing this :heart:

I’d love to add your improvements. Do you want to send a pull request or should I add them myself?

#3

@folkertdev great write up! I wondered if there was a way to improve loopHelp in Decode.elm, because it allocated a tuple on each step.

I tried to change it to:

loopHelp : Int -> Int -> String -> Decode.Decoder (Decode.Step String String)
loopHelp remaining maxLength string =
    if maxLength - String.length string > 0 then
        Decode.map3 (\a b c -> Decode.Loop (string ++ bitsToChars (Bitwise.shiftLeftBy 16 a + Bitwise.shiftLeftBy 8 b + c) 0))
            Decode.unsignedInt8
            Decode.unsignedInt8
            Decode.unsignedInt8

    else if remaining == 0 then
        Decode.succeed (Decode.Done string)

    else if remaining == 2 then
        Decode.map2 (\a b -> Decode.Done (string ++ bitsToChars (Bitwise.shiftLeftBy 16 a + Bitwise.shiftLeftBy 8 b) 1))
            Decode.unsignedInt8
            Decode.unsignedInt8

    else
        -- remaining == 1
        Decode.map (\a -> Decode.Done (string ++ bitsToChars (Bitwise.shiftLeftBy 16 a) 2))
            Decode.unsignedInt8

and then partially applied loopHelp (modBy 3 width) (width // 3 * 4)

Unfortunately this didn’t improve anything.

1 Like
#4

On its own that is good practice: it splits the process into self-contained pieces. … It is good practice to use types like Maybe (make impossible states unrepresentable)

One of the challenges in optimizing code is making sure you don’t sacrifice correctness. (It’s easy to be fast if you can be wrong!) One approach is to have an idiomatic but slow “reference” implementation that you can easily test and reason about. Then use fuzz testing to ensure that the reference and optimized implementations are equivalent.

5 Likes
#5

The code presented here has now been merged into master and will be part of the next release of base64-bytes.

@unsoundscapes interesting. I know that haskell under some circumstances will realize that it creates a tuple, only to immediately pattern match on it again (in the next loop iteration).
It then unboxes the tuple saving some indirection. Perhaps the jit does something similar? It would be convenient if elm could do that kind of optimization for this kind of work.

@mgold absolutely, which is why it was so nice the package already had a solid set of fuzz tests.
Also i remember some 1000000x speedups because a function just immediately returned Nothing.

2 Likes
closed #6

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.