Triggered by this thread I’ve been working on an optimized
elm/bytes-based Base64 encoder/decoder.
Here are some notes. I hope this is helpful if you are writing and optimizing bytes encoders and decoders. As always, premature optimization is the root of all evil. It is also really important to actually run a benchmark. I still often have wrong intuitions about how fast something is: the
elm => js => jit => hardware pipeline has some surprises. But when using bytes, we really want all the performance we can get, so here goes.
I picked @dan s danfishgold/base64-bytes as a starting point. The code is very readable, it alone was enough to understand what Base64 encoding is and how it works. The package also has a complete set of fuzz tests which were very helpful during the optimization process.
Removing intermediate data structures
Or more generally, minimize allocation. @ianmackenzie pointed this out to me when working on Voronoi tesselation in elm-geometry. The jit optimizer is quite good at fast computation, but you pay a heavy price for allocation.
For instance, the original code turns a string into a list of characters, then into a list of integers, and finally a
Bytes sequence. On its own that is good practice: it splits the process into self-contained pieces. But in tight loops the cost of allocating the intermediate lists is a problem. The phases can be fused together to eliminate materializing the intermediate data structures.
Another optimization I found amusing is that in
a = 34 b = 42 c = 140 -- in practice calculating `combined` is cheap (a bitshift) -- if we had to calculate it like below the uint16 case is slower combined = Bitwise.or (Bitwise.shiftLeftBy 8 a) b encodingInts = Benchmark.compare "byte encoding performance" "with uint16" (\_ -> Encode.encode <| Encode.sequence [ Encode.unsignedInt16 BE combined , Encode.unsignedInt8 c ] ) "only uint8" (\_ -> Encode.encode <| Encode.sequence [ Encode.unsignedInt8 a , Encode.unsignedInt8 b , Encode.unsignedInt8 c ] )
uint16 case is faster. This is not because
unsignedInt16 is faster than two
DataView which internally implements writing u16 as writing two u8s. In Chrome writing two u8s is actually much faster than one u16 (in FF they are comparable). The actual reason is that one fewer list element is allocated and there is one fewer list element to go through for each 24-bit chunk, which is significant when decoding KBs of data.
Bitwise operations are intimidating and cryptic: they require a lot of practice to (quickly) understand and effectively use. But in this case they are convenient and fast: the original code used integer division, modulo and exponentiation by a power of 2 to achieve what is effectively a bitshift. Using actual bit shifts is much faster and more clearly communicates the intent.
Both encoding and decoding need to perform checks. Are characters within the base64 range, and do the bits actually represent a base64 character. It is good practice to use types like
Maybe (make impossible states unrepresentable), but in tight loops there is a real cost to using such types. For instance wrapping an integer in
Just allocates and is a function call.
On an input of 1KB, decoding is ~4x faster, encoding ~3.5x. this means roughly 13Mb/s decoding and 9MB/s encoding. (In chrome with
--optimize, in FF the number of MB/s is lower but the speedup is still there). Not shocking considering it is possible to parse a GB of JSON in a second, but a significant improvement nonetheless.
In general it is nice to have a reference point of decoding speed with
elm/bytes in MB/s. The current code can be found at base64-benchmarks.
PS. if you have further optimization suggestions let me know!