elm-sha2 is a pure elm implementation of the sha-2 algorithms (sha224, sha256, sha384, and sha512). The implementation is based on
elm/bytes, which means it is faster (especially for larger inputs) and more memory-efficient than current solutions (that all had to work around the lack of bytearray support). I have also opened a PR for a faster sha1 on the TFoster/elm-sha1 repo which will hopefully be released in the coming weeks. I’ve written down some thoughts on going from a non-elm/bytes to elm-bytes implementation, and some other noteworthy optimizations.
The story here is that I started on the PR first, based on this thread. Based on this thread I decided to “just” do sha2 as well. The sha algorithms are very similar in structure, and the existing sha2 implementations had the same problem as sha1: their performance scales poorly for large inputs. It’s not their fault: they are all good packages, with documentation, examples and tests. But with
elm/bytes we can go back and see how that changes things. Just to be clear, I do this kind of experimentation out of a personal interest, and I don’t want to criticize anyone’s hard work.
What is SHA1?
sha1 is a hashing algorithm: given an input sequence of bytes, it produces a 160-bit integer that identifies the input. Changing even one bit in the input will lead to a wildly different output. The output of sha1 is called a ‘digest’. Sha1 processes the input in chunks of 512 bits (or 64 bytes) at a time.
Bytes as the default.
The current code uses a
List Int, that stores byte values. Once again, this made perfect sense before
elm/bytes. But, it uses a lot of memory, and encourages less-efficient code than
As an overarching theme, the current code makes many small modifications to potentially long data structures. For instance, split a chunk into 32-bit words, then map a function over every word, then convert to an array.
words = chunk |> groupsOf 4 |> List.map wordFromInts |> Array.fromList
This style is widely used, especially in FP teaching material based on Haskell. In a strict language like elm though,
groupsOf will allocate its result list, then
List.map does the same, then
Array.fromList. So, a data structure roughly the size of the input is allocated and traversed 3 times. In Haskell, a combination of laziness and compiler optimizations (foldr/build rule) will generate code that only does one allocation and traversal.
So in elm, for performance sensitive code, read/decode as little as possible, then push it through your whole processing pipeline.
List.map f << List.map g is equivalent to
List.map (f << g), but the latter is much faster because no intermediate list is produced and the list is traversed only once.
Splitting into chunks
My definition of the chunk processing function looks like this:
u32 : Decoder Int u32 = Decode.unsignedInt32 BE reduceBytesMessage : State -> Decoder State reduceBytesMessage state = map16 (reduceMessage state) u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32
Because in this case the 512-bit chunk is that smallest piece of information. By batching its decoding with a custom
map16, the decoding cost is minimized. The algorithm needs 32-bit words, and with
elm/bytes those can be immediately decoded, rather than decoding bytes (8-bit integer) and then stitching them back into a 32-bit word.
The current code instead splits the whole input
List Int into groups of 64 bytes. Again this follows the “only read/decode the smallest piece you need” theme. Iterating the above function
numberOfChunks times never allocates a list of chunks.
Processing a chunk
To process a chunk, current code uses an array to store the words the chunk, then iterates this function 48 times. Each call will look up previous elements, combine them, then push to the array.
reduceWords : Int -> Array Int -> Array Int reduceWords index words = let v i = Array.get (index - i) words val = [ v 3, v 8, v 14, v 16 ] |> List.filterMap identity |> List.foldl Bitwise.xor 0 |> rotateLeftBy 1 in Array.push val words
In the chrome devtools, I saw that most time of the whole hashing process was spent on
Array.get, which is weird for a function that should do (bitwise) arithmetic. The performance bottleneck was in the implementation – the
Array – not the algorithm. My version spends most time on bitwise operations, as expected.
I noticed that only the 16 right-most elements of the array were used, and decided to try to inline the array elements. That removes the overhead of the array. The resulting code looks weird, but is much faster (leading to ~5X overall speedup)
reduceWordsHelp i deltaState b16 b15 b14 b13 b12 b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 = if (i - blockSize) < 0 then let value = b3 |> Bitwise.xor b8 |> Bitwise.xor b14 |> Bitwise.xor b16 |> rotateLeftBy 1 in reduceWordsHelp (i + 1) (calculateDigestDeltas (i + numberOfWords) value deltaState) b15 b14 b13 b12 b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 value else deltaState
SHA1 mixes 32-bit unsigned integer addition with bitwise operators. For instance,
There are two fixes:
Bitwise.and 0xFFFFFFFF discards the bits higher than 32 resulting in overflow, which is desired here, and
Bitwise.shiftRightZfBy 0 which also overflows, but additionally gives an unsigned result:
(Bitwise.complement 6 |> Bitwise.shiftRightZfBy 0) == 4294967289
By carefully looking when overflow and signed values would actually be a problem, the number of int-32 correction occurrences can be minimized. Here it helps to have a large set of test cases, because it is easy to make mistakes.
Viewing a digest: Base64 & Hex
A convenient way to view the digest is its base64 representation. Turns out that
danfishgold/base64-bytes just does the right thing when a digest is encoded as 5
unsignedInt32s, replacing a bunch of tricky bitwise logic.
Looking at a digest as a sequence of hex digits is also common. In
wordToHex the code worked around an issue with possibly negative integers in the digest, which required splitting the number into two.
Bitwise.shiftRightZfBy 0 gives the unsigned representation, so some logic was removed here too.
For Sha1, a 10X performance gain was achieved for an input of 1kb. Time grows linearly with the input size (double the input size takes twice the time). The code can be found here.
For sha2 the optimization steps taken are pretty much the same. I’ve been testing hashing of a 3Mb file, where
- sha1 takes ~300ms
- sha256 takes ~400ms
- sha512 takes ~1500ms
This is better than the existing implementations, but objectively still quite slow. A big part of this is lack of inlining by the elm compiler. For instance the
A5 wrappers take 20% of total execution time for sha256. Also an extension of Kernel-defined map functions (e.g. to map9) would help.
sha512 is that much slower because it is based on unsigned 64-bit integers, which I simulate with two 32-bit integers. Wrapping arithmetic (to get “correct” overflow) slows it down. This too should improve if the compiler starts to inline more. So hopefully this package will just become faster over time without code changes.
Thanks to @TSFoster for writing such a well-documented package. It was a lot of fun to hack on.
Edit: false alarm.
I tried to write a custom
iterate will less allocation. It looked tail-recursive but in practice wasn’t. It was also slower in the end. So instead
Decode.loop is now used. Here is the ellie with the weird behavior.
For context, when a decoder goes “out of range”, trying to decode values more than the ArrayBuffer really contains, a RangeError is raised. The exception is caught, and