elm-sha2 is a pure elm implementation of the sha-2 algorithms (sha224, sha256, sha384, and sha512). The implementation is based on elm/bytes
, which means it is faster (especially for larger inputs) and more memory-efficient than current solutions (that all had to work around the lack of bytearray support). I have also opened a PR for a faster sha1 on the TFoster/elm-sha1 repo which will hopefully be released in the coming weeks. I’ve written down some thoughts on going from a non-elm/bytes to elm-bytes implementation, and some other noteworthy optimizations.
Optimizing SHA1
The story here is that I started on the PR first, based on this thread. Based on this thread I decided to “just” do sha2 as well. The sha algorithms are very similar in structure, and the existing sha2 implementations had the same problem as sha1: their performance scales poorly for large inputs. It’s not their fault: they are all good packages, with documentation, examples and tests. But with elm/bytes
we can go back and see how that changes things. Just to be clear, I do this kind of experimentation out of a personal interest, and I don’t want to criticize anyone’s hard work.
What is SHA1?
sha1 is a hashing algorithm: given an input sequence of bytes, it produces a 160-bit integer that identifies the input. Changing even one bit in the input will lead to a wildly different output. The output of sha1 is called a ‘digest’. Sha1 processes the input in chunks of 512 bits (or 64 bytes) at a time.
Using Bytes
as the default.
The current code uses a List Int
, that stores byte values. Once again, this made perfect sense before elm/bytes
. But, it uses a lot of memory, and encourages less-efficient code than elm/bytes
.
As an overarching theme, the current code makes many small modifications to potentially long data structures. For instance, split a chunk into 32-bit words, then map a function over every word, then convert to an array.
words =
chunk
|> groupsOf 4
|> List.map wordFromInts
|> Array.fromList
This style is widely used, especially in FP teaching material based on Haskell. In a strict language like elm though, groupsOf
will allocate its result list, then List.map
does the same, then Array.fromList
. So, a data structure roughly the size of the input is allocated and traversed 3 times. In Haskell, a combination of laziness and compiler optimizations (foldr/build rule) will generate code that only does one allocation and traversal.
So in elm, for performance sensitive code, read/decode as little as possible, then push it through your whole processing pipeline. List.map f << List.map g
is equivalent to List.map (f << g)
, but the latter is much faster because no intermediate list is produced and the list is traversed only once.
Splitting into chunks
My definition of the chunk processing function looks like this:
u32 : Decoder Int
u32 =
Decode.unsignedInt32 BE
reduceBytesMessage : State -> Decoder State
reduceBytesMessage state =
map16 (reduceMessage state) u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32
Because in this case the 512-bit chunk is that smallest piece of information. By batching its decoding with a custom map16
, the decoding cost is minimized. The algorithm needs 32-bit words, and with elm/bytes
those can be immediately decoded, rather than decoding bytes (8-bit integer) and then stitching them back into a 32-bit word.
The current code instead splits the whole input List Int
into groups of 64 bytes. Again this follows the “only read/decode the smallest piece you need” theme. Iterating the above function numberOfChunks
times never allocates a list of chunks.
Processing a chunk
To process a chunk, current code uses an array to store the words the chunk, then iterates this function 48 times. Each call will look up previous elements, combine them, then push to the array.
reduceWords : Int -> Array Int -> Array Int
reduceWords index words =
let
v i =
Array.get (index - i) words
val =
[ v 3, v 8, v 14, v 16 ]
|> List.filterMap identity
|> List.foldl Bitwise.xor 0
|> rotateLeftBy 1
in
Array.push val words
In the chrome devtools, I saw that most time of the whole hashing process was spent on Array.push
and Array.get
, which is weird for a function that should do (bitwise) arithmetic. The performance bottleneck was in the implementation – the Array
– not the algorithm. My version spends most time on bitwise operations, as expected.
I noticed that only the 16 right-most elements of the array were used, and decided to try to inline the array elements. That removes the overhead of the array. The resulting code looks weird, but is much faster (leading to ~5X overall speedup)
reduceWordsHelp i deltaState b16 b15 b14 b13 b12 b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 =
if (i - blockSize) < 0 then
let
value =
b3
|> Bitwise.xor b8
|> Bitwise.xor b14
|> Bitwise.xor b16
|> rotateLeftBy 1
in
reduceWordsHelp (i + 1) (calculateDigestDeltas (i + numberOfWords) value deltaState) b15 b14 b13 b12 b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 value
else
deltaState
JavaScript integers
Some tricks are needed to get javascript numbers to behave like proper unsigned 32-bit integers.
SHA1 mixes 32-bit unsigned integer addition with bitwise operators. For instance, Bitwise.complement 6 == -7
. -7 is not unsigned (it’s negative) and that will wreak havoc when added, though it’s fine for bitwise operations. Also, JavaScript will happily let you add large integers, that result in values above 2^32. Large numbers in the final digest could be formatted incorrectly, leading to incorrect results.
There are two fixes: Bitwise.and 0xFFFFFFFF
discards the bits higher than 32 resulting in overflow, which is desired here, and Bitwise.shiftRightZfBy 0
which also overflows, but additionally gives an unsigned result: (Bitwise.complement 6 |> Bitwise.shiftRightZfBy 0) == 4294967289
By carefully looking when overflow and signed values would actually be a problem, the number of int-32 correction occurrences can be minimized. Here it helps to have a large set of test cases, because it is easy to make mistakes.
Viewing a digest: Base64 & Hex
A convenient way to view the digest is its base64 representation. Turns out that danfishgold/base64-bytes
just does the right thing when a digest is encoded as 5 unsignedInt32
s, replacing a bunch of tricky bitwise logic.
Looking at a digest as a sequence of hex digits is also common. In wordToHex
the code worked around an issue with possibly negative integers in the digest, which required splitting the number into two. Bitwise.shiftRightZfBy 0
gives the unsigned representation, so some logic was removed here too.
Conclusion
For Sha1, a 10X performance gain was achieved for an input of 1kb. Time grows linearly with the input size (double the input size takes twice the time). The code can be found here.
For sha2 the optimization steps taken are pretty much the same. I’ve been testing hashing of a 3Mb file, where
- sha1 takes ~300ms
- sha256 takes ~400ms
- sha512 takes ~1500ms
This is better than the existing implementations, but objectively still quite slow. A big part of this is lack of inlining by the elm compiler. For instance the A2
to A5
wrappers take 20% of total execution time for sha256. Also an extension of Kernel-defined map functions (e.g. to map9) would help.
sha512 is that much slower because it is based on unsigned 64-bit integers, which I simulate with two 32-bit integers. Wrapping arithmetic (to get “correct” overflow) slows it down. This too should improve if the compiler starts to inline more. So hopefully this package will just become faster over time without code changes.
Thanks to @TSFoster for writing such a well-documented package. It was a lot of fun to hack on.
P.S. bugs!
Edit: false alarm.
I tried to write a custom iterate
will less allocation. It looked tail-recursive but in practice wasn’t. It was also slower in the end. So instead Decode.loop
is now used. Here is the ellie with the weird behavior.
For context, when a decoder goes “out of range”, trying to decode values more than the ArrayBuffer really contains, a RangeError is raised. The exception is caught, and Decode.decode
returns Nothing
. But the net is cast too wide: javascript uses RangeError in other cases too, in particular a stack overflow. So if a decoder causes a stack overflow, the result is indistinguishable (on the elm side) from a decoder that tries to decode more values than there are available. There is a way to fix this, but it increases the code size somewhat. Hopefully this will be fixed in the future though.