Fast pure elm SHA2 (and soon SHA1)

elm-sha2 is a pure elm implementation of the sha-2 algorithms (sha224, sha256, sha384, and sha512). The implementation is based on elm/bytes, which means it is faster (especially for larger inputs) and more memory-efficient than current solutions (that all had to work around the lack of bytearray support). I have also opened a PR for a faster sha1 on the TFoster/elm-sha1 repo which will hopefully be released in the coming weeks. I’ve written down some thoughts on going from a non-elm/bytes to elm-bytes implementation, and some other noteworthy optimizations.

Optimizing SHA1

The story here is that I started on the PR first, based on this thread. Based on this thread I decided to “just” do sha2 as well. The sha algorithms are very similar in structure, and the existing sha2 implementations had the same problem as sha1: their performance scales poorly for large inputs. It’s not their fault: they are all good packages, with documentation, examples and tests. But with elm/bytes we can go back and see how that changes things. Just to be clear, I do this kind of experimentation out of a personal interest, and I don’t want to criticize anyone’s hard work.

What is SHA1?

sha1 is a hashing algorithm: given an input sequence of bytes, it produces a 160-bit integer that identifies the input. Changing even one bit in the input will lead to a wildly different output. The output of sha1 is called a ‘digest’. Sha1 processes the input in chunks of 512 bits (or 64 bytes) at a time.

Using Bytes as the default.

The current code uses a List Int, that stores byte values. Once again, this made perfect sense before elm/bytes. But, it uses a lot of memory, and encourages less-efficient code than elm/bytes.

As an overarching theme, the current code makes many small modifications to potentially long data structures. For instance, split a chunk into 32-bit words, then map a function over every word, then convert to an array.

words =
    chunk
        |> groupsOf 4
        |> List.map wordFromInts
        |> Array.fromList

This style is widely used, especially in FP teaching material based on Haskell. In a strict language like elm though, groupsOf will allocate its result list, then List.map does the same, then Array.fromList. So, a data structure roughly the size of the input is allocated and traversed 3 times. In Haskell, a combination of laziness and compiler optimizations (foldr/build rule) will generate code that only does one allocation and traversal.

So in elm, for performance sensitive code, read/decode as little as possible, then push it through your whole processing pipeline. List.map f << List.map g is equivalent to List.map (f << g), but the latter is much faster because no intermediate list is produced and the list is traversed only once.

Splitting into chunks

My definition of the chunk processing function looks like this:

u32 : Decoder Int
u32 =
    Decode.unsignedInt32 BE

reduceBytesMessage : State -> Decoder State
reduceBytesMessage state =
    map16 (reduceMessage state) u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32

Because in this case the 512-bit chunk is that smallest piece of information. By batching its decoding with a custom map16, the decoding cost is minimized. The algorithm needs 32-bit words, and with elm/bytes those can be immediately decoded, rather than decoding bytes (8-bit integer) and then stitching them back into a 32-bit word.

The current code instead splits the whole input List Int into groups of 64 bytes. Again this follows the “only read/decode the smallest piece you need” theme. Iterating the above function numberOfChunks times never allocates a list of chunks.

Processing a chunk

To process a chunk, current code uses an array to store the words the chunk, then iterates this function 48 times. Each call will look up previous elements, combine them, then push to the array.

reduceWords : Int -> Array Int -> Array Int
reduceWords index words =
    let
        v i =
            Array.get (index - i) words

        val =
            [ v 3, v 8, v 14, v 16 ]
                |> List.filterMap identity
                |> List.foldl Bitwise.xor 0
                |> rotateLeftBy 1
    in
    Array.push val words

In the chrome devtools, I saw that most time of the whole hashing process was spent on Array.push and Array.get, which is weird for a function that should do (bitwise) arithmetic. The performance bottleneck was in the implementation – the Array – not the algorithm. My version spends most time on bitwise operations, as expected.

I noticed that only the 16 right-most elements of the array were used, and decided to try to inline the array elements. That removes the overhead of the array. The resulting code looks weird, but is much faster (leading to ~5X overall speedup)

reduceWordsHelp i deltaState b16 b15 b14 b13 b12 b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 =
    if (i - blockSize) < 0 then
        let
            value =
                b3
                    |> Bitwise.xor b8
                    |> Bitwise.xor b14
                    |> Bitwise.xor b16
                    |> rotateLeftBy 1
        in
        reduceWordsHelp (i + 1) (calculateDigestDeltas (i + numberOfWords) value deltaState) b15 b14 b13 b12 b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 value

    else
        deltaState

JavaScript integers

Some tricks are needed to get javascript numbers to behave like proper unsigned 32-bit integers.

SHA1 mixes 32-bit unsigned integer addition with bitwise operators. For instance, Bitwise.complement 6 == -7. -7 is not unsigned (it’s negative) and that will wreak havoc when added, though it’s fine for bitwise operations. Also, JavaScript will happily let you add large integers, that result in values above 2^32. Large numbers in the final digest could be formatted incorrectly, leading to incorrect results.

There are two fixes: Bitwise.and 0xFFFFFFFF discards the bits higher than 32 resulting in overflow, which is desired here, and Bitwise.shiftRightZfBy 0 which also overflows, but additionally gives an unsigned result: (Bitwise.complement 6 |> Bitwise.shiftRightZfBy 0) == 4294967289

By carefully looking when overflow and signed values would actually be a problem, the number of int-32 correction occurrences can be minimized. Here it helps to have a large set of test cases, because it is easy to make mistakes.

Viewing a digest: Base64 & Hex

A convenient way to view the digest is its base64 representation. Turns out that danfishgold/base64-bytes just does the right thing when a digest is encoded as 5 unsignedInt32s, replacing a bunch of tricky bitwise logic.

Looking at a digest as a sequence of hex digits is also common. In wordToHex the code worked around an issue with possibly negative integers in the digest, which required splitting the number into two. Bitwise.shiftRightZfBy 0 gives the unsigned representation, so some logic was removed here too.

Conclusion

For Sha1, a 10X performance gain was achieved for an input of 1kb. Time grows linearly with the input size (double the input size takes twice the time). The code can be found here.

For sha2 the optimization steps taken are pretty much the same. I’ve been testing hashing of a 3Mb file, where

  • sha1 takes ~300ms
  • sha256 takes ~400ms
  • sha512 takes ~1500ms

This is better than the existing implementations, but objectively still quite slow. A big part of this is lack of inlining by the elm compiler. For instance the A2 to A5 wrappers take 20% of total execution time for sha256. Also an extension of Kernel-defined map functions (e.g. to map9) would help.

sha512 is that much slower because it is based on unsigned 64-bit integers, which I simulate with two 32-bit integers. Wrapping arithmetic (to get “correct” overflow) slows it down. This too should improve if the compiler starts to inline more. So hopefully this package will just become faster over time without code changes.

Thanks to @TSFoster for writing such a well-documented package. It was a lot of fun to hack on.

P.S. bugs!

Edit: false alarm.

I tried to write a custom iterate will less allocation. It looked tail-recursive but in practice wasn’t. It was also slower in the end. So instead Decode.loop is now used. Here is the ellie with the weird behavior.

For context, when a decoder goes “out of range”, trying to decode values more than the ArrayBuffer really contains, a RangeError is raised. The exception is caught, and Decode.decode returns Nothing. But the net is cast too wide: javascript uses RangeError in other cases too, in particular a stack overflow. So if a decoder causes a stack overflow, the result is indistinguishable (on the elm side) from a decoder that tries to decode more values than there are available. There is a way to fix this, but it increases the code size somewhat. Hopefully this will be fixed in the future though.

29 Likes

This is not my area of expertise at all, so I didn’t expect to get a whole lot out of this post, but it was very interesting to find out more about the low-level optimisation that you performed to get this impressive performance improvement. Good job!

1 Like

Excellent writeup!

I’m back from vacation and have published TSFoster/elm-sha1 v2.0.0 now, as well as patch and minor updates for version 1, so anyone using the package directly or indirectly can benefit from the improvements.

4 Likes

This project has had some nice consequences, SHA1 is now updated, there is a PR for md5 and work happening in elm-hmac-sha.

I’ve just published an offshoot called elm-int64, a simulation of 64-bit integer arithmetic. It’s used in sha512 and I thought it might be generally useful.

4 Likes

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.