Geting the hash sha-256 of a file for upload perpuses

How can one get the sha-256 hash for a file.

What I am currently trying to do is the following:

import Bytes
import Bytes.Decode  exposing (Decoder)
import Sha256
import File exposing (File)
  1. Select.file [“application/jpeg”] ImmageLoaded)
  2. File.toBytes file to get the bytes of the file
  3. str_dec = Bytes.Decode.string (Bytes.width bytes)
  4. str_hash = Bytes.Decode.decode str_dec bytes

But the result of 4 is alwasys Nothing although the value of Bytes.width is some int number

Anybody has any experience on the topic ? or how would you aproach it ?

A problem I can see here is, that you’re decoding a bunch of random bytes into an UTF-8 String, but not all byte sequences are valid UTF-8 strings, hence the Nothing. Can you try to select a text file and see if that works?

Edit: I created an Ellie that demonstrates the problem. UTF-8 text files work fine, but some binary files do not. https://ellie-app.com/6W2G972xfwTa1

2 Likes

Hmm you are right thank you.
Ineed whehn I try that with a txt file works perfectly.
But then do you have any direction on how to aproach this since I specifically want to get the sha256 for jpg files ?

I’d consider this to be a bug in Sha256 since it only accepts String as input and not raw Bytes.

You could try the elm-sha package which I gather should accept arbitrary byte sequences with a round trip through a hex string (the intermediate representations required might be pretty memory-intensive for a large file, but images or most documents should be okay).

Another option might be to send the file through a port (maybe base64 encode it?) and use the browser subtle crypto API. It’s much less clean though.

2 Likes

Yes, icidasset/elm-sha package actually works unlike billstclair/elm-sha256, which gives incorrect result for some inputs: https://github.com/billstclair/elm-sha256/issues/7

1 Like

Huh. I wonder if it trusts 32-bit math (skimming the code, it looks like maybe). I found out trying to implement chacha20 (which I might get back to some day) that 32-bit integers sort of work but act strangely because javascript. I think the Basics module warns about this.

The roll-your-own approach of the icidasset packages seems probably less efficient but it’s guaranteed to work.

Well When I tried working with elm-sha first converting to hex with jxxcarlson/hex
and then going to binary and get the sha256 of the binary but something goes awfully wrong, window goes blank console the same and I still can’t close that window :sweat_smile:

Probably the momory issue you described my jpg for testing is 955,8 kB

UPDATE:
I tried to get the hash of the hex string from elm crypto but stil having some memory issues I think because I get not responding allert from chrome!

Further more when trying to get the bytes of the file
, then get a Hex string out of the bytes
, use an online tool to calculate the hash of the hex string
and directly geting the hash of the file
two results differ from each other.

using elm/bytes directly should be the way to go. Anything else will be orders of magnitude slower, and suffers from memory problems. (e.g. an 8 million items long List Bool…)

I’m working on a PR for sha1 that is 10 times faster than the current non-bytes implementation. As mentioned by @keisisqrl you have to work around javascript numbers being weird, but with proper testing that should be allright.

It looks like there is currently no package that can really do this though. I’m happy to help out if someone wants to give it a go.

2 Likes

That’s pretty slick. I wouldn’t have thought of putting the logic in a Decoder, but it makes sense considering it’s a transform from input to a SHA state. I’m still new to the FP mindset!

I think that’s the purpose of the masks in calculateDigestDeltas you (?) remarked on here. As far as I can tell all arithmetic will be, functionally, mod32, but above 2^31 it gets weird. Ints behave as signed with (most?) arithmetic operations and wrap, but bitwise operations (at least some - it varies with bit shifts, per javascript docs) seem to coerce the value to unsigned without changing the bit representation before using it.

The main advantage of moving the logic into the Decoder is that only one pass is made over the input data, and there is little allocation. Using lists of items, you often get pipelines like byteValues |> groupsOf n |> List.map g |> etc. that looks nice and simple, but traverses and allocates effectively the whole input again for each |>.

I was able to get rid of those masks on this branch.

A tricky aspect of sha1 is that it mixes unsigned 32-bit integer addition with bitwise operators. Some bitwise operators can flip the sign (e.g. Bitwise.complement 6 == -7) and clearly that will be a problem doing addition (which as you say is signed by default). In this case it is enough to add a |> Bitwise.shiftRightZfBy 0 (built into rotateLeftBy) before addition to force the number to be unsigned and overflow. Because the starting numbers are relatively small (in the order 2^31 at most) there is no risk of javascript number overflow by just adding three of them (integer addition works till 2^53 - 1) so the intermediate Bitwise.and 0xFFFFFFFF could be removed.

I’m working on a longer post about that PR. Working with JS numbers in this case requires a bit of experimentation, but when you understand where problems can occur, strategically placing a bunch of bit shifts can make it work reliably. And the performance is just a lot better, actually making it possible to hash 1Mb+ files. So an elm-bytes based sha should be the standard.

2 Likes

I ended up just uploading the file to the server and having the server return the hash.

Did you wanted to absulutely avoid ports or you found similar issues in the JS side also ? I am asking because at this point I am considering either to give js side a try or follow your path along.

The problem was that it was not possible to get the bytes via ports, so you’d have to do the whole thing in javascript. And since the file most of the time was going to be uploaded to the server anyway, I found it to be the easiest way.

Yea but for my case the end destinations is s3 so probably I have to invest a bit more time on that

What did the trick for me was:

  1. Define an hiden input in elm with type file(So you are not able to use it as a user)
  2. I have defined an onchange function also in the elm side that will allo me to know which file was selected if any
  3. Registering a port function that when invoked trigers the click behaviour of the hidden input
  4. Register onchange callback on the javascript side on the same input object defined in the elm side
  5. when the file changes I calculate in the javascript side the hash of the file using the crypto-js/sha256
  6. using a port I send back the value calculater in the javascript side.
2 Likes

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.