Uncompressed unencrypted zip encoder

alon · February 16, 2019, 11:05pm

Hi,

I implemented a basic zip file generator, it doesn’t do compression nor encryption and fails to compress strings with end of lines. The most time was spent on doing the crc32, until discovering I needed Bitwise.shiftRightZfBy and not shiftRightBy ; It is mostly a clean room implementation, when I got stumped with the crc I took a look at the zlib implementation, and also at the rosetta stone julia implementation (just because it’s another language I’m interested in, and the haskell implementation was too hard to understand, still not sure what !! does).

I don’t have an uploaded repo yet (it was developed with git as a part of another repo), so here is the snapshot (I hope it isn’t took long).

I tested mainly with elm repl, i.e. “elm repl < test_crc32.slm” which did “import CRC32 exposing (…); crc32 example” and I used the same example as rosetta stone because I was comparing to it.

Finally here is the code for the two modules: Zip.elm and CRC32.elm

-- Zip.elm
module Zip exposing (AFile, zip)
import Bytes exposing (Bytes, Endianness(..))
import Bytes.Encode as Encode exposing (Encoder, encode, string)
import CRC32 exposing (calcCrc32, CRC32)


type alias AFile = { filename : String, content : String }

type alias AFileZipData = { filename : String, content : String, crc32 : Int }


-- This is basically a zip encoder, should be split to it's own module
-- only implements no compression no encryption, minimum to get something
zip : CRC32 -> List AFile -> Bytes
zip crcobj files =
  files |> zipEncoder crcobj |> encode


computeZipData crcobj offset afile =
  {
    filename = afile.filename,
    content = afile.content,
    crc32 = afile.content |> String.toList |> List.map Char.toCode |> calcCrc32 crcobj,
    header_and_content_width = (zipLocalFileHeaderSize afile.filename) + (Encode.getStringWidth afile.content),
    relative_offset_of_local_header = offset
  }


g_version_made_by = u16 10 -- 1.0, MSDOS
g_version_needed_to_extract = u16 10 -- 1.0


zipCentralDirectoryFileHeaderSize filename =
  (2 * 6 + 4 * 3 + 2 * 5 + 4 + 4 + (Encode.getStringWidth filename))

zipCentralDirectoryFileHeader data =
  let
    version_made_by = g_version_made_by
    version_needed_to_extract = g_version_needed_to_extract
    general_purpose_bit_flag = z16 -- TODO
    compression_method = z16 -- store, i.e. uncompressed
    last_mod_file_time = z16 -- TODO 2 second units
    last_mod_file_date = z16 -- TODO
    crc_32 = u32 data.crc32
    compressed_size = u32 <| String.length data.content -- TODO: wrong if this is utf-8? should switch to AFile.content : Bytes
    uncompressed_size = compressed_size
    file_name_length = u16 <| String.length data.filename
    extra_field_length = z16
    file_comment_length = z16
    disk_number_start = z16 -- TODO
    internal_file_attributes = z16 -- TODO
    external_file_attributes = z32 -- TODO
    relative_offset_of_local_header = u32 data.relative_offset_of_local_header
    file_name = string data.filename
    extra_field = string ""
    file_comment = string ""
  in
    Encode.sequence [
        u32 0x02014b50, -- central file header signature
        version_made_by,
        version_needed_to_extract,
        general_purpose_bit_flag,
        compression_method,
        last_mod_file_time,
        last_mod_file_date,
        crc_32,
        compressed_size,
        uncompressed_size,
        file_name_length,
        extra_field_length,
        file_comment_length,
        disk_number_start,
        internal_file_attributes,
        external_file_attributes,
        relative_offset_of_local_header,
        file_name,
        extra_field,
        file_comment
      ]


zipCentralDirectory files_data =
  let
    headers = List.map zipCentralDirectoryFileHeader files_data
    digital_signature =
      Encode.sequence [
        u32 0x05054b50,
        z16, -- size_of_data TODO wrong
        string ""
      ]
    zip64_end_of_central_directory_record = string ""
    zip64_end_of_central_directory_locator = string ""
    end_of_central_directory_record = zipEndCentralDirectory files_data
  in
    Encode.sequence (headers ++ -- [digital_signature])
      [
        zip64_end_of_central_directory_record,
        zip64_end_of_central_directory_locator,
        end_of_central_directory_record
      ]
      )


listSum l =
  List.foldr (\s a -> s + a) 0 l


zipCentralDirectorySize files_data =
  let
      headers_size = files_data |> List.map .filename |> List.map zipCentralDirectoryFileHeaderSize |> listSum
  in
    zipEndCentralDirectorySize + headers_size


zipEndCentralDirectorySize =
  (4 + 2 + 2 + 2 + 2 + 4 + 4 + 2)


zipEndCentralDirectory files_data =
  let
    end_of_central_dir_signature = u32 0x06054b50
    number_of_this_disk = z16
    number_of_the_disk_with_the_start_of_the_central_directory = z16
    central_directory_on_this_disk = u16 2 -- TODO copied from output of zip
    total_number_of_entries_in_the_central_directory = u16 <| (List.length files_data) -- +1 ?
    size_of_the_central_directory = u32 (zipCentralDirectorySize files_data)
    -- todo: cache this, computed twice
    start_offset = files_data |> List.map .header_and_content_width |> listSum
    offset_of_start_of_central_directory_with_respect_to_the_starting_disk_number = u32 start_offset
    dotzip_file_comment_length = z16
    dotzip_file_comment = string ""
  in
    Encode.sequence [
        end_of_central_dir_signature,
        number_of_this_disk,
        number_of_the_disk_with_the_start_of_the_central_directory,
        central_directory_on_this_disk,
        total_number_of_entries_in_the_central_directory,
        size_of_the_central_directory,
        offset_of_start_of_central_directory_with_respect_to_the_starting_disk_number,
        dotzip_file_comment_length,
        dotzip_file_comment
      ]

-- Per ZIP File Format Specification.TXT version 6.3.2 September 28, 2007
-- Using jxxcarlson/elm-tar/2.2.2/src/Tar.elm as reference for elm / Bytes.Encoder usage
zipEncoder : CRC32 -> List AFile -> Encoder
zipEncoder crcobj files =
  let
      helper file (offset, retl) =
        let
            augfile = computeZipData crcobj offset file
        in
            (offset + augfile.header_and_content_width, augfile :: retl)
      (last_offset, files_data) = List.foldr helper (0, []) files
      files_encoders = List.map zipFileEncoder files_data
      central_directory = zipCentralDirectory files_data

      endheaders = [
        central_directory
        ]
  in
      Encode.sequence (files_encoders ++ endheaders)


zipFileEncoder afile =
  let
    local_file_header = zipLocalFileHeaderEncoder afile
    file_data = string afile.content
    -- This descriptor only appears if bit 3 of the general purpose bit flag is set (6.3.2 V.C)
    -- data_descriptor = string "todo"
  in
    Encode.sequence [local_file_header, file_data]


u32 = Encode.unsignedInt32 LE
u16 = Encode.unsignedInt16 LE
z32 = u32 0
z16 = u16 0


zipLocalFileHeaderSize filename =
  4 + 2 + 2 + 2 + 2 + 2 + 4 + 4 + 4 + 2 + 2 + (Encode.getStringWidth filename)

zipLocalFileHeaderEncoder afile =
  let
    local_file_header_signature = u32 0x04034b50
    version_needed_to_extract = g_version_needed_to_extract
    general_purpose_bit_flag = z16 -- TODO
    compression_method = z16 -- TODO
    last_mod_file_time = z16 -- TODO should be either now or from file name or get from headers?
    last_mod_file_date = z16 -- TODO -"-
    crc_32 = u32 <| afile.crc32
    compressed_size = u32 <| String.length afile.content
    uncompressed_size = compressed_size
    file_name_length = u16 <| String.length afile.filename
    extra_field_length = z16
    file_name = string afile.filename
    extra_field = string "" -- TODO - another way to encode zero bytes?
  in
  Encode.sequence [
      local_file_header_signature,
      version_needed_to_extract,
      general_purpose_bit_flag,
      compression_method,
      last_mod_file_time,
      last_mod_file_date,
      crc_32,
      compressed_size,
      uncompressed_size,
      file_name_length,
      extra_field_length,
      file_name,
      extra_field
    ]

-- CRC32.elm

module CRC32 exposing (crc32, calcCrc32, polyRemainder, reduce, crcTable, example, CRC32)

import Array
import Bitwise exposing (..)

{-
The CRC-32 algorithm was generously contributed by
David Schwaderer and can be found in his excellent
book "C Programmers Guide to NetBIOS" published by
Howard W. Sams & Co. Inc. The 'magic number' for
the CRC is 0xdebb20e3. The proper CRC pre and post
conditioning is used, meaning that the CRC register
is pre-conditioned with all ones (a starting value
of 0xffffffff) and the value is post-conditioned by
taking the one's complement of the CRC residual.
If bit 3 of the general purpose flag is set, this
field is set to zero in the local header and the correct
value is put in the data descriptor and in the central
directory. When encrypting the central directory, if the
local header is not in ZIP64 format and general purpose
bit flag 13 is set indicating masking, the value stored
in the Local Header will be zero
 -}
-- poly = 0xdebb20e3
poly = 0xedb88320 -- taken from rossetastone after verifying result for julia code equals unzip expected value


example = "The quick brown fox jumps over the lazy dog" |> String.toList |> List.map Char.toCode

reduce : (Int -> Int) -> Int -> Int -> Int
reduce step n rounds =
  if rounds <= 0 then
    n
  else
    reduce step (step n) (rounds - 1)


polyRemainder pol num =
  let
    step n =
      if (Bitwise.and 1 n) == 1
      then
        Bitwise.xor (Bitwise.shiftRightZfBy 1 n) pol
      else
        Bitwise.shiftRightZfBy 1 n
  in
    reduce step num 8


crcTable =
  let
    numbers = List.range 0 255
    table = List.map (polyRemainder poly) numbers
  in
    Array.fromList table


type alias CRC32 = { table : Array.Array Int }

crc32 = { table = crcTable }

{-

From wikipedia:

Function CRC32
   Input:
      data:  Bytes     //Array of bytes
   Output:
      crc32: UInt32    //32-bit unsigned crc-32 value

//Initialize crc-32 to starting value
crc32 ← 0xFFFFFFFF

for each byte in data do
   nLookupIndex ← (crc32 xor byte) and 0xFF;
   crc32 ← (crc32 shr 8) xor CRCTable[nLookupIndex] //CRCTable is an array of 256 32-bit constants

//Finalize the CRC-32 value by inverting all the bits
crc32 ← crc32 xor 0xFFFFFFFF
return crc32
-}
calcCrc32 crcdata data =
  let
      helper : Int -> Int -> Int
      helper byte last =
        let
            nLookupIndex = last |> Bitwise.and 0xff |> Bitwise.xor byte

            -- TODO: how do I get rid of this withDefault - i.e. prove it cannot happen and have a fast lookup without bounds check?
            lookedup = Array.get nLookupIndex crcdata.table |> Maybe.withDefault 0
        in
            last |> Bitwise.shiftRightZfBy 8 |> Bitwise.xor lookedup
      crc = List.foldr helper 0xffffffff (List.reverse data) -- 0xffffffff is called preconditioning
  in
      Bitwise.xor crc 0xffffffff -- this is called postconditioning

alon · February 16, 2019, 11:09pm

@jxxcarlson - I used your tar module, even opened one issue (found another I didn’t report yet, had to use Binary and not String because I got corrupted files), so thanks! I saw on another thread in “Request Feedback” you were asking about a zip implementation, maybe this will be of interest to you. I would love feedback on the code, as I’m a newbie.

unrelated: my main pain point so far is the slowness of the crc32 calculation, which is noticeable on a 200 KiB file (about 1 second). Did not do any real benchmarking. Any suggestions for optimizing that function are welcome.

alon · February 16, 2019, 11:11pm

Forgot to mention: this produces files that can be extracted using unzip-6.0-42.fc29.x86_64 (fedora 29), it produces some warnings but decompressed, tested on a single file so far:

$ unzip -t file.zip 
Archive:  /tmp/file.zip
error [/tmp/file.zip]:  missing 18 bytes in zipfile
  (attempting to process anyway)
error [/tmp/file.zip]:  reported length of central directory is
  18 bytes too long (Atari STZip zipfile?  J.H.Holm ZIPSPLIT 1.1
  zipfile?).  Compensating...
    testing: filename.ext             OK
No errors detected in compressed data of /tmp/file.zip.

jxxcarlson · February 17, 2019, 2:44am

Hello @alon, I made some changes that resolve at least part of the problem. I’ve pasted by notes from my comments on the GitHub issue you raised on this. Thank you so much for bringing this to my attention.

— Jim

@alon, I pushed version 3.0.8 of elm-tar to the package manager, which may provide an interim fix for your problem. Please let me know what happens. I ran your example successfully, but with a name change to accommodate some changers in the API (defaultMetada):

:


arx =

createArchive

[ ( { defaultMetadata | filename = "test1.csv" }, StringData "" )

, ( { defaultMetadata | filename = "test2.csv" }, StringData "" )

]

The problem you were experiencing was due to the fact that I was putting a 512 byte block of zeros for the data block in the case of an empty string. That block should be nonexistent in the latter case. Also, please note the following file types (or not):

Metadata notes

There is an unresolved problem of how to properly encode the metadata

of text files so that extractArchive knows to decode the content

as a string. At the moment, archived files whose content is to be

decoded as a string are recognized by their file extension. The

admissible text files have extension in the list


[ "text", "txt", "tex", "csv" ]

This is an unsatisfactory solution, but at the moment, I don’t

know how to get around it.

then


> arx

<3072 bytes> : Bytes.Bytes

> exx = extractArchive arx

[({ fileNamePrefix = "", fileSize = 0, filename = "test1.csv", groupID = 123, groupName = "", lastModificationTime = 0, linkIndicator = NormalFile, linkedFileName = "foo.txt", mode = { group = [Write,Read], other = [Read], system = [], user = [Execute,Write,Read] }, ownerID = 501, userName = "" },StringData ""),({ fileNamePrefix = "", fileSize = 0, filename = "test2.csv", groupID = 123, groupName = "", lastModificationTime = 0, linkIndicator = NormalFile, linkedFileName = "foo.txt", mode = { group = [Write,Read], other = [Read], system = [], user = [Execute,Write,Read] }, ownerID = 501, userName = "" },StringData "")]

> List.length exx

2 : Int

alon · February 17, 2019, 8:00am

Updates to original post:
Source viewable here: Zip.elm
It works on any 7-bit clean inputs, what I thought was an EOL problem is actually a code point that is > 255. so I updated the code to drop those, the problem is to do with not being able to turn a String into a List UInt8 where UInt8 is a hypothetical type that is identical to Int but with unsigned byte values (even if this is highly space redundant in javascript), or just an Array UInt8.

So right now I do:

  s |> String.toList |> Char.toCode |> List.filter (\x -> x <= 255)

and crc32 & encode that, which passes (tested for multiple files, 200 k each). The crc32 speed is actually good enough for this right now. But it is still better to just use @jxxcarlson 's tar

alon · February 17, 2019, 8:18pm

Interesting. Right now the tar package is working perfectly for me. Apparently chrome on windows gives some complaint but otherwise worked, so my users are happy, and I didn’t have to write anything on the decoding part. The zip exercise is also good enough but except for the security issue doesn’t give any benefit.

alon · February 17, 2019, 8:24pm

@jxxcarlson also, I can’t figure out: how do I get an Array? I can’t can I? i.e. a List UInt8? the closest I can get is:

Http.expectBytes State (Decoder.bytes 100)

which would work if I always got 100 bytes.
I could get the file’s size by a side channel, but it’s silly - of course we already know the file size at this point, the Elm API is just hiding it from us.

jxxcarlson · February 20, 2019, 4:31am

Hi @alon, so sorry to be slow replying. I’ll be home tomorrow and am really anxious to try out your code and study it. I had originaly thought of doing zip, or maybe gzip, but someone on slack suggested tar as a simpler alternative to implement, which it certainly is.

Are you going to try to do some kind of conpression / decompression?

Do you have any benchmarks on your zip implmentation? This is something I want to do for tar also.

alon · February 20, 2019, 7:20am

Hey @jxxcarlson, looking forward to your comments! I since updated the code. There is no 8-bit issue, I was reading the original via Http.get with expectString ; There is another problem there, but unrelated to the Zip implementation. Most recent version is at https://git.sr.ht/~alon/studer-widget/tree/master/src/Zip.elm

Implementing zip decoding (better name imo than decompression because the later implies taking less bytes, while the former only a different representation) is much harder than encoding, because of the way zip works, you have to start by reading the last central directory, which can be anywhere in the file. It has offsets to the file headers, which are immediately followed by the file contents (compressed / encrypted - more details).

I have not grokked the parser library yet, that is probably the right direction. Having an Array would be easiest for someone like me coming from a non fp background.

I don’t have any benchmarks. I would probably use elm repl for those with an external script. Maybe there is support in elm-test for it? anyway I expect it is slow but not too slow right now, for some uses. The crc-32 is the longest part I think. But I didn’t measure, numbers welcome.

alon · February 20, 2019, 7:51am