EuCrypt addition: Keccak File Hashing

July 8th, 2020

Since the client data model includes Keccak hashes for files received from Eulora's server and the files themselves may be of any size whatsoever, it follows that both client and server have to be able to use EuCrypt's Keccak sponge sequentially too - basically feeding it data in chunks as opposed to all in one go, as a single input (be it bitstream or bytestream).1 So I got my Ada reference book out on the desk again and it turns out that it's not even all that difficult really - even though this addition has to be yet another package on top of the existing ones in EuCrypt, mainly because such type of use breaks by definition the stateless guarantee provided by the "single input" use and in turn, this would then propagate to all code that uses the Keccak package (and that's the encryption scheme, mainly). Therefore, rather than forcing now stateful code everywhere just because there's a need to calculate hashes for files on the disk, I simply provide this file-hashing as a separate package using the same underlying sponge. Quite as it should even be, I would say: there is only one sponge implementation but there are now two options to using it for hashing, namely a stateless one for data that is held entirely in memory (the one that existed already) and a stateful one for data that is fed sequentially (my new code). As this is now implemented, tested and integrated into the respective parts on both client and server, I'd rather take the time and write it down as well, to unload it and have it all in one place.

The approach here is very straightforward: keep a sponge's state locally, read from the input file blocks as big as the sponge can handle in one go (and add if needed padding to the last block), pass each block as soon as read on to the sponge, scramble the state and repeat until the whole file has been processed; then squeeze out of the sponge a block and return its first 8 octets given that the hash is meant to be that size. Ada's Sequential_IO package simply needs to be provided with the type of element to read and then it works like any other file input/output, without any trouble. In principle, the most effective implementation would be to read a whole block (ie as many octets as the sponge can absorb in one go) each time but this means that one has to handle at file reading time the special case of the last block that may be incomplete. For now at least I preferred to sidestep this and I went instead for the cheap and angry solution that seems however perfectly adequate for current needs: simply read a file octet by octet, so that there is no special case at all. Here's the code that does it all:

  -- for reading files one Octet at a time
  package Octet_IO is new Ada.Sequential_IO(Element_Type =>
                                              Interfaces.Unsigned_8);

  function Hash_File(Filename: in String; Hash: out Raw_Types.Octets_8)
      return Boolean is
    F: Octet_IO.File_Type;
    S: Keccak.State := (others => (others => 0));
    Block_Len: Keccak.Keccak_Rate := Keccak.Default_Byterate;
    Block: Keccak.Bytestream(1..Block_Len);
    Pos: Keccak.Keccak_Rate;
  begin
    Octet_IO.Open(F, Octet_IO.In_File, Filename);
    -- check that this is not an empty file as hashing of that is nonsense
    if Octet_IO.End_Of_File(F) then
      Octet_IO.Close(F); -- close it before returning!
      return False;
    end if;

    -- read from file and absorb into the sponge
    while not Octet_IO.End_Of_File(F) loop
      -- read block by block
      Pos := 1;
      while Pos <= Block_Len and (not Octet_IO.End_Of_File(F)) loop
        Octet_IO.Read(F, Block(Pos));
        Pos := Pos + 1;
      end loop; -- single block loop
      -- if it's an incomplete block, it needs padding
      if Pos <= Block_Len then
        -- pad it with 10*1
        Block(Pos..Block'Last) := (others => 0);
        Block(Pos) := 1;
        Block(Block'Last) := Block(Block'Last) + 16#80#;
      end if;
      -- here the block is complete, padded if needed.
      -- absorb it into the state
      Keccak.AbsorbBlock( Block, S);
      -- scramble state
      S := Keccak.Keccak_Function( S );
    end loop; -- full file loop
    Octet_IO.Close(F);

    -- now squeeze a block and get the 8 octets required
    Keccak.SqueezeBlock( Block, S);
    Hash := Block(1..8);

    -- if it got here, all is well, return true.
    return True;
  exception
    when others =>
      Octet_IO.Close(F);
      return False;
  end Hash_File;

Note that the above returns the *raw* Keccak hash, meaning the direct output of the sponge, as a set of octets. This is normally fine and well but if one specifies the hash as "unsigned 64", it means that the above set of octets has to be interpreted as a number - and in turn, this means that byte/bit order matters. Since this can and does create confusion quite easily, I'll state it here plainly again: the raw output of the Keccak sponge is MSB/b, meaning that on a little endian machine, you'll need to flip both bytes and bits if you want to get the exact same number as you would on a big endian machine! Since this is however something that I already sorted out before, I added to the above convenient wrappers to do this properly so that the whole code should work seamlessly on both big and little endian computers anyway:

  function Hash_File(Filename: in String; Hash: out Interfaces.Unsigned_64)
      return Boolean is
    Raw: Raw_Types.Octets_8;
  begin
    -- calculate the raw hash and then convert it
    if Hash_File(Filename, Raw) then
      Hash := Hash2Val(Raw);
      return True;
    else
      return False;
    end if;
  end Hash_File;
  function Hash2Val( Raw: in Raw_Types.Octets_8 )
      return Interfaces.Unsigned_64 is
    B8: Raw_Types.Octets_8;
    U64: Interfaces.Unsigned_64;
  begin
    -- convert to U64 (NB: no need to squeeze etc, as block_len has to be > 8)
    -- for this to remain consistent on both little and big endian machines:
    -- on little endian, octets and bits need to be flipped before conversion
    if Default_Bit_Order = Low_Order_First then
      for BI in 1..8 loop
        B8(BI) := Keccak.Reverse_Table(Natural(Raw(Raw'First+8-BI)));
      end loop;
    else --simply copy as it is
      B8 := Raw(1..8);
    end if;
    -- convert and return
    U64 := Raw_Types.Cast(B8);
    return U64;
  end Hash2Val;

Using the above, one can now test Keccak hashes of files both raw and in the more usual numerical format (e.g. hex as given for instance by the keksum implementation). More importantly for me, Eulora's client can now check the hash of a received file when it gets the last chunk of it and therefore it can decide whether the whole thing is any good or not! I'm quite happy to inform as well that initial tests are running fine - the data acquisition part of the client successfully requests files and receives them apparently unmolested (even when there are quite a few chunks, so far I tested with several thousands meaning up to 10MB files), writing them neatly to disk exactly as and where intended. Hooray!

The above out of the way, the uglier next step is to get that hideous gui-code to actually *use* those files properly, too!


  1. This is not something entirely new, since vtools for instance has previously adapted my Keccak implementation to a similar use, in order to calculate the hashes for vpatches. However, the approach taken there (by phf) apparently aimed to wrap the Ada implementation for C/CPP use and I don't want this at all. First, I'd much rather move C/CPP code to Ada if/when possible than the other way around. Second, there is absolutely no good reason in this case to force any C/CPP code in the mix since Ada actually provides all that is needed to interact with files on the disk without any trouble whatsoever.