Calculate an SHA-1 Hash on Files

Article ID: 57451

Q: I read with interest your article "Calculate an SHA-1 Hash" (October 23, 2008, article ID 57351). We use mirroring software to transfer data from one i system to another and have done so for many years. Is there a way of calculating "hash" values for individual mirrored files to determine, with reasonable certainty, that the files are the same? Some of these files are extremely large in terms of the number of records.

A: The Calculate Hash (Qc3CalculateHash) API that I used to calculate an SHA-1 hash can be called with format ALGD0100. This format lets you call the API repeatedly, each time passing more data to include in the hash. In ALGD0100 there's a flag you can pass to tell it when you're done, at which point it returns the hash. This article demonstrates how to use that support.

The use of the ALGD0100 format means that I want to create an algorithm context token. This sounds complicated, but it's really quite easy to do. All I'm doing is telling the operating system that I want it to create a space in memory where it keeps track of variables related to the running of a particular cryptographic algorithm. In this case, the SHA-1 hash algorithm.

I do that by running code such as this:

     D Qc3CreateAlgorithmContext...
     D                 PR                  ExtProc('Qc3CreateAlgorithm-
     D                                     Context')
     D   AlgDesc                     64A   const options(*varsize)
     D   Format                       8a   const
     D   token                        8a
     D   ErrorCode                32767a   options(*varsize)

     D ALGD0500_t      ds                  qualified
     D                                     based(Template)
     D   HashAlg                     10i 0

     D HASH_SHA1       c                   2

     D alg1            ds                  likeds(ALGD0500_t)
     D token           s              8a
         .
         .
         alg1.HashAlg = HASH_SHA1;

         Qc3CreateAlgorithmContext( alg1
                                  : 'ALGD0500'
                                  : token
                                  : ErrorNull );

The alg1 data structure tells the API which cryptographic algorithm I want to work with. In this case, I'm using format ALGD0500, which is a data structure that consists of nothing but a single numeric field to identify which hash algorithm I want to use. SHA-1 happens to be algorithm 2, so I've set my data structure accordingly. I then call Qc3CreateAlgorithmContext(), and the OS reserves a place in memory for processing the SHA-1 algorithm on my set of data. The "token" variable is returned by the API and contains a handle to that space in memory.

Now that I have that set up, I can call Qc3CalculateHash() to calculate the SHA-1, just as I did in the previous article, except that I use format ALGD0100 instead of ALGD0500.

     D ALGD0100_t      ds                  qualified
     D                                     based(Template)
     D   Token                        8a
     D   Final                        1n

     D alg2            ds                  likeds(ALGD0100_t)
         .
         .
         alg2.Token   = token;
         alg2.Final   = *OFF;

            Qc3CalculateHash( %addr(buf)
                            : len
                            : 'DATA0100'
                            : alg2
                            : 'ALGD0100'
                            : '0'
                            : *OMIT
                            : *OMIT
                            : ErrorNull );

In the previous article, I used ALGD0500 with Qc3CalculateHash. The problem with that solution is that it doesn't let me keep adding more data to the hash; I have to provide all the needed data in one fell swoop. With this solution, the system maintains variables keeping track of the state of my SHA-1. ALGD0100 specifies the token that I got earlier when I called Qc3CreateAlgorithmContext(), and it also specifies a flag named Final. When Final is set to *OFF, the input data to the API is added to the internal work variables for the context token. No hash is returned, so I've passed *OMIT for the hash parameter.

The advantage of this approach is that I can call the API repeatedly, each time adding more data to the hash. In your example, you want to read a large file and calculate an SHA-1 hash on it. This is a perfect example of where you'd want to use the context token. Read each record of your file and pass that record to Qc3CalculateHash, one record at a time. The system keeps track of things and calculates the hash without needing all the records loaded into memory at once.

Here's an example of calculating the hash for all records in the CUSTMAS file, which has record format CUSTMASF:

     FCUSTMAS   IF   E             DISK    BLOCK(*YES)

     D CustData        ds                  likerec(CUSTMASF:*INPUT)
         .
         .
         setll *start CUSTMAS;
         read CUSTMASF CUSTDATA;
         dow not %eof(CUSTMAS);
            Qc3CalculateHash( %addr(CUSTDATA)
                            : %size(CUSTDATA)
                            : 'DATA0100'
                            : alg2
                            : 'ALGD0100'
                            : '0'
                            : *OMIT
                            : *OMIT
                            : ErrorNull );
            read CUSTMASF CUSTDATA;
         enddo;

This code reads one record of CUSTMAS at a time. The record is loaded into a data structure named CUSTDATA. It then passes that record into Qc3CalculateHash to update the hash with the new data. I repeat this process for each record in the file.

At the end, I change the Final flag to *ON to indicate that I'm done reading the data and am ready to receive the hash. I code that like this:

         alg2.final = *ON;
         Qc3CalculateHash( *NULL
                         : 0
                         : 'DATA0100'
                         : alg2
                         : 'ALGD0100'
                         : '0'
                         : *OMIT
                         : binhash
                         : ErrorNull );

As you can see, I passed *NULL for the data and 0 for the length of the data. That's because I don't want to add any new data to the hash--I've already fed all the records into the API. But I do specify that this is the final call by setting the Final subfield of the data structure to *ON. The API stores the hash in the binhash parameter.

After calling the API with final=*ON, the system knows that I'm done with that particular hash. If I use the same context token again, it starts calculating a new hash instead of adding data to the end of the old one.

When I'm done using the token, I can call the Qc3DestroyContextToken() API to destroy the context token and free up the memory it used.

Qc3DestroyAlgorithmContext( token: ErrorNull );

Code Download

I've provided two sample programs that use the concepts outlined in this article. The first one is called SHA1CUST, and it calculates the SHA-1 hash based on the contents of the records in the CUSTMAS file. The second one is called SHA1IFS and calculates an SHA-1 hash on any file in the IFS. You can download the sample code right here.

ProVIP Sponsors

ProVIP Sponsors