**This is an old revision of the document!**
SHA-256 HASH-generator
The SHA-256 hash is widely used as the basis for computer security. Generating a SHA-256 hash is not totally trivial though. This version closely follows the original description, by using and naming all the variables and functions as put down in the documentation. This helps in understanding the actual SHA-hash algorithm. More efficient implementation are easily possible, but tend to be more or less system specific.
This program generates SHA-256 hashes on files with a maximum length of 2^32 bytes. The hash is created in the 8 values called H0-H7, and can be printed using the .HASH word.
This version is runs on 32 bit Forths with little-endian storage in memory
In this setup it is suitable for messages upto 4GB (ie 32Gb)
TAKE CARE: the invert of 'we' in function CH is MISSING in the draft spec of SHA-256!!
\ ===== support words - use primitives if available in your system : D8* ( n -- d_n*8 ) \ takes a single as input, multiplies with 8 into a double dup 29 rshift swap 3 lshift swap ; : ROTR ( n rot -- rotated_n ) \ rotate right n with rot bits 31 and 2dup 32 swap - lshift >r rshift r> or ; hex : BYTES>< ( m -- w ) \ reverse cell bytes: 0x12345678 <-> 0x78563412 dup >r 18 lshift r@ ff00 and 8 lshift or r@ ff0000 and 8 rshift or r> 18 rshift or ; decimal \ ===== values, arrays and related helper-functions \ h0-h7 form the actual hash 0 value h0 0 value h1 0 value h2 0 value h3 0 value h4 0 value h5 0 value h6 0 value h7 \ wa-wg are working variables 0 value wa 0 value wb 0 value wc 0 value wd 0 value we 0 value wf 0 value wg 0 value wh : HASH->WVAR h0 to wa h1 to wb h2 to wc h3 to wd h4 to we h5 to wf h6 to wg h7 to wh ; : WVAR->HASH wa +to h0 wb +to h1 wc +to h2 wd +to h3 we +to h4 wf +to h5 wg +to h6 wh +to h7 ; \ temp values 0 value t1 0 value t2 0 value lenmess 0 value addrmess \ message block = 16 words for message and... \ ...48 words for temp storage of Wi for last 48 rounds create MBLCK 64 cells allot : MESS@ ( n -- mess ) 4 * mblck + @ ; : MESS! ( m n -- ) 4 * mblck + ! ; : CLEARMESS ( -- ) 64 0 do 0 i mess! loop ; \ clear message-block hex create KI[] \ 64 SHA-256 constants, 1 for each subloop - checked 428a2f98 , 71374491 , b5c0fbcf , e9b5dba5 , 3956c25b , 59f111f1 , 923f82a4 , ab1c5ed5 , d807aa98 , 12835b01 , 243185be , 550c7dc3 , 72be5d74 , 80deb1fe , 9bdc06a7 , c19bf174 , e49b69c1 , efbe4786 , 0fc19dc6 , 240ca1cc , 2de92c6f , 4a7484aa , 5cb0a9dc , 76f988da , 983e5152 , a831c66d , b00327c8 , bf597fc7 , c6e00bf3 , d5a79147 , 06ca6351 , 14292967 , 27b70a85 , 2e1b2138 , 4d2c6dfc , 53380d13 , 650a7354 , 766a0abb , 81c2c92e , 92722c85 , a2bfe8a1 , a81a664b , c24b8b70 , c76c51a3 , d192e819 , d6990624 , f40e3585 , 106aa070 , 19a4c116 , 1e376c08 , 2748774c , 34b0bcb5 , 391c0cb3 , 4ed8aa4a , 5b9cca4f , 682e6ff3 , 748f82ee , 78a5636f , 84c87814 , 8cc70208 , 90befffa , a4506ceb , bef9a3f7 , c67178f2 , decimal : KI@ ( i -- KI ) 4* ki[] + @ ; \ ===== the 6 basic logic functions of SHA-256 - all 6 functions mix and combine data : SIGMA0 ( x -- xn ) >r r@ 2 rotr r@ 13 rotr xor r> 22 rotr xor ; : SIGMA1 ( x -- xn ) >r r@ 6 rotr r@ 11 rotr xor r> 25 rotr xor ; : SIG0 ( x -- n ) >r r@ 7 rotr r@ 18 rotr xor r> 3 rshift xor ; : SIG1 ( x -- n ) >r r@ 17 rotr r@ 19 rotr xor r> 10 rshift xor ; : CH ( -- an ) we wf and we invert wg and xor ; : MAJ ( -- an ) wa wb and wa wc and xor wb wc and xor ; \ ===== the HASH functions hex : INITHASH \ fill first hash with intial values 6a09e667 to h0 bb67ae85 to h1 3c6ef372 to h2 a54ff53a to h3 510e527f to h4 9b05688c to h5 1f83d9ab to h6 5be0cd19 to h7 ; decimal \ do 1 of the 64 rounds per block : SUBLOOP ( message+ki -- ) wh + CH + we sigma1 + to t1 \ t1 filled with temp value MAJ wa sigma0 + to t2 \ t2 filled with temp value wg to wh wf to wg we to wf wd t1 + to we wc to wd wb to wc wa to wb t1 t2 + to wa ; : HASH1BLOCK \ 27us -> MACRO based=12us hash->wvar \ copy hash-variables h0-h7 to temp-variables wa-wh \ first 16 rounds digest message-data from message-block & does initial scrambling 16 0 do i mess@ \ get data from message i ki@ + subloop \ message+Ki to subloop loop \ the next 48 rounds scramble the data from the first 16 rounds by combining \ data from 4 earlier messages into a new 'message', as input to the subloop 64 16 do i 2 - mess@ sig1 i 7 - mess@ + i 15 - mess@ sig0 + i 16 - mess@ + \ ( Wi ) dup i mess! \ store Wi for following rounds i ki@ + subloop \ message+Ki to subloop loop wvar->hash ; \ this is where the hash is actually created : FILLMESS ( addr -- ) \ fills message block from message with reversed words 16 0 do dup i 4* + @ bytes>< i mess! loop drop ; : MBLCKREVERSE ( -- ) \ reverse bytes words in message-block - only for last block! 16 0 do i mess@ bytes>< i mess! loop ; : PUTLEN ( len -- ) d8* 14 mess! 15 mess! ; : GENSHA256 ( addr len -- ) to lenmess to addrmess \ addr en len point to message-array \ first hash all full blocks lenmess 6 rshift dup >r \ number of full blocks - also in r: 0 ?do addrmess i 64 * + fillmess \ i = blocknumber hash1block loop \ than hash the last block << including 1 extra block if needed r> 64 * addrmess + to addrmess \ addrmess now address of last block of data lenmess 63 and >r \ bytes_left in r: clearmess \ clear mblck addrmess mblck r@ cmove \ move last bytes to mblck mblck r@ + 128 swap c! \ add bit after last byte mblckreverse \ block now definitely filled with message r> 55 > if hash1block \ mblck is already filled and zeroed till the end clearmess \ clear mblck again then lenmess putlen \ put len at correct place hash1block ; \ and last block : sha256 ( addr len -- ) \ create a hash in hash-variables H0-H7 for the data starting at addr \ with length len inithash gensha256 ; \ ===== printing : .HAHDR ." ---h0--- ---h1--- ---h2--- ---h3--- ---h4--- ---h5--- ---h6--- ---h7---" ; : .32HEX base @ hex >r 0 <# # # # # # # # # #> type r> base ! ; : .HASH h0 .32hex h1 .32hex h2 .32hex h3 .32hex h4 .32hex h5 .32hex h6 .32hex h7 .32hex ; create ABC 97 c, 98 c, 99 c, : proof ( -- ) ABC 3 sha256 cr 9 spaces .hahdr cr ." must be: " ." ba7816bf 8f01cfea 414140de 5dae2223 b00361a3 96177a9c b410ff61 f20015ad" cr ." SHA-256: " .hash ; \ ===== the END
Optimising performance of the SHA-256 hash generator
Creating SHA-256 hashes is an area where time of execution is usually seen as relevant. The above version, using generic Forth, will run on most Forth systems. But is not the fastest version possible. Running on wabiForth in tests it achieves a throughput of around 2.7 MB/s.
The first step to make the algorithm faster is check if BYTES>< and ROTR are available as faster primitives on your forth-system. Especially the ROTR is common as most processors have a specific opcode for that function.
The next step is to use a data-array for the hash-variables H0-H7 and the temp variables. On systems with a memory-cache, this saves time by reducing writes to and reads from memory.
A next step would be to program the six logical functions in assembly. Tested on wabiForth, combining these optimisation steps makes the program run about 5 times faster, with a throughput of around 12.9 MB/s
The nest step would be to program the subloop as a whole in assembly. This is a surprisingly short assembly routine of only 37 opcodes for the whole in ARM32 assembly. The throughput is now around 26.4 MB/s
The last step tested by the author is to also program the HASH1BLOCK word in assembly. The final throughput achieved is 48.3 MB/s. Around 17 times faster than using generic Forth.
\ **SHA-256 assembly version - for Project Forth Works - wabiPi4 specific** \ system specific: \ wabiForth measures time-intervals with T[ and ]T. \ integrated wabi Armv8-assembler \ CAVE: the invert of we in function CH is MISSING in the draft spec of SHA-256!! unused \ 4620 bytes \ ===== variables, arrays and helper-functions \ h0-h7 form the actual hash 0 value h0 0 value h1 0 value h2 0 value h3 0 value h4 0 value h5 0 value h6 0 value h7 \ other values 0 value lenmess 0 value addrmess \ message block = 16 words for message and... \ ...48 words for temp storage of Wi for last 48 rounds 256 allocate drop constant MBLCK \ wabiForth specific data-oriented memory 32 allocate drop constant WBLCK \ storage for 8 32b woring variables wa-wh \ no need to store T1 and T2 : MESS! ( m n -- ) 4* mblck + ! ; : CLEARMESS 16 0 do 0 i mess! loop ; \ clears message-block - FILL is slower : HASH->WVAR \ move hash to working variables - checked - 102c! h0 WBLCK 0 + ! h1 WBLCK 4 + ! h2 WBLCK 8 + ! h3 WBLCK 12 + ! h4 WBLCK 16 + ! h5 WBLCK 20 + ! h6 WBLCK 24 + ! h7 WBLCK 28 + ! ; : WVAR->HASH \ add working variables to hash - checked - 42c! WBLCK 0 + @ +to h0 WBLCK 4 + @ +to h1 WBLCK 8 + @ +to h2 WBLCK 12 + @ +to h3 WBLCK 16 + @ +to h4 WBLCK 20 + @ +to h5 WBLCK 24 + @ +to h6 WBLCK 28 + @ +to h7 ; hex create KI[] \ 64 SHA-256 constants - could be in data-mem 428a2f98 , 71374491 , b5c0fbcf , e9b5dba5 , 3956c25b , 59f111f1 , 923f82a4 , ab1c5ed5 , d807aa98 , 12835b01 , 243185be , 550c7dc3 , 72be5d74 , 80deb1fe , 9bdc06a7 , c19bf174 , e49b69c1 , efbe4786 , 0fc19dc6 , 240ca1cc , 2de92c6f , 4a7484aa , 5cb0a9dc , 76f988da , 983e5152 , a831c66d , b00327c8 , bf597fc7 , c6e00bf3 , d5a79147 , 06ca6351 , 14292967 , 27b70a85 , 2e1b2138 , 4d2c6dfc , 53380d13 , 650a7354 , 766a0abb , 81c2c92e , 92722c85 , a2bfe8a1 , a81a664b , c24b8b70 , c76c51a3 , d192e819 , d6990624 , f40e3585 , 106aa070 , 19a4c116 , 1e376c08 , 2748774c , 34b0bcb5 , 391c0cb3 , 4ed8aa4a , 5b9cca4f , 682e6ff3 , 748f82ee , 78a5636f , 84c87814 , 8cc70208 , 90befffa , a4506ceb , bef9a3f7 , c67178f2 , decimal KI[] constant KITBL \ using KITBL is faster than using the name KI[] \ ===== the functions MAJ, CH, SIGMA0 and SIGMA1 are integrated into SUBLOOP \ the functions SIG0 and SIG1 are integrated into HASH1BLOCK \ ===== assembly macros : ldv32, ( reg n -- ) \ load value n into reg - creates 2 opcodes! 2dup 16 lshift 16 rshift movw, 16 rshift movt, ; : prologcust r13, {, r0, r4, r-r, r6, r8, r-r, v, w, }!, stmfd, ; \ saving v and w critical : nextcust r13, {, r0, r4, r-r, r6, r8, r-r, v, w, }!, ldmfd, ; \ restoring v and w critical \ ===== The actual HASH-routiunes hex : INITHASH \ fill first hash with intial values 6a09e667 to h0 bb67ae85 to h1 3c6ef372 to h2 a54ff53a to h3 510e527f to h4 9b05688c to h5 1f83d9ab to h6 5be0cd19 to h7 ; decimal \ wa=r0 wb=r1 wc=r2 wd=r3 we=r4 wf=r6 \ wg=r7 wh=r8 t1=v t2=w top=scratch code: SUBLOOP ( message+ki -- ) \ 25c - 37 opcodes [ prologcust \ get working variables into registers - w=T2=> here scratch w, WBLCK ldv32, \ get address of work-regs in w w, {, r0, r4, r-r, r6, r8, r-r, }, ldmia, \ move work-regs into cpu-regs \ start of T1 v, top, r8, add, \ r8=wh - top=message+ki - v=wh+message+ki=t1_step1 \ now CH => ( we wf and ) ( we invert wg and ) xor ; top, r4, r6, and, \ =we wf and - top=scratch w, r7, r4, bic, \ only faster in full assembly version w, w, top, eor, \ w=eor result v, v, w, add, \ v=t1_step2 - add result to t1 - top and w free \ now SIGMA1 ( x -- xn ) >r r@ 6 rotr r@ 11 rotr xor r> 25 rotr xor w, r4, 6 ror#, mov, \ r4=we w, w, r4, 11 ror#, eor, w, w, r4, 25 ror#, eor, v, v, w, add, \ v=t1_final \ start of T2 r8, r0, r1, and, \ r8=wa and wb top, r0, r2, and, \ top=wa and wc r8, r8, top, eor, top, r1, r2, and, \ top=wb and wc w, top, r8, eor, \ w=t2_step1 \ now SIGMA0 as part of T2 with wa=r0 top, r0, 2 ror#, mov, top, top, r0, 13 ror#, eor, top, top, r0, 22 ror#, eor, w, w, top, add, \ w=t2_final \ actual update of the 8 working regs r8, r7, mov, \ wg to wh r7, r6, mov, \ wf to wg r6, r4, mov, \ cave: we=r4 - we to wf r4, v, r3, add, \ wd t1 + to we r3, r2, mov, \ wc to wd r2, r1, mov, \ wb to wc r1, r0, mov, \ wa to wb r0, v, w, add, \ t1 t2 + to wa \ write wregs back to work-block w, wblck ldv32, \ get address of work-regs in w w, {, r0, r4, r-r, r6, r8, r-r, }, stmia, top, dts, 4 ]i+!, ldr, \ =drop -> stack neutral nextcust ] ; : HASH1BLOCK \ 3-4us -> MACRO based=12us [ r13, {, r0, r3, r-r, }!, stmfd, ] \ unsure if necessary hash->wvar \ << to assembly! \ first digest 16*4 bytes message-data from message-block & scrambles a bit 16 0 do \ the next part saves 1c from 67.5 to 66.5c/subloop [ top, dts, 4 i-]!, str, \ dup w, mblck ldv32, \ get address of message-regs in w w, w, i, 2 lsl#, add, top, w, ldr, v, kitbl ldv32, w, v, i, 2 lsl#, add, v, w, ldr, top, top, v, add, ] subloop loop \ the data from the first 16 rounds is scrambled by mixing the data from \ 4 earlier messages into a new 'message', as input to the 48 subloops 64 16 do \ the following saves ~20c/subloop [ top, dts, 4 i-]!, str, \ dup \ ===== line 1 r0, mblck ldv32, \ get address of message-block in r0 - keep r2, i, 0 2 i#, sub, \ r2=i-2 - r2=temp top, r0, r2, 2 lsl#, +], ldr, \ now 'i 4* [ 2 cells ] literal - mblck + @' r3, top, 17 ror#, mov, r3, r3, top, 19 ror#, eor, top, r3, top, 10 lsr#, eor, \ ===== line 2 r2, i, 0 7 i#, sub, \ r2=i-7 - temp r2, r0, r2, 2 lsl#, +], ldr, \ r2 top, top, r2, add, \ keep top \ ===== line 3 r2, i, 0 15 i#, sub, \ r2=i-15 - temp r2, r0, r2, 2 lsl#, +], ldr, r3, r2, 7 ror#, mov, r3, r3, r2, 18 ror#, eor, r3, r3, r2, 3 lsr#, eor, top, top, r3, add, \ ===== line 4 r2, i, 0 16 i#, sub, \ r2=i-16 - temp r2, r0, r2, 2 lsl#, +], ldr, \ r2 top, top, r2, add, \ dup i 4* mblck + ! \ store Wi for following rounds w, mblck ldv32, \ get address of message-regs in w w, w, i, 2 lsl#, add, top, w, str, \ i 4* kitbl + @ + subloop \ ( message+ki ) v, kitbl ldv32, w, v, i, 2 lsl#, add, v, w, ldr, top, top, v, add, ] subloop loop wvar->hash [ r13, {, r0, r3, r-r, }!, ldmfd, ] \ unsure if necessary ; : FILLMESS ( addr -- ) \ ( 54c wabi:165c ) fills message block with reversed words [ r13, {, r0, r3, r-r, }!, stmfd, ] \ costs 4ms in BM - not sure if needed 16 0 do [ r0, top, i, 2 lsl#, +], ldr, \ r0=[top + i*4] r1, mblck ldv32, \ get address of message-block in r1 - keep r0, r0, rev, \ reverse bytes in word r0, r1, i, 2 lsl#, +], str, ] \ store in message-block loop drop [ r13, {, r0, r3, r-r, }!, ldmfd, ] \ not sure if needed ; : MBLCKREVERSE ( -- ) \ reverse bytes in words in message-block 16 0 do mblck i 4* + dup @ bytes>< swap ! loop ; : PUTLEN ( len -- ) \ len*8 for number of bits 0 3 dlshift 14 mess! 15 mess! ; : GENSHA256 ( addr len -- ) \ addr en len point to message-array to lenmess to addrmess inithash \ do hashing of full block lenmess 6 rshift dup >r \ number of full blocks - also in r: 0 ?do addrmess i 64 * + fillmess \ i = blocknumber hash1block loop \ do hashing of last block r> 64 * addrmess + to addrmess \ addrmess now address of last block of data lenmess 63 and >r \ bytes_left in r: clearmess \ clear mblck addrmess mblck r@ cmove \ move last bytes to mblck mblck r@ + 128 swap c! \ add bit after last byte mblckreverse \ block now definitely filled with message r> 55 > if hash1block \ mblck is already filled and zeroed till the end clearmess \ clear mblck again then lenmess putlen \ put len at correct place hash1block ; \ and last block : sha256 ( addr len -- ) inithash gensha256 ;
Please note that this is al done using normal ARM32 assembly, no SHA-256 specific opcodes were used. These would increase the performance even more.