====== Marsaglia's XORshift routine on the ARM processor ======

===== Raspberry Pi 3b+ with wabiForth =====


The Forth version of the randomisation routines is the same on any processor as only standard Forth words are used. But the ARM-processor can do do a neat
trick: it can do 1 cycle (dup, shift and xor) in 1 opcode!! And as
most Forths include an assembler it is an interesting exercise to see how
much faster the routine is when coded in assembly.
This example is coded using wabiForth on a Raspberry 3b+, but the principle is the same for any ARMv8 Aarch32 processor.

The routine uses three registers named top, v and w. Top contains the top of the stack, v and w are scratch registers.


==== XORshift in ARM Aarch32 assembly ====
<code>
variable seed
2345 seed !

code: ASMRANDOM ( address_seed -- rndm_val )
  [ w, top, ldr,       \ get value in seed in w
  
  w, w, w, 13 lsl#, eor,
  w, w, w, 17 lsr#, eor,
  w, w, w,  5 lsl#, eor,
  
  v, v, w, eor,	       \ xor old seed value with generated random number
  v, top, str,         \ save xor'd value in seed
  top, w, mov,
  
  ] ; 7 inlinable

</code>

===== Comparison of Forth vs assembly =====

Tested with wabiForth on Raspberry 3b+ @ 1.5 GHz  
Here some simple benchmarks which compare the 1 and 2 seed
versions coded in Forth and the 1 seed version in assembly. Just
to get an idea about execution-speeds. 

<code>
    ---------------------------
    1 seed 32bit Forth:     40c
    2 seed 32bit Forth:     60c
    1 seed 32bit assembly:  13c
    ---------------------------
</code>

Time measured is the number of CPU-cycles required to put a
random number on the stack with a given method. The routine in assembly
is 3 times as fast as the corresponding routine in Forth. Which is a
decent speed-up of the routine.