Merge 2d0a761232a1f0e616fbacb8169fed554b973c59 into ce15e75bceb372867daf6b8e81918ab6978686eb
ec51ce56
pull/12/merge
1/17 ++ 109 --
The original goal of the DIT-DIF transformation, which this implementation also uses, was to allow for the core butterflys to be parallized with SIMD. Unfortunately, this idea seems to have been mostly lost to time. By having the omega (or s as it's called in this codebase) one loop higher, we're able to perform the entire inner loop as a single chain of SIMD instructions, specifically for the cases when t = 8, t = 4, and even t = 16 for AVX512. This gives us a very large speedup, bringing the total verification time down to around 7.6us (on Zen 5), from 11.6us. This is when combined with the other optimizations introduced in this branch.
See the docstrings placed around this commit for a better understanding of the strategy used.
During verification (when we use a variable-time H2P), Falcon samples the SHAKE state in a loop until it gets enough elements to full the polynomial. The criteria for a valid element is that it is smaller than (12289). The loop samples 2 bytes (16-bits), intepreters them in big-endian, and then checks the criteria. The naive method used by the reference implementation has it sample 2 bytes at a time and check each one. Instead, we can trivially optimize this by simply extracting an amount of bytes close to the Keccak state size (136), and re-use the bytes until we run out again. This gives us an approximate 10% speedup for verification (saving 1.2us on Zen 5).