RFR: 8294073: Performance improvement for message digest implementations

Wed Sep 21 12:10:38 UTC 2022

On Tue, 20 Sep 2022 22:12:04 GMT, Xue-Lei Andrew Fan <xuelei at openjdk.org> wrote:

> Hi,
> 
> In the message digest implementation, for example SHA256, in JDK, two bitwise operations could be improved with equivalent arithmetic, and then the number bitwise operations could be reduced accordingly.  Specifically
> "(x and y) xor ((complement x) and z)" could be replaced with the equivalent "z xor (x and (y xor z))", and "(x and y) xor (x and z) xor (y and z)" could be replaced with the equivalent "(x and y) xor ((x xor y) and z)".  Each replacement reduces one bitwise operation, and thus improve the performance.
> 
> Per my testing on my MacOS laptop, the update on SHA256 improves the message digest throughput by 0.5%-0.8%.  The improvement is not significant, but might be worthy of it as the update is pretty simple and trivial, for those platforms that do not support CPU intrinsic for a certain hash algorithm.
> 
> This patch update SHA2 implementation only.  Please let me know what do you think.  If you are good  with this little bit performance, I will update more message digest implementations.  If no one interested in these little benefits, I will close this PR later.
> 
> Thanks,
> Xuelei

(Thinking out loud, because I am on the fence with this change)

First (negative), it departs from SHA function codes as stated in most papers. This is a minor thing if we can prove the equivalence, but it still adds to maintenance overhead. While the translation of `maj` (boolean majority) function is more or less straightforward, using the `xor`/`and` distributivity, the translation for `ch` is less so. I had to scribble down things to reveal that new expressions' DNF is the old one.

Second (negative), I suspect current code is optimized more for the instruction-level parallelism than for the number of operations. (This might be the reason why SHA papers do this, to optimize latency in hardware implementations?) In case of `ch`, we have one operation less, but it is probably a wash, since in the original expression that spare operation can run in parallel with the rest of the code. In `maj` case it is worse: we now have three operations on critical path (expression tree is one level deeper) instead of two, which inhibits the ILP.

Third (positive), I think the real target are VM modes which cannot exploit any ILP, so reduction in op count / bytecode size is sensible. For example, Zero that runs only in interpreter, and where bytecode-level ILP cannot be exploited as every bytecode instruction is handled by large chunk of code. The improvements/regressions on this code are usually visible when hashing the JMODs or jlinking the image. I see about 0.3% faster linux-x86_64-zero-release builds with this patch.

Fourth (positive), it is not unprecedented to optimize for interpreters here. I did https://bugs.openjdk.org/browse/JDK-8256523 the last time in the same code.

Are you getting +0.8% throughput on microbenchmarks?

-------------

PR: https://git.openjdk.org/jdk/pull/10365