RFR: 8341013: Optimize x86/aarch64 MD5 intrinsics by reducing data dependency [v2]

Fri Sep 27 10:46:36 UTC 2024

On Thu, 26 Sep 2024 14:58:49 GMT, Oli Gillespie <ogillespie at openjdk.org> wrote:

>> As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'b' by recognizing that the ((d & b) | (~d & c)) is equivalent to ((d & b) + (~d & c)) in this scenario, and we can perform those additions independently, leaving our dependency on b to the final addition. This speeds it up around 5%.
>> 
>> Benchmark results on my two hosts:
>> 
>> 
>> Benchmark                  (algorithm)  (dataSize)  (provider)   Mode  Cnt    Score   Error  Units
>> 
>> x86 Before:
>> MessageDigestBench.digest          MD5     1048576              thrpt   10  636.389 ± 0.240  ops/s
>> 
>> x86 After:
>> MessageDigestBench.digest          MD5     1048576              thrpt   10  671.611 ± 0.226  ops/s (+5.5%)
>> 
>> 
>> aarch64 Before:
>> MessageDigestBench.digest          MD5     1048576              thrpt   10  498.613 ± 0.359  ops/s
>> 
>> aarch64 After:
>> MessageDigestBench.digest          MD5     1048576              thrpt   10  526.008 ± 0.491  ops/s (+5.6%)
>
> Oli Gillespie has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix aarch64 bug

Overall it's a nice optimization! Some minor comment about aarch64 one.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 3422:

> 3420:     reg_cache.extract_u32(rscratch1, k);
> 3421:     __ movw(rscratch2, t);
> 3422:     __ addw(rscratch4, r1, rscratch2);

Can you try to replace these 2 lines (3421-3422) with following?

    __ movw(rscratch4, t);
    __ addw(rscratch4, r1, rscratch4);

I expect it could bring more performance gain, but not sure.

-------------

PR Review: https://git.openjdk.org/jdk/pull/21203#pullrequestreview-2333399088
PR Review Comment: https://git.openjdk.org/jdk/pull/21203#discussion_r1778419407