RFR: 8341013: Optimize x86/aarch64 MD5 intrinsics by reducing data dependency
Oli Gillespie
ogillespie at openjdk.org
Thu Sep 26 14:30:35 UTC 2024
On Thu, 26 Sep 2024 14:11:42 GMT, Hamlin Li <mli at openjdk.org> wrote:
>> As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'b' by recognizing that the ((d & b) | (~d & c)) is equivalent to ((d & b) + (~d & c)) in this scenario, and we can perform those additions independently, leaving our dependency on b to the final addition. This speeds it up around 5%.
>>
>> Benchmark results on my two hosts:
>>
>>
>> Benchmark (algorithm) (dataSize) (provider) Mode Cnt Score Error Units
>>
>> x86 Before:
>> MessageDigestBench.digest MD5 1048576 thrpt 10 636.389 ± 0.240 ops/s
>>
>> x86 After:
>> MessageDigestBench.digest MD5 1048576 thrpt 10 671.611 ± 0.226 ops/s (+5.5%)
>>
>>
>> aarch64 Before:
>> MessageDigestBench.digest MD5 1048576 thrpt 10 498.613 ± 0.359 ops/s
>>
>> aarch64 After:
>> MessageDigestBench.digest MD5 1048576 thrpt 10 526.008 ± 0.491 ops/s (+5.6%)
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 3428:
>
>> 3426: __ andw(rscratch2, r2, r4);
>> 3427: __ addw(rscratch2, rscratch2, rscratch3);
>> 3428: __ rorw(rscratch2, rscratch3, 32 - s);
>
> Does this mean that `rscratch2` at line 3426-3427 is discarded at line 3428?
Yes! Well spotted, my complete mistake. I'm kinda shocked it passed so many tests 😕 . I will fix it.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/21203#discussion_r1777187116
More information about the hotspot-dev
mailing list