RFR[M]: Adding MD5 Intrinsic on x86-64
Vladimir Kozlov
vladimir.kozlov at oracle.com
Tue Aug 4 17:19:56 UTC 2020
Hi Ludovic,
On 8/3/20 9:07 PM, Ludovic Henry wrote:
> Updated webrev: http://cr.openjdk.java.net/~luhenry/8250902/webrev.02
>
>> Next code in inline_digestBase_implCompressMB should be reversed (get_long_*() should be called for long_state):
>>
>> if (long_state) {
>> state = get_state_from_digestBase_object(digestBase_obj);
>> } else {
>> state = get_long_state_from_digestBase_object(digestBase_obj);
>> }
>
> Thanks for pointing that out. I tested everything with `hotspot:tier1` and `jdk:tier1` in fastdebug on Windows-x86, Windows-x64 and Linux-x64.
Code in library_call.cpp is good now.
>
>> It seems that the algorithm can be optimized further using SSE/AVX instructions. I am not aware of any specific SSE/AVX implementation which leverages those instructions in the best possible way. Sandhya can chime in more on that.
>
> I have done some research prior to implementing this intrinsic and the only pointers I could find to vectorized MD5 is on computing _multiple_ MD5 hashes in parallel but not a _single_ MD5 hash. Using vectors effectively parallelize the computation of many MD5 hash, but it does not accelerate the computation of a single MD5 hash. And looking at the algorithm, every step depends on the previous step's result, which make it particularly hard to parallelize/vectorize.
>
>> As far as I know, I came across this which points to MD5 SSE/AVX implementation. https://software.intel.com/content/www/us/en/develop/articles/intel-isa-l-cryptographic-hashes-for-cloud-storage.html
>
> That library points to computing many MD5 hashes in parallel. Quoting: "Intel® ISA-L uses a novel technique called multi-buffer hashing, which [...] compute several hashes at once within a single core." That is similar to what I found in researching how to vectorize MD5. I also did not find any reference of an ISA-level implementation of MD5, neither in x86 nor ARM.
>
> If you can point me to a document describing how to vectorize MD5, I would be more than happy to take a look and implement the algorithm. However, my understanding is that MD5 is not vectorizable by-design.
I would leave this investigation to Intel's Java group. They are expert in this area!
For now, lets put current implementation into JDK.
>
>> Add tests to verify intrinsic implementation. You can use test/hotspot/jtreg/compiler/intrinsics/sha/ as examples.
>
> I looked at these tests and they already cover MD5. I am not sure what's the best way to add tests here: 1. should I rename ` compiler/intrinsics/sha` to ` compiler/intrinsics/digest` and add the md5 tests there, 2. should I just add ` compiler/intrinsics/md5`, or 3. the name doesn't matter and I can just add it in ` compiler/intrinsics/sha`?
3. Just add MD5 tests into existing SHA directory.
Note, compiler/intrinsics/sha testing is done in tier2. I ran it and it passed but it does not test MD5 a lot as I
understand.
>
>> In vm_version_x86.cpp move UseMD5Intrinsics flag setting near UseSHA flag setting.
>
> Fixed.
It is not moved in webrev.02
>
>> In new file macroAssembler_x86_md5.cpp no need empty line after copyright line. There is also typo 'rrdistribute':
>>
>> * This code is free software; you can rrdistribute it and/or modify it
>>
>> Our validate-headers check failed. See GPL header template: ./make/templates/gpl-header
>
> I updated the header, and added the license for the original code for the MD5 core algorithm.
You don't need to use Oracle copyright line. Using original Microsoft's copyright line is fine since you are author.
Thank you for adding license for original code.
>
>> Did you test it on 32-bit x86?
>
> I did run `hotspot:tier1` and `jdk:tier1` on Windows-x86, Windows-x64 and Linux-x64.
>
>> Would be interesting to see result of artificially switching off AVX and SSE:
>> '-XX:UseSSE=0 -XX:UseAVX=0'. It will make sure that only general instructions are needed.
>
> The results are below:
Very good. Thank you for testing it.
Regards,
Vladimir
>
> -XX:-UseMD5Intrinsics
> Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units
> MessageDigests.digest md5 64 DEFAULT thrpt 10 3512.618 ± 9.384 ops/ms
> MessageDigests.digest md5 1024 DEFAULT thrpt 10 450.037 ± 1.213 ops/ms
> MessageDigests.digest md5 16384 DEFAULT thrpt 10 29.887 ± 0.057 ops/ms
> MessageDigests.digest md5 1048576 DEFAULT thrpt 10 0.485 ± 0.002 ops/ms
>
> -XX:+UseMD5Intrinsics
> Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units
> MessageDigests.digest md5 64 DEFAULT thrpt 10 4212.156 ± 7.781 ops/ ms => 19% speedup
> MessageDigests.digest md5 1024 DEFAULT thrpt 10 548.609 ± 1.374 ops/ ms => 22% speedup
> MessageDigests.digest md5 16384 DEFAULT thrpt 10 37.961 ± 0.079 ops/ ms => 27% speedup
> MessageDigests.digest md5 1048576 DEFAULT thrpt 10 0.596 ± 0.006 ops/ ms => 23% speedup
>
> -XX:-UseMD5Intrinsics -XX:UseSSE=0 -XX:UseAVX=0
> Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units
> MessageDigests.digest md5 64 DEFAULT thrpt 10 3462.769 ± 4.992 ops/ms
> MessageDigests.digest md5 1024 DEFAULT thrpt 10 443.858 ± 0.576 ops/ms
> MessageDigests.digest md5 16384 DEFAULT thrpt 10 29.723 ± 0.480 ops/ms
> MessageDigests.digest md5 1048576 DEFAULT thrpt 10 0.470 ± 0.001 ops/ms
>
> -XX:+UseMD5Intrinsics -XX:UseSSE=0 -XX:UseAVX=0
> Benchmark (digesterName) (length) (provider) Mode Cnt Score Error Units
> MessageDigests.digest md5 64 DEFAULT thrpt 10 4237.219 ± 15.627 ops/ms => 22% speedup
> MessageDigests.digest md5 1024 DEFAULT thrpt 10 564.625 ± 1.510 ops/ms => 27% speedup
> MessageDigests.digest md5 16384 DEFAULT thrpt 10 38.004 ± 0.078 ops/ms => 28% speedup
> MessageDigests.digest md5 1048576 DEFAULT thrpt 10 0.597 ± 0.002 ops/ms => 27% speedup
>
> Thank you,
> Ludovic
>
More information about the hotspot-compiler-dev
mailing list