RFR[M]: Adding MD5 Intrinsic on x86-64

Tue Aug 4 17:19:56 UTC 2020

Hi Ludovic,

On 8/3/20 9:07 PM, Ludovic Henry wrote:
> Updated webrev: http://cr.openjdk.java.net/~luhenry/8250902/webrev.02
> 
>> Next code in inline_digestBase_implCompressMB should be reversed (get_long_*() should be called for long_state):
>>
>>     if (long_state) {
>>       state = get_state_from_digestBase_object(digestBase_obj);
>>     } else {
>>       state = get_long_state_from_digestBase_object(digestBase_obj);
>>     }
> 
> Thanks for pointing that out. I tested everything with `hotspot:tier1` and `jdk:tier1` in fastdebug on Windows-x86, Windows-x64 and Linux-x64.

Code in library_call.cpp is good now.

> 
>> It seems that the algorithm can be optimized further using SSE/AVX instructions. I am not aware of any specific SSE/AVX implementation which leverages those instructions in the best possible way. Sandhya can chime in more on that.
> 
> I have done some research prior to implementing this intrinsic and the only pointers I could find to vectorized MD5 is on computing _multiple_ MD5 hashes in parallel but not a _single_ MD5 hash. Using vectors effectively parallelize the computation of many MD5 hash, but it does not accelerate the computation of a single MD5 hash. And looking at the algorithm, every step depends on the previous step's result, which make it particularly hard to parallelize/vectorize.
> 
>> As far as I know, I came across this which points to MD5 SSE/AVX implementation. https://software.intel.com/content/www/us/en/develop/articles/intel-isa-l-cryptographic-hashes-for-cloud-storage.html
> 
> That library points to computing many MD5 hashes in parallel. Quoting: "Intel® ISA-L uses a novel technique called multi-buffer hashing, which [...] compute several hashes at once within a single core." That is similar to what I found in researching how to vectorize MD5. I also did not find any reference of an ISA-level implementation of MD5, neither in x86 nor ARM.
> 
> If you can point me to a document describing how to vectorize MD5, I would be more than happy to take a look and implement the algorithm. However, my understanding is that MD5 is not vectorizable by-design.

I would leave this investigation to Intel's Java group. They are expert in this area!

For now, lets put current implementation into JDK.

> 
>> Add tests to verify intrinsic implementation. You can use test/hotspot/jtreg/compiler/intrinsics/sha/ as examples.
> 
> I looked at these tests and they already cover MD5. I am not sure what's the best way to add tests here: 1. should I rename ` compiler/intrinsics/sha` to ` compiler/intrinsics/digest` and add the md5 tests there, 2. should I just add ` compiler/intrinsics/md5`, or 3. the name doesn't matter and I can just add it in ` compiler/intrinsics/sha`?

3. Just add MD5 tests into existing SHA directory.

Note, compiler/intrinsics/sha testing is done in tier2. I ran it and it passed but it does not test MD5 a lot as I 
understand.

> 
>> In vm_version_x86.cpp move UseMD5Intrinsics flag setting near UseSHA flag setting.
> 
> Fixed.

It is not moved in webrev.02

> 
>> In new file macroAssembler_x86_md5.cpp no need empty line after copyright line. There is also typo 'rrdistribute':
>>
>>    * This code is free software; you can rrdistribute it and/or modify it
>>
>> Our validate-headers check failed. See GPL header template: ./make/templates/gpl-header
> 
> I updated the header, and added the license for the original code for the MD5 core algorithm.

You don't need to use Oracle copyright line. Using original Microsoft's copyright line is fine since you are author.
Thank you for adding license for original code.

> 
>> Did you test it on 32-bit x86?
> 
> I did run `hotspot:tier1` and `jdk:tier1` on Windows-x86, Windows-x64 and Linux-x64.
> 
>> Would be interesting to see result of artificially switching off AVX and SSE:
>> '-XX:UseSSE=0 -XX:UseAVX=0'. It will make sure that only general instructions are needed.
> 
> The results are below:

Very good. Thank you for testing it.

Regards,
Vladimir

> 
> -XX:-UseMD5Intrinsics
> Benchmark              (digesterName)  (length)  (provider)   Mode  Cnt     Score    Error   Units
> MessageDigests.digest             md5        64     DEFAULT  thrpt   10  3512.618 ± 9.384  ops/ms
> MessageDigests.digest             md5      1024     DEFAULT  thrpt   10   450.037 ± 1.213  ops/ms
> MessageDigests.digest             md5     16384     DEFAULT  thrpt   10    29.887 ± 0.057  ops/ms
> MessageDigests.digest             md5   1048576     DEFAULT  thrpt   10     0.485 ± 0.002  ops/ms
> 
> -XX:+UseMD5Intrinsics
> Benchmark              (digesterName)  (length)  (provider)   Mode  Cnt     Score   Error   Units
> MessageDigests.digest             md5        64     DEFAULT  thrpt   10  4212.156 ± 7.781  ops/ ms => 19% speedup
> MessageDigests.digest             md5      1024     DEFAULT  thrpt   10   548.609 ± 1.374  ops/ ms => 22% speedup
> MessageDigests.digest             md5     16384     DEFAULT  thrpt   10    37.961 ± 0.079  ops/ ms => 27% speedup
> MessageDigests.digest             md5   1048576     DEFAULT  thrpt   10     0.596 ± 0.006  ops/ ms => 23% speedup
> 
> -XX:-UseMD5Intrinsics -XX:UseSSE=0 -XX:UseAVX=0
> Benchmark              (digesterName)  (length)  (provider)   Mode  Cnt     Score    Error   Units
> MessageDigests.digest             md5        64     DEFAULT  thrpt   10  3462.769 ± 4.992  ops/ms
> MessageDigests.digest             md5      1024     DEFAULT  thrpt   10   443.858 ± 0.576  ops/ms
> MessageDigests.digest             md5     16384     DEFAULT  thrpt   10    29.723 ± 0.480  ops/ms
> MessageDigests.digest             md5   1048576     DEFAULT  thrpt   10     0.470 ± 0.001  ops/ms
> 
> -XX:+UseMD5Intrinsics -XX:UseSSE=0 -XX:UseAVX=0
> Benchmark              (digesterName)  (length)  (provider)   Mode  Cnt     Score   Error   Units
> MessageDigests.digest             md5        64     DEFAULT  thrpt   10  4237.219 ± 15.627  ops/ms => 22% speedup
> MessageDigests.digest             md5      1024     DEFAULT  thrpt   10   564.625 ±  1.510  ops/ms => 27% speedup
> MessageDigests.digest             md5     16384     DEFAULT  thrpt   10    38.004 ±  0.078  ops/ms => 28% speedup
> MessageDigests.digest             md5   1048576     DEFAULT  thrpt   10     0.597 ±  0.002  ops/ms => 27% speedup
> 
> Thank you,
> Ludovic
>