RFR: 8256820: AArch64: Optimize vector rotate (immediate) with shift and insert instructions [v2]

Wed Dec 16 02:02:55 UTC 2020

On Mon, 14 Dec 2020 11:45:05 GMT, Dong Bo <dongbo at openjdk.org> wrote:

>> This patch is very hard to review because much of it is just moving things around. Please do this as two PRs, one which does all the moves and one with the substantive changes. Thanks.
>
> Thanks for the quick reply. Deleted all the moves in the updated version.
> 
> We tested this on the two servers we have by hand now. It is a pity for us that we didn't see performance improvements on `Cortex-A72`.
> Because we have detected microarchitecture in the code already, I thought we could make full use of the flexibility it provided.
> 
> We find the shift and insert instructions are used in Linux OS for serveral crypto algorithms for all CPUs [1, 2, 3].
> But as of now, we only have the micro benchmarks for JDK as shown before.
> I'll try to investigate to see if there is any workload can benifit from this.
> 
> Maybe someone from the community can help test this on other CPUs, like `ThunderX`, `Cortex-A73`? Thanks. :)
> 
> [1] https://github.com/torvalds/linux/blob/master/arch/arm64/crypto/chacha-neon-core.S
> [2] https://github.com/torvalds/linux/blob/master/arch/arm64/crypto/crct10dif-ce-core.S
> [3] https://github.com/torvalds/linux/blob/master/arch/arm64/crypto/sha512-armv8.pl

> _Mailing list message from [Andrew Haley](mailto:aph at redhat.com) on [hotspot-dev](mailto:hotspot-dev at openjdk.java.net):_
> 
> On 14/12/2020 11:48, Dong Bo wrote:
> 
> > We tested this on the two servers we have by hand now. It is a pity
> > for us that we didn't see performance improvements on `Cortex-A72`.
> > Because we have detected microarchitecture in the code already, I
> > thought we could make full use of the flexibility it provided.
> > We find the shift and insert instructions are used in Linux OS for
> > serveral crypto algorithms for all CPUs [1, 2, 3].
> 
> I don't want to be unfair to you. There are many small optimizations
> that we have made for which we don't require large-scale benchmarks.
> I'm not going to start insisting that every small change shows a
> gain on such benchmarks.
> 
> But this patch seems to be very marginal: not only does it offer a
> modest gain one one machine, it also shows a modest loss on another.
> Not only that, but we do not know if auto-vectorization is at all
> effective on these OpenJDK's algorithms.
> 
Agree. It's marginal. These crypto algorithms in JDK are not vectorized.
I checked some other Java code in core-libs, didn't find any candidate of this optimization.
The rotate instructions are also negligible or inexistent in applications we investigated, like Spark, Flink, Tomcat.
> My reason for pushing back here is that I want HotSpot to generate "good"
> code for all AArch64 platforms, rather than containing special patterns
> for microarchitectures. There are risks if we allow that. For example,
> mistakes which affect only one machine might be made by accident, and
> they cannot be tested by anyone who does not have that machine. Also,
> the AArch64 back end becomes complex and hard to maintain.
> 
> For example, a simple mistake made during a critical patch update
> might break Kunpeng920, and unless everybody tests on all machines, we
> might not know until it's too late to fix it.
> 
> So, any change which causes divergence between microarchitectures must
> be needed, or it isn't worth the risk. It's not in your interest for
> that to happen, either.
> 
OKAY, this does make sense to us, we don't want to take the possible risk neither.
By far, I think it's not worthy, unless we see a profitable workload one day.
As of now, I am going to withdraw this.

Thanks for the comments.
Best Regards.

-------------

PR: https://git.openjdk.java.net/jdk/pull/1761