RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86
Bhateja, Jatin
jatin.bhateja at intel.com
Wed Jul 22 10:27:26 UTC 2020
Hi Vladimir,
Please find the updated patch at following link
http://cr.openjdk.java.net/~jbhateja/8248830/webrev.03/
Change Summary:
1) Unified the handling for scalar rotate operation. All scalar rotate selection patterns are now dependent on newly created RotateLeft/RotateRight nodes. This promotes rotate inferencing. Currently if DAG nodes corresponding to a sub-pattern are shared (have multiple users) then existing complex patterns based on Or/LShiftL/URShift does not get matched and this prevents inferring rotate nodes. Please refer to JIT'ed assembly output with baseline[1] and with patch[2] . We can see that generated code size also went done from 832 byte to 768 bytes. Also this can cause perf degradation if shift-or dependency chain appears inside a hot region.
2) Due to enhanced rotate inferencing new patch shows better performance even for legacy targets (non AVX-512). Please refer to the perf result[3] over AVX2 machine for JMH benchmark part of the patch.
3) As suggested, removed Java API intrinsification changes and scalar rotate transformation are done during OrI/OrL node idealizations.
4) SLP always gets to work on new scalar Rotate nodes and creates vector rotate nodes which are degenerated into OrV/LShiftV/URShiftV nodes if target does not supports vector rotates(non-AVX512).
5) Added new instruction patterns for vector shift Left/Right operations with constant shift operands. This prevents emitting extra moves to XMM.
6) Constant folding scenarios are covered in RotateLeft/RotateRight idealization, inferencing of vector rotate through OrV idealization covers the vector patterns generated though non SLP route i.e. VectorAPI.
Following are the results JMH benchmark over AVX3 target.
Baseline:
Benchmark (SHIFT) (TESTSIZE) Mode Cnt Score Error Units
RotateBenchmark.testRotateLeftI 20 512 thrpt 2 33541.569 ops/ms
RotateBenchmark.testRotateLeftL 20 512 thrpt 2 20363.973 ops/ms
RotateBenchmark.testRotateRightI 20 512 thrpt 2 33944.085 ops/ms
RotateBenchmark.testRotateRightL 20 512 thrpt 2 20443.967 ops/ms
With Changes:
Benchmark (SHIFT) (TESTSIZE) Mode Cnt Score Error Units
RotateBenchmark.testRotateLeftI 20 512 thrpt 2 48439.220 ops/ms
RotateBenchmark.testRotateLeftL 20 512 thrpt 2 35758.933 ops/ms
RotateBenchmark.testRotateRightI 20 512 thrpt 2 49702.219 ops/ms
RotateBenchmark.testRotateRightL 20 512 thrpt 2 35618.666 ops/ms
Please push the patch through your testing framework and let me know your review feedback.
Best Regards,
Jatin
[1] http://cr.openjdk.java.net/~jbhateja/8248830/rotate_baseline_avx2_asm.txt
[2] http://cr.openjdk.java.net/~jbhateja/8248830/rotate_new_patch_avx2_asm.txt
[3] http://cr.openjdk.java.net/~jbhateja/8248830/rotate_perf_avx2_new_patch.txt
> -----Original Message-----
> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> Sent: Saturday, July 18, 2020 12:25 AM
> To: Bhateja, Jatin <jatin.bhateja at intel.com>; Andrew Haley <aph at redhat.com>
> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; hotspot-compiler-
> dev at openjdk.java.net
> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86
>
> Hi Jatin,
>
> > http://cr.openjdk.java.net/~jbhateja/8248830/webrev_02/
>
> It definitely looks better, but IMO it hasn't reached the sweet spot yet.
> It feels like the focus is on auto-vectorizer while the burden is put on
> scalar cases.
>
> First of all, considering GVN folds relevant operation patterns into a
> single Rotate node now, what's the motivation to introduce intrinsics?
>
> Another point is there's still significant duplication for scalar cases.
>
> I'd prefer to see the legacy cases which rely on pattern matching to go
> away and be substituted with instructions which match Rotate instructions
> (migrating ).
>
> I understand that it will penalize the vectorization implementation, but
> IMO reducing overall complexity is worth it. On auto-vectorizer side, I see
> 2 ways to fix it:
>
> (1) introduce additional AD instructions for RotateLeftV/RotateRightV
> specifically for pre-AVX512 hardware;
>
> (2) in SuperWord::output(), when matcher doesn't support
> RotateLeftV/RotateLeftV nodes (Matcher::match_rule_supported()),
> generate vectorized version of the original pattern.
>
> Overall, it looks like more and more focus is made on scalar part.
> Considering the main goal of the patch is to enable vectorization, I'm fine
> with separating cleanup of scalar part. As an interim solution, it seems
> that leaving the scalar part as it is now and matching scalar bit rotate
> pattern in VectorNode::is_rotate() should be enough to keep the
> vectorization part functioning. Then scalar Rotate nodes and relevant
> cleanups can be integrated later. (Or vice versa: clean up scalar part
> first and then follow up with vectorization.)
>
> Some other comments:
>
> * There's a lot of duplication between OrINode::Ideal and OrLNode::Ideal.
> What do you think about introducing a super type
> (OrNode) and put a unified version (OrNode::Ideal) there?
>
>
> * src/hotspot/cpu/x86/x86.ad
>
> +instruct vprotate_immI8(vec dst, vec src, immI8 shift) %{
> + predicate(n->bottom_type()->is_vect()->element_basic_type() == T_INT ||
> + n->bottom_type()->is_vect()->element_basic_type() ==
> +T_LONG);
>
> +instruct vprorate(vec dst, vec src, vec shift) %{
> + predicate(n->bottom_type()->is_vect()->element_basic_type() == T_INT ||
> + n->bottom_type()->is_vect()->element_basic_type() ==
> +T_LONG);
>
> The predicates are redundant here.
>
>
> * src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp
>
> +void C2_MacroAssembler::vprotate_imm(int opcode, BasicType etype,
> XMMRegister dst, XMMRegister src,
> + int shift, int vector_len) { if
> + (opcode == Op_RotateLeftV) {
> + if (etype == T_INT) {
> + evprold(dst, src, shift, vector_len);
> + } else {
> + evprolq(dst, src, shift, vector_len);
> + }
>
> Please, put an assert for the false case (assert(etype == T_LONG, "...")).
>
>
> * On testing (with previous version of the patch): -XX:UseAVX is x86-
> specific flag, so new/adjusted tests now fail on non-x86 platforms.
> Either omitting the flag or adding -XX:+IgnoreUnrecognizedVMOptions will
> solve the issue.
>
> Best regards,
> Vladimir Ivanov
>
> >
> >
> > Summary of changes:
> > 1) Optimization is specifically targeted to exploit vector rotation
> instruction added for X86 AVX512. A single rotate instruction encapsulates
> entire vector OR/SHIFTs pattern thus offers better latency at reduced
> instruction count.
> >
> > 2) There were two approaches to implement this:
> > a) Let everything remain the same and add new wide complex
> instruction patterns in the matcher for e.g.
> > set Dst ( OrV (Binary (LShiftVI dst (Binary ReplicateI shift))
> (URShiftVI dst (Binary (SubI (Binary ReplicateI 32) ( Replicate shift))
> > It would have been an overoptimistic assumption to expect that graph
> shape would be preserved till the matcher for correct inferencing.
> > In addition we would have required multiple such bulky patterns.
> > b) Create new RotateLeft/RotateRight scalar nodes, these gets
> generated during intrinsification as well as during additional pattern
> > matching during node Idealization, later on these nodes are consumed
> by SLP for valid vectorization scenarios to emit their vector
> > counterparts which eventually emits vector rotates.
> >
> > 3) I choose approach 2b) since its cleaner, only problem here was that
> > in non-evex mode (UseAVX < 3) new scalar Rotate nodes should either be
> dismantled back to OR/SHIFT pattern or we penalize the vectorization which
> would be very costly, other option would have been to add additional vector
> rotate pattern for UseAVX=3 in the matcher which emit vector OR-SHIFTs
> instruction but then it will loose on emitting efficient instruction
> sequence which node sharing (OrV/LShiftV/URShift) offer in current
> implementation - thus it will not be beneficial for non-AVX512 targets,
> only saving will be in terms of cleanup of few existing scalar rotate
> matcher patterns, also old targets does not offer this powerful rotate
> instruction. Therefore new scalar nodes are created only for AVX512
> targets.
> >
> > As per suggestions constant folding scenarios have been covered during
> Idealizations of newly added scalar nodes.
> >
> > Please review the latest version and share your feedback and test
> results.
> >
> > Best Regards,
> > Jatin
> >
> >
> >> -----Original Message-----
> >> From: Andrew Haley <aph at redhat.com>
> >> Sent: Saturday, July 11, 2020 2:24 PM
> >> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; Bhateja, Jatin
> >> <jatin.bhateja at intel.com>; hotspot-compiler-dev at openjdk.java.net
> >> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
> >> Subject: Re: 8248830 : RFR[S] : C2 : Rotate API intrinsification for
> >> X86
> >>
> >> On 10/07/2020 18:32, Vladimir Ivanov wrote:
> >>
> >> > High-level comment: so far, there were no pressing need in >
> >> explicitly marking the methods as intrinsics. ROR/ROL instructions >
> >> were selected during matching [1]. Now the patch introduces >
> >> dedicated nodes
> >> (RotateLeft/RotateRight) specifically for intrinsics > which partly
> >> duplicates existing logic.
> >>
> >> The lack of rotate nodes in the IR has always meant that AArch64
> >> doesn't generate optimal code for e.g.
> >>
> >> (Set dst (XorL reg1 (RotateLeftL reg2 imm)))
> >>
> >> because, with the RotateLeft expanded to its full combination of ORs
> >> and shifts, it's to complicated to match. At the time I put this to
> >> one side because it wasn't urgent. This is a shame because although
> >> such combinations are unusual they are used in some crypto operations.
> >>
> >> If we can generate immediate-form rotate nodes early by pattern
> >> matching during parsing (rather than depending on intrinsics) we'll
> >> get more value than by depending on programmers calling intrinsics.
> >>
> >> --
> >> Andrew Haley (he/him)
> >> Java Platform Lead Engineer
> >> Red Hat UK Ltd. <https://www.redhat.com>
> >> https://keybase.io/andrewhaley
> >> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
> >
More information about the hotspot-compiler-dev
mailing list