RFR: 8283307: Vectorize unsigned shift right on signed subword types
Jie Fu
jiefu at openjdk.java.net
Mon Apr 11 13:57:37 UTC 2022
On Mon, 28 Mar 2022 02:11:55 GMT, Fei Gao <fgao at openjdk.org> wrote:
> public short[] vectorUnsignedShiftRight(short[] shorts) {
> short[] res = new short[SIZE];
> for (int i = 0; i < SIZE; i++) {
> res[i] = (short) (shorts[i] >>> 3);
> }
> return res;
> }
>
> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2].
>
> Taking unsigned right shift on short type as an example,
> 
>
> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like
> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation:
> 
>
> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like:
>
> ...
> sbfiz x13, x10, #1, #32
> add x15, x11, x13
> ldr q16, [x15, #16]
> sshr v16.8h, v16.8h, #3
> add x13, x17, x13
> str q16, [x13, #16]
> ...
>
>
> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch.
>
> The perf data on AArch64:
> Before the patch:
> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
> urShiftImmByte 1024 3 avgt 5 295.711 ± 0.117 ns/op
> urShiftImmShort 1024 3 avgt 5 284.559 ± 0.148 ns/op
>
> after the patch:
> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
> urShiftImmByte 1024 3 avgt 5 45.111 ± 0.047 ns/op
> urShiftImmShort 1024 3 avgt 5 55.294 ± 0.072 ns/op
>
> The perf data on X86:
> Before the patch:
> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
> urShiftImmByte 1024 3 avgt 5 361.374 ± 4.621 ns/op
> urShiftImmShort 1024 3 avgt 5 365.390 ± 3.595 ns/op
>
> After the patch:
> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
> urShiftImmByte 1024 3 avgt 5 105.489 ± 0.488 ns/op
> urShiftImmShort 1024 3 avgt 5 43.400 ± 0.394 ns/op
>
> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190
> [2] https://github.com/jpountz/decode-128-ints-benchmark/
This seems a good idea to vectorize urshift for short/byte.
However, changing the opcode in superword code seems tricky, which may be not easy to maintain.
Why not transform scalar `urhsift` --> `rshift` during the GVN phase like this?
diff --git a/src/hotspot/share/opto/memnode.cpp b/src/hotspot/share/opto/memnode.cpp
index c2e2e939bf3..190a2a44727 100644
--- a/src/hotspot/share/opto/memnode.cpp
+++ b/src/hotspot/share/opto/memnode.cpp
@@ -2867,6 +2867,24 @@ Node *StoreNode::Ideal_sign_extended_input(PhaseGVN *phase, int num_bits) {
return NULL;
}
+//------------------------------Ideal_urshift_to_rshift----------------------
+// Check for URShiftI patterns which can be transformed to RShiftI.
+// - StoreB ... (URShiftI n con) ==> StoreB ... (RShiftI n con) if con <= 24
+// - StoreC ... (URShiftI n con) ==> StoreC ... (RShiftI n con) if con <= 16
+// We perform this transformation in hoping that the shift operation may be vectorized.
+Node *StoreNode::Ideal_urshift_to_rshift(PhaseGVN *phase, int num_bits) {
+ Node *val = in(MemNode::ValueIn);
+ if( val->Opcode() == Op_URShiftI ) {
+ const TypeInt *t = phase->type( val->in(2) )->isa_int();
+ if( t && t->is_con() && (t->get_con() <= num_bits) ) {
+ Node* rshift = phase->transform(new RShiftINode(val->in(1), val->in(2)));
+ set_req_X(MemNode::ValueIn, rshift, phase);
+ return this;
+ }
+ }
+ return NULL;
+}
+
//------------------------------value_never_loaded-----------------------------------
// Determine whether there are any possible loads of the value stored.
// For simplicity, we actually check if there are any loads from the
@@ -2927,6 +2945,9 @@ Node *StoreBNode::Ideal(PhaseGVN *phase, bool can_reshape){
progress = StoreNode::Ideal_sign_extended_input(phase, 24);
if( progress != NULL ) return progress;
+ progress = StoreNode::Ideal_urshift_to_rshift(phase, 24);
+ if( progress != NULL ) return progress;
+
// Finally check the default case
return StoreNode::Ideal(phase, can_reshape);
}
@@ -2942,6 +2963,9 @@ Node *StoreCNode::Ideal(PhaseGVN *phase, bool can_reshape){
progress = StoreNode::Ideal_sign_extended_input(phase, 16);
if( progress != NULL ) return progress;
+ progress = StoreNode::Ideal_urshift_to_rshift(phase, 16);
+ if( progress != NULL ) return progress;
+
// Finally check the default case
return StoreNode::Ideal(phase, can_reshape);
}
diff --git a/src/hotspot/share/opto/memnode.hpp b/src/hotspot/share/opto/memnode.hpp
index 7c02a1b0861..7dd9d8bd268 100644
--- a/src/hotspot/share/opto/memnode.hpp
+++ b/src/hotspot/share/opto/memnode.hpp
@@ -565,6 +565,7 @@ protected:
Node *Ideal_masked_input (PhaseGVN *phase, uint mask);
Node *Ideal_sign_extended_input(PhaseGVN *phase, int num_bits);
+ Node *Ideal_urshift_to_rshift (PhaseGVN *phase, int num_bits);
public:
// We must ensure that stores of object references will be visible
-------------
PR: https://git.openjdk.java.net/jdk/pull/7979
More information about the hotspot-compiler-dev
mailing list