RFR: 8302673: [SuperWord] MaxReduction and MinReduction should vectorize for int
Emanuel Peter
epeter at openjdk.org
Wed May 17 14:18:51 UTC 2023
On Thu, 11 May 2023 09:18:21 GMT, Roberto Castañeda Lozano <rcastanedalo at openjdk.org> wrote:
> The canonicalization of MinI/MaxI chains into right-spline graphs within `MinINode/MaxINode::Ideal()` inhibits the vectorization of reductions of these nodes. This changeset reworks `MinINode/MaxINode::Ideal()` to perform the same algebraic optimizations without the need for canonicalization, re-enabling auto-vectorization of MinI/MaxI reductions. This is achieved by handling all four permutations of the targeted Ideal subgraph induced by the commutativity of MinI/MaxI directly. The algorithm (for the MaxI case, the MinI case is analogous) tries to apply two Ideal graph rewrites in the following order, where `c0` and `c1` are constants and `MAX` is a compile-time operation:
> 1. `max(x + c0, max(x + c1, z))` (or a permutation of it) to `max(x + MAX(c0, c1), z)`.
> 2. `max(x + c0, x + c1)` (or a permutation of it) to `x + MAX(c0, c1)`.
>
> Here is an example of the four permutations handled in step 1 with `x = RShiftI`, `c0 = 100` or `150`, `c1 = 150` or `100`, and `z = ConI (#int:200)`:
>
> 
>
> Here is an example of the two permutations handled in step 2 with `x = RShiftI`, `c0 = 10` or `11`, and `c1 = 11` or `10`:
>
> 
>
> The changeset implements `MinINode/MaxINode::Ideal()` in a common method `MaxNode::IdealI()`, since the algorithm is symmetric for both node types. The changeset also extends the existing MinI/MaxI Idealization tests with positive tests for all targeted permutations and negative tests, and adds a new test (contributed by @jbhateja) to assert that MinI/MaxI reductions are vectorized.
>
> #### Testing
>
> ##### Functionality
>
> - tier1-5, stress test, fuzzing (windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64; release and debug mode).
>
> ##### Performance
>
> - Tested performance on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008). No significant change was observed.
src/hotspot/share/opto/addnode.cpp line 1132:
> 1130: // four possible permutations given by opcode's commutativity) into
> 1131: // opcode(x + opcode(x_off, y_off), z), where opcode is either MinI or MaxI,
> 1132: // if x == y and the additions can't overflow.
Ok, effectively we have 5, not just 4 cases here:
opcode(x + x_off, opcode(y + y_off, z))
opcode(x + x_off, opcode(z, y + y_off))
opcode(opcode(y + y_off, z), x + x_off)
opcode(opcode(z, y + y_off), x + x_off)
opcode(x + x_off, y + y_off)
I find the nested for-loop quite confusing. Maybe packing the inner stuff into a separate function could work?
// Check for opcode(x + x_con, y + y_con), no z
if (in(1)->Opcode() == Op_AddI && in(2)->Opcode() == Op_AddI) {
Node* ret = try_fold(opcode, in(1), in(2), nullptr);
if (ret != nullptr) { return ret; }
}
// Check for these 4 cases, equivalent to opcode3(addx, addy, z)
// opcode(x + x_con, opcode(y + y_con, z))
// opcode(x + x_con, opcode(z, y + y_con))
// opcode(opcode(y + y_con, z), x + x_con)
// opcode(opcode(z, y + y_con), x + x_con)
for (uint i = 1; i < 2; i++) {
Node* addx = in(i);
Node* other = in(i == 1 ? 2 : 1); // or just "2-i"
if (addx->Opcode() != Op_AddI || other->Opcode() != opcode) { continue; }
for (uint j = 1; i < 2; j++) {
Node* addy = other->in(j);
Node* z = other->in(j == 1 ? 2 : 1);
if (addy->Opcode() != Op_AddI) { continue; }
// We have opcode3(addx, addy, z)
Node* ret = try_fold(opcode, addx, addy, z);
if (ret != nullptr) { return ret; }
}
}
Where we have
Node* try_fold(int opcode, Node* addx, Node* addy, Node* z = nullptr) {
jint addx_con = 0;
jint addy_con = 0;
Node* addx_var = as_add_constant(addx, &addx_con);
Node* addy_var = as_add_constant(addy, &addy_con);
if (addx_var == nullptr || addy_var == nullptr) {
// found a top
return nullptr;
}
// could even check addx_var != addy_var, then we don't have to do that inside...
Node* folded = extract_addition(phase, addx_var, addx_con, addy_var, addy_con, opcode);
if (z != nullptr) {
folded = opcode(folded, z);
}
return folded;
}
Maybe this does a few more calls to `as_add_constant` than strictly necessary, but it is a bit easier to understand, right?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/13924#discussion_r1196592919
More information about the hotspot-compiler-dev
mailing list