RFR: 8302673: [SuperWord] MaxReduction and MinReduction should vectorize for int

Wed May 17 14:18:51 UTC 2023

On Thu, 11 May 2023 09:18:21 GMT, Roberto Castañeda Lozano <rcastanedalo at openjdk.org> wrote:

> The canonicalization of MinI/MaxI chains into right-spline graphs within `MinINode/MaxINode::Ideal()` inhibits the vectorization of reductions of these nodes. This changeset reworks `MinINode/MaxINode::Ideal()` to perform the same algebraic optimizations without the need for canonicalization, re-enabling auto-vectorization of MinI/MaxI reductions. This is achieved by handling all four permutations of the targeted Ideal subgraph induced by the commutativity of MinI/MaxI directly. The algorithm (for the MaxI case, the MinI case is analogous) tries to apply two Ideal graph rewrites in the following order, where `c0` and `c1` are constants and `MAX` is a compile-time operation:
> 1. `max(x + c0, max(x + c1, z))` (or a permutation of it) to `max(x + MAX(c0, c1), z)`.
> 2. `max(x + c0, x + c1)` (or a permutation of it) to `x + MAX(c0, c1)`.
> 
> Here is an example of the four permutations handled in step 1 with `x = RShiftI`, `c0 = 100` or `150`, `c1 = 150` or `100`, and `z = ConI (#int:200)`:
> 
> ![two-level-idealization](https://github.com/openjdk/jdk/assets/8792647/bf60a2c3-39cd-4f0d-965d-c711723e374c)
> 
> Here is an example of the two permutations handled in step 2 with `x = RShiftI`, `c0 = 10` or `11`, and `c1 = 11` or `10`:
> 
> ![one-level-idealization](https://github.com/openjdk/jdk/assets/8792647/0a1fe85b-3f30-46bc-8817-d90b3eff946c)
> 
> The changeset implements `MinINode/MaxINode::Ideal()` in a common method `MaxNode::IdealI()`, since the algorithm is symmetric for both node types. The changeset also extends the existing MinI/MaxI Idealization tests with positive tests for all targeted permutations and negative tests, and adds a new test (contributed by @jbhateja) to assert that MinI/MaxI reductions are vectorized.
> 
> #### Testing
> 
> ##### Functionality
> 
> - tier1-5, stress test, fuzzing (windows-x64, linux-x64, linux-aarch64, macosx-x64, macosx-aarch64; release and debug mode).
> 
> ##### Performance
> 
> - Tested performance on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008). No significant change was observed.

src/hotspot/share/opto/addnode.cpp line 1132:

> 1130:   // four possible permutations given by opcode's commutativity) into
> 1131:   // opcode(x + opcode(x_off, y_off), z), where opcode is either MinI or MaxI,
> 1132:   // if x == y and the additions can't overflow.

Ok, effectively we have 5, not just 4 cases here:

opcode(x + x_off, opcode(y + y_off, z))
opcode(x + x_off, opcode(z, y + y_off))
opcode(opcode(y + y_off, z), x + x_off)
opcode(opcode(z, y + y_off), x + x_off)
opcode(x + x_off, y + y_off)

I find the nested for-loop quite confusing. Maybe packing the inner stuff into a separate function could work?

// Check for opcode(x + x_con,  y + y_con), no z
if (in(1)->Opcode() == Op_AddI && in(2)->Opcode() == Op_AddI) {
  Node* ret = try_fold(opcode, in(1), in(2), nullptr);
  if (ret != nullptr) { return ret; }
}

// Check for these 4 cases, equivalent to opcode3(addx, addy, z)
// opcode(x + x_con, opcode(y + y_con, z))
// opcode(x + x_con, opcode(z, y + y_con))
// opcode(opcode(y + y_con, z), x + x_con)
// opcode(opcode(z, y + y_con), x + x_con)
for (uint i = 1; i < 2; i++) {
  Node* addx = in(i);
  Node* other = in(i == 1 ? 2 : 1); // or just "2-i"
  if (addx->Opcode() != Op_AddI || other->Opcode() != opcode) { continue; }
  for (uint j = 1; i < 2; j++) {
    Node* addy = other->in(j);
    Node* z = other->in(j == 1 ? 2 : 1);
    if (addy->Opcode() != Op_AddI) { continue; }
    // We have opcode3(addx, addy, z)
    Node* ret = try_fold(opcode, addx, addy, z);
    if (ret != nullptr) { return ret; }
  }
}

Where we have

Node* try_fold(int opcode, Node* addx, Node* addy, Node* z = nullptr) {
  jint addx_con = 0;
  jint addy_con = 0;
  Node* addx_var = as_add_constant(addx, &addx_con);
  Node* addy_var = as_add_constant(addy, &addy_con);
  if (addx_var == nullptr || addy_var == nullptr) {
    // found a top
    return nullptr;
  }
  // could even check addx_var != addy_var, then we don't have to do that inside...
  Node* folded = extract_addition(phase, addx_var, addx_con, addy_var, addy_con, opcode);
  if (z != nullptr) {
    folded = opcode(folded, z);
  }
  return folded;
}

Maybe this does a few more calls to `as_add_constant` than strictly necessary, but it is a bit easier to understand, right?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/13924#discussion_r1196592919