RFR: 8259609: C2: optimize long range checks in long counted loops [v4]

Mon Jun 7 18:44:30 UTC 2021

On Wed, 27 Jan 2021 08:36:11 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> JDK-8255150 makes it possible for java code to explicitly perform a
>> range check on long values. JDK-8223051 provides a transformation of
>> long counted loops into loop nests with an inner int counted
>> loop. With this change I propose transforming long range checks that
>> operate on the iv of a long counted loop into range checks that
>> operate on the iv of the int inner loop once it has been
>> created. Existing range check eliminations can then kick in.
>> 
>> Transformation of range checks is piggy backed on the loop nest
>> creation for 2 reasons:
>> 
>> - pattern matching range checks is easier right before the loop nest
>>   is created
>> 
>> - the number of iterations of the inner loop is adjusted so scale *
>>   inner_iv doesn't overflow
>> 
>> C2 has logic to delay some split if transformations so they don't
>> break the scale * iv + offset pattern. I reused that logic for long
>> range checks and had to relax what's considered a range check because
>> initially a range check from Object.checkIndex() may include a test
>> for range > 0 that needs a round of loop opts to be hoisted. I realize
>> there's some code duplication but I didn't see a way to share logic
>> between IdealLoopTree::may_have_range_check()
>> IdealLoopTree::policy_range_check() that would feel right.
>> 
>> I realize the comment in PhaseIdealLoop::transform_long_range_checks()
>> is scary. FWIW, it's not as complicated as it looks. I found drawing
>> the range covered by the entire long loop and the range covered by the
>> inner loop help see how range checks can be transformed. Then the
>> comment helps make sure all cases are covered and verify the generated
>> code actually covers all of them.
>> 
>> One issue is overflow. I think the fact that inner_iv * scale doesn't
>> overflow helps simplify thing. One possible overflow is that of scale
>> * upper + offset which is handled by forcing all range checks in that
>> case to deoptimize. I don't think other case of overflow needs special
>> handling.
>> 
>> This was tested with a Memory Segment micro benchmark (and patched
>> Memory Segment support to take advantage of the new checkIndex
>> intrinsic, both provided by Maurizio). Range checks in the micro
>> benchmark are properly optimized (and performance increases
>> significantly).
>
> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision:
> 
>  - Tobias' comments
>  - Merge branch 'master' into JDK-8259609
>  - min_jint overflow fix
>  - Revert "assert(static_cast<T1>(result) == thing) fix"
>    
>    This reverts commit e234477df097475d503ea6f94ab6a258132d165e.
>  - Merge branch 'master' into JDK-8259609
>  - assert(static_cast<T1>(result) == thing) fix
>  - whitespaces
>  - build fix + test cleanup
>  - range checks

Thanks for your patience.  I went and wrote down the operations
in my own notation in order to reason about them more clearly.
I proceeded to do so as a sort of "background process".
My apologies for a less-than-timely response!

I found a bug that happens when 64-bit overflow occurs on
the range check operands.  The fix (in my notation) involves
biasing the comparison operations by Integer.MIN_VALUE
when 64-bit overflow is possible.

Here is my reasoning.

Notation:  int:N declares a mathematical value which fits
into the range of a signed N-bit register.  N.someop is an
operation that consumes and produces int:N values.
Regardless of N notation, all values are ideal mathematical
integers.  Non-annotated operations are ideal mathematical
operations.

We generalize from 64/32 to N/N2 in order to make it clear
that the formulas do not rely on specific properties of
hardware registers.

executeLoopPieceWise(N, N2; A, B, S, K, L, R) := {
  int:N2 maxB2 = computeLargestB2(N2; S, K)
  for (i = A;;) {
    int:N C = i  //C = first element of each strip-mined piece

    // pick a strip i = [C..C+nextB2] simulated by j = [0..nextB2]
    int:N2 B2 = maybeReplaceWithLastB2(N2; maxB2, C, B, S)

    // choose parameters for optimized comparison:
    (int:N2 Q, int:N2 R2) = chooseQandR2(N, N2; C, B2, S, K, L, R, N2)

    for (int:N2 j = 0;
       (S > 0 ? j <= B2 : j > B2);
       j += S) {

      // compute i*K + L <u R, in range 

      // unoptimized comparison:
      currentI = N.add(C, j)
      indexExpr = N.add(N.mul(currentI, K), L)
      c0 = N.cmpu(indexExpr, R);

      // optimized comparison:
      long test = N2.signExtend(N2.add(N2.mul(j, K), Q))
      int c = N2.cmpu(test, R2)

      // valid optimization iff X <u Y is same in N and N2
      assert ((c < 0) == (c0 < 0))
    }

    // increment to the end (inclusive) of the strip
    i += B2

    // loop control at bottom (A != B not A < B)
    if (i == B)  break

    // increment to the beginning of next strip (or past the end)
    i += S
    assert(S > 0 ? i <= B : i >= B)
  }
}

// pick the biggest strip mine size we can handle with N2 < N
computeLargestB2(N2; S, K) -> int:N2 := {
  // we will need this much room to maneuver
  assert(abs(K)*(abs(S)+2) + abs(S) <= N2.MAX_VALUE);

  int:N B2 = ((((N2.MAX_VALUE - abs(S)) / abs(K)) - 1) / abs(S)) * S
  if (B2 > 0)
      assert(B2 > 0 && B2 * K <= N2.MAX_VALUE - abs(S))
  else
      assert(B2 < 0 && B2 * K >= N2.MIN_VALUE + abs(S))
  assert(B2 % S == 0)
  assert(B2 == N2.signExtend(B2))
  return B2
}

// If our stop is coming up soon, use a shorter B2 value at the very end.
// This is a little tricky to compute, since the big B2 could overflow.
maybeReplaceWithLastB2(N2; B2, C, B, S) -> int:N2 := {
  // C is the current 'i' value
  if (N.add(C, B2) < C || // <- that happens on overflow
      N.add(C, B2) > B) { // <- this is the normal way we overshoot
      int:N shortB2 = (S > 0 ? B - C : C - B)
      assert(shortB2 == N2.signExtend(shortB2))
      return shortB2
  }
  // for all but the last iteration, just use the same large B2 value
  return B2
}

// Pick an R2 value like R, but close to the range [L..H].
// We assume that R >= 0 so that R-1 does not underflow.
clamp(N; R, L, H) -> int:N := N.max(L, N.min(R-1, H)+1)

// Given a loop strip we are about to execute, pick
// Q and R2 values to substitute in j*K + Q <uN2 R2 as
// replacement for i*K + L <uN R2 for i in [C..C+B2].
chooseQandR2(N, N2; C, B2, S, K, L, R) -> (int:N2, int:N2) := {
  // Start with 64-bit values:
  //   i*K + L <u64 R
  //   (C+j)*K + L <u64 R
  //   j*K + L2 <u64 R    where L2 = C*K+L
  L2 = N.add(N.mul(C, K), L)

  // Compute endpoints of the range of values j*K.
  //  Q_min = (j=0)*K + L2;  Q_max = (j=B2)*K + L2
  Q_min = L2
  Q_max = N.add(N.mul(B2, K), L2)
  //if (S * K < 0)  (Q_min, Q_max) = (Q_max, Q_min)

  if (N.sgn(Q_min) < 0) {
    // Case A: Some Negatives (but no overflow).
    assert(Q_max > Q_min)  // no N-bit overflow (b/c N2-bit range)
    // Number line:
    // |s64_min   .    .    .    0    .    .    .   s64_max|
    // |    .  Q_min..Q_max .    0    .    .    .     .    |  s64 negative
    // |    .     .    .  Q_min..0..Q_max  .    .     .    |  small mixed

    // if Q_min <s64 0, then use this test:
    // j*K + Q_min <u32 clamp(R, 0, Q_max)
    Q = N2.signExtend(Q_min)
    R2 = clamp(N; R, 0, Q_max)
    assert(R2 == 0 || Q == Q_min)

  } else if (N.sgn(Q_min) >= 0 && N.sgn(Q_max) >= 0) {
    // Case B: No Negatives.
    // Number line:
    // |s64_min   .    .    .    0    .    .    .   s64_max|
    // |    .     .    .    .    0 Q_min..Q_max .     .    |  small positive
    // |    .     .    .    .    0    . Q_min..Q_max  .    |  s64 positive

    // if both Q_min, Q_max >=s64 0, then use this test:
    // j*K + 0 <u32 clamp(R, Q_min, Q_max) - Q_min
    Q = 0
    R2 = N.sub(clamp(N; R, Q_min, Q_max), Q_min)

  } else {
    // Case C: Overflow in the int:N domain
    assert(Q_max < Q_min)  // wrap around the wrong way!
    // Number line:
    // |..Q_max-2^64   .    .    0    .    .    .   Q_min..|  s64 overflow

    // if Q_min >=s64 0 but Q_max <s64 0, then just truncate
    // and add s32_min to both sides of the test from Case A:
    // j*K + s32(Q_min) + s32_min  <u32  s32(R') + s32_min
    //   where  R' = clamp(R, Q_min, R)

    Rprime   = clamp(N; R, Q_min, R)
    assert(Rprime == R || Rprime == Q_min)

    // flip values to the overflow region of int:N2
    Q  = N.add(Q_min,  N2.MIN_VALUE)
    R2 = N.add(Rprime, N2.MIN_VALUE)

    // normalize (N-N2+1) sign bits after int:N2 overflow
    // (for the following assert, and later comparisons)
    Q  = N2.signExtend(Q);
    R2 = N2.signExtend(R2);
  }

  // Exercise:  Fold the above three cases together
  // using min/max/mask logic to a flow-free formula.
  // Start with the common use of the clamp operation.

  // The resulting mathematical values must fit into
  // an N2-bit register.
  assert(Q  == N2.signExtend(Q))
  assert(R2 == N2.signExtend(R2))
  return (Q, R2)
}

-------------

PR: https://git.openjdk.java.net/jdk/pull/2045