RFR: 8259609: C2: optimize long range checks in long counted loops [v4]
John R Rose
jrose at openjdk.java.net
Mon Jun 7 18:44:30 UTC 2021
On Wed, 27 Jan 2021 08:36:11 GMT, Roland Westrelin <roland at openjdk.org> wrote:
>> JDK-8255150 makes it possible for java code to explicitly perform a
>> range check on long values. JDK-8223051 provides a transformation of
>> long counted loops into loop nests with an inner int counted
>> loop. With this change I propose transforming long range checks that
>> operate on the iv of a long counted loop into range checks that
>> operate on the iv of the int inner loop once it has been
>> created. Existing range check eliminations can then kick in.
>>
>> Transformation of range checks is piggy backed on the loop nest
>> creation for 2 reasons:
>>
>> - pattern matching range checks is easier right before the loop nest
>> is created
>>
>> - the number of iterations of the inner loop is adjusted so scale *
>> inner_iv doesn't overflow
>>
>> C2 has logic to delay some split if transformations so they don't
>> break the scale * iv + offset pattern. I reused that logic for long
>> range checks and had to relax what's considered a range check because
>> initially a range check from Object.checkIndex() may include a test
>> for range > 0 that needs a round of loop opts to be hoisted. I realize
>> there's some code duplication but I didn't see a way to share logic
>> between IdealLoopTree::may_have_range_check()
>> IdealLoopTree::policy_range_check() that would feel right.
>>
>> I realize the comment in PhaseIdealLoop::transform_long_range_checks()
>> is scary. FWIW, it's not as complicated as it looks. I found drawing
>> the range covered by the entire long loop and the range covered by the
>> inner loop help see how range checks can be transformed. Then the
>> comment helps make sure all cases are covered and verify the generated
>> code actually covers all of them.
>>
>> One issue is overflow. I think the fact that inner_iv * scale doesn't
>> overflow helps simplify thing. One possible overflow is that of scale
>> * upper + offset which is handled by forcing all range checks in that
>> case to deoptimize. I don't think other case of overflow needs special
>> handling.
>>
>> This was tested with a Memory Segment micro benchmark (and patched
>> Memory Segment support to take advantage of the new checkIndex
>> intrinsic, both provided by Maurizio). Range checks in the micro
>> benchmark are properly optimized (and performance increases
>> significantly).
>
> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision:
>
> - Tobias' comments
> - Merge branch 'master' into JDK-8259609
> - min_jint overflow fix
> - Revert "assert(static_cast<T1>(result) == thing) fix"
>
> This reverts commit e234477df097475d503ea6f94ab6a258132d165e.
> - Merge branch 'master' into JDK-8259609
> - assert(static_cast<T1>(result) == thing) fix
> - whitespaces
> - build fix + test cleanup
> - range checks
Thanks for your patience. I went and wrote down the operations
in my own notation in order to reason about them more clearly.
I proceeded to do so as a sort of "background process".
My apologies for a less-than-timely response!
I found a bug that happens when 64-bit overflow occurs on
the range check operands. The fix (in my notation) involves
biasing the comparison operations by Integer.MIN_VALUE
when 64-bit overflow is possible.
Here is my reasoning.
Notation: int:N declares a mathematical value which fits
into the range of a signed N-bit register. N.someop is an
operation that consumes and produces int:N values.
Regardless of N notation, all values are ideal mathematical
integers. Non-annotated operations are ideal mathematical
operations.
We generalize from 64/32 to N/N2 in order to make it clear
that the formulas do not rely on specific properties of
hardware registers.
executeLoopPieceWise(N, N2; A, B, S, K, L, R) := {
int:N2 maxB2 = computeLargestB2(N2; S, K)
for (i = A;;) {
int:N C = i //C = first element of each strip-mined piece
// pick a strip i = [C..C+nextB2] simulated by j = [0..nextB2]
int:N2 B2 = maybeReplaceWithLastB2(N2; maxB2, C, B, S)
// choose parameters for optimized comparison:
(int:N2 Q, int:N2 R2) = chooseQandR2(N, N2; C, B2, S, K, L, R, N2)
for (int:N2 j = 0;
(S > 0 ? j <= B2 : j > B2);
j += S) {
// compute i*K + L <u R, in range
// unoptimized comparison:
currentI = N.add(C, j)
indexExpr = N.add(N.mul(currentI, K), L)
c0 = N.cmpu(indexExpr, R);
// optimized comparison:
long test = N2.signExtend(N2.add(N2.mul(j, K), Q))
int c = N2.cmpu(test, R2)
// valid optimization iff X <u Y is same in N and N2
assert ((c < 0) == (c0 < 0))
}
// increment to the end (inclusive) of the strip
i += B2
// loop control at bottom (A != B not A < B)
if (i == B) break
// increment to the beginning of next strip (or past the end)
i += S
assert(S > 0 ? i <= B : i >= B)
}
}
// pick the biggest strip mine size we can handle with N2 < N
computeLargestB2(N2; S, K) -> int:N2 := {
// we will need this much room to maneuver
assert(abs(K)*(abs(S)+2) + abs(S) <= N2.MAX_VALUE);
int:N B2 = ((((N2.MAX_VALUE - abs(S)) / abs(K)) - 1) / abs(S)) * S
if (B2 > 0)
assert(B2 > 0 && B2 * K <= N2.MAX_VALUE - abs(S))
else
assert(B2 < 0 && B2 * K >= N2.MIN_VALUE + abs(S))
assert(B2 % S == 0)
assert(B2 == N2.signExtend(B2))
return B2
}
// If our stop is coming up soon, use a shorter B2 value at the very end.
// This is a little tricky to compute, since the big B2 could overflow.
maybeReplaceWithLastB2(N2; B2, C, B, S) -> int:N2 := {
// C is the current 'i' value
if (N.add(C, B2) < C || // <- that happens on overflow
N.add(C, B2) > B) { // <- this is the normal way we overshoot
int:N shortB2 = (S > 0 ? B - C : C - B)
assert(shortB2 == N2.signExtend(shortB2))
return shortB2
}
// for all but the last iteration, just use the same large B2 value
return B2
}
// Pick an R2 value like R, but close to the range [L..H].
// We assume that R >= 0 so that R-1 does not underflow.
clamp(N; R, L, H) -> int:N := N.max(L, N.min(R-1, H)+1)
// Given a loop strip we are about to execute, pick
// Q and R2 values to substitute in j*K + Q <uN2 R2 as
// replacement for i*K + L <uN R2 for i in [C..C+B2].
chooseQandR2(N, N2; C, B2, S, K, L, R) -> (int:N2, int:N2) := {
// Start with 64-bit values:
// i*K + L <u64 R
// (C+j)*K + L <u64 R
// j*K + L2 <u64 R where L2 = C*K+L
L2 = N.add(N.mul(C, K), L)
// Compute endpoints of the range of values j*K.
// Q_min = (j=0)*K + L2; Q_max = (j=B2)*K + L2
Q_min = L2
Q_max = N.add(N.mul(B2, K), L2)
//if (S * K < 0) (Q_min, Q_max) = (Q_max, Q_min)
if (N.sgn(Q_min) < 0) {
// Case A: Some Negatives (but no overflow).
assert(Q_max > Q_min) // no N-bit overflow (b/c N2-bit range)
// Number line:
// |s64_min . . . 0 . . . s64_max|
// | . Q_min..Q_max . 0 . . . . | s64 negative
// | . . . Q_min..0..Q_max . . . | small mixed
// if Q_min <s64 0, then use this test:
// j*K + Q_min <u32 clamp(R, 0, Q_max)
Q = N2.signExtend(Q_min)
R2 = clamp(N; R, 0, Q_max)
assert(R2 == 0 || Q == Q_min)
} else if (N.sgn(Q_min) >= 0 && N.sgn(Q_max) >= 0) {
// Case B: No Negatives.
// Number line:
// |s64_min . . . 0 . . . s64_max|
// | . . . . 0 Q_min..Q_max . . | small positive
// | . . . . 0 . Q_min..Q_max . | s64 positive
// if both Q_min, Q_max >=s64 0, then use this test:
// j*K + 0 <u32 clamp(R, Q_min, Q_max) - Q_min
Q = 0
R2 = N.sub(clamp(N; R, Q_min, Q_max), Q_min)
} else {
// Case C: Overflow in the int:N domain
assert(Q_max < Q_min) // wrap around the wrong way!
// Number line:
// |..Q_max-2^64 . . 0 . . . Q_min..| s64 overflow
// if Q_min >=s64 0 but Q_max <s64 0, then just truncate
// and add s32_min to both sides of the test from Case A:
// j*K + s32(Q_min) + s32_min <u32 s32(R') + s32_min
// where R' = clamp(R, Q_min, R)
Rprime = clamp(N; R, Q_min, R)
assert(Rprime == R || Rprime == Q_min)
// flip values to the overflow region of int:N2
Q = N.add(Q_min, N2.MIN_VALUE)
R2 = N.add(Rprime, N2.MIN_VALUE)
// normalize (N-N2+1) sign bits after int:N2 overflow
// (for the following assert, and later comparisons)
Q = N2.signExtend(Q);
R2 = N2.signExtend(R2);
}
// Exercise: Fold the above three cases together
// using min/max/mask logic to a flow-free formula.
// Start with the common use of the clamp operation.
// The resulting mathematical values must fit into
// an N2-bit register.
assert(Q == N2.signExtend(Q))
assert(R2 == N2.signExtend(R2))
return (Q, R2)
}
-------------
PR: https://git.openjdk.java.net/jdk/pull/2045
More information about the hotspot-compiler-dev
mailing list