Math.min(II) polluted compilation
Galder Zamarreno
galder at redhat.com
Thu Feb 27 16:28:59 UTC 2025
Hi,
While I was working on testing min/max long intrinsic as part of [1] I
encountered an oddity benchmarking Math.min(II) when called inside a loop.
As part of that PR I was trying to measure potential regressions that
adding this intrinsic can cause when the code is not vectorized. To test
this:
1. I emulated code not being vectorized by passing in -XX:-UseSuperWord
2. I emulated with/without the intrinsic by disabling the minL/maxL
intrinsic, allowing me to easily test with/without the changes in my PR.
To compare things I also tested with both long and ints. The test is:
```
public int intReductionSimpleMin(LoopState state) {
int result = 0;
for (int i = 0; i < state.size; i++) {
final int v = state.minIntA[i];
result = Math.min(result, v);
}
return result;
}
```
The results can sometimes look like this:
```
Benchmark (probability) (size) Mode Cnt
-min/-max +min/+max Units
MinMaxVector.intReductionSimpleMin 100 2048 thrpt 4
460.530 460.490 ops/ms (2)
MinMaxVector.longReductionSimpleMin 100 2048 thrpt 4
959.507 459.197 ops/ms (-52%)
```
The probability is a way to control the branchiness Math.min. With the 100
value shown above, the code puts data in the `state.minIntA` array such
that on each iteration the code always goes the same way in the branch.
The odd thing is that on certain occasions when the code runs scalar and
with the min intrinsic disabled, the code behaves a lot slower than what is
observed with Math.max(II) or Math.min(LL).
By running with perfasm I observed that on slow runs the Math.min(II)
version is using cmov instead of cmp+mov. When the branched code is so one
sided, the cmov version works much slower.
The odd thing is that data in the array being added such that one side of
the branch is always taken, one should not expect cmov to occur:
```
Node *PhaseIdealLoop::conditional_move( Node *region ) {
...
// Check for highly predictable branch. No point in CMOV'ing if
// we are going to predict accurately all the time.
if (C->use_cmove() && (cmp_op == Op_CmpF || cmp_op == Op_CmpD)) {
//keep going
} else if (iff->_prob < infrequent_prob ||
iff->_prob > (1.0f - infrequent_prob))
return nullptr;
```
If we look at the PrintMethodData for the slow runs you see this:
```
static java.lang.Math::min(II)I
interpreter_invocation_count: 18171
invocation_counter: 18171
...
0 bci: 2 BranchData taken(7732) displacement(56)
not taken(10180)
...
org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin(Lorg/openjdk/bench/java/lang/MinMaxVector$LoopState;)I
...
23 invokestatic 32 <java/lang/Math.min(II)I>
32 bci: 23 CounterData count(192512)...
```
There are 2 odd things there, one the invocation counter for Math.min. It's
way lower than the number of times the benchmark invoves Math.min. Also,
the percentage of not taken/taken is nowhere near 100% being either taken
or not taken. I verified that the data in the array was correct.
What has happened is that Math.min has been compiled before the benchmark
code runs, and it's been compiled with different branch conditions to the
one that the test expects. That is causing Math.min(II) in this scenario to
use cmov.
So, where are these other Math.min invocations coming from? Looking at the
PrintMethodData it looks like the majority of them come from the Java
Serialization layer that JMH forked processes depend on:
```
static java.util.Arrays::copyOfRange([BII)[B
73 invokestatic 304 <java/lang/Math.min(II)I>
416 bci: 73 CounterData count(6878)
java.io.ObjectOutputStream$BlockDataOutputStream::write([BIIZ)V
107 invokestatic 64 <java/lang/Math.min(II)I>
488 bci: 107 CounterData count(3611)
sun.nio.ch.NioSocketImpl::write([BII)V
41 invokestatic 255 <java/lang/Math.min(II)I>
128 bci: 41 CounterData count(3623)
sun.nio.cs.UTF_8$Encoder::encodeArrayLoop(Ljava/nio/CharBuffer;Ljava/nio/ByteBuffer;)Ljava/nio/charset/CoderResult;
75 invokestatic 62 <java/lang/Math.min(II)I>
480 bci: 75 CounterData count(3599)
sun.nio.cs.StreamEncoder::growByteBufferIfNeeded(I)V
34 invokestatic 252 <java/lang/Math.min(II)I>
144 bci: 34 CounterData count(3597)
```
Although the test does not happen under normal circumstances (having the
Math.min intrinsic disabled), anyone benchmarking Math.min(II) could
potentially see odd results as a result of intended pollution from other
parts of code that runs the forked process.
Is there a way to avoid this issue? Can JMH somehow instruct HotSpot to
deopt Math.min(II) before the warmup phase of the benchmark runs to avoid
pollution? Any other ideas?
Thanks
Galder
[1] https://github.com/openjdk/jdk/pull/20098
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/jmh-dev/attachments/20250227/d3dd7ab4/attachment-0001.htm>
More information about the jmh-dev
mailing list