Math.min(II) polluted compilation

Thu Feb 27 16:28:59 UTC 2025

Hi,

While I was working on testing min/max long intrinsic as part of [1] I
encountered an oddity benchmarking Math.min(II) when called inside a loop.

As part of that PR I was trying to measure potential regressions that
adding this intrinsic can cause when the code is not vectorized. To test
this:

1. I emulated code not being vectorized by passing in -XX:-UseSuperWord
2. I emulated with/without the intrinsic by disabling the minL/maxL
intrinsic, allowing me to easily test with/without the changes in my PR.

To compare things I also tested with both long and ints. The test is:

```
    public int intReductionSimpleMin(LoopState state) {
        int result = 0;
        for (int i = 0; i < state.size; i++) {
            final int v = state.minIntA[i];
            result = Math.min(result, v);
        }
        return result;
    }
```

The results can sometimes look like this:
```
Benchmark                              (probability)  (size)   Mode  Cnt
 -min/-max  +min/+max   Units
MinMaxVector.intReductionSimpleMin               100    2048  thrpt    4
 460.530    460.490  ops/ms (2)
MinMaxVector.longReductionSimpleMin              100    2048  thrpt    4
 959.507    459.197  ops/ms (-52%)
```

The probability is a way to control the branchiness Math.min. With the 100
value shown above, the code puts data in the `state.minIntA` array such
that on each iteration the code always goes the same way in the branch.

The odd thing is that on certain occasions when the code runs scalar and
with the min intrinsic disabled, the code behaves a lot slower than what is
observed with Math.max(II) or Math.min(LL).

By running with perfasm I observed that on slow runs the Math.min(II)
version is using cmov instead of cmp+mov. When the branched code is so one
sided, the cmov version works much slower.

The odd thing is that data in the array being added such that one side of
the branch is always taken, one should not expect cmov to occur:

```
Node *PhaseIdealLoop::conditional_move( Node *region ) {
...
  // Check for highly predictable branch.  No point in CMOV'ing if
  // we are going to predict accurately all the time.
  if (C->use_cmove() && (cmp_op == Op_CmpF || cmp_op == Op_CmpD)) {
    //keep going
  } else if (iff->_prob < infrequent_prob ||
      iff->_prob > (1.0f - infrequent_prob))
    return nullptr;
```

If we look at the PrintMethodData for the slow runs you see this:

```
static java.lang.Math::min(II)I
  interpreter_invocation_count:       18171
  invocation_counter:                 18171
...
  0    bci: 2    BranchData         taken(7732) displacement(56)
                                    not taken(10180)
...
org.openjdk.bench.java.lang.MinMaxVector::intReductionSimpleMin(Lorg/openjdk/bench/java/lang/MinMaxVector$LoopState;)I
...
  23 invokestatic 32 <java/lang/Math.min(II)I>
  32   bci: 23   CounterData        count(192512)...
```

There are 2 odd things there, one the invocation counter for Math.min. It's
way lower than the number of times the benchmark invoves Math.min. Also,
the percentage of not taken/taken is nowhere near 100% being either taken
or not taken. I verified that the data in the array was correct.
What has happened is that Math.min has been compiled before the benchmark
code runs, and it's been compiled with different branch conditions to the
one that the test expects. That is causing Math.min(II) in this scenario to
use cmov.

So, where are these other Math.min invocations coming from? Looking at the
PrintMethodData it looks like the majority of them come from the Java
Serialization layer that JMH forked processes depend on:

```
static java.util.Arrays::copyOfRange([BII)[B
  73 invokestatic 304 <java/lang/Math.min(II)I>
  416  bci: 73   CounterData        count(6878)

java.io.ObjectOutputStream$BlockDataOutputStream::write([BIIZ)V
  107 invokestatic 64 <java/lang/Math.min(II)I>
  488  bci: 107  CounterData        count(3611)

sun.nio.ch.NioSocketImpl::write([BII)V
  41 invokestatic 255 <java/lang/Math.min(II)I>
  128  bci: 41   CounterData        count(3623)

sun.nio.cs.UTF_8$Encoder::encodeArrayLoop(Ljava/nio/CharBuffer;Ljava/nio/ByteBuffer;)Ljava/nio/charset/CoderResult;
  75 invokestatic 62 <java/lang/Math.min(II)I>
  480  bci: 75   CounterData        count(3599)

sun.nio.cs.StreamEncoder::growByteBufferIfNeeded(I)V
  34 invokestatic 252 <java/lang/Math.min(II)I>
  144  bci: 34   CounterData        count(3597)
```

Although the test does not happen under normal circumstances (having the
Math.min intrinsic disabled), anyone benchmarking Math.min(II) could
potentially see odd results as a result of intended pollution from other
parts of code that runs the forked process.

Is there a way to avoid this issue? Can JMH somehow instruct HotSpot to
deopt Math.min(II) before the warmup phase of the benchmark runs to avoid
pollution? Any other ideas?

Thanks
Galder

[1] https://github.com/openjdk/jdk/pull/20098
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/jmh-dev/attachments/20250227/d3dd7ab4/attachment-0001.htm>