Comments / metadata in assembly listings don't make sense for code vectorized using Vector API

Tue Jan 19 20:17:09 UTC 2021

Hi Piotr,

Thanks for further sharing. I am glad you managed to make progress. I was not aware there were some benchmark rules you needed to adhere to.

Re: masks, yes there is still work to do for some mask operations.

Re: execution from the command line. You can run with -XX:-TieredCompilation (Remi, thanks for the correction in the prior email :-) ), and it's also possible reduce the compilation threshold (at the expense of potentially less accurate profiling information) using say -XX:CompileThreshold=1000 (the default is 10000).
It’s always a bit tricky to compare a static (Rust) vs. dynamic system that needs to warm up.

Paul.

On Jan 16, 2021, at 4:12 AM, Piotr Tarsa <piotr.tarsa at gmail.com<mailto:piotr.tarsa at gmail.com>> wrote:

Hi Paul,

Thanks for replying.

As per your advice I've prepared JMH benchmarks. Also I've copied some
optimizations from other mandelbrot benchmark implementations and
achieved further speedup. New version is here:
https://urldefense.com/v3/__https://gist.github.com/tarsa/7a9c80bb84c2dcd807be9cd16a655ee0/4ced690e20ad561515094995a852adc95820955e__;!!GqivPVa7Brio!PPG_y5O-1X6iaHMBpRFVBkTI7zM4hvwD-CWKID54DAHe_q47jEN4GoDyUO8MCOGPNQ$
It also has simplified buffer management and multithreading, so
there's less boilerplate.

Results from JMH are quite good (a bit reordered for clarity):
# Run complete. Total time: 00:16:42
Benchmark                  Mode  Cnt        Score      Error  Units
benchRowMT                thrpt    5    17755,337 ±  279,118  ops/s
benchRowST                thrpt    5     4535,280 ±    7,471  ops/s
benchScalarPairsMT        thrpt    5     4583,354 ±   89,532  ops/s
benchScalarPairsST        thrpt    5     1163,925 ±    0,469  ops/s
benchScalarMT             thrpt    5     2666,210 ±    5,004  ops/s
benchScalarST             thrpt    5      673,234 ±    0,167  ops/s
benchVectorMT             thrpt    5    18020,397 ±   54,230  ops/s
benchVectorST             thrpt    5     4567,873 ±   10,339  ops/s
benchVectorWithTransfer   thrpt    5     4557,361 ±    9,450  ops/s
benchScalarRemainderOnly  thrpt    5  7105989,167 ± 4691,311  ops/s

The mandelbrot Java #2 is named here benchScalarPairsMT (it's manually
unrolled x2). My new vectorized version (similarly manually unrolled)
is benchVectorMT and it has about 4x higher performance which is quite
good.

However when I run the benchmark from command line (in non-JMH mode)
to replicate the benchmark rules then there is much smaller
performance difference. Mandelbrot Java #2 (the unvectorized one)
takes about 3s, while my one takes about 1.2s - 1.3s and sometimes
fluctuated up to about 1.5s. It seems that compilation takes up a lot
of benchmark time. Is it possible to reduce compilation times for code
using Vector API?

wt., 5 sty 2021 o 00:11 Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>> napisał(a):

Hi Piotr,

Thanks for experimenting. The Intel folks have also experimented with Mandelbrot generation and might be able to comment with their experience in comparison.

I would be interested to know what your expectations were with regards to speedup.

I've expected more than 2x speedup vs the scalar version as I have
256-bit SIMD on my Haswell machine. Luckily, I've managed to achieve
it as I've written at the beginning.

It’s hard to evaluate without a JMH benchmark which can more easily present the code hotspots and isolate from the other areas, such as for thread parallelism. My recommendation would be to extract out the computeChunksVector kernel and embed within such a benchmark.

Switching off tired compilation should help (-XX:-TiredCompilation) i.e. only C2, not C1, in getting to the C2 generated code faster.

Good point. I've made a JMH benchmark (that I've presented at the
beginning) and saw where are inefficiences.

To your question about the assembly listing I believe as HotSpot goes through various compiler passes it tries to preserve the byte code index associated with generated instructions, but naturally as code is lowered this becomes an approximation, and especially so with the Vector API.

Hmmm, it's sad that it's only approximation.

In the case of "*synchronization entry”, this is stating the pseudo byte code index just before a method is entered. However, I think there is tech debt here, see

https://urldefense.com/v3/__https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/debugInfoRec.hpp*L67__;Iw!!GqivPVa7Brio!PPG_y5O-1X6iaHMBpRFVBkTI7zM4hvwD-CWKID54DAHe_q47jEN4GoDyUO-g6St52w$

And the usages of SynchronizationEntryBCI in hotspot code.

Running in fast debug mode will present a slightly higher-level view of generated code. Here’s a snippet:

26e     vmulpd  XMM12,XMM7,XMM7 ! mul packedD
272     vaddpd  XMM8,XMM0,XMM11    ! add packedD
277     vmulpd  XMM9,XMM8,XMM8 ! mul packedD
27c     vaddpd  XMM0,XMM12,XMM9    ! add packedD
281     vector_compare XMM10,XMM5,XMM0,#3  !
286     # TLS is in R15
286     cmpq    RCX, [R15 + #344 (32-bit)] # raw ptr
28d     jnb,u   B47  P=0.000100 C=-1.000000

…

2cf     vmovdqu [rsp + 64],XMM1    # spill

2d5     B25: # out( B44 B26 ) <- in( B48 B24 )  Freq: 9114.56
2d5
2d5     # checkcastPP of RAX
2d5     vector_store_mask XMM1,XMM10   ! using XMM13 as TEMP
2f4     vector_loadmask_byte XMM15,XMM1

302     vmovdqu XMM1,[rsp + 448]   # spill
30b     vpor    XMM1,XMM1,XMM15    ! or vectors
310     vector_store_mask XMM0,XMM1    ! using XMM14 as TEMP
32e     store_vector [RAX + #16 (8-bit)],XMM0

333     vector_test RCX,XMM1, XMM2 ! using RFLAGS as TEMP
       nop    # 2 bytes pad for loops and calls
340     testl   RCX, RCX
342     je     B44  P=0.108889 C=184362.000000

The mask instructions, such as vector_store_mask, are substituted for a more complex sequence of x86 instructions on AVX 2.

Thanks. For now the speedup in JMH is good enough for me, so I won't
dig into assembly code, but I'll consider fastdebug Java build when I
explore assembly next time.

I do notice that the inner loop (upper bound of 5) does unroll (FWIW, making the inner bound a power of 2 is more friendly for unrolling). There also appears to be many register spills, suggesting non-optimal vector register allocation/usage by C2.

I've tried factor of 4, but that didn't improve performance. I think
it even made it worse as 50 is not a multiply of 4 and benchmark rules
dictates that there must be exactly 50 iterations so I needed to make
additional code for 50 % 4 = 2 extra iterations. That probably made
code compilation even longer and worsen benchmark execution time.

Instead of unrolling the inner loops, I've duplicated all vectors
(i.e. unrolled the chunk-level loop, instead of the inner iteration
level loop), so I was working on 256-bit / 64-bit * 2 = 8 results at a
time. That worked well, similarly to mandelbrot Java #2.

I noticed this in the code:

// in Rust version this works fine, so where's the bug then?
// cmpMask = vFours.lt(vZiN.add(vZrN));

What problem did you encounter? It works for me on the tip of https://urldefense.com/v3/__https://github.com/openjdk/jdk__;!!GqivPVa7Brio!PPG_y5O-1X6iaHMBpRFVBkTI7zM4hvwD-CWKID54DAHe_q47jEN4GoDyUO80IYo13A$ .

The results were different, but benchmark dictates that the output
must be bit-to-bit identical to correct output. I've nailed it down.
It's because of numeric overflows causing numbers to go to
Double.Infinity and then subtracting Infinity from Infinity results in
Double.NaN and then comparisons are always false. Anyway, I've tried
to make a version without `cmpMask = cmpMask.or(newValue)` and made a
correct version, but that turned out to be slower. Perhaps some of the
mask operations are not intrinsified yet or something. Rust was doing
the comparisons a little bit differently, properly accounting for
NaNs.

All tests were done on:
openjdk 16-ea 2021-03-16
OpenJDK Runtime Environment (build 16-ea+30-2130)
OpenJDK 64-Bit Server VM (build 16-ea+30-2130, mixed mode, sharing)

Paul.

On Dec 30, 2020, at 6:17 AM, Piotr Tarsa <piotr.tarsa at gmail.com<mailto:piotr.tarsa at gmail.com>> wrote:

Hi all,

Thanks for creating Project Panama! It looks promising. However, I've
made a try to vectorize some code and got somewhat disappointing
results. Therefore I wanted to look at the generated machine code to
see it it looks optimal. I've attached hsdis to JVM and enabled
assembly printing but the output doesn't make sense to me, i.e. the
instructions and comments / metadata don't seem to match. I may be
wrong as I've very rarely looked at assembly listing produced by JVM.

Performance:
As a baseline I took
https://urldefense.com/v3/__https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/mandelbrot-java-2.html__;!!GqivPVa7Brio!PPG_y5O-1X6iaHMBpRFVBkTI7zM4hvwD-CWKID54DAHe_q47jEN4GoDyUO9Hidpnxg$
which takes about 3.05s to finish on my system. After vectorization
I've managed to achieve timings like 1.80s. That's quite disappointing
to me as I have a Haswell machine which has AVX2, high speed L1
caches, etc I've tested on recent JDK 16 EA build from
https://urldefense.com/v3/__http://jdk.java.net/16/__;!!GqivPVa7Brio!PPG_y5O-1X6iaHMBpRFVBkTI7zM4hvwD-CWKID54DAHe_q47jEN4GoDyUO8eAZxJgQ$

Link to the code and assembly listing:
https://urldefense.com/v3/__https://gist.github.com/tarsa/7a9c80bb84c2dcd807be9cd16a655ee0__;!!GqivPVa7Brio!PPG_y5O-1X6iaHMBpRFVBkTI7zM4hvwD-CWKID54DAHe_q47jEN4GoDyUO9lqtRFkA$  I'll
copy the source code again in this mail at the end.

What I see in the assembly listings is e.g.

0x00007f0e208b8ab9:   cmp    r13d,0x7fffffc0
0x00007f0e208b8ac0:   jg     0x00007f0e208b932c
0x00007f0e208b8ac6:   vmulpd ymm0,ymm6,ymm4
0x00007f0e208b8aca:   vsubpd ymm1,ymm4,ymm4
0x00007f0e208b8ace:   vmovdqu YMMWORD PTR [rsp+0xc0],ymm1
0x00007f0e208b8ad7:   vmulpd ymm0,ymm0,ymm4
;*synchronization entry
                                                        ; -
jdk.internal.vm.vector.VectorSupport$VectorPayload::getPayload at -1
(line 101)
                                                        ; -
jdk.incubator.vector.Double256Vector$Double256Mask::getBits at 1 (line
557)
                                                        ; -
jdk.incubator.vector.AbstractMask::toLong at 24 (line 77)
                                                        ; -
mandelbrot_simd_1::computeChunksVector at 228 (line 187)
0x00007f0e208b8adb:   vaddpd ymm0,ymm0,ymm2               ;*checkcast
{reexecute=0 rethrow=0 return_oop=0}
                                                        ; -
jdk.incubator.vector.DoubleVector::fromArray0Template at 34 (line 3119)
                                                        ; -
jdk.incubator.vector.Double256Vector::fromArray0 at 3 (line 777)
                                                        ; -
jdk.incubator.vector.DoubleVector::fromArray at 24 (line 2564)
                                                        ; -
mandelbrot_simd_1::computeChunksVector at 95 (line 169)
0x00007f0e208b8adf:   vmovdqu YMMWORD PTR [rsp+0xe0],ymm0
0x00007f0e208b8ae8:   vmulpd ymm0,ymm0,ymm0
0x00007f0e208b8aec:   vmovdqu YMMWORD PTR [rsp+0x100],ymm0

How does vmulpd relate to a synchronization entry and
AbstrackMask::toLong? It seems way off to me. However, there maybe
some trick to understand it. Could you give me some guidelines on how
to intepret that? Are the comments describing lines below or above
them?

Regards,
Piotr

mandelbrot_simd_1.java source code:
import jdk.incubator.vector.DoubleVector;
import jdk.incubator.vector.VectorMask;
import jdk.incubator.vector.VectorSpecies;

import java.io.BufferedOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.stream.IntStream;

public class mandelbrot_simd_1 {
  private static final VectorSpecies<Double> SPECIES =
          DoubleVector.SPECIES_PREFERRED.length() <= 8 ?
                  DoubleVector.SPECIES_PREFERRED : DoubleVector.SPECIES_512;

  private static final int LANES = SPECIES.length();

  private static final int LANES_LOG = Integer.numberOfTrailingZeros(LANES);

  public static void main(String[] args) throws IOException {
      if ((LANES > 8) || (LANES != (1 << LANES_LOG))) {
          var errorMsg = "LANES must be a power of two and at most 8. " +
                  "Change SPECIES in the source code.";
          throw new RuntimeException(errorMsg);
      }
      var sideLen = Integer.parseInt(args[0]);
      try (var out = new BufferedOutputStream(makeOut1())) {
          out.write(String.format("P4\n%d %d\n", sideLen,
sideLen).getBytes());
          computeAndOutputRows(out, sideLen);
      }
  }

  @SuppressWarnings("unused")
  // the version that avoids mixing up output with JVM diagnostic messages
  private static OutputStream makeOut1() throws IOException {
      return Files.newOutputStream(Path.of("mandelbrot_simd_1.pbm"));
  }

  // the version that is compatible with benchmark requirements
  private static OutputStream makeOut2() {
      return System.out;
  }

  private static void computeAndOutputRows(OutputStream out, int sideLen) {
      var poolFactor = 1000000 / sideLen;
      if (poolFactor < 10) {
          throw new RuntimeException("Too small poolFactor");
      }
      var numCpus = Runtime.getRuntime().availableProcessors();
      var rowsPerBatch = numCpus * poolFactor;
      var fac = 2.0 / sideLen;
      var aCr = IntStream.range(0, sideLen).parallel()
              .mapToDouble(x -> x * fac - 1.5).toArray();
      var bitsReversalMapping = computeBitsReversalMapping();
      var rowsPools = new byte[2][rowsPerBatch][(sideLen + 7) / 8];
      var rowsChunksPools = new long[2][rowsPerBatch][sideLen / 64];
      var batchSizes = new int[2];
      var batchCountDowns = new CountDownLatch[2];
      var computeEc = Executors.newWorkStealingPool(numCpus);
      var masterThread = new Thread(() -> {
          var rowsToProcess = sideLen;
          var nextBatchStart = 0;
          batchSizes[0] = 0;
          batchCountDowns[0] = new CountDownLatch(0);
          for (var poolId = 0; rowsToProcess > 0; poolId ^= 1) {
              while (batchCountDowns[poolId].getCount() != 0) {
                  try {
                      batchCountDowns[poolId].await();
                  } catch (InterruptedException ignored) {
                  }
              }
              batchCountDowns[poolId] = null;

              var nextBatchSize =
                      Math.min(sideLen - nextBatchStart, rowsPerBatch);
              var nextPoolId = poolId ^ 1;
              batchSizes[nextPoolId] = nextBatchSize;
              batchCountDowns[nextPoolId] = new CountDownLatch(nextBatchSize);
              sendTasks(fac, aCr, bitsReversalMapping,
                      rowsPools[nextPoolId], rowsChunksPools[nextPoolId],
                      nextBatchStart, nextBatchSize,
                      batchCountDowns[nextPoolId], computeEc);
              nextBatchStart += nextBatchSize;

              var batchSize = batchSizes[poolId];
              try {
                  for (var rowIdx = 0; rowIdx < batchSize; rowIdx++) {
                      out.write(rowsPools[poolId][rowIdx]);
                  }
                  out.flush();
              } catch (IOException e) {
                  e.printStackTrace();
                  System.exit(-1);
              }
              rowsToProcess -= batchSize;
          }

          computeEc.shutdown();
      });
      masterThread.start();
      while (masterThread.isAlive() || !computeEc.isTerminated()) {
          try {
              @SuppressWarnings("unused")
              var ignored = computeEc.awaitTermination(1, TimeUnit.DAYS);
              masterThread.join();
          } catch (InterruptedException ignored) {
          }
      }
  }

  private static void sendTasks(double fac, double[] aCr,
                                byte[] bitsReversalMapping,
                                byte[][] rows, long[][] rowsChunks,
                                int batchStart, int batchSize,
                                CountDownLatch poolsActiveWorkersCount,
                                ExecutorService computeEc) {
      for (var i = 0; i < batchSize; i++) {
          var indexInBatch = i;
          var y = batchStart + i;
          var Ci = y * fac - 1.0;
          computeEc.submit(() -> {
              try {
                  computeRow(Ci, aCr, bitsReversalMapping,
                          rows[indexInBatch], rowsChunks[indexInBatch]);
                  poolsActiveWorkersCount.countDown();
              } catch (Exception e) {
                  e.printStackTrace();
                  System.exit(-1);
              }
          });
      }
  }

  private static byte[] computeBitsReversalMapping() {
      var bitsReversalMapping = new byte[256];
      for (var i = 0; i < 256; i++) {
          bitsReversalMapping[i] = (byte) (Integer.reverse(i) >>> 24);
      }
      return bitsReversalMapping;
  }

  private static void computeRow(double Ci, double[] aCr,
                                 byte[] bitsReversalMapping,
                                 byte[] row, long[] rowChunks) {
      computeChunksVector(Ci, aCr, rowChunks);
      transferRowFlags(rowChunks, row, bitsReversalMapping);
      computeRemainderScalar(aCr, row, Ci);
  }

  private static void computeChunksVector(double Ci, double[] aCr,
                                          long[] rowChunks) {
      var sideLen = aCr.length;
      var vCi = DoubleVector.broadcast(SPECIES, Ci);
      var vZeroes = DoubleVector.zero(SPECIES);
      var vTwos = DoubleVector.broadcast(SPECIES, 2.0);
      var vFours = DoubleVector.broadcast(SPECIES, 4.0);
      var zeroMask = VectorMask.fromLong(SPECIES, 0);
      // (1 << 6) = 64 = length of long in bits
      for (var xBase = 0; xBase < (sideLen & ~(1 << 6)); xBase += (1 << 6)) {
          var cmpFlags = 0L;
          for (var xInc = 0; xInc < (1 << 6); xInc += LANES) {
              var vZr = vZeroes;
              var vZi = vZeroes;
              var vCr = DoubleVector.fromArray(SPECIES, aCr, xBase + xInc);
              var vZrN = vZeroes;
              var vZiN = vZeroes;
              var cmpMask = zeroMask;
              for (var outer = 0; outer < 10; outer++) {
                  for (var inner = 0; inner < 5; inner++) {
                      vZi = vTwos.mul(vZr).mul(vZi).add(vCi);
                      vZr = vZrN.sub(vZiN).add(vCr);
                      vZiN = vZi.mul(vZi);
                      vZrN = vZr.mul(vZr);
                  }
                  cmpMask = cmpMask.or(vFours.lt(vZiN.add(vZrN)));
                  // in Rust version this works fine, so where's the bug then?
                  // cmpMask = vFours.lt(vZiN.add(vZrN));
                  if (cmpMask.allTrue()) {
                      break;
                  }
              }
              cmpFlags |= cmpMask.toLong() << xInc;
          }
          rowChunks[xBase >> 6] = cmpFlags;
      }
  }

  private static void transferRowFlags(long[] rowChunks, byte[] row,
                                       byte[] bitsReversalMapping) {
      for (var i = 0; i < rowChunks.length; i++) {
          var group = ~rowChunks[i];
          row[i * 8 + 7] = bitsReversalMapping[0xff & (byte) (group >>> 56)];
          row[i * 8 + 6] = bitsReversalMapping[0xff & (byte) (group >>> 48)];
          row[i * 8 + 5] = bitsReversalMapping[0xff & (byte) (group >>> 40)];
          row[i * 8 + 4] = bitsReversalMapping[0xff & (byte) (group >>> 32)];
          row[i * 8 + 3] = bitsReversalMapping[0xff & (byte) (group >>> 24)];
          row[i * 8 + 2] = bitsReversalMapping[0xff & (byte) (group >>> 16)];
          row[i * 8 + 1] = bitsReversalMapping[0xff & (byte) (group >>> 8)];
          row[i * 8] = bitsReversalMapping[0xff & (byte) group];
      }
  }

  private static void computeRemainderScalar(double[] aCr, byte[]
row, double Ci) {
      var sideLen = aCr.length;
      var bits = 0;
      for (var x = sideLen & ~(1 << 6); x < sideLen; x++) {
          var Zr = 0.0;
          var Zi = 0.0;
          var Cr = aCr[x];
          var i = 50;
          var ZrN = 0.0;
          var ZiN = 0.0;
          do {
              Zi = 2.0 * Zr * Zi + Ci;
              Zr = ZrN - ZiN + Cr;
              ZiN = Zi * Zi;
              ZrN = Zr * Zr;
          } while (ZiN + ZrN <= 4.0 && --i > 0);
          bits <<= 1;
          bits += i == 0 ? 1 : 0;
          if (x % 8 == 7) {
              row[x / 8] = (byte) bits;
              bits = 0;
          }
      }
      if (sideLen % 8 != 0) {
          row[sideLen / 8] = (byte) bits;
      }
  }
}