Comments / metadata in assembly listings don't make sense for code vectorized using Vector API

Tue Jan 5 09:23:33 UTC 2021

----- Mail original -----
> De: "Paul Sandoz" <paul.sandoz at oracle.com>
> À: "Piotr Tarsa" <piotr.tarsa at gmail.com>
> Cc: "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
> Envoyé: Mardi 5 Janvier 2021 00:11:43
> Objet: Re: Comments / metadata in assembly listings don't make sense for code vectorized using Vector API

> Hi Piotr,
> 
> Thanks for experimenting. The Intel folks have also experimented with Mandelbrot
> generation and might be able to comment with their experience in comparison.
> 
> I would be interested to know what your expectations were with regards to
> speedup.
> 
> It’s hard to evaluate without a JMH benchmark which can more easily present the
> code hotspots and isolate from the other areas, such as for thread parallelism.
> My recommendation would be to extract out the computeChunksVector kernel and
> embed within such a benchmark.
> 
> Switching off tired compilation should help (-XX:-TiredCompilation) i.e. only
> C2, not C1, in getting to the C2 generated code faster.

tiered compilation / -XX:-TieredCompilation
:)

> 
> To your question about the assembly listing I believe as HotSpot goes through
> various compiler passes it tries to preserve the byte code index associated
> with generated instructions, but naturally as code is lowered this becomes an
> approximation, and especially so with the Vector API.
> 
> In the case of "*synchronization entry”, this is stating the pseudo byte code
> index just before a method is entered. However, I think there is tech debt
> here, see
> 
> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/debugInfoRec.hpp#L67
> <https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/debugInfoRec.hpp#L67>
> 
> And the usages of SynchronizationEntryBCI in hotspot code.
> 
> Running in fast debug mode will present a slightly higher-level view of
> generated code. Here’s a snippet:
> 
> 26e     vmulpd  XMM12,XMM7,XMM7 ! mul packedD
> 272     vaddpd  XMM8,XMM0,XMM11    ! add packedD
> 277     vmulpd  XMM9,XMM8,XMM8 ! mul packedD
> 27c     vaddpd  XMM0,XMM12,XMM9    ! add packedD
> 281     vector_compare XMM10,XMM5,XMM0,#3  !
> 286     # TLS is in R15
> 286     cmpq    RCX, [R15 + #344 (32-bit)] # raw ptr
> 28d     jnb,u   B47  P=0.000100 C=-1.000000
> 
> …
> 
> 2cf     vmovdqu [rsp + 64],XMM1    # spill
> 
> 2d5     B25: # out( B44 B26 ) <- in( B48 B24 )  Freq: 9114.56
> 2d5
> 2d5     # checkcastPP of RAX
> 2d5     vector_store_mask XMM1,XMM10   ! using XMM13 as TEMP
> 2f4     vector_loadmask_byte XMM15,XMM1
>   
> 302     vmovdqu XMM1,[rsp + 448]   # spill
> 30b     vpor    XMM1,XMM1,XMM15    ! or vectors
> 310     vector_store_mask XMM0,XMM1    ! using XMM14 as TEMP
> 32e     store_vector [RAX + #16 (8-bit)],XMM0
>   
> 333     vector_test RCX,XMM1, XMM2 ! using RFLAGS as TEMP
>        nop    # 2 bytes pad for loops and calls
> 340     testl   RCX, RCX
> 342     je     B44  P=0.108889 C=184362.000000
> 
> 
> The mask instructions, such as vector_store_mask, are substituted for a more
> complex sequence of x86 instructions on AVX 2.
> 
> I do notice that the inner loop (upper bound of 5) does unroll (FWIW, making the
> inner bound a power of 2 is more friendly for unrolling). There also appears to
> be many register spills, suggesting non-optimal vector register
> allocation/usage by C2.
> 
> —
> 
> I noticed this in the code:
> 
> // in Rust version this works fine, so where's the bug then?
> // cmpMask = vFours.lt(vZiN.add(vZrN));
> 
> What problem did you encounter? It works for me on the tip of
> https://github.com/openjdk/jdk.
> 
> Paul.

Rémi

> 
>> On Dec 30, 2020, at 6:17 AM, Piotr Tarsa <piotr.tarsa at gmail.com> wrote:
>> 
>> Hi all,
>> 
>> Thanks for creating Project Panama! It looks promising. However, I've
>> made a try to vectorize some code and got somewhat disappointing
>> results. Therefore I wanted to look at the generated machine code to
>> see it it looks optimal. I've attached hsdis to JVM and enabled
>> assembly printing but the output doesn't make sense to me, i.e. the
>> instructions and comments / metadata don't seem to match. I may be
>> wrong as I've very rarely looked at assembly listing produced by JVM.
>> 
>> Performance:
>> As a baseline I took
>> https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/mandelbrot-java-2.html
>> which takes about 3.05s to finish on my system. After vectorization
>> I've managed to achieve timings like 1.80s. That's quite disappointing
>> to me as I have a Haswell machine which has AVX2, high speed L1
>> caches, etc I've tested on recent JDK 16 EA build from
>> http://jdk.java.net/16/
>> 
>> Link to the code and assembly listing:
>> https://gist.github.com/tarsa/7a9c80bb84c2dcd807be9cd16a655ee0 I'll
>> copy the source code again in this mail at the end.
>> 
>> What I see in the assembly listings is e.g.
>> 
>> 0x00007f0e208b8ab9:   cmp    r13d,0x7fffffc0
>> 0x00007f0e208b8ac0:   jg     0x00007f0e208b932c
>> 0x00007f0e208b8ac6:   vmulpd ymm0,ymm6,ymm4
>> 0x00007f0e208b8aca:   vsubpd ymm1,ymm4,ymm4
>> 0x00007f0e208b8ace:   vmovdqu YMMWORD PTR [rsp+0xc0],ymm1
>> 0x00007f0e208b8ad7:   vmulpd ymm0,ymm0,ymm4
>> ;*synchronization entry
>>                                                          ; -
>> jdk.internal.vm.vector.VectorSupport$VectorPayload::getPayload at -1
>> (line 101)
>>                                                          ; -
>> jdk.incubator.vector.Double256Vector$Double256Mask::getBits at 1 (line
>> 557)
>>                                                          ; -
>> jdk.incubator.vector.AbstractMask::toLong at 24 (line 77)
>>                                                          ; -
>> mandelbrot_simd_1::computeChunksVector at 228 (line 187)
>> 0x00007f0e208b8adb:   vaddpd ymm0,ymm0,ymm2               ;*checkcast
>> {reexecute=0 rethrow=0 return_oop=0}
>>                                                          ; -
>> jdk.incubator.vector.DoubleVector::fromArray0Template at 34 (line 3119)
>>                                                          ; -
>> jdk.incubator.vector.Double256Vector::fromArray0 at 3 (line 777)
>>                                                          ; -
>> jdk.incubator.vector.DoubleVector::fromArray at 24 (line 2564)
>>                                                          ; -
>> mandelbrot_simd_1::computeChunksVector at 95 (line 169)
>> 0x00007f0e208b8adf:   vmovdqu YMMWORD PTR [rsp+0xe0],ymm0
>> 0x00007f0e208b8ae8:   vmulpd ymm0,ymm0,ymm0
>> 0x00007f0e208b8aec:   vmovdqu YMMWORD PTR [rsp+0x100],ymm0
>> 
>> How does vmulpd relate to a synchronization entry and
>> AbstrackMask::toLong? It seems way off to me. However, there maybe
>> some trick to understand it. Could you give me some guidelines on how
>> to intepret that? Are the comments describing lines below or above
>> them?
>> 
>> Regards,
>> Piotr
>> 
>> mandelbrot_simd_1.java source code:
>> import jdk.incubator.vector.DoubleVector;
>> import jdk.incubator.vector.VectorMask;
>> import jdk.incubator.vector.VectorSpecies;
>> 
>> import java.io.BufferedOutputStream;
>> import java.io.IOException;
>> import java.io.OutputStream;
>> import java.nio.file.Files;
>> import java.nio.file.Path;
>> import java.util.concurrent.CountDownLatch;
>> import java.util.concurrent.ExecutorService;
>> import java.util.concurrent.Executors;
>> import java.util.concurrent.TimeUnit;
>> import java.util.stream.IntStream;
>> 
>> public class mandelbrot_simd_1 {
>>    private static final VectorSpecies<Double> SPECIES =
>>            DoubleVector.SPECIES_PREFERRED.length() <= 8 ?
>>                    DoubleVector.SPECIES_PREFERRED : DoubleVector.SPECIES_512;
>> 
>>    private static final int LANES = SPECIES.length();
>> 
>>    private static final int LANES_LOG = Integer.numberOfTrailingZeros(LANES);
>> 
>>    public static void main(String[] args) throws IOException {
>>        if ((LANES > 8) || (LANES != (1 << LANES_LOG))) {
>>            var errorMsg = "LANES must be a power of two and at most 8. " +
>>                    "Change SPECIES in the source code.";
>>            throw new RuntimeException(errorMsg);
>>        }
>>        var sideLen = Integer.parseInt(args[0]);
>>        try (var out = new BufferedOutputStream(makeOut1())) {
>>            out.write(String.format("P4\n%d %d\n", sideLen,
>> sideLen).getBytes());
>>            computeAndOutputRows(out, sideLen);
>>        }
>>    }
>> 
>>    @SuppressWarnings("unused")
>>    // the version that avoids mixing up output with JVM diagnostic messages
>>    private static OutputStream makeOut1() throws IOException {
>>        return Files.newOutputStream(Path.of("mandelbrot_simd_1.pbm"));
>>    }
>> 
>>    // the version that is compatible with benchmark requirements
>>    private static OutputStream makeOut2() {
>>        return System.out;
>>    }
>> 
>>    private static void computeAndOutputRows(OutputStream out, int sideLen) {
>>        var poolFactor = 1000000 / sideLen;
>>        if (poolFactor < 10) {
>>            throw new RuntimeException("Too small poolFactor");
>>        }
>>        var numCpus = Runtime.getRuntime().availableProcessors();
>>        var rowsPerBatch = numCpus * poolFactor;
>>        var fac = 2.0 / sideLen;
>>        var aCr = IntStream.range(0, sideLen).parallel()
>>                .mapToDouble(x -> x * fac - 1.5).toArray();
>>        var bitsReversalMapping = computeBitsReversalMapping();
>>        var rowsPools = new byte[2][rowsPerBatch][(sideLen + 7) / 8];
>>        var rowsChunksPools = new long[2][rowsPerBatch][sideLen / 64];
>>        var batchSizes = new int[2];
>>        var batchCountDowns = new CountDownLatch[2];
>>        var computeEc = Executors.newWorkStealingPool(numCpus);
>>        var masterThread = new Thread(() -> {
>>            var rowsToProcess = sideLen;
>>            var nextBatchStart = 0;
>>            batchSizes[0] = 0;
>>            batchCountDowns[0] = new CountDownLatch(0);
>>            for (var poolId = 0; rowsToProcess > 0; poolId ^= 1) {
>>                while (batchCountDowns[poolId].getCount() != 0) {
>>                    try {
>>                        batchCountDowns[poolId].await();
>>                    } catch (InterruptedException ignored) {
>>                    }
>>                }
>>                batchCountDowns[poolId] = null;
>> 
>>                var nextBatchSize =
>>                        Math.min(sideLen - nextBatchStart, rowsPerBatch);
>>                var nextPoolId = poolId ^ 1;
>>                batchSizes[nextPoolId] = nextBatchSize;
>>                batchCountDowns[nextPoolId] = new CountDownLatch(nextBatchSize);
>>                sendTasks(fac, aCr, bitsReversalMapping,
>>                        rowsPools[nextPoolId], rowsChunksPools[nextPoolId],
>>                        nextBatchStart, nextBatchSize,
>>                        batchCountDowns[nextPoolId], computeEc);
>>                nextBatchStart += nextBatchSize;
>> 
>>                var batchSize = batchSizes[poolId];
>>                try {
>>                    for (var rowIdx = 0; rowIdx < batchSize; rowIdx++) {
>>                        out.write(rowsPools[poolId][rowIdx]);
>>                    }
>>                    out.flush();
>>                } catch (IOException e) {
>>                    e.printStackTrace();
>>                    System.exit(-1);
>>                }
>>                rowsToProcess -= batchSize;
>>            }
>> 
>>            computeEc.shutdown();
>>        });
>>        masterThread.start();
>>        while (masterThread.isAlive() || !computeEc.isTerminated()) {
>>            try {
>>                @SuppressWarnings("unused")
>>                var ignored = computeEc.awaitTermination(1, TimeUnit.DAYS);
>>                masterThread.join();
>>            } catch (InterruptedException ignored) {
>>            }
>>        }
>>    }
>> 
>>    private static void sendTasks(double fac, double[] aCr,
>>                                  byte[] bitsReversalMapping,
>>                                  byte[][] rows, long[][] rowsChunks,
>>                                  int batchStart, int batchSize,
>>                                  CountDownLatch poolsActiveWorkersCount,
>>                                  ExecutorService computeEc) {
>>        for (var i = 0; i < batchSize; i++) {
>>            var indexInBatch = i;
>>            var y = batchStart + i;
>>            var Ci = y * fac - 1.0;
>>            computeEc.submit(() -> {
>>                try {
>>                    computeRow(Ci, aCr, bitsReversalMapping,
>>                            rows[indexInBatch], rowsChunks[indexInBatch]);
>>                    poolsActiveWorkersCount.countDown();
>>                } catch (Exception e) {
>>                    e.printStackTrace();
>>                    System.exit(-1);
>>                }
>>            });
>>        }
>>    }
>> 
>>    private static byte[] computeBitsReversalMapping() {
>>        var bitsReversalMapping = new byte[256];
>>        for (var i = 0; i < 256; i++) {
>>            bitsReversalMapping[i] = (byte) (Integer.reverse(i) >>> 24);
>>        }
>>        return bitsReversalMapping;
>>    }
>> 
>>    private static void computeRow(double Ci, double[] aCr,
>>                                   byte[] bitsReversalMapping,
>>                                   byte[] row, long[] rowChunks) {
>>        computeChunksVector(Ci, aCr, rowChunks);
>>        transferRowFlags(rowChunks, row, bitsReversalMapping);
>>        computeRemainderScalar(aCr, row, Ci);
>>    }
>> 
>>    private static void computeChunksVector(double Ci, double[] aCr,
>>                                            long[] rowChunks) {
>>        var sideLen = aCr.length;
>>        var vCi = DoubleVector.broadcast(SPECIES, Ci);
>>        var vZeroes = DoubleVector.zero(SPECIES);
>>        var vTwos = DoubleVector.broadcast(SPECIES, 2.0);
>>        var vFours = DoubleVector.broadcast(SPECIES, 4.0);
>>        var zeroMask = VectorMask.fromLong(SPECIES, 0);
>>        // (1 << 6) = 64 = length of long in bits
>>        for (var xBase = 0; xBase < (sideLen & ~(1 << 6)); xBase += (1 << 6)) {
>>            var cmpFlags = 0L;
>>            for (var xInc = 0; xInc < (1 << 6); xInc += LANES) {
>>                var vZr = vZeroes;
>>                var vZi = vZeroes;
>>                var vCr = DoubleVector.fromArray(SPECIES, aCr, xBase + xInc);
>>                var vZrN = vZeroes;
>>                var vZiN = vZeroes;
>>                var cmpMask = zeroMask;
>>                for (var outer = 0; outer < 10; outer++) {
>>                    for (var inner = 0; inner < 5; inner++) {
>>                        vZi = vTwos.mul(vZr).mul(vZi).add(vCi);
>>                        vZr = vZrN.sub(vZiN).add(vCr);
>>                        vZiN = vZi.mul(vZi);
>>                        vZrN = vZr.mul(vZr);
>>                    }
>>                    cmpMask = cmpMask.or(vFours.lt(vZiN.add(vZrN)));
>>                    // in Rust version this works fine, so where's the bug then?
>>                    // cmpMask = vFours.lt(vZiN.add(vZrN));
>>                    if (cmpMask.allTrue()) {
>>                        break;
>>                    }
>>                }
>>                cmpFlags |= cmpMask.toLong() << xInc;
>>            }
>>            rowChunks[xBase >> 6] = cmpFlags;
>>        }
>>    }
>> 
>>    private static void transferRowFlags(long[] rowChunks, byte[] row,
>>                                         byte[] bitsReversalMapping) {
>>        for (var i = 0; i < rowChunks.length; i++) {
>>            var group = ~rowChunks[i];
>>            row[i * 8 + 7] = bitsReversalMapping[0xff & (byte) (group >>> 56)];
>>            row[i * 8 + 6] = bitsReversalMapping[0xff & (byte) (group >>> 48)];
>>            row[i * 8 + 5] = bitsReversalMapping[0xff & (byte) (group >>> 40)];
>>            row[i * 8 + 4] = bitsReversalMapping[0xff & (byte) (group >>> 32)];
>>            row[i * 8 + 3] = bitsReversalMapping[0xff & (byte) (group >>> 24)];
>>            row[i * 8 + 2] = bitsReversalMapping[0xff & (byte) (group >>> 16)];
>>            row[i * 8 + 1] = bitsReversalMapping[0xff & (byte) (group >>> 8)];
>>            row[i * 8] = bitsReversalMapping[0xff & (byte) group];
>>        }
>>    }
>> 
>>    private static void computeRemainderScalar(double[] aCr, byte[]
>> row, double Ci) {
>>        var sideLen = aCr.length;
>>        var bits = 0;
>>        for (var x = sideLen & ~(1 << 6); x < sideLen; x++) {
>>            var Zr = 0.0;
>>            var Zi = 0.0;
>>            var Cr = aCr[x];
>>            var i = 50;
>>            var ZrN = 0.0;
>>            var ZiN = 0.0;
>>            do {
>>                Zi = 2.0 * Zr * Zi + Ci;
>>                Zr = ZrN - ZiN + Cr;
>>                ZiN = Zi * Zi;
>>                ZrN = Zr * Zr;
>>            } while (ZiN + ZrN <= 4.0 && --i > 0);
>>            bits <<= 1;
>>            bits += i == 0 ? 1 : 0;
>>            if (x % 8 == 7) {
>>                row[x / 8] = (byte) bits;
>>                bits = 0;
>>            }
>>        }
>>        if (sideLen % 8 != 0) {
>>            row[sideLen / 8] = (byte) bits;
>>        }
>>    }
> > }