Comments / metadata in assembly listings don't make sense for code vectorized using Vector API

Sat Jan 30 13:14:06 UTC 2021

Hi,

I was busy with other things, so that's why I delayed the reply.

wt., 19 sty 2021 o 21:17 Paul Sandoz <paul.sandoz at oracle.com> napisał(a):
>
> Hi Piotr,
>
> Thanks for further sharing. I am glad you managed to make progress. I was not aware there were some benchmark rules you needed to adhere to.

Rules are here:
https://benchmarksgame-team.pages.debian.net/benchmarksgame/description/mandelbrot.html#mandelbrot
They require bit perfect output and also the same algorithm. Well, in
the end it's a benchmark of programming languages, not of algorithms.

> Re: masks, yes there is still work to do for some mask operations.

OK, good to know.

> Re: execution from the command line. You can run with -XX:-TieredCompilation (Remi, thanks for the correction in the prior email :-) ), and it's also possible reduce the compilation threshold (at the expense of potentially less accurate profiling information) using say -XX:CompileThreshold=1000 (the default is 10000).
> It’s always a bit tricky to compare a static (Rust) vs. dynamic system that needs to warm up.
>
> Paul.

I've tested the provided options, but they don't improve performance
on the real benchmark:

$ time for run in {1..30}; do ~/devel/jdk-16/bin/java
-XX:-TieredCompilation -XX:CompileThreshold=1000 --add-modules
jdk.incubator.vector -cp
target/classes/:/home/piotrek/.m2/repository/org/openjdk/jmh/jmh-core/1.27/jmh-core-1.27.jar
pl.tarsa.mandelbrot_simd_1 16000 > /dev/null; done
WARNING: Using incubator modules: jdk.incubator.vector
WARNING: Using incubator modules: jdk.incubator.vector
... (repeated for each run)
WARNING: Using incubator modules: jdk.incubator.vector
WARNING: Using incubator modules: jdk.incubator.vector
real 0m42,213s
user 2m35,740s
sys 0m4,094s

$ time for run in {1..30}; do ~/devel/jdk-16/bin/java
-XX:-TieredCompilation --add-modules jdk.incubator.vector -cp
target/classes/:/home/piotrek/.m2/repository/org/openjdk/jmh/jmh-core/1.27/jmh-core-1.27.jar
pl.tarsa.mandelbrot_simd_1 16000 > /dev/null; done
WARNING: Using incubator modules: jdk.incubator.vector
WARNING: Using incubator modules: jdk.incubator.vector
... (repeated for each run)
WARNING: Using incubator modules: jdk.incubator.vector
WARNING: Using incubator modules: jdk.incubator.vector
real 0m40,038s
user 2m23,227s
sys 0m3,515s

$ time for run in {1..30}; do ~/devel/jdk-16/bin/java --add-modules
jdk.incubator.vector -cp
target/classes/:/home/piotrek/.m2/repository/org/openjdk/jmh/jmh-core/1.27/jmh-core-1.27.jar
pl.tarsa.mandelbrot_simd_1 16000 > /dev/null; done
WARNING: Using incubator modules: jdk.incubator.vector
WARNING: Using incubator modules: jdk.incubator.vector
... (repeated for each run)
WARNING: Using incubator modules: jdk.incubator.vector
WARNING: Using incubator modules: jdk.incubator.vector
real 0m37,743s
user 2m16,316s
sys 0m3,758s

Looks like the default settings yield best performance. That's a
positive thing, actually.

I'll probably send my version to benchmarks game maintainer when he
switches to Java 16 and then leave tuning to others.

Thanks for the conversation,
Piotr

>
> On Jan 16, 2021, at 4:12 AM, Piotr Tarsa <piotr.tarsa at gmail.com> wrote:
>
> Hi Paul,
>
> Thanks for replying.
>
> As per your advice I've prepared JMH benchmarks. Also I've copied some
> optimizations from other mandelbrot benchmark implementations and
> achieved further speedup. New version is here:
> https://urldefense.com/v3/__https://gist.github.com/tarsa/7a9c80bb84c2dcd807be9cd16a655ee0/4ced690e20ad561515094995a852adc95820955e__;!!GqivPVa7Brio!PPG_y5O-1X6iaHMBpRFVBkTI7zM4hvwD-CWKID54DAHe_q47jEN4GoDyUO8MCOGPNQ$
> It also has simplified buffer management and multithreading, so
> there's less boilerplate.
>
> Results from JMH are quite good (a bit reordered for clarity):
> # Run complete. Total time: 00:16:42
> Benchmark                  Mode  Cnt        Score      Error  Units
> benchRowMT                thrpt    5    17755,337 ±  279,118  ops/s
> benchRowST                thrpt    5     4535,280 ±    7,471  ops/s
> benchScalarPairsMT        thrpt    5     4583,354 ±   89,532  ops/s
> benchScalarPairsST        thrpt    5     1163,925 ±    0,469  ops/s
> benchScalarMT             thrpt    5     2666,210 ±    5,004  ops/s
> benchScalarST             thrpt    5      673,234 ±    0,167  ops/s
> benchVectorMT             thrpt    5    18020,397 ±   54,230  ops/s
> benchVectorST             thrpt    5     4567,873 ±   10,339  ops/s
> benchVectorWithTransfer   thrpt    5     4557,361 ±    9,450  ops/s
> benchScalarRemainderOnly  thrpt    5  7105989,167 ± 4691,311  ops/s
>
> The mandelbrot Java #2 is named here benchScalarPairsMT (it's manually
> unrolled x2). My new vectorized version (similarly manually unrolled)
> is benchVectorMT and it has about 4x higher performance which is quite
> good.
>
> However when I run the benchmark from command line (in non-JMH mode)
> to replicate the benchmark rules then there is much smaller
> performance difference. Mandelbrot Java #2 (the unvectorized one)
> takes about 3s, while my one takes about 1.2s - 1.3s and sometimes
> fluctuated up to about 1.5s. It seems that compilation takes up a lot
> of benchmark time. Is it possible to reduce compilation times for code
> using Vector API?
>
> wt., 5 sty 2021 o 00:11 Paul Sandoz <paul.sandoz at oracle.com> napisał(a):
>
>
> Hi Piotr,
>
> Thanks for experimenting. The Intel folks have also experimented with Mandelbrot generation and might be able to comment with their experience in comparison.
>
> I would be interested to know what your expectations were with regards to speedup.
>
>
> I've expected more than 2x speedup vs the scalar version as I have
> 256-bit SIMD on my Haswell machine. Luckily, I've managed to achieve
> it as I've written at the beginning.
>
> It’s hard to evaluate without a JMH benchmark which can more easily present the code hotspots and isolate from the other areas, such as for thread parallelism. My recommendation would be to extract out the computeChunksVector kernel and embed within such a benchmark.
>
> Switching off tired compilation should help (-XX:-TiredCompilation) i.e. only C2, not C1, in getting to the C2 generated code faster.
>
>
> Good point. I've made a JMH benchmark (that I've presented at the
> beginning) and saw where are inefficiences.
>
> To your question about the assembly listing I believe as HotSpot goes through various compiler passes it tries to preserve the byte code index associated with generated instructions, but naturally as code is lowered this becomes an approximation, and especially so with the Vector API.
>
>
> Hmmm, it's sad that it's only approximation.
>
> In the case of "*synchronization entry”, this is stating the pseudo byte code index just before a method is entered. However, I think there is tech debt here, see
>
> https://urldefense.com/v3/__https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/debugInfoRec.hpp*L67__;Iw!!GqivPVa7Brio!PPG_y5O-1X6iaHMBpRFVBkTI7zM4hvwD-CWKID54DAHe_q47jEN4GoDyUO-g6St52w$
>
> And the usages of SynchronizationEntryBCI in hotspot code.
>
> Running in fast debug mode will present a slightly higher-level view of generated code. Here’s a snippet:
>
> 26e     vmulpd  XMM12,XMM7,XMM7 ! mul packedD
> 272     vaddpd  XMM8,XMM0,XMM11    ! add packedD
> 277     vmulpd  XMM9,XMM8,XMM8 ! mul packedD
> 27c     vaddpd  XMM0,XMM12,XMM9    ! add packedD
> 281     vector_compare XMM10,XMM5,XMM0,#3  !
> 286     # TLS is in R15
> 286     cmpq    RCX, [R15 + #344 (32-bit)] # raw ptr
> 28d     jnb,u   B47  P=0.000100 C=-1.000000
>
> …
>
> 2cf     vmovdqu [rsp + 64],XMM1    # spill
>
> 2d5     B25: # out( B44 B26 ) <- in( B48 B24 )  Freq: 9114.56
> 2d5
> 2d5     # checkcastPP of RAX
> 2d5     vector_store_mask XMM1,XMM10   ! using XMM13 as TEMP
> 2f4     vector_loadmask_byte XMM15,XMM1
>
> 302     vmovdqu XMM1,[rsp + 448]   # spill
> 30b     vpor    XMM1,XMM1,XMM15    ! or vectors
> 310     vector_store_mask XMM0,XMM1    ! using XMM14 as TEMP
> 32e     store_vector [RAX + #16 (8-bit)],XMM0
>
> 333     vector_test RCX,XMM1, XMM2 ! using RFLAGS as TEMP
>        nop    # 2 bytes pad for loops and calls
> 340     testl   RCX, RCX
> 342     je     B44  P=0.108889 C=184362.000000
>
>
> The mask instructions, such as vector_store_mask, are substituted for a more complex sequence of x86 instructions on AVX 2.
>
>
> Thanks. For now the speedup in JMH is good enough for me, so I won't
> dig into assembly code, but I'll consider fastdebug Java build when I
> explore assembly next time.
>
> I do notice that the inner loop (upper bound of 5) does unroll (FWIW, making the inner bound a power of 2 is more friendly for unrolling). There also appears to be many register spills, suggesting non-optimal vector register allocation/usage by C2.
>
>
> I've tried factor of 4, but that didn't improve performance. I think
> it even made it worse as 50 is not a multiply of 4 and benchmark rules
> dictates that there must be exactly 50 iterations so I needed to make
> additional code for 50 % 4 = 2 extra iterations. That probably made
> code compilation even longer and worsen benchmark execution time.
>
> Instead of unrolling the inner loops, I've duplicated all vectors
> (i.e. unrolled the chunk-level loop, instead of the inner iteration
> level loop), so I was working on 256-bit / 64-bit * 2 = 8 results at a
> time. That worked well, similarly to mandelbrot Java #2.
>
> I noticed this in the code:
>
> // in Rust version this works fine, so where's the bug then?
> // cmpMask = vFours.lt(vZiN.add(vZrN));
>
> What problem did you encounter? It works for me on the tip of https://urldefense.com/v3/__https://github.com/openjdk/jdk__;!!GqivPVa7Brio!PPG_y5O-1X6iaHMBpRFVBkTI7zM4hvwD-CWKID54DAHe_q47jEN4GoDyUO80IYo13A$ .
>
>
> The results were different, but benchmark dictates that the output
> must be bit-to-bit identical to correct output. I've nailed it down.
> It's because of numeric overflows causing numbers to go to
> Double.Infinity and then subtracting Infinity from Infinity results in
> Double.NaN and then comparisons are always false. Anyway, I've tried
> to make a version without `cmpMask = cmpMask.or(newValue)` and made a
> correct version, but that turned out to be slower. Perhaps some of the
> mask operations are not intrinsified yet or something. Rust was doing
> the comparisons a little bit differently, properly accounting for
> NaNs.
>
> All tests were done on:
> openjdk 16-ea 2021-03-16
> OpenJDK Runtime Environment (build 16-ea+30-2130)
> OpenJDK 64-Bit Server VM (build 16-ea+30-2130, mixed mode, sharing)
>
>
> Paul.
>
> On Dec 30, 2020, at 6:17 AM, Piotr Tarsa <piotr.tarsa at gmail.com> wrote:
>
> Hi all,
>
> Thanks for creating Project Panama! It looks promising. However, I've
> made a try to vectorize some code and got somewhat disappointing
> results. Therefore I wanted to look at the generated machine code to
> see it it looks optimal. I've attached hsdis to JVM and enabled
> assembly printing but the output doesn't make sense to me, i.e. the
> instructions and comments / metadata don't seem to match. I may be
> wrong as I've very rarely looked at assembly listing produced by JVM.
>
> Performance:
> As a baseline I took
> https://urldefense.com/v3/__https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/mandelbrot-java-2.html__;!!GqivPVa7Brio!PPG_y5O-1X6iaHMBpRFVBkTI7zM4hvwD-CWKID54DAHe_q47jEN4GoDyUO9Hidpnxg$
> which takes about 3.05s to finish on my system. After vectorization
> I've managed to achieve timings like 1.80s. That's quite disappointing
> to me as I have a Haswell machine which has AVX2, high speed L1
> caches, etc I've tested on recent JDK 16 EA build from
> https://urldefense.com/v3/__http://jdk.java.net/16/__;!!GqivPVa7Brio!PPG_y5O-1X6iaHMBpRFVBkTI7zM4hvwD-CWKID54DAHe_q47jEN4GoDyUO8eAZxJgQ$
>
> Link to the code and assembly listing:
> https://urldefense.com/v3/__https://gist.github.com/tarsa/7a9c80bb84c2dcd807be9cd16a655ee0__;!!GqivPVa7Brio!PPG_y5O-1X6iaHMBpRFVBkTI7zM4hvwD-CWKID54DAHe_q47jEN4GoDyUO9lqtRFkA$  I'll
> copy the source code again in this mail at the end.
>
> What I see in the assembly listings is e.g.
>
> 0x00007f0e208b8ab9:   cmp    r13d,0x7fffffc0
> 0x00007f0e208b8ac0:   jg     0x00007f0e208b932c
> 0x00007f0e208b8ac6:   vmulpd ymm0,ymm6,ymm4
> 0x00007f0e208b8aca:   vsubpd ymm1,ymm4,ymm4
> 0x00007f0e208b8ace:   vmovdqu YMMWORD PTR [rsp+0xc0],ymm1
> 0x00007f0e208b8ad7:   vmulpd ymm0,ymm0,ymm4
> ;*synchronization entry
>                                                         ; -
> jdk.internal.vm.vector.VectorSupport$VectorPayload::getPayload at -1
> (line 101)
>                                                         ; -
> jdk.incubator.vector.Double256Vector$Double256Mask::getBits at 1 (line
> 557)
>                                                         ; -
> jdk.incubator.vector.AbstractMask::toLong at 24 (line 77)
>                                                         ; -
> mandelbrot_simd_1::computeChunksVector at 228 (line 187)
> 0x00007f0e208b8adb:   vaddpd ymm0,ymm0,ymm2               ;*checkcast
> {reexecute=0 rethrow=0 return_oop=0}
>                                                         ; -
> jdk.incubator.vector.DoubleVector::fromArray0Template at 34 (line 3119)
>                                                         ; -
> jdk.incubator.vector.Double256Vector::fromArray0 at 3 (line 777)
>                                                         ; -
> jdk.incubator.vector.DoubleVector::fromArray at 24 (line 2564)
>                                                         ; -
> mandelbrot_simd_1::computeChunksVector at 95 (line 169)
> 0x00007f0e208b8adf:   vmovdqu YMMWORD PTR [rsp+0xe0],ymm0
> 0x00007f0e208b8ae8:   vmulpd ymm0,ymm0,ymm0
> 0x00007f0e208b8aec:   vmovdqu YMMWORD PTR [rsp+0x100],ymm0
>
> How does vmulpd relate to a synchronization entry and
> AbstrackMask::toLong? It seems way off to me. However, there maybe
> some trick to understand it. Could you give me some guidelines on how
> to intepret that? Are the comments describing lines below or above
> them?
>
> Regards,
> Piotr
>
> mandelbrot_simd_1.java source code:
> import jdk.incubator.vector.DoubleVector;
> import jdk.incubator.vector.VectorMask;
> import jdk.incubator.vector.VectorSpecies;
>
> import java.io.BufferedOutputStream;
> import java.io.IOException;
> import java.io.OutputStream;
> import java.nio.file.Files;
> import java.nio.file.Path;
> import java.util.concurrent.CountDownLatch;
> import java.util.concurrent.ExecutorService;
> import java.util.concurrent.Executors;
> import java.util.concurrent.TimeUnit;
> import java.util.stream.IntStream;
>
> public class mandelbrot_simd_1 {
>   private static final VectorSpecies<Double> SPECIES =
>           DoubleVector.SPECIES_PREFERRED.length() <= 8 ?
>                   DoubleVector.SPECIES_PREFERRED : DoubleVector.SPECIES_512;
>
>   private static final int LANES = SPECIES.length();
>
>   private static final int LANES_LOG = Integer.numberOfTrailingZeros(LANES);
>
>   public static void main(String[] args) throws IOException {
>       if ((LANES > 8) || (LANES != (1 << LANES_LOG))) {
>           var errorMsg = "LANES must be a power of two and at most 8. " +
>                   "Change SPECIES in the source code.";
>           throw new RuntimeException(errorMsg);
>       }
>       var sideLen = Integer.parseInt(args[0]);
>       try (var out = new BufferedOutputStream(makeOut1())) {
>           out.write(String.format("P4\n%d %d\n", sideLen,
> sideLen).getBytes());
>           computeAndOutputRows(out, sideLen);
>       }
>   }
>
>   @SuppressWarnings("unused")
>   // the version that avoids mixing up output with JVM diagnostic messages
>   private static OutputStream makeOut1() throws IOException {
>       return Files.newOutputStream(Path.of("mandelbrot_simd_1.pbm"));
>   }
>
>   // the version that is compatible with benchmark requirements
>   private static OutputStream makeOut2() {
>       return System.out;
>   }
>
>   private static void computeAndOutputRows(OutputStream out, int sideLen) {
>       var poolFactor = 1000000 / sideLen;
>       if (poolFactor < 10) {
>           throw new RuntimeException("Too small poolFactor");
>       }
>       var numCpus = Runtime.getRuntime().availableProcessors();
>       var rowsPerBatch = numCpus * poolFactor;
>       var fac = 2.0 / sideLen;
>       var aCr = IntStream.range(0, sideLen).parallel()
>               .mapToDouble(x -> x * fac - 1.5).toArray();
>       var bitsReversalMapping = computeBitsReversalMapping();
>       var rowsPools = new byte[2][rowsPerBatch][(sideLen + 7) / 8];
>       var rowsChunksPools = new long[2][rowsPerBatch][sideLen / 64];
>       var batchSizes = new int[2];
>       var batchCountDowns = new CountDownLatch[2];
>       var computeEc = Executors.newWorkStealingPool(numCpus);
>       var masterThread = new Thread(() -> {
>           var rowsToProcess = sideLen;
>           var nextBatchStart = 0;
>           batchSizes[0] = 0;
>           batchCountDowns[0] = new CountDownLatch(0);
>           for (var poolId = 0; rowsToProcess > 0; poolId ^= 1) {
>               while (batchCountDowns[poolId].getCount() != 0) {
>                   try {
>                       batchCountDowns[poolId].await();
>                   } catch (InterruptedException ignored) {
>                   }
>               }
>               batchCountDowns[poolId] = null;
>
>               var nextBatchSize =
>                       Math.min(sideLen - nextBatchStart, rowsPerBatch);
>               var nextPoolId = poolId ^ 1;
>               batchSizes[nextPoolId] = nextBatchSize;
>               batchCountDowns[nextPoolId] = new CountDownLatch(nextBatchSize);
>               sendTasks(fac, aCr, bitsReversalMapping,
>                       rowsPools[nextPoolId], rowsChunksPools[nextPoolId],
>                       nextBatchStart, nextBatchSize,
>                       batchCountDowns[nextPoolId], computeEc);
>               nextBatchStart += nextBatchSize;
>
>               var batchSize = batchSizes[poolId];
>               try {
>                   for (var rowIdx = 0; rowIdx < batchSize; rowIdx++) {
>                       out.write(rowsPools[poolId][rowIdx]);
>                   }
>                   out.flush();
>               } catch (IOException e) {
>                   e.printStackTrace();
>                   System.exit(-1);
>               }
>               rowsToProcess -= batchSize;
>           }
>
>           computeEc.shutdown();
>       });
>       masterThread.start();
>       while (masterThread.isAlive() || !computeEc.isTerminated()) {
>           try {
>               @SuppressWarnings("unused")
>               var ignored = computeEc.awaitTermination(1, TimeUnit.DAYS);
>               masterThread.join();
>           } catch (InterruptedException ignored) {
>           }
>       }
>   }
>
>   private static void sendTasks(double fac, double[] aCr,
>                                 byte[] bitsReversalMapping,
>                                 byte[][] rows, long[][] rowsChunks,
>                                 int batchStart, int batchSize,
>                                 CountDownLatch poolsActiveWorkersCount,
>                                 ExecutorService computeEc) {
>       for (var i = 0; i < batchSize; i++) {
>           var indexInBatch = i;
>           var y = batchStart + i;
>           var Ci = y * fac - 1.0;
>           computeEc.submit(() -> {
>               try {
>                   computeRow(Ci, aCr, bitsReversalMapping,
>                           rows[indexInBatch], rowsChunks[indexInBatch]);
>                   poolsActiveWorkersCount.countDown();
>               } catch (Exception e) {
>                   e.printStackTrace();
>                   System.exit(-1);
>               }
>           });
>       }
>   }
>
>   private static byte[] computeBitsReversalMapping() {
>       var bitsReversalMapping = new byte[256];
>       for (var i = 0; i < 256; i++) {
>           bitsReversalMapping[i] = (byte) (Integer.reverse(i) >>> 24);
>       }
>       return bitsReversalMapping;
>   }
>
>   private static void computeRow(double Ci, double[] aCr,
>                                  byte[] bitsReversalMapping,
>                                  byte[] row, long[] rowChunks) {
>       computeChunksVector(Ci, aCr, rowChunks);
>       transferRowFlags(rowChunks, row, bitsReversalMapping);
>       computeRemainderScalar(aCr, row, Ci);
>   }
>
>   private static void computeChunksVector(double Ci, double[] aCr,
>                                           long[] rowChunks) {
>       var sideLen = aCr.length;
>       var vCi = DoubleVector.broadcast(SPECIES, Ci);
>       var vZeroes = DoubleVector.zero(SPECIES);
>       var vTwos = DoubleVector.broadcast(SPECIES, 2.0);
>       var vFours = DoubleVector.broadcast(SPECIES, 4.0);
>       var zeroMask = VectorMask.fromLong(SPECIES, 0);
>       // (1 << 6) = 64 = length of long in bits
>       for (var xBase = 0; xBase < (sideLen & ~(1 << 6)); xBase += (1 << 6)) {
>           var cmpFlags = 0L;
>           for (var xInc = 0; xInc < (1 << 6); xInc += LANES) {
>               var vZr = vZeroes;
>               var vZi = vZeroes;
>               var vCr = DoubleVector.fromArray(SPECIES, aCr, xBase + xInc);
>               var vZrN = vZeroes;
>               var vZiN = vZeroes;
>               var cmpMask = zeroMask;
>               for (var outer = 0; outer < 10; outer++) {
>                   for (var inner = 0; inner < 5; inner++) {
>                       vZi = vTwos.mul(vZr).mul(vZi).add(vCi);
>                       vZr = vZrN.sub(vZiN).add(vCr);
>                       vZiN = vZi.mul(vZi);
>                       vZrN = vZr.mul(vZr);
>                   }
>                   cmpMask = cmpMask.or(vFours.lt(vZiN.add(vZrN)));
>                   // in Rust version this works fine, so where's the bug then?
>                   // cmpMask = vFours.lt(vZiN.add(vZrN));
>                   if (cmpMask.allTrue()) {
>                       break;
>                   }
>               }
>               cmpFlags |= cmpMask.toLong() << xInc;
>           }
>           rowChunks[xBase >> 6] = cmpFlags;
>       }
>   }
>
>   private static void transferRowFlags(long[] rowChunks, byte[] row,
>                                        byte[] bitsReversalMapping) {
>       for (var i = 0; i < rowChunks.length; i++) {
>           var group = ~rowChunks[i];
>           row[i * 8 + 7] = bitsReversalMapping[0xff & (byte) (group >>> 56)];
>           row[i * 8 + 6] = bitsReversalMapping[0xff & (byte) (group >>> 48)];
>           row[i * 8 + 5] = bitsReversalMapping[0xff & (byte) (group >>> 40)];
>           row[i * 8 + 4] = bitsReversalMapping[0xff & (byte) (group >>> 32)];
>           row[i * 8 + 3] = bitsReversalMapping[0xff & (byte) (group >>> 24)];
>           row[i * 8 + 2] = bitsReversalMapping[0xff & (byte) (group >>> 16)];
>           row[i * 8 + 1] = bitsReversalMapping[0xff & (byte) (group >>> 8)];
>           row[i * 8] = bitsReversalMapping[0xff & (byte) group];
>       }
>   }
>
>   private static void computeRemainderScalar(double[] aCr, byte[]
> row, double Ci) {
>       var sideLen = aCr.length;
>       var bits = 0;
>       for (var x = sideLen & ~(1 << 6); x < sideLen; x++) {
>           var Zr = 0.0;
>           var Zi = 0.0;
>           var Cr = aCr[x];
>           var i = 50;
>           var ZrN = 0.0;
>           var ZiN = 0.0;
>           do {
>               Zi = 2.0 * Zr * Zi + Ci;
>               Zr = ZrN - ZiN + Cr;
>               ZiN = Zi * Zi;
>               ZrN = Zr * Zr;
>           } while (ZiN + ZrN <= 4.0 && --i > 0);
>           bits <<= 1;
>           bits += i == 0 ? 1 : 0;
>           if (x % 8 == 7) {
>               row[x / 8] = (byte) bits;
>               bits = 0;
>           }
>       }
>       if (sideLen % 8 != 0) {
>           row[sideLen / 8] = (byte) bits;
>       }
>   }
> }
>
>