RFR: 8320709: AArch64: Vectorized Poly1305 intrinsics [v5]

Tue Jan 9 12:11:56 UTC 2024

On Mon, 4 Dec 2023 17:33:00 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> Vectorizing Poly1305 is quite tricky. We already have a highly-
>> efficient scalar Poly1305 implementation that runs on the core integer
>> unit, but it's highly serialized, so it does not make make good use of
>> the parallelism available.
>> 
>> The scalar implementation takes advantage of some particular features
>> of the Poly1305 keys. In particular, certain bits of r, the secret
>> key, are required to be 0. These make it possible to use a full
>> 64-bit-wide multiply-accumulate operation without needing to process
>> carries between partial products,
>> 
>> While this works well for a serial implementation, a parallel
>> implementation cannot do this because rather than multiplying by r,
>> each step multiplies by some integer power of r, modulo
>> 2^130-5.
>> 
>> In order to avoid processing carries between partial products we use a
>> redundant representation, in which each 130-bit integer is encoded
>> either as a 5-digit integer in base 2^26 or as a 3-digit integer in
>> base 2^52, depending on whether we are using a 64- or 32-bit
>> multiply-accumulate.
>> 
>> In AArch64 Advanced SIMD, there is no 64-bit multiply-accumulate
>> operation available to us, so we must use 32*32 -> 64-bit operations.
>> 
>> In order to achieve maximum performance we'd like to get close to the
>> processor's decode bandwidth, so that every clock cycle does something
>> useful. In a typical high-end AArch64 implementation, the core integer
>> unit has a fast 64-bit multiplier pipeline and the ASIMD unit has a
>> fast(ish) two-way 32-bit multiplier, which may be slower than than the
>> core integer unit's. It is not at all obvious whether it's best to use
>> ASIMD or core instructions.
>> 
>> Fortunately, if we have a wide-bandwidth instruction decode, we can do
>> both at the same time, by feeding alternating instructions to the core
>> and the ASIMD units. This also allows us to make good use of all of
>> the available core and ASIMD registers, in parallel.
>> 
>> To do this we use generators, which here are a kind of iterator that
>> emits a group of instructions each time it is called. In this case we
>> 4 parallel generators, and by calling them alternately we interleave
>> the ASIMD and the core instructions. We also take care to ensure that
>> each generator finishes at about the same time, to maximize the
>> distance between instructions which generate and consume data.
>> 
>> The results are pretty good, ranging from 2* - 3* speedup. It is
>> possible that a pure in-order processor (Raspberry Pi?) migh...
>
> Andrew Haley has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Whitespace
>  - Whitespace

This is an outstanding piece of work which, setting aside the pleasure afforded by its beauty, achieves some highly important goals, both immediate and long-term.

The base achievement is to implement the poly1305 algorithm with an intrinsic which drives both the vector and integer units in combination, maximising pipeline parallelism while also profiting from vector (2-way) SIMD parallelism. This design enables high end, out of order processors like Apple's M-series to attain close to 6 instructions per cycle. However, the implementation achieves some much more important goals.

A further achievement is to generate this highly efficient code using a generation strategy that renders its correctness ascertainable by direct review of the methods employed in the generator code, rather than by resort to eyeballing the highly complex, interleaved streams of parallel instructions that they generate.

The third and, perhaps, most significant achievement is to achieve that goal by implementing the generator using a toolkit which simplifies handling of the many of complexities involved in structuring and interleaving the generated instruction sequences and managing the independent and shared register sets those instructions employ.

This last goal is arguably the most important one as it presents a paradigm and for how to generate highly efficient, correct parallel code. It's importance is that the same technique and technology might be retrofitted to other intrinsics with great benefits for maintainability, reliability and confidence in the correctness of the code.

The generator toolkit and generation code appears to be entirely correct and is mostly very clean, suffering only from a few format details and one or two now redundant methods that appear to be hangovers from earlier vesions. However the code is missing documentation comments that will be critical to ensure that maintainers can quickly understand what the generated code is doing.

The generator code itself needs some commenting to clarify how it is used and how it operates. I have made a few suggestions to that end.

The largest omission is commenting of 1) the data layouts used to manage 130-bit data values and 2) the purpose and operation of the various macro-functions that generate smaller and larger instructions sequences. I have likewise suggested comments to clarify these parts of the patch.

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1745:

> 1743:     poly1305_multiply(acc, u, s, r, RR2, scratch);
> 1744:     acc.gen();
> 1745:   }

Comment

    /*
     * Appends instructions to the current code buffer implementing
     * a vector parallel 2-way SIMD widening 130-bit multiply
     * u <--(s*r) mod 2^130-5 by calling
     *   poly1305_multiply_vec(acc, u, m, r, rr, scratch)
     *   acc.gen();
     */

Also, change m[] to s[] in the method signature

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1752:

> 1750:     acc.gen();
> 1751:   }
> 1752: 

Comment

    /*
     * Appends instructions to the generator which:
     *
     * load two 128-bit values from input_start into s[0], s[1] and
     * s[2] in vec_4s3_26 format.
     *
     * set bit 24 of each DWORD in s[2] to 1.
     *
     * add each of the 26 bit limbs of the two values passed in u[] in
     * vec_2d5_26 format to the corresponding limbs and values of s[]
     * in vec_4s3_26 format.
     *
     * shuffle the values in s from vec_4s3_26 format to vec_4s3_26_I
     * format.
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1755:

> 1753:   void poly1305_step_vec(AsmGenerator &acc,
> 1754:                          const FloatRegister s[], const FloatRegister u[],
> 1755:                          const FloatRegister zero, Register input_start);

The method below appears to be redundant?

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1761:

> 1759:                            const FloatRegister s_v[],
> 1760:                            const FloatRegister r_v[],
> 1761:                            const FloatRegister rr_v[]);

Comment

    /*
     * Appends instructions to acc that perform a modulo 2^130-5
     * Goll-Guerin reduction on the pair of cross-multiplied 130-bit
     * products presented in u[].
     *
     * u[] inputs the five 26-bit 'digits' and associated 'carry' bits
     * for two 130-bit cross-products in vec_2d5_26 format (as output
     * by poly1305_multiply_vec). On return the reduced 130-bit values
     * are output in u[] in the same format.
     *
     * zero is used to zero out bits 26 to 63 of the low and high
     * DWORDS in u[]. Both the low and high DWORDs of this input
     * argument must be set to 0 by the caller.
     *
     * scratch provides at least two scratch registers that can
     * be used by the generated code.
     */

Argument upper_bits should be renamed to zero.

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1765:

> 1763:                            const FloatRegister u[],
> 1764:                            const FloatRegister upper_bits,
> 1765:                            AbstractRegSet<FloatRegister> scratch);

Comment

    /*
     * Appends instructions to acc that perform a 130-bit
     * cross-multiply and reduction by calling
     * poly1305_multiply_vec(acc, u, s, r, rr) and
     * poly1305_reduce_vec(acc, u, zero, scratch).
     */

Also, it would flag what is happening more clearly if this method were renamed poly1305_field_multiply_vec. Obviously, the suffix is redundant -- because the signature clarifies which variant of the two methods woith this name is being called. However, it does no harm to ensure that apples are clearly labelled apples and pairs clearly labelled pairs.

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1776:

> 1774:     poly1305_reduce_vec(acc, u, zero, scratch);
> 1775:   }
> 1776: 

Comment

    /*
     * Appends instructions to acc which load a 128-bit value from
     * input_start, stores it as a 130-bit value in s[] in gpr_d3_56
     * format and sets bit 24 of s[2] to 1.
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1778:

> 1776: 
> 1777:   void poly1305_load(AsmGenerator &acc, const Register s[],
> 1778:                      const Register input_start);

Comment

    /*
     * Appends instructions to the current code buffer to load a
     * 128-bit value by calling
     *
     *  poly1305_load(acc, s, input_start);
     *  acc.gen(); 
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1783:

> 1781:     poly1305_load(acc, s, input_start);
> 1782:     acc.gen();
> 1783:   }

Comment

    /*
     * Appends instructions to acc which load a 128-bit value from
     * input_start into s and then add it to the value in u[] by
     * calling
     *
     *  poly1305_load(acc, s, input_start);
     *  _ { poly1305_add(s, u)); }
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1784:

> 1782:     acc.gen();
> 1783:   }
> 1784:   void poly1305_step(AsmGenerator &acc, const Register s[], const RegPair u[], const Register input_start);

Comment

    /*
     * Appends instructions to the current code buffer which load a
     * 128-bit value from input_start into s and then add it to the
     * value in u[] by calling
     *
     *  poly1305_load(acc, s, u, input_start);
     *  acc.gen();
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1789:

> 1787:     poly1305_step(acc, s, u, input_start);
> 1788:     acc.gen();
> 1789:   }

Comment

    /*
     * Appends instructions to the current code buffer which add the
     * 130-bit value in src to the 130-bit value in dest by calling
     *
     *  poly1305_add(acc, dest, src);
     *  acc.gen();
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1790:

> 1788:     acc.gen();
> 1789:   }
> 1790:   void poly1305_add(const Register dest[], const RegPair src[]);

Comment
    /*
     * Appends instructions to the current code buffer which add each
     * of the 3 56-bit limbs in src to the corresponding 56 bit limb
     * in dest.
     *
     * src is a 130-bit value in gpr_d3_56 format.
     *
     * dest is a 130-bit value in gpr_d3_56 format. Carry bits may
     * accumulate in the higher bits of each limb as a result of
     * successive additions.
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1793:

> 1791:   void poly1305_add(AsmGenerator &acc,
> 1792:                     const Register dest[], const RegPair src[]);
> 1793: 

Method mov26 appears to be redundant

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1794:

> 1792:                     const Register dest[], const RegPair src[]);
> 1793: 
> 1794:   void mov26(FloatRegister d, Register s, int lsb);

Comment
Add comment

    /*
     * Split the 56 bit digit passed in r into two low and high 26-bit
     * digits and insert them, respectively, into the lower and upper
     * 32-bit half-words of d.
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1795:

> 1793: 
> 1794:   void mov26(FloatRegister d, Register s, int lsb);
> 1795:   void expand26(Register d, Register r);

Add comment

    /*
     * Split the 56 bit digit passed in r into two low and high 26-bit
     * digits and insert them into the low DWORD of, respectively,
     * d[0] and d[1].
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1796:

> 1794:   void mov26(FloatRegister d, Register s, int lsb);
> 1795:   void expand26(Register d, Register r);
> 1796:   void split26(const FloatRegister d[], Register s);

Add comment

    /*
     * Copy a 130-bit value from general purpose registers s0, s1, s2 into
     * the vector register array d[5].
     *
     * s0, s1 and s2 input a 130-bit value in gpr_d3_56 format.
     *
     * d outputs a 130-bit value in vec_d5_26 format.
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1798:

> 1796:   void split26(const FloatRegister d[], Register s);
> 1797:   void copy_3_to_5_regs(const FloatRegister d[],
> 1798:                         const Register s0, const Register s1, const Register s2);

Add comment

    /*
     * Copy a 130-bit value from general purpose registers s0, s1, s2 into
     * the vector register array d[2].
     *
     * s0, s1 and s2 input a 130-bit value in gpr_d3_56 format.
     *
     * d outputs a 130-bit value in vec_4s2_26 format.
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1801:

> 1799:   void copy_3_regs_to_5_elements(const FloatRegister d[],
> 1800:                                  const Register s0, const Register s1, const Register s2);
> 1801: 

Add comment

    /*

     * Appends instructions to acc that perform a modulo 2^130-5
     * Goll-Guerin reduction on the cross-multiplied 130-bit presented
     * in u[].
     *
     * u[] inputs 3 56-bit 'digits' and associated 'carry' bits for a
     * 130-bit cross-product in gpr_d3_26 format (as output by
     * poly1305_multiply). On return the reduced 130-bit value is
     * output via u[] in the same format.
     */

Also, argument 's' seems to be redundant. Can it be removed?

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1802:

> 1800:                                  const Register s0, const Register s1, const Register s2);
> 1801: 
> 1802:   void poly1305_reduce(AsmGenerator &acc, const RegPair u[], const char *s = nullptr);

Add comment

    /*
     * Append instructions to the current code buffer that perform a
     * modulo 2^130-5 Goll-Guerin reduction on the 130-bit passed in
     * u[] by calling:
     *
     *   poly1305_reduce(acc, u, "redc");
     *   acc.gen();
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1807:

> 1805:     poly1305_reduce(acc, u, "redc");
> 1806:     acc.gen();
> 1807:   }

Add comment

    /*
     * Appends instructions to acc that add carry bits from s to
     * d and then zero out the carry bits in s.
     *
     * s stores a single limb of a 130-bit value in reg_d5_26 format
     * comprising a 26-bit 'digit' combined with up to 30 higher
     * 'carry' bits.
     *
     * d stores a single limb of a 130-bit value in reg_d3_52 format
     * comprising a 26-bit 'digit' combined with up to 30 higher
     * 'carry' bits.
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1809:

> 1807:   }
> 1808:   void poly1305_reduce_step(AsmGenerator &acc,
> 1809:                             FloatRegister d, FloatRegister s, FloatRegister upper_bits, FloatRegister scratch);

Add comment

    /*
     * Appends instructions to acc that reformat a 130-bit value
     * stored in the low words of u[] in reg_d5_26 format into the
     * registers passed in dest[] in reg_d3_52 format and clamp the
     * result to range [0, 2^130-5].
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1810:

> 1808:   void poly1305_reduce_step(AsmGenerator &acc,
> 1809:                             FloatRegister d, FloatRegister s, FloatRegister upper_bits, FloatRegister scratch);
> 1810:   void poly1305_fully_reduce(Register dest[], const RegPair u[]);

Add comment

    /*
     * Appends instructions to acc that transfer five 26-bit 'digits'
     * of a 130-bit value input in s[] in vec_d5_26 format into three
     * 52-bit 'digits' output in the low words of u[] in gpr_d3_56
     * format.
     */

Also, the first argument in the declaration is d[] but it is u0[] in the
definition. It should probably be u[] but d[] would do.

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1812:

> 1810:   void poly1305_fully_reduce(Register dest[], const RegPair u[]);
> 1811:   void poly1305_transfer(const RegPair d[], const FloatRegister s[],
> 1812:                          int lane, FloatRegister vscratch);

Add comment

    /*
     * Copies a 64-bit value from each of the 3 low registers of src[]
     * to the corresponding register in dest[].
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp line 1813:

> 1811:   void poly1305_transfer(const RegPair d[], const FloatRegister s[],
> 1812:                          int lane, FloatRegister vscratch);
> 1813:   void copy_3_regs(const Register dest[], const Register src[]);

Add comment

    /*
     * Adds the 64-bit value in each of the 3 low registers of src[]
     * to the corresponding register in dest[].
     */

Add comment

    /*
     * Adds the 64-bit value in each of the 3 low registers of src[]
     * to the corresponding register in dest[].
     */

src/hotspot/cpu/aarch64/macroAssembler_aarch64_poly1305.cpp line 48:

> 46:   mul(prod._lo, n, m);
> 47:   umulh(prod._hi, n, m);
> 48: }

nit: new line needed here

src/hotspot/cpu/aarch64/macroAssembler_aarch64_poly1305.cpp line 55:

> 53: }
> 54: 
> 55: void MacroAssembler::poly1305_transfer(const RegPair u0[],

This argument should be named u (or possibly d) not u0.

src/hotspot/cpu/aarch64/macroAssembler_aarch64_poly1305.cpp line 105:

> 103:   ubfx(rscratch1, s, lsb, 26);
> 104:   mov(d, S, 0, rscratch1);
> 105: }

nit: new line needed here

src/hotspot/cpu/aarch64/macroAssembler_aarch64_poly1305.cpp line 249:

> 247:                                            const FloatRegister s[],
> 248:                                            const FloatRegister r[],
> 249:                                            const FloatRegister rr[]) {

The comment I suggested for the declaration already defines the layout of r and RR. So, this comment might be redundant.

src/hotspot/cpu/aarch64/macroAssembler_aarch64_poly1305.cpp line 252:

> 250:   // Five limbs of r and rr (5·r) are packed as 32-bit integers into
> 251:   // two 128-bit vectors.
> 252: 

I'm not sure what the next line is meant to explain. Is it needed?

src/hotspot/cpu/aarch64/macroAssembler_aarch64_poly1305.cpp line 253:

> 251:   // two 128-bit vectors.
> 252: 
> 253:   // // (h + c) * r, without carry propagation

The comment below needs to refer to s0, s1 etc rather than m0, m1, etc.

src/hotspot/cpu/aarch64/macroAssembler_aarch64_poly1305.cpp line 300:

> 298:     trn1(u[1], T4S, u[2], u[3]);
> 299: 
> 300:     // The incoming sum is packed into u[0], u[1], u[4]

Better to explain the layout change and include full stops.
    // The incoming sum is packed into u[0], u[1], u[4] in
    // vecd_4s3_26 format. u[2] and u[3] are now free.

src/hotspot/cpu/aarch64/macroAssembler_aarch64_poly1305.cpp line 322:

> 320:     sli(s[0], T4S, zero, 26);
> 321:   };
> 322: 

Comment

    // set bit 129 of each value

src/hotspot/cpu/aarch64/macroAssembler_aarch64_poly1305.cpp line 326:

> 324:   _ { addv(s[2], T2D, s[2], scratch1); };
> 325:   _ { sli(s[2], T2D, zero, 32); };
> 326: 

Comment

    // add the current sum into the next input

src/hotspot/cpu/aarch64/macroAssembler_aarch64_poly1305.cpp line 330:

> 328:   _ { addv(s[1], T4S, s[1], u[1]); };
> 329:   _ { addv(s[2], T4S, s[2], u[4]); };
> 330: 

Add comment

    // Interleave the lower and upper pairs of SWORD lanes so
    // that paired values are now in even and odd SWORDs lanes
    // i.e. reformat from vec_2d3_26 to vec_2d3_26_I

I know this could be left as an exercise for the reader but it's better to spell it out as a reminder for anyone who might need to fix the code so they don't have to spend oo much effort (re-)familiarising.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7300:

> 7298:   // the ASIMD and the core instructions. We also take care to ensure that
> 7299:   // each generator finishes at about the same time, to maximize the
> 7300:   // distance between instructions which generate and consume data.

We ought to mention here that the parallelism is six way because the 2 vector instruction streams use 2-way data (SIMD) parallelism.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7331:

> 7329:   public:
> 7330:     RegPair _reg_pairs[3];
> 7331:     RegPairs(RegSetIterator<Register> &it, int n) {

Not sure we need n here as it is always passed as 3 in the latest version of the patch.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7366:

> 7364:     __ pack_26(R[0], R[1], R[2], r_start);
> 7365: 
> 7366:     // Sn is to be the sum of Un and the next block of data

Should this say

     // Sn is to be the sum of Un * r and the next block of data

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7392:

> 7390:     }
> 7391: 
> 7392:     // We're going to use R**6

I think this comment is a tad ...  minimal!

The following would be more helpful

      // The following instructions implement 6 parallel streams of
      // computation. Each stream processes input elements separated
      // by a distance of 6. Hence each stream needs to multiply its
      // accumulated sum by R**6 before adding the next input value.
      // Once all 6 partial sums are computed they constitute a
      // subsequence which is combined using successive multiply
      // by R and add operations. Likewise, any remaining tail (up to
      // 5 extra values) is folded in using multiply by R and
      // add operations.

-------------

Changes requested by adinn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/16812#pullrequestreview-1750837151
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1442998908
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1444906166
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1444928221
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445009971
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1444981051
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445051387
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445057606
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445065854
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445068969
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445072860
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445082701
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445085266
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445092662
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445095710
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445868262
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445871377
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445885641
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445886003
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445897957
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445928677
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445938026
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445941212
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445942640
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1406413151
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1442032172
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1406413966
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445947233
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445946608
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445946018
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1409086567
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1444448111
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445949487
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445953540
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1442773290
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1442040097
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445964681
PR Review Comment: https://git.openjdk.org/jdk/pull/16812#discussion_r1445973376