JDK-8214239 (?): Missing x86_64.ad patterns for clearing and setting long vector bits

Thu Nov 7 19:30:09 UTC 2019

Hi Vladimir, Sandhya and John,

Thanks for your respective answers.

The suggested fix focuses on x86_64 and pure 64-bit immediates which
means that all other cases are left unchanged as shown by the initial
benchmark, for example:

andq &= ~MASK00;
orq |= MASK00;

would still give:

03c       andq    [RSI + #16 (8-bit)], #-2    # long ! Field:
org/openjdk/bench/vm/compiler/BitSetAndReset.andq
041       orq     [RSI + #24 (8-bit)], #1    # long ! Field:
org/openjdk/bench/vm/compiler/BitSetAndReset.orq
046       ...

Now, the interesting point is that pure 64-bit immediates (which
cannot be treated as sign-extended 8/32-bit values) are assembled
using two instructions (not one) because AND/OR cannot be used
directly in such cases, for example:

andq &= ~MASK63;
orq |= MASK63;

gives:

03e       movq    R10, #9223372036854775807    # long
048       andq    [RSI + #16 (8-bit)], R10    # long ! Field:
org/openjdk/bench/vm/compiler/BitSetAndReset.andq
04c       movq    R10, #-9223372036854775808    # long
056       orq     [RSI + #24 (8-bit)], R10    # long ! Field:
org/openjdk/bench/vm/compiler/BitSetAndReset.orq
05a       ...

So, even though Sandhya mentioned a better throughput for AND/OR, the
additional MOV cost (I didn't find it in table C-17 but I assume
something close to MOVS/Z with latency=1/throughput=0.25) seems to be
in favor of a sole BTR/BTS instruction as shown by the initial
benchmark.

However, as John suggested, I tried another benchmark which focuses on
the throughput to make sure there isn't any regression in such
situations:

    private long orq63, orq62, orq61, orq60;

    @Benchmark
    public void throughput(Blackhole bh) {
        for (int i=0; i<COUNT; i++) {
            orq63 = orq62 = orq61 = orq60 = 0;
            bh.consume(testTp());
        }
    }

    private long testTp() {
        orq63 |= MASK63;
        orq62 |= MASK62;
        orq61 |= MASK61;
        orq60 |= MASK60;
        return 0L;
    }

Before, we had:

03e       movq    R10, #-9223372036854775808    # long
048       orq     [RSI + #32 (8-bit)], R10    # long ! Field:
org/openjdk/bench/vm/compiler/BitSetAndReset.orq63
04c       movq    R10, #4611686018427387904    # long
056       orq     [RSI + #40 (8-bit)], R10    # long ! Field:
org/openjdk/bench/vm/compiler/BitSetAndReset.orq62
05a       movq    R10, #2305843009213693952    # long
064       orq     [RSI + #48 (8-bit)], R10    # long ! Field:
org/openjdk/bench/vm/compiler/BitSetAndReset.orq61
068       movq    R10, #1152921504606846976    # long
072       orq     [RSI + #56 (8-bit)], R10    # long ! Field:
org/openjdk/bench/vm/compiler/BitSetAndReset.orq60

Benchmark                  Mode  Cnt      Score      Error  Units
BitSetAndReset.throughput  avgt    9  25912.455 ± 2527.041  ns/op

And after, we would have:

03c       btsq    [RSI + #32 (8-bit)], log2(#-9223372036854775808)
# long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq63
042       btsq    [RSI + #40 (8-bit)], log2(#4611686018427387904)    #
long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq62
048       btsq    [RSI + #48 (8-bit)], log2(#2305843009213693952)    #
long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq61
04e       btsq    [RSI + #56 (8-bit)], log2(#1152921504606846976)    #
long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq60

Benchmark                  Mode  Cnt      Score      Error  Units
BitSetAndReset.throughput  avgt    9  25803.195 ± 2434.009  ns/op

Fortunately, we still see a tiny performance gain along with the large
size reduction and register saving.
Should we go ahead with this optimization? If so, I'll post a RFR with
Vladimir's requested changes soon.

Thanks,
Bernard

On Thu, 7 Nov 2019 at 02:02, John Rose <john.r.rose at oracle.com> wrote:
>
> I recently saw LLVM compile a classification switch into a really tidy BTR instruction,
> something like this:
>
>   switch (ch) {
>   case ';': case '/': case '.': case '[':  return 0;
>   default: return 1;
>   }
> =>
>   … range check …
>   movabsq       0x200000002003, %rcx
>   btq   %rdi, %rcx
>
> It made me wish for this change, plus some more to switch itself.
> Given Sandhya’s report, though, BTR may only be helpful in limited
> cases.  In the case above, it subsumes a shift instruction.
>
> Bernard’s JMH experiment suggests something else is going on besides
> the throughput difference which Sandhya cites.  Maybe it’s a benchmark
> artifact, or maybe it’s a good effect from smaller code.  I suggest jamming
> more back-to-back BTRs together, to see if the throughput effect appears.
>
> — John
>
> On Nov 6, 2019, at 4:34 PM, Viswanathan, Sandhya <sandhya.viswanathan at intel.com> wrote:
> >
> > Hi Vladimir/Bernard,
> >
> >
> >
> > I don’t see any restrictions/limitations on these instructions other than the fact that the “long” operation is only supported on 64-bit format as usual so should be restricted to 64-bit JVM only.
> >
> > The code size improvement that Bernard demonstrates is significant for operation on longs.
> >
> > It looks like the throughput for AND/OR is better than BTR/BTS  (0.25 vs 0.5) though. Please refer Table C-17 in the document below:
>