JDK-8214239 (?): Missing x86_64.ad patterns for clearing and setting long vector bits

Thu Nov 7 21:58:45 UTC 2019

Would you consider adding patterns for non-constant masks also?
It would be something like (And (LShift n) x), etc.
It could be in this set or in an a follow-on.
Thanks (says John who always wants more).

> On Nov 7, 2019, at 11:30 AM, B. Blaser <bsrbnd at gmail.com> wrote:
> 
> Hi Vladimir, Sandhya and John,
> 
> Thanks for your respective answers.
> 
> The suggested fix focuses on x86_64 and pure 64-bit immediates which
> means that all other cases are left unchanged as shown by the initial
> benchmark, for example:
> 
> andq &= ~MASK00;
> orq |= MASK00;
> 
> would still give:
> 
> 03c       andq    [RSI + #16 (8-bit)], #-2    # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.andq
> 041       orq     [RSI + #24 (8-bit)], #1    # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq
> 046       ...
> 
> Now, the interesting point is that pure 64-bit immediates (which
> cannot be treated as sign-extended 8/32-bit values) are assembled
> using two instructions (not one) because AND/OR cannot be used
> directly in such cases, for example:
> 
> andq &= ~MASK63;
> orq |= MASK63;
> 
> gives:
> 
> 03e       movq    R10, #9223372036854775807    # long
> 048       andq    [RSI + #16 (8-bit)], R10    # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.andq
> 04c       movq    R10, #-9223372036854775808    # long
> 056       orq     [RSI + #24 (8-bit)], R10    # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq
> 05a       ...
> 
> So, even though Sandhya mentioned a better throughput for AND/OR, the
> additional MOV cost (I didn't find it in table C-17 but I assume
> something close to MOVS/Z with latency=1/throughput=0.25) seems to be
> in favor of a sole BTR/BTS instruction as shown by the initial
> benchmark.
> 
> However, as John suggested, I tried another benchmark which focuses on
> the throughput to make sure there isn't any regression in such
> situations:
> 
>    private long orq63, orq62, orq61, orq60;
> 
>    @Benchmark
>    public void throughput(Blackhole bh) {
>        for (int i=0; i<COUNT; i++) {
>            orq63 = orq62 = orq61 = orq60 = 0;
>            bh.consume(testTp());
>        }
>    }
> 
>    private long testTp() {
>        orq63 |= MASK63;
>        orq62 |= MASK62;
>        orq61 |= MASK61;
>        orq60 |= MASK60;
>        return 0L;
>    }
> 
> Before, we had:
> 
> 03e       movq    R10, #-9223372036854775808    # long
> 048       orq     [RSI + #32 (8-bit)], R10    # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq63
> 04c       movq    R10, #4611686018427387904    # long
> 056       orq     [RSI + #40 (8-bit)], R10    # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq62
> 05a       movq    R10, #2305843009213693952    # long
> 064       orq     [RSI + #48 (8-bit)], R10    # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq61
> 068       movq    R10, #1152921504606846976    # long
> 072       orq     [RSI + #56 (8-bit)], R10    # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq60
> 
> Benchmark                  Mode  Cnt      Score      Error  Units
> BitSetAndReset.throughput  avgt    9  25912.455 ± 2527.041  ns/op
> 
> And after, we would have:
> 
> 03c       btsq    [RSI + #32 (8-bit)], log2(#-9223372036854775808)
> # long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq63
> 042       btsq    [RSI + #40 (8-bit)], log2(#4611686018427387904)    #
> long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq62
> 048       btsq    [RSI + #48 (8-bit)], log2(#2305843009213693952)    #
> long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq61
> 04e       btsq    [RSI + #56 (8-bit)], log2(#1152921504606846976)    #
> long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq60
> 
> Benchmark                  Mode  Cnt      Score      Error  Units
> BitSetAndReset.throughput  avgt    9  25803.195 ± 2434.009  ns/op
> 
> Fortunately, we still see a tiny performance gain along with the large
> size reduction and register saving.
> Should we go ahead with this optimization? If so, I'll post a RFR with
> Vladimir's requested changes soon.
> 
> Thanks,
> Bernard
> 
> On Thu, 7 Nov 2019 at 02:02, John Rose <john.r.rose at oracle.com> wrote:
>> 
>> I recently saw LLVM compile a classification switch into a really tidy BTR instruction,
>> something like this:
>> 
>>  switch (ch) {
>>  case ';': case '/': case '.': case '[':  return 0;
>>  default: return 1;
>>  }
>> =>
>>  … range check …
>>  movabsq       0x200000002003, %rcx
>>  btq   %rdi, %rcx
>> 
>> It made me wish for this change, plus some more to switch itself.
>> Given Sandhya’s report, though, BTR may only be helpful in limited
>> cases.  In the case above, it subsumes a shift instruction.
>> 
>> Bernard’s JMH experiment suggests something else is going on besides
>> the throughput difference which Sandhya cites.  Maybe it’s a benchmark
>> artifact, or maybe it’s a good effect from smaller code.  I suggest jamming
>> more back-to-back BTRs together, to see if the throughput effect appears.
>> 
>> — John
>> 
>> On Nov 6, 2019, at 4:34 PM, Viswanathan, Sandhya <sandhya.viswanathan at intel.com> wrote:
>>> 
>>> Hi Vladimir/Bernard,
>>> 
>>> 
>>> 
>>> I don’t see any restrictions/limitations on these instructions other than the fact that the “long” operation is only supported on 64-bit format as usual so should be restricted to 64-bit JVM only.
>>> 
>>> The code size improvement that Bernard demonstrates is significant for operation on longs.
>>> 
>>> It looks like the throughput for AND/OR is better than BTR/BTS  (0.25 vs 0.5) though. Please refer Table C-17 in the document below:
>>