JDK-8214239 (?): Missing x86_64.ad patterns for clearing and setting long vector bits

Thu Nov 7 19:51:08 UTC 2019

I agree with you, Bernard.

I think throughput performance is limited by memory accesses which is the same in both cases. But 
code reduction is very nice improvement. We can squeeze more code into CPU buffer which is very good 
for small loops.

Please, send official RFR and to testing. Also would be nice to have a test which verifies result of 
these operations.

Thanks,
Vladimir

On 11/7/19 11:30 AM, B. Blaser wrote:
> Hi Vladimir, Sandhya and John,
> 
> Thanks for your respective answers.
> 
> The suggested fix focuses on x86_64 and pure 64-bit immediates which
> means that all other cases are left unchanged as shown by the initial
> benchmark, for example:
> 
> andq &= ~MASK00;
> orq |= MASK00;
> 
> would still give:
> 
> 03c       andq    [RSI + #16 (8-bit)], #-2    # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.andq
> 041       orq     [RSI + #24 (8-bit)], #1    # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq
> 046       ...
> 
> Now, the interesting point is that pure 64-bit immediates (which
> cannot be treated as sign-extended 8/32-bit values) are assembled
> using two instructions (not one) because AND/OR cannot be used
> directly in such cases, for example:
> 
> andq &= ~MASK63;
> orq |= MASK63;
> 
> gives:
> 
> 03e       movq    R10, #9223372036854775807    # long
> 048       andq    [RSI + #16 (8-bit)], R10    # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.andq
> 04c       movq    R10, #-9223372036854775808    # long
> 056       orq     [RSI + #24 (8-bit)], R10    # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq
> 05a       ...
> 
> So, even though Sandhya mentioned a better throughput for AND/OR, the
> additional MOV cost (I didn't find it in table C-17 but I assume
> something close to MOVS/Z with latency=1/throughput=0.25) seems to be
> in favor of a sole BTR/BTS instruction as shown by the initial
> benchmark.
> 
> However, as John suggested, I tried another benchmark which focuses on
> the throughput to make sure there isn't any regression in such
> situations:
> 
>      private long orq63, orq62, orq61, orq60;
> 
>      @Benchmark
>      public void throughput(Blackhole bh) {
>          for (int i=0; i<COUNT; i++) {
>              orq63 = orq62 = orq61 = orq60 = 0;
>              bh.consume(testTp());
>          }
>      }
> 
>      private long testTp() {
>          orq63 |= MASK63;
>          orq62 |= MASK62;
>          orq61 |= MASK61;
>          orq60 |= MASK60;
>          return 0L;
>      }
> 
> Before, we had:
> 
> 03e       movq    R10, #-9223372036854775808    # long
> 048       orq     [RSI + #32 (8-bit)], R10    # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq63
> 04c       movq    R10, #4611686018427387904    # long
> 056       orq     [RSI + #40 (8-bit)], R10    # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq62
> 05a       movq    R10, #2305843009213693952    # long
> 064       orq     [RSI + #48 (8-bit)], R10    # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq61
> 068       movq    R10, #1152921504606846976    # long
> 072       orq     [RSI + #56 (8-bit)], R10    # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq60
> 
> Benchmark                  Mode  Cnt      Score      Error  Units
> BitSetAndReset.throughput  avgt    9  25912.455 ± 2527.041  ns/op
> 
> And after, we would have:
> 
> 03c       btsq    [RSI + #32 (8-bit)], log2(#-9223372036854775808)
> # long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq63
> 042       btsq    [RSI + #40 (8-bit)], log2(#4611686018427387904)    #
> long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq62
> 048       btsq    [RSI + #48 (8-bit)], log2(#2305843009213693952)    #
> long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq61
> 04e       btsq    [RSI + #56 (8-bit)], log2(#1152921504606846976)    #
> long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq60
> 
> Benchmark                  Mode  Cnt      Score      Error  Units
> BitSetAndReset.throughput  avgt    9  25803.195 ± 2434.009  ns/op
> 
> Fortunately, we still see a tiny performance gain along with the large
> size reduction and register saving.
> Should we go ahead with this optimization? If so, I'll post a RFR with
> Vladimir's requested changes soon.
> 
> Thanks,
> Bernard
> 
> On Thu, 7 Nov 2019 at 02:02, John Rose <john.r.rose at oracle.com> wrote:
>>
>> I recently saw LLVM compile a classification switch into a really tidy BTR instruction,
>> something like this:
>>
>>    switch (ch) {
>>    case ';': case '/': case '.': case '[':  return 0;
>>    default: return 1;
>>    }
>> =>
>>    … range check …
>>    movabsq       0x200000002003, %rcx
>>    btq   %rdi, %rcx
>>
>> It made me wish for this change, plus some more to switch itself.
>> Given Sandhya’s report, though, BTR may only be helpful in limited
>> cases.  In the case above, it subsumes a shift instruction.
>>
>> Bernard’s JMH experiment suggests something else is going on besides
>> the throughput difference which Sandhya cites.  Maybe it’s a benchmark
>> artifact, or maybe it’s a good effect from smaller code.  I suggest jamming
>> more back-to-back BTRs together, to see if the throughput effect appears.
>>
>> — John
>>
>> On Nov 6, 2019, at 4:34 PM, Viswanathan, Sandhya <sandhya.viswanathan at intel.com> wrote:
>>>
>>> Hi Vladimir/Bernard,
>>>
>>>
>>>
>>> I don’t see any restrictions/limitations on these instructions other than the fact that the “long” operation is only supported on 64-bit format as usual so should be restricted to 64-bit JVM only.
>>>
>>> The code size improvement that Bernard demonstrates is significant for operation on longs.
>>>
>>> It looks like the throughput for AND/OR is better than BTR/BTS  (0.25 vs 0.5) though. Please refer Table C-17 in the document below:
>>