JDK-8214239 (?): Missing x86_64.ad patterns for clearing and setting long vector bits
John Rose
john.r.rose at oracle.com
Thu Nov 7 21:58:45 UTC 2019
Would you consider adding patterns for non-constant masks also?
It would be something like (And (LShift n) x), etc.
It could be in this set or in an a follow-on.
Thanks (says John who always wants more).
> On Nov 7, 2019, at 11:30 AM, B. Blaser <bsrbnd at gmail.com> wrote:
>
> Hi Vladimir, Sandhya and John,
>
> Thanks for your respective answers.
>
> The suggested fix focuses on x86_64 and pure 64-bit immediates which
> means that all other cases are left unchanged as shown by the initial
> benchmark, for example:
>
> andq &= ~MASK00;
> orq |= MASK00;
>
> would still give:
>
> 03c andq [RSI + #16 (8-bit)], #-2 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.andq
> 041 orq [RSI + #24 (8-bit)], #1 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq
> 046 ...
>
> Now, the interesting point is that pure 64-bit immediates (which
> cannot be treated as sign-extended 8/32-bit values) are assembled
> using two instructions (not one) because AND/OR cannot be used
> directly in such cases, for example:
>
> andq &= ~MASK63;
> orq |= MASK63;
>
> gives:
>
> 03e movq R10, #9223372036854775807 # long
> 048 andq [RSI + #16 (8-bit)], R10 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.andq
> 04c movq R10, #-9223372036854775808 # long
> 056 orq [RSI + #24 (8-bit)], R10 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq
> 05a ...
>
> So, even though Sandhya mentioned a better throughput for AND/OR, the
> additional MOV cost (I didn't find it in table C-17 but I assume
> something close to MOVS/Z with latency=1/throughput=0.25) seems to be
> in favor of a sole BTR/BTS instruction as shown by the initial
> benchmark.
>
> However, as John suggested, I tried another benchmark which focuses on
> the throughput to make sure there isn't any regression in such
> situations:
>
> private long orq63, orq62, orq61, orq60;
>
> @Benchmark
> public void throughput(Blackhole bh) {
> for (int i=0; i<COUNT; i++) {
> orq63 = orq62 = orq61 = orq60 = 0;
> bh.consume(testTp());
> }
> }
>
> private long testTp() {
> orq63 |= MASK63;
> orq62 |= MASK62;
> orq61 |= MASK61;
> orq60 |= MASK60;
> return 0L;
> }
>
> Before, we had:
>
> 03e movq R10, #-9223372036854775808 # long
> 048 orq [RSI + #32 (8-bit)], R10 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq63
> 04c movq R10, #4611686018427387904 # long
> 056 orq [RSI + #40 (8-bit)], R10 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq62
> 05a movq R10, #2305843009213693952 # long
> 064 orq [RSI + #48 (8-bit)], R10 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq61
> 068 movq R10, #1152921504606846976 # long
> 072 orq [RSI + #56 (8-bit)], R10 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq60
>
> Benchmark Mode Cnt Score Error Units
> BitSetAndReset.throughput avgt 9 25912.455 ± 2527.041 ns/op
>
> And after, we would have:
>
> 03c btsq [RSI + #32 (8-bit)], log2(#-9223372036854775808)
> # long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq63
> 042 btsq [RSI + #40 (8-bit)], log2(#4611686018427387904) #
> long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq62
> 048 btsq [RSI + #48 (8-bit)], log2(#2305843009213693952) #
> long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq61
> 04e btsq [RSI + #56 (8-bit)], log2(#1152921504606846976) #
> long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq60
>
> Benchmark Mode Cnt Score Error Units
> BitSetAndReset.throughput avgt 9 25803.195 ± 2434.009 ns/op
>
> Fortunately, we still see a tiny performance gain along with the large
> size reduction and register saving.
> Should we go ahead with this optimization? If so, I'll post a RFR with
> Vladimir's requested changes soon.
>
> Thanks,
> Bernard
>
> On Thu, 7 Nov 2019 at 02:02, John Rose <john.r.rose at oracle.com> wrote:
>>
>> I recently saw LLVM compile a classification switch into a really tidy BTR instruction,
>> something like this:
>>
>> switch (ch) {
>> case ';': case '/': case '.': case '[': return 0;
>> default: return 1;
>> }
>> =>
>> … range check …
>> movabsq 0x200000002003, %rcx
>> btq %rdi, %rcx
>>
>> It made me wish for this change, plus some more to switch itself.
>> Given Sandhya’s report, though, BTR may only be helpful in limited
>> cases. In the case above, it subsumes a shift instruction.
>>
>> Bernard’s JMH experiment suggests something else is going on besides
>> the throughput difference which Sandhya cites. Maybe it’s a benchmark
>> artifact, or maybe it’s a good effect from smaller code. I suggest jamming
>> more back-to-back BTRs together, to see if the throughput effect appears.
>>
>> — John
>>
>> On Nov 6, 2019, at 4:34 PM, Viswanathan, Sandhya <sandhya.viswanathan at intel.com> wrote:
>>>
>>> Hi Vladimir/Bernard,
>>>
>>>
>>>
>>> I don’t see any restrictions/limitations on these instructions other than the fact that the “long” operation is only supported on 64-bit format as usual so should be restricted to 64-bit JVM only.
>>>
>>> The code size improvement that Bernard demonstrates is significant for operation on longs.
>>>
>>> It looks like the throughput for AND/OR is better than BTR/BTS (0.25 vs 0.5) though. Please refer Table C-17 in the document below:
>>
More information about the hotspot-compiler-dev
mailing list