JDK-8214239 (?): Missing x86_64.ad patterns for clearing and setting long vector bits
Viswanathan, Sandhya
sandhya.viswanathan at intel.com
Fri Nov 8 00:57:28 UTC 2019
Thanks a lot Bernard, for identifying these optimizations.
Best Regards,
Sandhya
-----Original Message-----
From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Vladimir Kozlov
Sent: Thursday, November 07, 2019 11:51 AM
To: B. Blaser <bsrbnd at gmail.com>; John Rose <john.r.rose at oracle.com>
Cc: hotspot-compiler-dev at openjdk.java.net
Subject: Re: JDK-8214239 (?): Missing x86_64.ad patterns for clearing and setting long vector bits
I agree with you, Bernard.
I think throughput performance is limited by memory accesses which is the same in both cases. But code reduction is very nice improvement. We can squeeze more code into CPU buffer which is very good for small loops.
Please, send official RFR and to testing. Also would be nice to have a test which verifies result of these operations.
Thanks,
Vladimir
On 11/7/19 11:30 AM, B. Blaser wrote:
> Hi Vladimir, Sandhya and John,
>
> Thanks for your respective answers.
>
> The suggested fix focuses on x86_64 and pure 64-bit immediates which
> means that all other cases are left unchanged as shown by the initial
> benchmark, for example:
>
> andq &= ~MASK00;
> orq |= MASK00;
>
> would still give:
>
> 03c andq [RSI + #16 (8-bit)], #-2 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.andq
> 041 orq [RSI + #24 (8-bit)], #1 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq
> 046 ...
>
> Now, the interesting point is that pure 64-bit immediates (which
> cannot be treated as sign-extended 8/32-bit values) are assembled
> using two instructions (not one) because AND/OR cannot be used
> directly in such cases, for example:
>
> andq &= ~MASK63;
> orq |= MASK63;
>
> gives:
>
> 03e movq R10, #9223372036854775807 # long
> 048 andq [RSI + #16 (8-bit)], R10 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.andq
> 04c movq R10, #-9223372036854775808 # long
> 056 orq [RSI + #24 (8-bit)], R10 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq
> 05a ...
>
> So, even though Sandhya mentioned a better throughput for AND/OR, the
> additional MOV cost (I didn't find it in table C-17 but I assume
> something close to MOVS/Z with latency=1/throughput=0.25) seems to be
> in favor of a sole BTR/BTS instruction as shown by the initial
> benchmark.
>
> However, as John suggested, I tried another benchmark which focuses on
> the throughput to make sure there isn't any regression in such
> situations:
>
> private long orq63, orq62, orq61, orq60;
>
> @Benchmark
> public void throughput(Blackhole bh) {
> for (int i=0; i<COUNT; i++) {
> orq63 = orq62 = orq61 = orq60 = 0;
> bh.consume(testTp());
> }
> }
>
> private long testTp() {
> orq63 |= MASK63;
> orq62 |= MASK62;
> orq61 |= MASK61;
> orq60 |= MASK60;
> return 0L;
> }
>
> Before, we had:
>
> 03e movq R10, #-9223372036854775808 # long
> 048 orq [RSI + #32 (8-bit)], R10 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq63
> 04c movq R10, #4611686018427387904 # long
> 056 orq [RSI + #40 (8-bit)], R10 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq62
> 05a movq R10, #2305843009213693952 # long
> 064 orq [RSI + #48 (8-bit)], R10 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq61
> 068 movq R10, #1152921504606846976 # long
> 072 orq [RSI + #56 (8-bit)], R10 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq60
>
> Benchmark Mode Cnt Score Error Units
> BitSetAndReset.throughput avgt 9 25912.455 ± 2527.041 ns/op
>
> And after, we would have:
>
> 03c btsq [RSI + #32 (8-bit)], log2(#-9223372036854775808)
> # long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq63
> 042 btsq [RSI + #40 (8-bit)], log2(#4611686018427387904) #
> long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq62
> 048 btsq [RSI + #48 (8-bit)], log2(#2305843009213693952) #
> long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq61
> 04e btsq [RSI + #56 (8-bit)], log2(#1152921504606846976) #
> long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq60
>
> Benchmark Mode Cnt Score Error Units
> BitSetAndReset.throughput avgt 9 25803.195 ± 2434.009 ns/op
>
> Fortunately, we still see a tiny performance gain along with the large
> size reduction and register saving.
> Should we go ahead with this optimization? If so, I'll post a RFR with
> Vladimir's requested changes soon.
>
> Thanks,
> Bernard
>
> On Thu, 7 Nov 2019 at 02:02, John Rose <john.r.rose at oracle.com> wrote:
>>
>> I recently saw LLVM compile a classification switch into a really
>> tidy BTR instruction, something like this:
>>
>> switch (ch) {
>> case ';': case '/': case '.': case '[': return 0;
>> default: return 1;
>> }
>> =>
>> … range check …
>> movabsq 0x200000002003, %rcx
>> btq %rdi, %rcx
>>
>> It made me wish for this change, plus some more to switch itself.
>> Given Sandhya’s report, though, BTR may only be helpful in limited
>> cases. In the case above, it subsumes a shift instruction.
>>
>> Bernard’s JMH experiment suggests something else is going on besides
>> the throughput difference which Sandhya cites. Maybe it’s a
>> benchmark artifact, or maybe it’s a good effect from smaller code. I
>> suggest jamming more back-to-back BTRs together, to see if the throughput effect appears.
>>
>> — John
>>
>> On Nov 6, 2019, at 4:34 PM, Viswanathan, Sandhya <sandhya.viswanathan at intel.com> wrote:
>>>
>>> Hi Vladimir/Bernard,
>>>
>>>
>>>
>>> I don’t see any restrictions/limitations on these instructions other than the fact that the “long” operation is only supported on 64-bit format as usual so should be restricted to 64-bit JVM only.
>>>
>>> The code size improvement that Bernard demonstrates is significant for operation on longs.
>>>
>>> It looks like the throughput for AND/OR is better than BTR/BTS (0.25 vs 0.5) though. Please refer Table C-17 in the document below:
>>
More information about the hotspot-compiler-dev
mailing list