JDK-8214239 (?): Missing x86_64.ad patterns for clearing and setting long vector bits
Vladimir Kozlov
vladimir.kozlov at oracle.com
Thu Nov 7 19:51:08 UTC 2019
I agree with you, Bernard.
I think throughput performance is limited by memory accesses which is the same in both cases. But
code reduction is very nice improvement. We can squeeze more code into CPU buffer which is very good
for small loops.
Please, send official RFR and to testing. Also would be nice to have a test which verifies result of
these operations.
Thanks,
Vladimir
On 11/7/19 11:30 AM, B. Blaser wrote:
> Hi Vladimir, Sandhya and John,
>
> Thanks for your respective answers.
>
> The suggested fix focuses on x86_64 and pure 64-bit immediates which
> means that all other cases are left unchanged as shown by the initial
> benchmark, for example:
>
> andq &= ~MASK00;
> orq |= MASK00;
>
> would still give:
>
> 03c andq [RSI + #16 (8-bit)], #-2 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.andq
> 041 orq [RSI + #24 (8-bit)], #1 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq
> 046 ...
>
> Now, the interesting point is that pure 64-bit immediates (which
> cannot be treated as sign-extended 8/32-bit values) are assembled
> using two instructions (not one) because AND/OR cannot be used
> directly in such cases, for example:
>
> andq &= ~MASK63;
> orq |= MASK63;
>
> gives:
>
> 03e movq R10, #9223372036854775807 # long
> 048 andq [RSI + #16 (8-bit)], R10 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.andq
> 04c movq R10, #-9223372036854775808 # long
> 056 orq [RSI + #24 (8-bit)], R10 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq
> 05a ...
>
> So, even though Sandhya mentioned a better throughput for AND/OR, the
> additional MOV cost (I didn't find it in table C-17 but I assume
> something close to MOVS/Z with latency=1/throughput=0.25) seems to be
> in favor of a sole BTR/BTS instruction as shown by the initial
> benchmark.
>
> However, as John suggested, I tried another benchmark which focuses on
> the throughput to make sure there isn't any regression in such
> situations:
>
> private long orq63, orq62, orq61, orq60;
>
> @Benchmark
> public void throughput(Blackhole bh) {
> for (int i=0; i<COUNT; i++) {
> orq63 = orq62 = orq61 = orq60 = 0;
> bh.consume(testTp());
> }
> }
>
> private long testTp() {
> orq63 |= MASK63;
> orq62 |= MASK62;
> orq61 |= MASK61;
> orq60 |= MASK60;
> return 0L;
> }
>
> Before, we had:
>
> 03e movq R10, #-9223372036854775808 # long
> 048 orq [RSI + #32 (8-bit)], R10 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq63
> 04c movq R10, #4611686018427387904 # long
> 056 orq [RSI + #40 (8-bit)], R10 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq62
> 05a movq R10, #2305843009213693952 # long
> 064 orq [RSI + #48 (8-bit)], R10 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq61
> 068 movq R10, #1152921504606846976 # long
> 072 orq [RSI + #56 (8-bit)], R10 # long ! Field:
> org/openjdk/bench/vm/compiler/BitSetAndReset.orq60
>
> Benchmark Mode Cnt Score Error Units
> BitSetAndReset.throughput avgt 9 25912.455 ± 2527.041 ns/op
>
> And after, we would have:
>
> 03c btsq [RSI + #32 (8-bit)], log2(#-9223372036854775808)
> # long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq63
> 042 btsq [RSI + #40 (8-bit)], log2(#4611686018427387904) #
> long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq62
> 048 btsq [RSI + #48 (8-bit)], log2(#2305843009213693952) #
> long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq61
> 04e btsq [RSI + #56 (8-bit)], log2(#1152921504606846976) #
> long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq60
>
> Benchmark Mode Cnt Score Error Units
> BitSetAndReset.throughput avgt 9 25803.195 ± 2434.009 ns/op
>
> Fortunately, we still see a tiny performance gain along with the large
> size reduction and register saving.
> Should we go ahead with this optimization? If so, I'll post a RFR with
> Vladimir's requested changes soon.
>
> Thanks,
> Bernard
>
> On Thu, 7 Nov 2019 at 02:02, John Rose <john.r.rose at oracle.com> wrote:
>>
>> I recently saw LLVM compile a classification switch into a really tidy BTR instruction,
>> something like this:
>>
>> switch (ch) {
>> case ';': case '/': case '.': case '[': return 0;
>> default: return 1;
>> }
>> =>
>> … range check …
>> movabsq 0x200000002003, %rcx
>> btq %rdi, %rcx
>>
>> It made me wish for this change, plus some more to switch itself.
>> Given Sandhya’s report, though, BTR may only be helpful in limited
>> cases. In the case above, it subsumes a shift instruction.
>>
>> Bernard’s JMH experiment suggests something else is going on besides
>> the throughput difference which Sandhya cites. Maybe it’s a benchmark
>> artifact, or maybe it’s a good effect from smaller code. I suggest jamming
>> more back-to-back BTRs together, to see if the throughput effect appears.
>>
>> — John
>>
>> On Nov 6, 2019, at 4:34 PM, Viswanathan, Sandhya <sandhya.viswanathan at intel.com> wrote:
>>>
>>> Hi Vladimir/Bernard,
>>>
>>>
>>>
>>> I don’t see any restrictions/limitations on these instructions other than the fact that the “long” operation is only supported on 64-bit format as usual so should be restricted to 64-bit JVM only.
>>>
>>> The code size improvement that Bernard demonstrates is significant for operation on longs.
>>>
>>> It looks like the throughput for AND/OR is better than BTR/BTS (0.25 vs 0.5) though. Please refer Table C-17 in the document below:
>>
More information about the hotspot-compiler-dev
mailing list