JDK-8214239 (?): Missing x86_64.ad patterns for clearing and setting long vector bits
B. Blaser
bsrbnd at gmail.com
Thu Nov 7 19:30:09 UTC 2019
Hi Vladimir, Sandhya and John,
Thanks for your respective answers.
The suggested fix focuses on x86_64 and pure 64-bit immediates which
means that all other cases are left unchanged as shown by the initial
benchmark, for example:
andq &= ~MASK00;
orq |= MASK00;
would still give:
03c andq [RSI + #16 (8-bit)], #-2 # long ! Field:
org/openjdk/bench/vm/compiler/BitSetAndReset.andq
041 orq [RSI + #24 (8-bit)], #1 # long ! Field:
org/openjdk/bench/vm/compiler/BitSetAndReset.orq
046 ...
Now, the interesting point is that pure 64-bit immediates (which
cannot be treated as sign-extended 8/32-bit values) are assembled
using two instructions (not one) because AND/OR cannot be used
directly in such cases, for example:
andq &= ~MASK63;
orq |= MASK63;
gives:
03e movq R10, #9223372036854775807 # long
048 andq [RSI + #16 (8-bit)], R10 # long ! Field:
org/openjdk/bench/vm/compiler/BitSetAndReset.andq
04c movq R10, #-9223372036854775808 # long
056 orq [RSI + #24 (8-bit)], R10 # long ! Field:
org/openjdk/bench/vm/compiler/BitSetAndReset.orq
05a ...
So, even though Sandhya mentioned a better throughput for AND/OR, the
additional MOV cost (I didn't find it in table C-17 but I assume
something close to MOVS/Z with latency=1/throughput=0.25) seems to be
in favor of a sole BTR/BTS instruction as shown by the initial
benchmark.
However, as John suggested, I tried another benchmark which focuses on
the throughput to make sure there isn't any regression in such
situations:
private long orq63, orq62, orq61, orq60;
@Benchmark
public void throughput(Blackhole bh) {
for (int i=0; i<COUNT; i++) {
orq63 = orq62 = orq61 = orq60 = 0;
bh.consume(testTp());
}
}
private long testTp() {
orq63 |= MASK63;
orq62 |= MASK62;
orq61 |= MASK61;
orq60 |= MASK60;
return 0L;
}
Before, we had:
03e movq R10, #-9223372036854775808 # long
048 orq [RSI + #32 (8-bit)], R10 # long ! Field:
org/openjdk/bench/vm/compiler/BitSetAndReset.orq63
04c movq R10, #4611686018427387904 # long
056 orq [RSI + #40 (8-bit)], R10 # long ! Field:
org/openjdk/bench/vm/compiler/BitSetAndReset.orq62
05a movq R10, #2305843009213693952 # long
064 orq [RSI + #48 (8-bit)], R10 # long ! Field:
org/openjdk/bench/vm/compiler/BitSetAndReset.orq61
068 movq R10, #1152921504606846976 # long
072 orq [RSI + #56 (8-bit)], R10 # long ! Field:
org/openjdk/bench/vm/compiler/BitSetAndReset.orq60
Benchmark Mode Cnt Score Error Units
BitSetAndReset.throughput avgt 9 25912.455 ± 2527.041 ns/op
And after, we would have:
03c btsq [RSI + #32 (8-bit)], log2(#-9223372036854775808)
# long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq63
042 btsq [RSI + #40 (8-bit)], log2(#4611686018427387904) #
long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq62
048 btsq [RSI + #48 (8-bit)], log2(#2305843009213693952) #
long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq61
04e btsq [RSI + #56 (8-bit)], log2(#1152921504606846976) #
long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.orq60
Benchmark Mode Cnt Score Error Units
BitSetAndReset.throughput avgt 9 25803.195 ± 2434.009 ns/op
Fortunately, we still see a tiny performance gain along with the large
size reduction and register saving.
Should we go ahead with this optimization? If so, I'll post a RFR with
Vladimir's requested changes soon.
Thanks,
Bernard
On Thu, 7 Nov 2019 at 02:02, John Rose <john.r.rose at oracle.com> wrote:
>
> I recently saw LLVM compile a classification switch into a really tidy BTR instruction,
> something like this:
>
> switch (ch) {
> case ';': case '/': case '.': case '[': return 0;
> default: return 1;
> }
> =>
> … range check …
> movabsq 0x200000002003, %rcx
> btq %rdi, %rcx
>
> It made me wish for this change, plus some more to switch itself.
> Given Sandhya’s report, though, BTR may only be helpful in limited
> cases. In the case above, it subsumes a shift instruction.
>
> Bernard’s JMH experiment suggests something else is going on besides
> the throughput difference which Sandhya cites. Maybe it’s a benchmark
> artifact, or maybe it’s a good effect from smaller code. I suggest jamming
> more back-to-back BTRs together, to see if the throughput effect appears.
>
> — John
>
> On Nov 6, 2019, at 4:34 PM, Viswanathan, Sandhya <sandhya.viswanathan at intel.com> wrote:
> >
> > Hi Vladimir/Bernard,
> >
> >
> >
> > I don’t see any restrictions/limitations on these instructions other than the fact that the “long” operation is only supported on 64-bit format as usual so should be restricted to 64-bit JVM only.
> >
> > The code size improvement that Bernard demonstrates is significant for operation on longs.
> >
> > It looks like the throughput for AND/OR is better than BTR/BTS (0.25 vs 0.5) though. Please refer Table C-17 in the document below:
>
More information about the hotspot-compiler-dev
mailing list