RFR 8214239 (S): Missing x86_64.ad patterns for clearing and setting long vector bits

Tue Nov 12 12:49:31 UTC 2019

Hi Vladimir Ivanov,

On Tue, 12 Nov 2019 at 10:00, Vladimir Ivanov
<vladimir.x.ivanov at oracle.com> wrote:
>
> Hi Bernard,
>
> Nice improvement!

Thanks.

> > http://cr.openjdk.java.net/~bsrbnd/jdk8214239/webrev.00/
>
> I don't see cases for non-constant masks John suggested covered. Have
> you tried to implement them? Any problems encountered or did you just
> leave them for future improvement?

I didn't experiment with non-constant masks yet, which is why I left
them for future improvements (as told to John).

> ====================================================================
>
> I'd like to see asserts in new MacroAssembler methods that imm8 fits
> into a byte, like:
>
> +void Assembler::btsq(Address dst, int imm8) {
> +  assert(isByte(imm8), "not a byte");
> +  InstructionMark im(this);

I'll do this.

> ====================================================================
>
> src/hotspot/cpu/x86/x86_64.ad:
>
> +operand immPow2L()
> +%{
> +  // n should be a pure 64-bit power of 2 immediate.
> +  predicate(is_power_of_2_long(n->get_long()) &&
> log2_long(n->get_long()) > 31);
>
> +operand immPow2NotL()
> +%{
> +  // n should be a pure 64-bit immediate given that not(n) is a power of 2.
>
> Why do you limit the optimization to bits in upper half? Is it because
> ordinary andq/orq instructions work well for the rest? If that's the
> case, it deserves a comment.

On a pure specification basis (Intel optimization manual that Sandhya
pointed me to), AND/OR and BTR/BTS have the same latency=1 but a
slightly better throughput for the former and when experimenting with
values <= 32-bit, I didn't observed much difference or quite
imperceptibly in favor of AND/OR. But with pure 64-bit values, the
benefit is much more evident because BTR/BTS replaces both a MOV and
an AND/OR which is simply better on specification basis (latency=1 for
BTR/BTS vs latency=1+1 for MOV + AND/OR). So, I'll update the comments
as next:

// n should be a pure 64-bit power of 2 immediate because AND/OR works
well enough for 8/32-bit values.
// n should be a pure 64-bit immediate given that not(n) is a power of
2 because AND/OR works well enough for 8/32-bit values.

> (immPow2NotL is a bit misleading: I read it as "power of 2, but not a
> long". What do you think about immL_NegPow2/immL_Pow2? Not sure how to
> encode that it's > 2^32, but I would just skip it for now.)

I agree with immL_NotPow2/immL_Pow2, for the encoding, see below.

> ====================================================================
>
> +instruct btrL_mem_imm(memory dst, immPow2NotL src, rFlagsReg cr) %{
> +  match(Set dst (StoreL dst (AndL (LoadL dst) src)));
>
> +instruct btsL_mem_imm(memory dst, immPow2L    src, rFlagsReg cr) %{
> +  match(Set dst (StoreL dst (OrL (LoadL dst) src)));
>
> Does it make sense to cover 32-/16-bit cases the same way? (Relates to
> the earlier question on bits from upper half.)

Given my above answer, I don't think this would make sense (but I'll
do some more experiments as part of further enhancement issues as
discussed with John).

> Do you leave out in-register updates because they don't give any
> benefits compared to andq/orq?

Same answer, I'd have to do further experiments as part of complementary issues.
Note that this very one is focusing on the most frequent cases which
will likely to become even more common with final fields trusted as
constants.

> Also, please, use "con" instead of "src". It's easier to read when the
> name reflects that the value is a constant.

Yes, I'll do this.

> ====================================================================
>
> test/hotspot/jtreg/compiler/c2/TestBitSetAndReset.java:
>
> As I understand, the only test case which exercises new code is:
>
>    64     private static void test63() {
>    65         andq &= ~MASK63;
>    66         orq |= MASK63;
>    67     }
>
> Please, at least, add a case for MASK32.

The existing MASK31 exercises the new code because ffff_ffff_7fff_ffff
cannot be written as an AND/OR sign-extended 32-bit value:

03c       btrq    [RSI + #16 (8-bit)], log2(not(#-2147483649))    #
long ! Field: org/openjdk/bench/vm/compiler/BitSetAndReset.andq

See 'immPow2NotL' predicate 'log2_long(~n->get_long()) > 30'.

Note that 8000_0000 is also a corner case as it can be written as a
MOV zero-extended 32-bit value which doesn't seem to be worse than
BTR/BTS on experiment basis:

044       movl    R10, #2147483648    # long (unsigned 32-bit)
04a       orq     [RSI + #24 (8-bit)], R10    # long ! Field:
org/openjdk/bench/vm/compiler/BitSetAndReset.orq

That's why I kept it, see 'immPow2L' predicate 'log2_long(n->get_long()) > 31'.
But I agree that a test case for MASK32 would be nice in this
situation and I'll add it.

>    29  * @run main/othervm -XX:+IgnoreUnrecognizedVMOptions
> -XX:+UnlockDiagnosticVMOptions
>    30  *                   -XX:-TieredCompilation
> -XX:CompileThresholdScaling=0.1 -XX:-Inline
>    31  *
> -XX:CompileCommand=print,compiler/c2/TestBitSetAndReset.test*
>    32  *
> -XX:CompileCommand=compileonly,compiler/c2/TestBitSetAndReset.test*
>    33  *                   compiler.c2.TestBitSetAndReset
>
> Since you explicitly disable tiered, you can just directly set the
> threshold instead (-XX:CompileThreshold=1000).
>
> -XX:-Inline is redundant and you can replace compileonly directive with
> dontinline to speed up the test.

Yes, thanks, I'll do this too.

Best regards,
Bernard

> Best regards,
> Vladimir Ivanov
> >
> > It includes:
> > * Vladimir's requested changes about instruction encoding and testing
> > along with:
> > * John's suggested complementary benchmark following Sandhya's note
> > regarding throughput.
> >
> > Thanks (hotspot:tier1 is OK on Linux/x86_64),
> > Bernard
> >
> > [1] https://bugs.openjdk.java.net/browse/JDK-8214239
> > [2] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-November/035699.html
> >