RFR 8214239 (S): Missing x86_64.ad patterns for clearing and setting long vector bits

Tue Nov 12 22:11:13 UTC 2019

> http://cr.openjdk.java.net/~bsrbnd/jdk8214239/webrev.01/

Looks good.

PS: it would be nice to be able to directly reference instruction 
arguments from predicates (in a similar way it is supported in 
ins_encode section).

Then:

+  predicate(log2_long(~n->in(3)->in(2)->get_long()) > 30);
+  predicate(log2_long( n->in(3)->in(2)->get_long()) > 31);

could be turned into:

+  predicate(log2_long(~$con->get_long()) > 30);
+  predicate(log2_long( $con->get_long()) > 31);

or even:

+  predicate(log2_long(~$con$$constant) > 30);
+  predicate(log2_long( $con$$constant) > 31);

Best regards,
Vladimir Ivanov

> 
> I've pushed it to jdk/submit as second changeset on branch
> "JDK-8214239" and tests are OK:
> 
> http://hg.openjdk.java.net/jdk/submit/rev/f961f7a454e4
> 
> Any feedback is welcome.
> 
> Thanks,
> Bernard
> 
> On Tue, 12 Nov 2019 at 14:32, Vladimir Ivanov
> <vladimir.x.ivanov at oracle.com> wrote:
>>
>> Thanks for the clarifications, Bernard.
>>
>>>>> http://cr.openjdk.java.net/~bsrbnd/jdk8214239/webrev.00/
>>>>
>>>> I don't see cases for non-constant masks John suggested covered. Have
>>>> you tried to implement them? Any problems encountered or did you just
>>>> leave them for future improvement?
>>>
>>> I didn't experiment with non-constant masks yet, which is why I left
>>> them for future improvements (as told to John).
>>
>> Sounds good.
>>
>>
>>>> Why do you limit the optimization to bits in upper half? Is it because
>>>> ordinary andq/orq instructions work well for the rest? If that's the
>>>> case, it deserves a comment.
>>>
>>> On a pure specification basis (Intel optimization manual that Sandhya
>>> pointed me to), AND/OR and BTR/BTS have the same latency=1 but a
>>> slightly better throughput for the former and when experimenting with
>>> values <= 32-bit, I didn't observed much difference or quite
>>> imperceptibly in favor of AND/OR. But with pure 64-bit values, the
>>> benefit is much more evident because BTR/BTS replaces both a MOV and
>>> an AND/OR which is simply better on specification basis (latency=1 for
>>> BTR/BTS vs latency=1+1 for MOV + AND/OR). So, I'll update the comments
>>> as next:
>>>
>>> // n should be a pure 64-bit power of 2 immediate because AND/OR works
>>> well enough for 8/32-bit values.
>>> // n should be a pure 64-bit immediate given that not(n) is a power of
>>> 2 because AND/OR works well enough for 8/32-bit values.
>>
>> Looks good.
>>
>>>
>>>> (immPow2NotL is a bit misleading: I read it as "power of 2, but not a
>>>> long". What do you think about immL_NegPow2/immL_Pow2? Not sure how to
>>>> encode that it's > 2^32, but I would just skip it for now.)
>>>
>>> I agree with immL_NotPow2/immL_Pow2, for the encoding, see below.
>>
>> One idea to try: you can move "log2_long(n->get_long()) > ..." check
>> from operand declaration to the instruction.
>>
>> operand immL_Pow2() %{
>>     // ...
>>     predicate(is_power_of_2_long(n->get_long()));
>>     ...
>>
>> operand immL_NotPow2() %{
>>     // ...
>>     predicate(is_power_of_2_long(~n->get_long()));
>>     ...
>>
>> instruct btrL_mem_imm(memory dst, immL_NotPow2 con, rFlagsReg cr) %{
>>     predicate(log2_long(~in(2)->in(2)->get_long()) > 30);
>>     match(Set dst (StoreL dst (AndL (LoadL dst) con)));
>> ...
>>
>> instruct btsL_mem_imm(memory dst, immPow2L con, rFlagsReg cr) %{
>>     predicate(log2_long(in(2)->in(2)->get_long()) > 31);
>>     match(Set dst (StoreL dst (OrL (LoadL dst) con)));
>> ...
>>
>> It looks more natural (but also it requires more code) to do such
>> operation-specific dispatching on instructions than on operands.
>>
>> Best regards,
>> Vladimir Ivanov