RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v5]

Wed Apr 6 18:10:39 UTC 2022

On Sun, 27 Mar 2022 09:40:27 GMT, Quan Anh Mai <duke at openjdk.java.net> wrote:

>> Hi, this patch improves some operations on x86_64:
>> 
>> - Base variable scalar shifts have bad performance implications and should be replaced by their bmi2 counterparts if possible:
>>   + Bounded operands
>>   + Multiple uops both in fused and unfused domains
>>   + May result in flag stall since the operations have unpredictable flag output
>> 
>> - Flag to general-purpose registers operation currently uses `cmovcc`, which requires set up and 1 more spare register for constant, this could be replaced by set, which transforms the sequence:
>> 
>>         xorl dst, dst
>>         sometest
>>         movl tmp, 0x01
>>         cmovlcc dst, tmp
>> 
>>         into:
>> 
>>         xorl dst, dst
>>         sometest
>>         setbcc dst
>> 
>> This sequence does not need a spare register and without any drawbacks.
>> (Note: `movzx` does not work since move elision only occurs with different registers for input and output)
>> 
>> - Some small improvements:
>>   + Add memory variances to `tzcnt` and `lzcnt`
>>   + Add memory variances to `rolx` and `rorx`
>>   + Add missing `rolx` rules (note that `rolx dst, imm` is actually `rorx dst, size - imm`)
>> 
>> The speedup can be observed for variable shift instructions
>> 
>>         Before:
>>         Benchmark               (size)  Mode  Cnt   Score   Error  Units
>>         Integers.shiftLeft         500  avgt    5   0.836 ± 0.030  us/op
>>         Integers.shiftRight        500  avgt    5   0.843 ± 0.056  us/op
>>         Integers.shiftURight       500  avgt    5   0.830 ± 0.057  us/op
>>         Longs.shiftLeft            500  avgt    5   0.827 ± 0.026  us/op
>>         Longs.shiftRight           500  avgt    5   0.828 ± 0.018  us/op
>>         Longs.shiftURight          500  avgt    5   0.829 ± 0.038  us/op
>> 
>>         After:
>>         Benchmark               (size)  Mode  Cnt   Score   Error  Units
>>         Integers.shiftLeft         500  avgt    5   0.761 ± 0.016  us/op
>>         Integers.shiftRight        500  avgt    5   0.762 ± 0.071  us/op
>>         Integers.shiftURight       500  avgt    5   0.765 ± 0.056  us/op
>>         Longs.shiftLeft            500  avgt    5   0.755 ± 0.026  us/op
>>         Longs.shiftRight           500  avgt    5   0.753 ± 0.017  us/op
>>         Longs.shiftURight          500  avgt    5   0.759 ± 0.031  us/op
>> 
>> For `cmovcc 1, 0`, I have not been able to create a reliable microbenchmark since the benefits are mostly regarding register allocation.
>> 
>> Thank you very much.
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
> 
>   movzx is not elided with same input and output

Few comments.

src/hotspot/cpu/x86/x86_64.ad line 9061:

> 9059: %{
> 9060:   // This is to match that of the memory variance
> 9061:   predicate(VM_Version::supports_bmi2() && !n->in(2)->is_Con());

Why you check for constant shift? With bmi2 check you excluded other reg_reg instruction. And for constant shift we have `salI_rReg_imm`.

src/hotspot/cpu/x86/x86_64.ad line 9438:

> 9436: 
> 9437: // Arithmetic Shift Right by 8-bit immediate
> 9438: instruct sarL_mem_imm(memory dst, immI shift, rFlagsReg cr)

Why this change to type of constant?

-------------

PR: https://git.openjdk.java.net/jdk/pull/7968