RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v7]

Thu Jun 2 21:50:26 UTC 2022

On Sat, 16 Apr 2022 11:24:57 GMT, Quan Anh Mai <duke at openjdk.java.net> wrote:

>> Hi, this patch improves some operations on x86_64:
>> 
>> - Base variable scalar shifts have bad performance implications and should be replaced by their bmi2 counterparts if possible:
>>   + Bounded operands
>>   + Multiple uops both in fused and unfused domains
>>   + May result in flag stall since the operations have unpredictable flag output
>> 
>> - Flag to general-purpose registers operation currently uses `cmovcc`, which requires set up and 1 more spare register for constant, this could be replaced by set, which transforms the sequence:
>> 
>>         xorl dst, dst
>>         sometest
>>         movl tmp, 0x01
>>         cmovlcc dst, tmp
>> 
>>         into:
>> 
>>         xorl dst, dst
>>         sometest
>>         setbcc dst
>> 
>> This sequence does not need a spare register and without any drawbacks.
>> (Note: `movzx` does not work since move elision only occurs with different registers for input and output)
>> 
>> - Some small improvements:
>>   + Add memory variances to `tzcnt` and `lzcnt`
>>   + Add memory variances to `rolx` and `rorx`
>>   + Add missing `rolx` rules (note that `rolx dst, imm` is actually `rorx dst, size - imm`)
>> 
>> The speedup can be observed for variable shift instructions
>> 
>>         Before:
>>         Benchmark               (size)  Mode  Cnt   Score   Error  Units
>>         Integers.shiftLeft         500  avgt    5   0.836 ± 0.030  us/op
>>         Integers.shiftRight        500  avgt    5   0.843 ± 0.056  us/op
>>         Integers.shiftURight       500  avgt    5   0.830 ± 0.057  us/op
>>         Longs.shiftLeft            500  avgt    5   0.827 ± 0.026  us/op
>>         Longs.shiftRight           500  avgt    5   0.828 ± 0.018  us/op
>>         Longs.shiftURight          500  avgt    5   0.829 ± 0.038  us/op
>> 
>>         After:
>>         Benchmark               (size)  Mode  Cnt   Score   Error  Units
>>         Integers.shiftLeft         500  avgt    5   0.761 ± 0.016  us/op
>>         Integers.shiftRight        500  avgt    5   0.762 ± 0.071  us/op
>>         Integers.shiftURight       500  avgt    5   0.765 ± 0.056  us/op
>>         Longs.shiftLeft            500  avgt    5   0.755 ± 0.026  us/op
>>         Longs.shiftRight           500  avgt    5   0.753 ± 0.017  us/op
>>         Longs.shiftURight          500  avgt    5   0.759 ± 0.031  us/op
>> 
>> For `cmovcc 1, 0`, I have not been able to create a reliable microbenchmark since the benefits are mostly regarding register allocation.
>> 
>> Thank you very much.
>
> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 15 commits:
> 
>  - Resolve conflict
>  - ins_cost
>  - movzx is not elided with same input and output
>  - fix only the needs
>  - fix
>  - cisc
>  - delete benchmark command
>  - pipe
>  - fix, benchmarks
>  - pipe_class
>  - ... and 5 more: https://git.openjdk.java.net/jdk/compare/e5041ae3...337c0bf3

src/hotspot/cpu/x86/x86_64.ad line 10766:

> 10764:   format %{ "xorl    $dst, $dst\t# ci2b\n\t"
> 10765:             "testl   $src, $src\n\t"
> 10766:             "setnz   $dst" %}

What's the advantage of this change?  The disadvantage is a spare TEMP register is needed -- we can't reuse src as dst.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7968