RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v7]
Dean Long
dlong at openjdk.java.net
Thu Jun 2 21:50:26 UTC 2022
On Sat, 16 Apr 2022 11:24:57 GMT, Quan Anh Mai <duke at openjdk.java.net> wrote:
>> Hi, this patch improves some operations on x86_64:
>>
>> - Base variable scalar shifts have bad performance implications and should be replaced by their bmi2 counterparts if possible:
>> + Bounded operands
>> + Multiple uops both in fused and unfused domains
>> + May result in flag stall since the operations have unpredictable flag output
>>
>> - Flag to general-purpose registers operation currently uses `cmovcc`, which requires set up and 1 more spare register for constant, this could be replaced by set, which transforms the sequence:
>>
>> xorl dst, dst
>> sometest
>> movl tmp, 0x01
>> cmovlcc dst, tmp
>>
>> into:
>>
>> xorl dst, dst
>> sometest
>> setbcc dst
>>
>> This sequence does not need a spare register and without any drawbacks.
>> (Note: `movzx` does not work since move elision only occurs with different registers for input and output)
>>
>> - Some small improvements:
>> + Add memory variances to `tzcnt` and `lzcnt`
>> + Add memory variances to `rolx` and `rorx`
>> + Add missing `rolx` rules (note that `rolx dst, imm` is actually `rorx dst, size - imm`)
>>
>> The speedup can be observed for variable shift instructions
>>
>> Before:
>> Benchmark (size) Mode Cnt Score Error Units
>> Integers.shiftLeft 500 avgt 5 0.836 ± 0.030 us/op
>> Integers.shiftRight 500 avgt 5 0.843 ± 0.056 us/op
>> Integers.shiftURight 500 avgt 5 0.830 ± 0.057 us/op
>> Longs.shiftLeft 500 avgt 5 0.827 ± 0.026 us/op
>> Longs.shiftRight 500 avgt 5 0.828 ± 0.018 us/op
>> Longs.shiftURight 500 avgt 5 0.829 ± 0.038 us/op
>>
>> After:
>> Benchmark (size) Mode Cnt Score Error Units
>> Integers.shiftLeft 500 avgt 5 0.761 ± 0.016 us/op
>> Integers.shiftRight 500 avgt 5 0.762 ± 0.071 us/op
>> Integers.shiftURight 500 avgt 5 0.765 ± 0.056 us/op
>> Longs.shiftLeft 500 avgt 5 0.755 ± 0.026 us/op
>> Longs.shiftRight 500 avgt 5 0.753 ± 0.017 us/op
>> Longs.shiftURight 500 avgt 5 0.759 ± 0.031 us/op
>>
>> For `cmovcc 1, 0`, I have not been able to create a reliable microbenchmark since the benefits are mostly regarding register allocation.
>>
>> Thank you very much.
>
> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 15 commits:
>
> - Resolve conflict
> - ins_cost
> - movzx is not elided with same input and output
> - fix only the needs
> - fix
> - cisc
> - delete benchmark command
> - pipe
> - fix, benchmarks
> - pipe_class
> - ... and 5 more: https://git.openjdk.java.net/jdk/compare/e5041ae3...337c0bf3
src/hotspot/cpu/x86/x86_64.ad line 10766:
> 10764: format %{ "xorl $dst, $dst\t# ci2b\n\t"
> 10765: "testl $src, $src\n\t"
> 10766: "setnz $dst" %}
What's the advantage of this change? The disadvantage is a spare TEMP register is needed -- we can't reuse src as dst.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7968
More information about the hotspot-compiler-dev
mailing list