RFR: 8294194: [AArch64] Create intrinsics compress and expand
Eric Liu
eliu at openjdk.org
Tue Oct 25 03:05:48 UTC 2022
On Mon, 3 Oct 2022 14:00:51 GMT, Stuart Monteith <smonteith at openjdk.org> wrote:
> The java.lang.Long and java.lang.Integer classes have the methods "compress(i, mask)" and "expand(i, mask)". They compile down to 236 assembler instructions. There are no scalar instructions that perform the equivalent functions on aarch64, instead the intrinsics can be implemented with vector instructions included in SVE2; expand with BDEP, compress with BEXT.
>
> Only the first lane of each vector will be used, two MOV instructions will move the inputs from GPRs into temporary vector registers, and another to do the reverse for the result. Autovectorization for this functionality is/will be implemented separately.
>
> Running on an SVE2 enabled system, I ran the following benchmarks:
>
> org.openjdk.bench.java.lang.Integers
> org.openjdk.bench.java.lang.Longs
>
> The time for each operation reduced to 56% to 72% of the original run time:
>
>
> Benchmark Result error Unit % against non-SVE2
> Integers.expand 2.106 0.011 us/op
> Integers.expand-SVE 1.431 0.009 us/op 67.95%
> Longs.expand 2.606 0.006 us/op
> Longs.expand-SVE 1.46 0.003 us/op 56.02%
> Integers.compress 1.982 0.004 us/op
> Integers.compress-SVE 1.427 0.003 us/op 72.00%
> Longs.compress 2.501 0.002 us/op
> Longs.compress-SVE 1.441 0.003 us/op 57.62%
>
>
> These methods can bed specifically tested with:
> `make test TEST="jtreg:compiler/intrinsics/TestBitShuffleOpers.java"`
Sorry for the delay.
Only a few trivial style issues. Otherwise it's okay to me.
src/hotspot/cpu/aarch64/aarch64.ad line 16948:
> 16946: instruct compressBitsI_reg(iRegINoSp dst, iRegIorL2I src, iRegIorL2I mask,
> 16947: vRegF tdst, vRegF tsrc, vRegF tmask) %{
> 16948: match(Set dst (CompressBits src mask));
I would suggest aligning the predicate with the conditions in Matcher::match_rule_supported(int opcode).
Suggestion:
predicate(UseSVE > 1 && VM_Version::supports_svebitperm());
match(Set dst (CompressBits src mask));
src/hotspot/cpu/aarch64/aarch64.ad line 16977:
> 16975: __ mov($tmask$$FloatRegister, __ D, 0, $mask$$Register);
> 16976: __ sve_bext($tdst$$FloatRegister, __ D, $tsrc$$FloatRegister, $tmask$$FloatRegister);
> 16977: __ mov($dst$$Register, $tdst$$FloatRegister, __ D, 0); %}
Obviously this is hand-made, not generated by m4.
Suggestion:
__ mov($dst$$Register, $tdst$$FloatRegister, __ D, 0);
%}
-------------
PR: https://git.openjdk.org/jdk/pull/10537
More information about the hotspot-compiler-dev
mailing list