RFR(S): 8247307: C2: Loop array fill stub routines are not called
Pengfei Li
Pengfei.Li at arm.com
Tue Jun 16 06:18:36 UTC 2020
Hi Vladimir,
Thanks for looking at this.
> I don't see referenced link [1] in e-mail.
Sorry I forgot to paste my JMH url.
[1] http://cr.openjdk.java.net/~pli/rfr/8247307/TestArrayFill.java
> Are these performance data for Aarch64?
Yes. I didn't paste the x86 result since there's no difference after my patch.
But if I turn OptimizeFill on manully there's a regression on x86. (see below)
Before (x86)
Benchmark Mode Cnt Score Error Units
TestArrayFill.fillByteArray avgt 25 1793.206 ± 15.337 ns/op
TestArrayFill.fillIntArray avgt 25 6679.491 ± 14.729 ns/op
TestArrayFill.fillShortArray avgt 25 3412.708 ± 12.005 ns/op
TestArrayFill.zeroByteArray avgt 25 1785.940 ± 15.174 ns/op
TestArrayFill.zeroIntArray avgt 25 6666.709 ± 11.735 ns/op
TestArrayFill.zeroShortArray avgt 25 3404.146 ± 23.045 ns/op
After (x86)
Benchmark Mode Cnt Score Error Units
TestArrayFill.fillByteArray avgt 25 2281.374 ± 191.220 ns/op
TestArrayFill.fillIntArray avgt 25 9009.679 ± 901.541 ns/op
TestArrayFill.fillShortArray avgt 25 4828.686 ± 49.199 ns/op
TestArrayFill.zeroByteArray avgt 25 2463.745 ± 47.640 ns/op
TestArrayFill.zeroIntArray avgt 25 9062.682 ± 939.538 ns/op
TestArrayFill.zeroShortArray avgt 25 4837.231 ± 50.026 ns/op
> What x86 CPU you tested on? (avx512?)
The results above are produced on Intel® Xeon® Gold 6152 but UseAVX=2 by default in latest JDK master.
> What size of arrays you tested.
It's 65536 in my test, see [1].
> Few years ago OptimizedFill wins over vectorized loops but CPU and
> vectorization are improved since then. May be we can deprecate this code if
> it does not have performance benefits. Or we should revisit stub's code for
> modern CPUs.
I think it's still valuable since it does have performance benefit on AArch64 if the value to be filled is zero.
See this part of TestArrayFill.zero* cases.
Before (AArch64)
Benchmark Mode Cnt Score Error Units
TestArrayFill.zeroByteArray avgt 25 2080.313 ± 7.516 ns/op
TestArrayFill.zeroIntArray avgt 25 10961.331 ± 527.750 ns/op
TestArrayFill.zeroShortArray avgt 25 4126.386 ± 20.997 ns/op
After (AArch64)
Benchmark Mode Cnt Score Error Units
TestArrayFill.zeroByteArray avgt 25 903.434 ± 10.944 ns/op
TestArrayFill.zeroIntArray avgt 25 8141.533 ± 946.341 ns/op
TestArrayFill.zeroShortArray avgt 25 1784.124 ± 24.618 ns/op
--
Thanks,
Pengfei
More information about the hotspot-compiler-dev
mailing list