performance degradation in Array::newInstance on -XX:TieredStopAtLevel=1
Claes Redestad
claes.redestad at oracle.com
Wed Jan 2 23:46:26 UTC 2019
Hi,
what you're seeing specifically here is likely the native overhead:
Array::newInstance calls into the native method Array::newArray, and C1
(TierStopAtLevel=1) doesn't have an intrinsic for this, while C2 does.
C1 and the interpreter will instead call into
Java_java_lang_reflect_Array_newArray in libjava / Array.c over JNI,
which will add a rather expensive constant overhead..
TieredStopAtLevel=1/C1 performance is expected to be relatively slower
than C2 in general, and often much worse in cases like this there are
optimized intrinsics at play.
Have you seen a regression here compared to some older JDK release?
It would also be very helpful if you could shed more light on the use
case and point out what particular startup issues you're seeing that
prevents you from using full tiered compilation and Spring Boot.
/Claes
On 2019-01-02 22:56, Сергей Цыпанов wrote:
> Hello,
>
> -XX:TieredStopAtLevel=1 flag is often used in some applications (e.g. Spring Boot based) to reduce start-up time.
>
> With this flag I've spotted huge performance degradation of Array::newInstance comparing to plain constructor call.
>
> I've used this benchmark
>
> @State(Scope.Thread)
> @BenchmarkMode(Mode.AverageTime)
> @OutputTimeUnit(TimeUnit.NANOSECONDS)
> public class ArrayInstantiationBenchmark {
>
> @Param({"10", "100", "1000"})
> private int length;
>
> @Benchmark
> public Object newInstance() {
> return Array.newInstance(Object.class, length);
> }
>
> @Benchmark
> public Object constructor() {
> return new Object[length];
> }
>
> }
>
> On C2 (JDK 11) both methods perform the same:
>
> Benchmark (length) Mode Cnt Score Error Units
> ArrayInstantiationBenchmark.constructor 10 avgt 50 11,557 ± 0,316 ns/op
> ArrayInstantiationBenchmark.constructor 100 avgt 50 86,944 ± 4,945 ns/op
> ArrayInstantiationBenchmark.constructor 1000 avgt 50 520,722 ± 28,068 ns/op
>
> ArrayInstantiationBenchmark.newInstance 10 avgt 50 11,899 ± 0,569 ns/op
> ArrayInstantiationBenchmark.newInstance 100 avgt 50 86,805 ± 5,103 ns/op
> ArrayInstantiationBenchmark.newInstance 1000 avgt 50 488,647 ± 20,829 ns/op
>
> On C1 however there's a huge difference (approximately 8 times!) for length = 10:
>
> Benchmark (length) Mode Cnt Score Error Units
> ArrayInstantiationBenchmark.constructor 10 avgt 50 11,183 ± 0,168 ns/op
> ArrayInstantiationBenchmark.constructor 100 avgt 50 92,215 ± 4,425 ns/op
> ArrayInstantiationBenchmark.constructor 1000 avgt 50 838,303 ± 33,161 ns/op
>
> ArrayInstantiationBenchmark.newInstance 10 avgt 50 86,696 ± 1,297 ns/op
> ArrayInstantiationBenchmark.newInstance 100 avgt 50 106,751 ± 2,796 ns/op
> ArrayInstantiationBenchmark.newInstance 1000 avgt 50 840,582 ± 24,745 ns/op
>
> Pay attention that performance for length = {100, 1000} is almost the same.
>
> I suppose it's a bug somewhere on VM because both methods just allocate memory and do zeroing elimination and subsequently there shouldn't be such a huge difference between them.
>
> Sergey Tsypanov
>
>
More information about the core-libs-dev
mailing list