performance degradation in Array::newInstance on -XX:TieredStopAtLevel=1

Fri Jan 4 07:25:03 UTC 2019

Hi Claes,

thanks for the explanation, I suspected something like that.

I've run into this performance effect while investigating creation of Spring's ConcurrentReferenceHashMap,
it turned out that it used Array::newInstance to create array of References stored in a map's Segment:

private Reference<K, V>[] createReferenceArray(int size) {
  return Array.newInstance(Reference.class, size);
}

The code above was rewritten into plain array constructor call gaining some performance improvement:

private Reference<K, V>[] createReferenceArray(int size) {
  return new Reference[size];
}

This was the reason to go deeper and look how both methods behave.
The actual behaviour is the same on both JDK 8 and JDK 11.

And creation of ConcurrentReferenceHashMap is important on some workloads, in my case it's 
database access via Spring Data where creation of ConcurrentReferenceHashMap takes approximately 1/5
of execution profile.

Talkin about Spring Boot it's possible to run SB application in IntelliJ IDEA in certain mode adding
-XX:TieredStopAtLevel=1 and -noverify VM options.

With full compilation the simplest application takes this to start up

          Mode  Cnt     Score     Error  Units
start-up    ss  100  2885,493 ± 167,660  ms/op

and with `-XX:TieredStopAtLevel=1 -noverify`

Benchmark Mode  Cnt     Score    Error  Units
start-up    ss  100  1707,342 ± 75,166  ms/op

> Hi,
> 
> what you're seeing specifically here is likely the native overhead:
> Array::newInstance calls into the native method Array::newArray, and C1
> (TierStopAtLevel=1) doesn't have an intrinsic for this, while C2 does.
> 
> C1 and the interpreter will instead call into
> Java_java_lang_reflect_Array_newArray in libjava / Array.c over JNI,
> which will add a rather expensive constant overhead..
> 
> TieredStopAtLevel=1/C1 performance is expected to be relatively slower
> than C2 in general, and often much worse in cases like this there are
> optimized intrinsics at play.
> 
> Have you seen a regression here compared to some older JDK release?
> 
> It would also be very helpful if you could shed more light on the use
> case and point out what particular startup issues you're seeing that
> prevents you from using full tiered compilation and Spring Boot.
> 
> /Claes
> 
> On 2019-01-02 22:56, Сергей Цыпанов wrote:
> 
>> Hello,
>>
>> -XX:TieredStopAtLevel=1 flag is often used in some applications (e.g. Spring Boot based) to reduce start-up time.
>>
>> With this flag I've spotted huge performance degradation of Array::newInstance comparing to plain constructor call.
>>
>> I've used this benchmark
>>
>> @State(Scope.Thread)
>> @BenchmarkMode(Mode.AverageTime)
>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>> public class ArrayInstantiationBenchmark {
>>
>> @Param({"10", "100", "1000"})
>> private int length;
>>
>> @Benchmark
>> public Object newInstance() {
>> return Array.newInstance(Object.class, length);
>> }
>>
>> @Benchmark
>> public Object constructor() {
>> return new Object[length];
>> }
>>
>> }
>>
>> On C2 (JDK 11) both methods perform the same:
>>
>> Benchmark (length) Mode Cnt Score Error Units
>> ArrayInstantiationBenchmark.constructor 10 avgt 50 11,557 ± 0,316 ns/op
>> ArrayInstantiationBenchmark.constructor 100 avgt 50 86,944 ± 4,945 ns/op
>> ArrayInstantiationBenchmark.constructor 1000 avgt 50 520,722 ± 28,068 ns/op
>>
>> ArrayInstantiationBenchmark.newInstance 10 avgt 50 11,899 ± 0,569 ns/op
>> ArrayInstantiationBenchmark.newInstance 100 avgt 50 86,805 ± 5,103 ns/op
>> ArrayInstantiationBenchmark.newInstance 1000 avgt 50 488,647 ± 20,829 ns/op
>>
>> On C1 however there's a huge difference (approximately 8 times!) for length = 10:
>>
>> Benchmark (length) Mode Cnt Score Error Units
>> ArrayInstantiationBenchmark.constructor 10 avgt 50 11,183 ± 0,168 ns/op
>> ArrayInstantiationBenchmark.constructor 100 avgt 50 92,215 ± 4,425 ns/op
>> ArrayInstantiationBenchmark.constructor 1000 avgt 50 838,303 ± 33,161 ns/op
>>
>> ArrayInstantiationBenchmark.newInstance 10 avgt 50 86,696 ± 1,297 ns/op
>> ArrayInstantiationBenchmark.newInstance 100 avgt 50 106,751 ± 2,796 ns/op
>> ArrayInstantiationBenchmark.newInstance 1000 avgt 50 840,582 ± 24,745 ns/op
>>
>> Pay attention that performance for length = {100, 1000} is almost the same.
>>
>> I suppose it's a bug somewhere on VM because both methods just allocate memory and do zeroing elimination and subsequently there shouldn't be such a huge difference between them.
>>
>> Sergey Tsypanov