performance degradation in Array::newInstance on -XX:TieredStopAtLevel=1

Fri Jan 4 09:57:35 UTC 2019

Hi,

I've also taken a look at your microbenchmark and seen a few regressions
from 9 through 12, some of which I've identified - and some that might
be (partially) actionable. Mostly related to recent additions of low
overhead heap sampling and allocator/GC changes. All of the blame is in
hotspot, so let's leave the core-libs-devs alone for now. :-)

Some additional comments below...

On 2019-01-04 08:25, Сергей Цыпанов wrote:
> Hi Claes,
> 
> thanks for the explanation, I suspected something like that.
> 
> I've run into this performance effect while investigating creation of Spring's ConcurrentReferenceHashMap,
> it turned out that it used Array::newInstance to create array of References stored in a map's Segment:
> 
> private Reference<K, V>[] createReferenceArray(int size) {
>    return Array.newInstance(Reference.class, size);
> }
> 
> The code above was rewritten into plain array constructor call gaining some performance improvement:
> 
> private Reference<K, V>[] createReferenceArray(int size) {
>    return new Reference[size];
> }

while a point fix, avoiding reflection seems like the right thing to do
when the array type is known statically, anyhow.

> 
> This was the reason to go deeper and look how both methods behave.
> The actual behaviour is the same on both JDK 8 and JDK 11.
> 
> And creation of ConcurrentReferenceHashMap is important on some workloads, in my case it's
> database access via Spring Data where creation of ConcurrentReferenceHashMap takes approximately 1/5
> of execution profile.
> 
> Talkin about Spring Boot it's possible to run SB application in IntelliJ IDEA in certain mode adding
> -XX:TieredStopAtLevel=1 and -noverify VM options.
> 
> With full compilation the simplest application takes this to start up
> 
>            Mode  Cnt     Score     Error  Units
> start-up    ss  100  2885,493 ± 167,660  ms/op
> 
> and with `-XX:TieredStopAtLevel=1 -noverify`
> 
> Benchmark Mode  Cnt     Score    Error  Units
> start-up    ss  100  1707,342 ± 75,166  ms/op

Thanks! Which JDK version are you using?

-noverify can be used without -XX:TieredStopAtLevel=1 (but don't use
this in production!). You might gain some by enabling CDS (run java
-Xshare:dump once, then add -Xshare:auto to your command lines). There
are a few other tricks to pull that might help startup without
sacrificing peak performance.

/Claes

> 
>> Hi,
>>
>> what you're seeing specifically here is likely the native overhead:
>> Array::newInstance calls into the native method Array::newArray, and C1
>> (TierStopAtLevel=1) doesn't have an intrinsic for this, while C2 does.
>>
>> C1 and the interpreter will instead call into
>> Java_java_lang_reflect_Array_newArray in libjava / Array.c over JNI,
>> which will add a rather expensive constant overhead..
>>
>> TieredStopAtLevel=1/C1 performance is expected to be relatively slower
>> than C2 in general, and often much worse in cases like this there are
>> optimized intrinsics at play.
>>
>> Have you seen a regression here compared to some older JDK release?
>>
>> It would also be very helpful if you could shed more light on the use
>> case and point out what particular startup issues you're seeing that
>> prevents you from using full tiered compilation and Spring Boot.
>>
>> /Claes
>>
>> On 2019-01-02 22:56, Сергей Цыпанов wrote:
>>
>>> Hello,
>>>
>>> -XX:TieredStopAtLevel=1 flag is often used in some applications (e.g. Spring Boot based) to reduce start-up time.
>>>
>>> With this flag I've spotted huge performance degradation of Array::newInstance comparing to plain constructor call.
>>>
>>> I've used this benchmark
>>>
>>> @State(Scope.Thread)
>>> @BenchmarkMode(Mode.AverageTime)
>>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>> public class ArrayInstantiationBenchmark {
>>>
>>> @Param({"10", "100", "1000"})
>>> private int length;
>>>
>>> @Benchmark
>>> public Object newInstance() {
>>> return Array.newInstance(Object.class, length);
>>> }
>>>
>>> @Benchmark
>>> public Object constructor() {
>>> return new Object[length];
>>> }
>>>
>>> }
>>>
>>> On C2 (JDK 11) both methods perform the same:
>>>
>>> Benchmark (length) Mode Cnt Score Error Units
>>> ArrayInstantiationBenchmark.constructor 10 avgt 50 11,557 ± 0,316 ns/op
>>> ArrayInstantiationBenchmark.constructor 100 avgt 50 86,944 ± 4,945 ns/op
>>> ArrayInstantiationBenchmark.constructor 1000 avgt 50 520,722 ± 28,068 ns/op
>>>
>>> ArrayInstantiationBenchmark.newInstance 10 avgt 50 11,899 ± 0,569 ns/op
>>> ArrayInstantiationBenchmark.newInstance 100 avgt 50 86,805 ± 5,103 ns/op
>>> ArrayInstantiationBenchmark.newInstance 1000 avgt 50 488,647 ± 20,829 ns/op
>>>
>>> On C1 however there's a huge difference (approximately 8 times!) for length = 10:
>>>
>>> Benchmark (length) Mode Cnt Score Error Units
>>> ArrayInstantiationBenchmark.constructor 10 avgt 50 11,183 ± 0,168 ns/op
>>> ArrayInstantiationBenchmark.constructor 100 avgt 50 92,215 ± 4,425 ns/op
>>> ArrayInstantiationBenchmark.constructor 1000 avgt 50 838,303 ± 33,161 ns/op
>>>
>>> ArrayInstantiationBenchmark.newInstance 10 avgt 50 86,696 ± 1,297 ns/op
>>> ArrayInstantiationBenchmark.newInstance 100 avgt 50 106,751 ± 2,796 ns/op
>>> ArrayInstantiationBenchmark.newInstance 1000 avgt 50 840,582 ± 24,745 ns/op
>>>
>>> Pay attention that performance for length = {100, 1000} is almost the same.
>>>
>>> I suppose it's a bug somewhere on VM because both methods just allocate memory and do zeroing elimination and subsequently there shouldn't be such a huge difference between them.
>>>
>>> Sergey Tsypanov