performance degradation in Array::newInstance on -XX:TieredStopAtLevel=1

Wed Jan 2 23:46:26 UTC 2019

Hi,

what you're seeing specifically here is likely the native overhead:
Array::newInstance calls into the native method Array::newArray, and C1
(TierStopAtLevel=1) doesn't have an intrinsic for this, while C2 does.

C1 and the interpreter will instead call into
Java_java_lang_reflect_Array_newArray in libjava / Array.c over JNI,
which will add a rather expensive constant overhead..

TieredStopAtLevel=1/C1 performance is expected to be relatively slower
than C2 in general, and often much worse in cases like this there are
optimized intrinsics at play.

Have you seen a regression here compared to some older JDK release?

It would also be very helpful if you could shed more light on the use
case and point out what particular startup issues you're seeing that
prevents you from using full tiered compilation and Spring Boot.

/Claes

On 2019-01-02 22:56, Сергей Цыпанов wrote:
> Hello,
> 
> -XX:TieredStopAtLevel=1 flag is often used in some applications (e.g. Spring Boot based) to reduce start-up time.
> 
> With this flag I've spotted huge performance degradation of Array::newInstance comparing to plain constructor call.
> 
> I've used this benchmark
> 
> @State(Scope.Thread)
> @BenchmarkMode(Mode.AverageTime)
> @OutputTimeUnit(TimeUnit.NANOSECONDS)
> public class ArrayInstantiationBenchmark {
> 
>    @Param({"10", "100", "1000"})
>    private int length;
> 
>    @Benchmark
>    public Object newInstance() {
>      return Array.newInstance(Object.class, length);
>    }
> 
>    @Benchmark
>    public Object constructor() {
>      return new Object[length];
>    }
> 
> }
> 
> On C2 (JDK 11) both methods perform the same:
> 
> Benchmark                                (length)  Mode  Cnt    Score    Error  Units
> ArrayInstantiationBenchmark.constructor        10  avgt   50   11,557 ±  0,316  ns/op
> ArrayInstantiationBenchmark.constructor       100  avgt   50   86,944 ±  4,945  ns/op
> ArrayInstantiationBenchmark.constructor      1000  avgt   50  520,722 ± 28,068  ns/op
> 
> ArrayInstantiationBenchmark.newInstance        10  avgt   50   11,899 ±  0,569  ns/op
> ArrayInstantiationBenchmark.newInstance       100  avgt   50   86,805 ±  5,103  ns/op
> ArrayInstantiationBenchmark.newInstance      1000  avgt   50  488,647 ± 20,829  ns/op
> 
> On C1 however there's a huge difference (approximately 8 times!) for length = 10:
> 
> Benchmark                                (length)  Mode  Cnt    Score    Error  Units
> ArrayInstantiationBenchmark.constructor        10  avgt   50   11,183 ±  0,168  ns/op
> ArrayInstantiationBenchmark.constructor       100  avgt   50   92,215 ±  4,425  ns/op
> ArrayInstantiationBenchmark.constructor      1000  avgt   50  838,303 ± 33,161  ns/op
> 
> ArrayInstantiationBenchmark.newInstance        10  avgt   50   86,696 ±  1,297  ns/op
> ArrayInstantiationBenchmark.newInstance       100  avgt   50  106,751 ±  2,796  ns/op
> ArrayInstantiationBenchmark.newInstance      1000  avgt   50  840,582 ± 24,745  ns/op
> 
> Pay attention that performance for length = {100, 1000} is almost the same.
> 
> I suppose it's a bug somewhere on VM because both methods just allocate memory and do zeroing elimination and subsequently there shouldn't be such a huge difference between them.
> 
> Sergey Tsypanov
> 
>