RFR: 8271820: Implementation of JEP 416: Reimplement Core Reflection with Method Handle [v3]

Tue Aug 31 21:08:44 UTC 2021

On Fri, 27 Aug 2021 13:12:33 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

>> Mandy Chung has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Remove separate accessor for static vs instance method
>>   
>>   There is no effective difference when using MethodHandle::dropArgument for static method.   Removing Static*Accessor and Instance*Accessor simplifies the implementation.
>
> src/java.base/share/classes/jdk/internal/reflect/DirectConstructorAccessorImpl.java line 106:
> 
>> 104:     Object invokeImpl(Object[] args) throws Throwable {
>> 105:         var mhInvoker = mhInvoker();
>> 106:         return switch (paramCount) {
> 
> As @plevart observed, I'm also a bit surprised to see this, but I note your comments regarding performance - especially in cold case. Every adaptation counts, I guess, if you're not in the hot path. But let's make sure that the pay off to add the extra hand-specialized cases is worth it - I'd be surprised if spreading arguments is that expensive.

While I recall it was motivated primarily for startup (note startup numbers for `-Djdk.reflect.useSpreader=true` in @mlchung  reply earlier in this PR), the specialization to avoid the spreader for low-arity arguments also improve performance on throughput microbenchmarks. Surprisingly also reduce the per invocation allocation rate. 

Allocation profiling suggests that with the specialization a varargs array is entirely eliminated. This is of course somewhat fragile and dependent on a number of things - and might ultimately be an artifact of a rather synthetic microbenchmark that will have little benefit in practice, but it's a rather consistent gain for various number of arguments - even when actually going into a spreader (I have some hypotheses as to why this might happen, but most likely we help the JIT to narrow things down for each kind of invocation scheme with the selector):

18 baseline
Benchmark                                                                  Mode  Cnt    Score     Error   Units
ReflectionMethods.static_method                                            avgt   10   57.329 ±   4.217   ns/op
ReflectionMethods.static_method:·gc.alloc.rate.norm                        avgt   10   72.006 ±   0.001    B/op
ReflectionMethods.static_method_3arg                                       avgt   10   70.940 ±   7.457   ns/op
ReflectionMethods.static_method_3arg:·gc.alloc.rate.norm                   avgt   10   96.008 ±   0.002    B/op
ReflectionMethods.static_method_4arg                                       avgt   10   90.453 ±   4.252   ns/op
ReflectionMethods.static_method_4arg:·gc.alloc.rate.norm                   avgt   10  112.010 ±   0.001    B/op

pull/5027
Benchmark                                                                  Mode  Cnt    Score     Error   Units
ReflectionMethods.static_method                                            avgt   10   41.518 ±   2.444   ns/op
ReflectionMethods.static_method:·gc.alloc.rate.norm                        avgt   10   48.004 ±   0.001    B/op
ReflectionMethods.static_method_3arg                                       avgt   10   57.603 ±   3.299   ns/op
ReflectionMethods.static_method_3arg:·gc.alloc.rate.norm                   avgt   10   64.006 ±   0.001    B/op
ReflectionMethods.static_method_4arg                                       avgt   10   56.772 ±   3.971   ns/op
ReflectionMethods.static_method_4arg:·gc.alloc.rate.norm                   avgt   10   80.007 ±   0.001    B/op

On a patch to remove the specialization on top of pull/5027, performance reverts back to numbers very similar to the baseline:

remove-spreader-patch:
Benchmark                                                                  Mode  Cnt    Score     Error   Units
ReflectionMethods.static_method                                            avgt   10   56.644 ±   4.322   ns/op
ReflectionMethods.static_method:·gc.alloc.rate.norm                        avgt   10   72.006 ±   0.001    B/op
ReflectionMethods.static_method_3arg                                       avgt   10   72.353 ±   6.815   ns/op
ReflectionMethods.static_method_3arg:·gc.alloc.rate.norm                   avgt   10   96.008 ±   0.001    B/op
ReflectionMethods.static_method_4arg                                       avgt   10   82.820 ±   8.303   ns/op
ReflectionMethods.static_method_4arg:·gc.alloc.rate.norm                   avgt   10  112.009 ±   0.002    B/op

I think the cold start reduction alone is worth the extra 100 lines or so of code, but if the gain on these microbenchmarks translates to real throughput gains then I think the complexity is more than paid for. Simplifying it while retaining similar characteristics would be great of course, but I think such an exploration should be done as a follow-up.

-------------

PR: https://git.openjdk.java.net/jdk/pull/5027