[foreign-memaccess+abi] RFR: 8262118: Specialize upcalls

Mon Feb 22 13:24:54 UTC 2021

On Mon, 22 Feb 2021 13:10:52 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

>> Hi,
>> 
>> This patch adds specialization of upcalls, which includes both the specialization of the binding recipe using MethodHandle combinators, as for downcalls, as well as VM support for generating a customized wrapper for a 'low-level' method handle taking and returning only primitive types, which can be used as a base handle in the binding specialization.
>> 
>> The latter is based on Vladimir Ivanov's earlier linkToNative work, with a few changes to the frame layout, exception handling, and added support for attaching non-java threads, as well as re-writing it to support an arbitrary caller ABI.
>> 
>> The current implementation supports x86 only for now - though I've made sure all the tests still pass on AArch64. Stack argument passing is disabled, as for downcalls - there is a mismatch between the encoding for stack slots we use on the Java side, and the encoding we use in the VM, which needs some more consideration. And finally, as with downcalls, multi-register returns are disabled as well. We need to figure out what the right protocol for those should be, maybe taking inspiration from Valhalla.
>> 
>> In the code I've moved some things from ProgrammableInvoker to SharedUtils in order to reuse for upcalls. I also added a bit of debugging code to BufferLayout for dumping stack arguments. Note that there is quite a bit of motion in ProgrammableUpcallHandler as well. This is mostly to untangle the part of an upcall that is specialized using MethodHandle combinators (invokeInterpBindings), from the part that is replaced by the VM's wrapper stub (invokeMoves). I've for instance introduced a simple helper class, called InterpretedHandler, which acts as a wrapper around the target MethodHandle for doing the old-style interpreted calls, instead of using the whole ProgrammableUpcallHandler instance.
>> 
>> I've also removed some of the test configurations. We now only test the default settings of various specialization and intrinsification flags when running TestDowncall, TestUpcall, and TestUpcallHighArity. This patch 2 more of these flags, making a total of 16 combinations, which would take way too long to test, and is probably not necessary. I gave the full testing matrix a run, and found no issues. We can repeat that process periodically to check that all the combinations still work. I've also added some more benchmark cases for the upcalls benchmark.
>> 
>> I also gave the Windows jextract samples a try with this patch and found no issues.
>> 
>> Here are some of the benchmark numbers (from Linux-x64):
>> 
>> Benchmark                Mode  Cnt    Score   Error  Units
>> Upcalls.jni_args10       avgt   30  134.220 ? 1.553  ns/op
>> Upcalls.jni_args5        avgt   30   87.340 ? 1.753  ns/op
>> Upcalls.jni_blank        avgt   30   57.590 ? 0.877  ns/op
>> Upcalls.jni_identity     avgt   30  109.859 ? 0.930  ns/op
>> Upcalls.panama_args10    avgt   30   30.606 ? 0.088  ns/op
>> Upcalls.panama_args5     avgt   30   23.908 ? 0.121  ns/op
>> Upcalls.panama_blank     avgt   30   20.314 ? 0.137  ns/op
>> Upcalls.panama_identity  avgt   30   24.894 ? 0.111  ns/op
>> While these are  the numbers with the optimizations disabled (quite a bit worse):
>> 
>> Upcalls.panama_args10    avgt   30  886.885 ? 22.133  ns/op
>> Upcalls.panama_args5     avgt   30  566.600 ? 12.442  ns/op
>> Upcalls.panama_blank     avgt   30  212.962 ?  2.386  ns/op
>> Upcalls.panama_identity  avgt   30  361.358 ?  3.212  ns/op
>> 
>> We beat JNI across the board, and also seem to scale a lot better for larger numbers of arguments. So, this seems like a success :)
>> 
>> Thanks,
>> Jorn
>
> test/jdk/java/foreign/TestUpcall.java line 33:
> 
>> 31:  * @run testng/othervm
>> 32:  *   -Dforeign.restricted=permit
>> 33:  *   TestUpcall
> 
> Here, in addition to reverting these changes, we should also test with all the new flavors of upcall specialization/intrinsics?

I've tried to explain this in the PR body. Testing the current configuration already takes 15 minutes for TestUpcall. Adding the new configurations would multiply the number of configurations by 4. This seems excessive, since we mostly only use the default configuration any ways. The full matrix can be tested periodically (maybe as part of a higher tier, or CI), but for regular testing runs we could only run the default configuration?

-------------

PR: https://git.openjdk.java.net/panama-foreign/pull/457