RFR: 8313406: nep_invoker_blob can be simplified more

Yasumasa Suenaga ysuenaga at openjdk.org
Wed Aug 2 12:38:51 UTC 2023


On Wed, 2 Aug 2023 02:12:43 GMT, Jorn Vernee <jvernee at openjdk.org> wrote:

>> In FFM, native function would be called via `nep_invoker_blob`. If the function has two arguments, it would be following:
>> 
>> 
>> Decoding RuntimeStub - nep_invoker_blob 0x00007fcae394cd10
>> --------------------------------------------------------------------------------
>>   0x00007fcae394cd80: pushq %rbp
>>   0x00007fcae394cd81: movq %rsp, %rbp
>>   0x00007fcae394cd84: subq $0, %rsp
>>  ;; { argument shuffle
>>   0x00007fcae394cd88: movq %r8, %rax
>>   0x00007fcae394cd8b: movq %rsi, %r10
>>   0x00007fcae394cd8e: movq %rcx, %rsi
>>   0x00007fcae394cd91: movq %rdx, %rdi
>>  ;; } argument shuffle
>>   0x00007fcae394cd94: callq *%r10
>>   0x00007fcae394cd97: leave
>>   0x00007fcae394cd98: retq
>> 
>> 
>> `subq $0, %rsp` is for shadow space on stack, and `movq %r8, %rax` is number of args for variadic function. So they are not necessary in some case. They should be remove following if they are not needed:
>> 
>> 
>> Decoding RuntimeStub - nep_invoker_blob 0x00007fd8778e2810
>> --------------------------------------------------------------------------------
>>   0x00007fd8778e2880: pushq %rbp
>>   0x00007fd8778e2881: movq %rsp, %rbp
>>  ;; { argument shuffle
>>   0x00007fd8778e2884: movq %rsi, %r10
>>   0x00007fd8778e2887: movq %rcx, %rsi
>>   0x00007fd8778e288a: movq %rdx, %rdi
>>  ;; } argument shuffle
>>   0x00007fd8778e288d: callq *%r10
>>   0x00007fd8778e2890: leave
>>   0x00007fd8778e2891: retq
>> 
>> 
>> All java/foreign jtreg tests are passed.
>> 
>> We can see these stub code on [ffmasm testcase](https://github.com/YaSuenag/ffmasm/tree/ef7a466ca9607164dbe7be7e68ea509d4bdac998/examples/cpumodel) with `-XX:+UnlockDiagnosticVMOptions -XX:+PrintStubCode` and hsdis library. This testcase linked the code with `Linker.Option.isTrivial()`.
>> 
>> After this change, FFM performance on [another ffmasm testcase](https://github.com/YaSuenag/ffmasm/tree/ef7a466ca9607164dbe7be7e68ea509d4bdac998/benchmarks/funccall) was improved:
>> 
>> before:
>> 
>> Benchmark                           Mode  Cnt          Score          Error  Units
>> FuncCallComparison.invokeFFMRDTSC  thrpt    3  106664071.816 ± 14396524.718  ops/s
>> FuncCallComparison.rdtsc           thrpt    3  108024079.738 ± 13223921.011  ops/s
>> 
>> 
>> after:
>> 
>> Benchmark                           Mode  Cnt          Score          Error  Units
>> FuncCallComparison.invokeFFMRDTSC  thrpt    3  107622971.525 ± 12249767.134  ops/s
>> FuncCallComparison.rdtsc           thrpt    3  107695741.608 ± 23983281.346  ops/s
>> 
>> 
>> Environment:
>> * CPU: AMD Ry...
>
> FWIW, if you want to look into reducing the generated code further, I think we can potentially reduce the amount of shuffling between registers that's needed by reordering the arguments on the Java side so that each VMStorage corresponding to an argument of the leaf method handle is the same as the register for that argument in the Java calling convention.
> 
> I think the right place to do this is in DowncallLinker where we are creating the NativeEntryPoint. The way I think it should work: 
> 1. compute the Java calling convention's argument registers for the leaf method type.
> 2. compute a re-ordered VMStorage[] for the arguments, and a re-ordered method type, such that the VMStorage/type for a particular argument index matches the register for the same index used in the Java calling convention as much as possible.
> 3. use the re-ordered VMStorage[] + MethodType to create the native entry point + native method handle
> 4. apply the same reordering in reverse to the arguments of the created native method handle (using MethodHandles::permuteArguments) so that the resulting method handle has the original argument order/method type.
> 
> Pushing this shuffling to the Java side will allow the JIT to reduce data motion, and this should result in reduced shuffling being needed overall I think.

@JornVernee Thanks for your review! I will integrate this when I get second reviewer.

> I think we can potentially reduce the amount of shuffling between registers that's needed by reordering the arguments on the Java side so that each VMStorage corresponding to an argument of the leaf method handle is the same as the register for that argument in the Java calling convention.

It would be great! I guess you suggested that `ArgumentShuffle` in HotSpot moves into `DowncallLinker`, right? To be honest, I haven't yet understood well about this, and also I do not have other testbed excepting Linux x64. So it is difficult to work for this now.

Again, this idea is great. I'd like to call native function via FFM with less overhead. So I'm happy to help if I can.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/15089#issuecomment-1662132113


More information about the core-libs-dev mailing list