Towards finalizing the linker implementation

Tue Oct 5 11:17:53 UTC 2021

Hi,

Now that we are talking about finalizing the linker and memory access 
APIs, I thought it would be good to talk about what I think still needs 
to be done to finalize the implementation of the linker, mostly so that 
potential porters know what to expect.

The linker has 2 main parts: downcall and upcall support. For both of 
these there are currently 4 flavors. Both the downcall and upcall 
support consists of a low-level and a high-level part. The low-level 
part takes care of shuffling primitives back and forth between registers 
in VM code, and the high-level part takes care of boxing and unboxing 
these primitives into for instance MemoryAddresses, or MemorySegments. 
The low-level part has 2 modes 'buffered' and 'optimized'. The 
'optimized' mode is faster than the 'buffered' mode, but currently 
doesn't support all kinds of functions. The high-level part also has 2 
modes: 'interpreted' and 'specialized'. The 2 modes for each part makes 
2 x 2 = 4 flavors. One important note is that the low-level part 
requires implementations in the VM for each architecture, while the 
high-level part is implemented completely in Java.

Two years ago, we thought that we only needed the buffered invocation 
strategy for the low-level part, and C2 could handle the heavy-lifting 
as far as the optimization was concerned, but this turned out to be 
harder than thought, partly because of instruction scheduling issues, 
and partly because current VM and GC code expect calls to native code to 
go through an intermediate 'native wrapper' which has it's own frame. As 
a result of this, the current implementation ended up with 2 pretty much 
separate implementations of the low-level part of downcalls. This makes 
maintenance and porting efforts harder, so I think we should ultimately 
get rid of the 'buffered' invocation strategy, by adding the missing 
support for certain function types (namely those that pass arguments on 
the stack, and return arguments in multiple registers) to the 
'optimized' mode, and then removing the buffered invocation strategy.

I have been working on a patch towards this goal. It makes some of the 
work that C2 does for downcalls more eager (namely spinning the 
mentioned 'native wrapper'), so that the 'optimized' mode for downcalls 
can in the future replace the 'buffered' mode completely. As a side 
effect of this, 'virtual calls', calls where the address of the target 
function is passed in as an argument, become a lot faster, and support 
for passing stack arguments is almost a matter of 'just turning it on'.

The 'specialized' mode of the high-level part for up/downcalls is 
currently implemented completely using method handle combinators. I have 
discussed this with Maurizio and, while using method handle combinators 
can work, the code has reached a level of complexity where it has become 
hard to maintain. We think switching from using method handle 
combinators to using byte code spinning with ASM will make this code 
easier to maintain (also because a lot more people are familiar with 
ASM). The 'interpreted' mode is really simple, so there is probably no 
need to remove that, for now.

I have the following timeline in mind:

1. Finish the patch I'm working on right now; moving the 'native 
wrapper' generation to be more eager, and uncoupled from C2.
2. Switch from method handle combinators to ASM for the 'specialized' mode.
3. Implement stack argument and multi-register return support for both 
downcalls and upcalls in the 'optimized' mode.
4. Bring the AArch64 port up to the same level (currently missing the 
'optimized' mode for upcalls).
5. Remove the buffered invocation strategy.

After #3 is implemented, I think it would be a good time for porters 
working on other platforms to start looking at this as well. They would 
only need to implemented the 'optimized' modes of up/downcalls. Nick 
Gasson from ARM has already done a stellar job so far with the AArch64 
port, but at the time that they started, I don't think we anticipated 
the amount of changes still needed to the VM code.

As for the timeline; none of these things are blockers for finalizing 
the API, and could be implemented afterwards as well.

Cheers,
Jorn