RFR: 8255397: x86: coalesce reference and int entry points into vtos bytecodes

Wed Oct 28 08:20:27 UTC 2020

On Tue, 27 Oct 2020 19:46:29 GMT, Claes Redestad <redestad at openjdk.org> wrote:

>> It rubs me the wrong way that we are effectively changing `push_ptr` to `push_i` for `aep`. While it is implemented in the same manner in `interp_masm_x86.cpp` -- delegating to `push`, it still means if `push_i` implementation changes, `aep` would do the `push_i` _as if_ it is integer, not pointer. Ditto a change in `push_ptr` (adding verification, maybe?) would miss this code.
>> 
>> So, how much of the improvement we are talking about to sacrifice this?
>
>> It rubs me the wrong way that we are effectively changing `push_ptr` to `push_i` for `aep`. While it is implemented in the same manner in `interp_masm_x86.cpp` -- delegating to `push`, it still means if `push_i` implementation changes, `aep` would do the `push_i` _as if_ it is integer, not pointer. Ditto a change in `push_ptr` (adding verification, maybe?) would miss this code.
> 
> Verification is done explicitly with `__ verify_oop(..)` and friends, so it seems unlikely we'll overload `push_ptr` any time soon (and they have been semantically identical for many years, even before the merging of 32- and 64-bit `interp_masm_x86...`). But I acknowledge this adds a fragility here, but perhaps there are some assertions we can add to put a check that `push_ptr` and `push_i` stays semantically the same?
> 
>> 
>> So, how much of the improvement we are talking about to sacrifice this?
> 
> A few hundred thousand instructions and branches on Hello World (seems unconditional jumps are logged as branches by `perf`?):
> 
> Baseline:
>        103,795,433      instructions              #    0.59  insn per cycle           ( +-  0.07% )
>         20,263,519      branches                  #  200.867 M/sec                    ( +-  0.08% )
>            731,187      branch-misses             #    3.61% of all branches          ( +-  0.15% )       0.067306367 seconds time elapsed                                          ( +-  0.24% )
> 
> Patch:
>        103,466,523      instructions              #    0.59  insn per cycle           ( +-  0.07% )
>         20,068,162      branches                  #  201.935 M/sec                    ( +-  0.08% )
>            727,575      branch-misses             #    3.63% of all branches          ( +-  0.13% )       0.066568115 seconds time elapsed                                          ( +-  0.27% )
> 
> For Hello World maybe half of that comes from reduced overhead of generating, the rest from quickening quite a few bytecode transitions. There's a scaling component (seen a few million instruction gains on slightly larger apps), but it's nothing huge.

Okay, so that is 0.3% less instructions and ~1% less branches on Hello World. That's interesting. 

Would rebalancing the entry points order give the similar improvement without messing up the code? For example, what happens if we move `aep` to be the last entry point, and set up `[bcsi]ep` for a short jump?

There is a middle-ground, I think: introduce `push_i_or_ptr` and delegate it to `push`. That would make it clear what usages expect `push_i` and `push_ptr` shapes to match, and if later it proves to be a problem, we could easily revert all new usages to the old form.

-------------

PR: https://git.openjdk.java.net/jdk/pull/865