Missaligned memory accesses from JDK

Fri Mar 17 10:50:19 UTC 2023

Hello
Continuing on misaligned memory accesses from JDK.
Hearing no news from Yadong's team [4], I have decided to take a look myself.
I have an fpga with risc-v cores, in this config it has no support for misaligned loads/stores.
When such memory access happens, it leads to trap and then the M-mode emulator is used ( very similar to opensbi)
Also I have two perf counters - trp_lam/trp_sam (for misaligned loads and stores resp.)
Using them and perfasm.jar I can track every misaligned access.
I have started with the patch from Xiaolin Zheng (which fixes misaligned memory access when writing/reading instructions), that completely removed trp_sam events, but trp_lam was mostly unaffected.
Using perfasm.jar I have found the rest of trp_lam to originate from Template Interpreter's generated code.

Here is numbers on current jdk21-dev (without Xiaolin's patch)
	java -Xint -version

   239163      trp_lam                                                     
     16289      trp_sam                                                     

      5.602736519 seconds time elapsed
      5.260201000 seconds user

Total executed instructions - 380M
1:1600 (trp_lam:total) - pretty high ration.

I was able to identify and fix all of them (also applying Xiaolin's patch)
New results:
	java -Xint -version

          0      trp_lam                                                     
          0      trp_sam                                                     

      4.273510055 seconds time elapsed
      3.926482000 seconds user

Notice time improvements.

Also running renaissance philosophers in Xint mode for 20 minutes:

    0      trp_lam                                                     
    0      trp_sam                                                     

   1290.397695196 seconds time elapsed

   2099.607472000 seconds user
       40.825845000 seconds sys

Clear win, for this fpga.

I can still get some trp_lam when running java -Xcomp -version, but their number is pretty low (less than 50) and they come from C2 generated code.

Now need to check if this changes affect performance on real hardware (I don't want to impact their performance)
java -Xint -version is too fast for it, so I was running renaissance philosophers in Xint mode, just one repetition, multiple runs.

Checking on Thead (c910 core):
before:
671-684 seconds

after:
657-689 seconds

It's good it’s not worse

On hifive umatched:
before:
2638-2663 seconds

after:
1489-1504 seconds

hifive benefits it.

I would like to get some pre-review for my patch [1]
Main points:
 - safeness of using t0/t1 registers.
 - the method void TemplateTable::load_unsigned_short_at_bcp(Register dst, int offset, Register tmp), maybe it has to be designed differently [2] [3]

 The patch [1] has some comments describing how much of trp_lam events I won there.

Regards, Vladimir

 [1] https://github.com/VladimirKempik/jdk/commit/18d7f399ce1bc213b2495411193938d914d3f616
 [2] https://github.com/VladimirKempik/jdk/commit/18d7f399ce1bc213b2495411193938d914d3f616#diff-ecc50a63ee11d784ec34c55425afb755500a58f9ef4065cdc691fe18fce3692dR148
 [3] https://github.com/VladimirKempik/jdk/commit/18d7f399ce1bc213b2495411193938d914d3f616#diff-412c07ae1ae7770f87b04175c0d65ed3cc1f60dca186e3cfaf0af6b6d00b597eR104
 [4] https://mail.openjdk.org/pipermail/riscv-port-dev/2022-July/000563.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/riscv-port-dev/attachments/20230317/7733838f/attachment.htm>