Missaligned memory accesses from JDK
vladimir.kempik at gmail.com
Fri Mar 17 10:50:19 UTC 2023
Continuing on misaligned memory accesses from JDK.
Hearing no news from Yadong's team , I have decided to take a look myself.
I have an fpga with risc-v cores, in this config it has no support for misaligned loads/stores.
When such memory access happens, it leads to trap and then the M-mode emulator is used ( very similar to opensbi)
Also I have two perf counters - trp_lam/trp_sam (for misaligned loads and stores resp.)
Using them and perfasm.jar I can track every misaligned access.
I have started with the patch from Xiaolin Zheng (which fixes misaligned memory access when writing/reading instructions), that completely removed trp_sam events, but trp_lam was mostly unaffected.
Using perfasm.jar I have found the rest of trp_lam to originate from Template Interpreter's generated code.
Here is numbers on current jdk21-dev (without Xiaolin's patch)
java -Xint -version
5.602736519 seconds time elapsed
5.260201000 seconds user
Total executed instructions - 380M
1:1600 (trp_lam:total) - pretty high ration.
I was able to identify and fix all of them (also applying Xiaolin's patch)
java -Xint -version
4.273510055 seconds time elapsed
3.926482000 seconds user
Notice time improvements.
Also running renaissance philosophers in Xint mode for 20 minutes:
1290.397695196 seconds time elapsed
2099.607472000 seconds user
40.825845000 seconds sys
Clear win, for this fpga.
I can still get some trp_lam when running java -Xcomp -version, but their number is pretty low (less than 50) and they come from C2 generated code.
Now need to check if this changes affect performance on real hardware (I don't want to impact their performance)
java -Xint -version is too fast for it, so I was running renaissance philosophers in Xint mode, just one repetition, multiple runs.
Checking on Thead (c910 core):
It's good it’s not worse
On hifive umatched:
hifive benefits it.
I would like to get some pre-review for my patch 
- safeness of using t0/t1 registers.
- the method void TemplateTable::load_unsigned_short_at_bcp(Register dst, int offset, Register tmp), maybe it has to be designed differently  
The patch  has some comments describing how much of trp_lam events I won there.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the riscv-port-dev