Missaligned memory accesses from JDK

Mon Mar 20 07:29:36 UTC 2023

Hi,

  Just did a quick scan of the changes. I have a few comments.

  It's interesting to see that changes were made in hotspot shared code, especially in file: src/hotspot/share/asm/codeBuffer.hpp 
  For each emit_intX functions modified, I see there is a correspondent version which handles unaligned access. For example, 'void emit_int16(uint8_t x1, uint8_t x2)' for 'void emit_int16(uint16_t x)' 
  So if we encounter an unaligned access issue when using 'emit_int16(uint16_t x)', shouldn't we change the callsite to use 'emit_int16(uint8_t x1, uint8_t x2)' instead? 
  I think this will also be a potential issue for other platforms in respect of functionality or performance. It doesn't look nice for us to handle that in a platform-dependent way. 

  Also, instead of changing file: src/hotspot/share/interpreter/templateTable.hpp for new function 'load_unsigned_short_at_bcp', I personally perfer inlining it at its callsites. 

  And what about C1 & C2 JIT compilers? 

Thanks,
Fei Yang

-----Original Messages-----
From:"Vladimir Kempik" <vladimir.kempik at gmail.com>
Sent Time:2023-03-17 18:50:19 (Friday)
To: riscv-port-dev <riscv-port-dev at openjdk.org>
Cc: yunyao.zxl at alibaba-inc.com
Subject: Missaligned memory accesses from JDK

Hello  
  Continuing on misaligned memory accesses from JDK.  
  Hearing no news from Yadong's team [4], I have decided to take a look myself.   

  I have an fpga with risc-v cores, in this config it has no support for misaligned loads/stores.  
  When such memory access happens, it leads to trap and then the M-mode emulator is used ( very similar to opensbi)  
  Also I have two perf counters - trp_lam/trp_sam (for misaligned loads and stores resp.)  
  Using them and perfasm.jar I can track every misaligned access.  
  I have started with the patch from Xiaolin Zheng (which fixes misaligned memory access when writing/reading instructions), that completely removed trp_sam events, but trp_lam was mostly unaffected.  
  Using perfasm.jar I have found the rest of trp_lam to originate from Template Interpreter's generated code.  

  Here is numbers on current jdk21-dev (without Xiaolin's patch)  
  java -Xint -version  

  239163      trp_lam                                                       
  16289      trp_sam                                                       

  5.602736519 seconds time elapsed  
  5.260201000 seconds user  

  Total executed instructions - 380M  
  1:1600 (trp_lam:total) - pretty high ration.  

  I was able to identify and fix all of them (also applying Xiaolin's patch)  
  New results:  
  java -Xint -version  

  0      trp_lam                                                       
  0      trp_sam                                                       

  4.273510055 seconds time elapsed  
  3.926482000 seconds user  

  Notice time improvements.  

  Also running renaissance philosophers in Xint mode for 20 minutes:  

  0      trp_lam                                                       
  0      trp_sam                                                       

  1290.397695196 seconds time elapsed  

  2099.607472000 seconds user  
  40.825845000 seconds sys  

  Clear win, for this fpga.  

  I can still get some trp_lam when running java -Xcomp -version, but their number is pretty low (less than 50) and they come from C2 generated code.  

  Now need to check if this changes affect performance on real hardware (I don't want to impact their performance)  
  java -Xint -version is too fast for it, so I was running renaissance philosophers in Xint mode, just one repetition, multiple runs.  

  Checking on Thead (c910 core):  
  before:  
  671-684 seconds  

  after:  
  657-689 seconds  

  It's good it’s not worse  

  On hifive umatched:  
  before:  
  2638-2663 seconds  

  after:  
  1489-1504 seconds  

  hifive benefits it.  

  I would like to get some pre-review for my patch [1]  
  Main points:  
  - safeness of using t0/t1 registers.  
  - the method void TemplateTable::load_unsigned_short_at_bcp(Register dst, int offset, Register tmp), maybe it has to be designed differently [2] [3]  

  The patch [1] has some comments describing how much of trp_lam events I won there.  

  Regards, Vladimir  

  [1] https://github.com/VladimirKempik/jdk/commit/18d7f399ce1bc213b2495411193938d914d3f616  
  [2] https://github.com/VladimirKempik/jdk/commit/18d7f399ce1bc213b2495411193938d914d3f616#diff-ecc50a63ee11d784ec34c55425afb755500a58f9ef4065cdc691fe18fce3692dR148  
  [3] https://github.com/VladimirKempik/jdk/commit/18d7f399ce1bc213b2495411193938d914d3f616#diff-412c07ae1ae7770f87b04175c0d65ed3cc1f60dca186e3cfaf0af6b6d00b597eR104   
  [4] https://mail.openjdk.org/pipermail/riscv-port-dev/2022-July/000563.html

</riscv-port-dev at openjdk.org></vladimir.kempik at gmail.com>