RFR: 8338379: Accesses to class init state should be properly synchronized [v2]

Mon Sep 30 09:32:35 UTC 2024

On Mon, 23 Sep 2024 07:17:50 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

>> See the bug for the discussion. We have not seen a clear evidence this is _the_ problem in the field, neither we were able to come up with a reproducer. We have found this gap by inspecting the code, while chasing a production bug.
>> 
>> In short, `InstanceKlass::_init_state` is used as the "witness" for initialized class state. When class initialization completes, it needs to publish the class state by writing `_init_state = _fully_initialized` with release semantics.
>> 
>> Various accessors that poll `IK::_init_state`, looking for class initialization to complete, need to read the field with acquire semantics. This is where the change fans out, touching VM, interpreter and compiler paths that e.g. implement clinit barriers. In some cases in assembler code, we can rely on hardware memory model to do what we need (i.e. acquire barriers/fences are nops).
>> 
>> I made the best _guess_ what ARM32, S390X, PPC64, RISC-V code should look like, based on what related code does for volatile loads. It would be good if port maintainers could sanity-check those.
>> 
>> Additional testing:
>>  - [x] Linux x86_64 server fastdebug, `all`
>>  - [x] Linux AArch64 server fastdebug, `all`
>>  - [x] GHA to test platform buildability + adhoc platform cross-compilation
>
> Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Relax to just a release

I've done a bit of research and it seems like the C2 clinit barrier is only used very rarely in a corner case while the C1 parts are not so infrequently used. Peak performance doesn't seem to be affected. So, I don't see any reason for optimizing C2, either. The shared code LGTM. The more frequently used parts are in platform specific code, so it might make sense to optimize the PPC64 parts. Also note that the "isync trick" is a faster acquire barrier than "lwsync". What do you think about this?

diff --git a/src/hotspot/cpu/ppc/c1_LIRAssembler_ppc.cpp b/src/hotspot/cpu/ppc/c1_LIRAssembler_ppc.cpp
index 61f654c9cfa..684c06614a9 100644
--- a/src/hotspot/cpu/ppc/c1_LIRAssembler_ppc.cpp
+++ b/src/hotspot/cpu/ppc/c1_LIRAssembler_ppc.cpp
@@ -2274,7 +2274,7 @@ void LIR_Assembler::emit_alloc_obj(LIR_OpAllocObj* op) {
     }
     __ lbz(op->tmp1()->as_register(),
            in_bytes(InstanceKlass::init_state_offset()), op->klass()->as_register());
-    __ lwsync(); // acquire
+    // acquire barrier included in membar_storestore() which follows the allocation immediately.
     __ cmpwi(CCR0, op->tmp1()->as_register(), InstanceKlass::fully_initialized);
     __ bc_far_optimized(Assembler::bcondCRbiIs0, __ bi0(CCR0, Assembler::equal), *op->stub()->entry());
   }
diff --git a/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp b/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp
index e73e617b8ca..bf2b2540e35 100644
--- a/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp
+++ b/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp
@@ -2410,7 +2410,7 @@ void MacroAssembler::verify_secondary_supers_table(Register r_sub_klass,
 void MacroAssembler::clinit_barrier(Register klass, Register thread, Label* L_fast_path, Label* L_slow_path) {
   assert(L_fast_path != nullptr || L_slow_path != nullptr, "at least one is required");
 
-  Label L_fallthrough;
+  Label L_check_thread, L_fallthrough;
   if (L_fast_path == nullptr) {
     L_fast_path = &L_fallthrough;
   } else if (L_slow_path == nullptr) {
@@ -2419,11 +2419,14 @@ void MacroAssembler::clinit_barrier(Register klass, Register thread, Label* L_fa
 
   // Fast path check: class is fully initialized
   lbz(R0, in_bytes(InstanceKlass::init_state_offset()), klass);
-  lwsync(); // acquire
+  // acquire by cmp-branch-isync if fully_initialized
   cmpwi(CCR0, R0, InstanceKlass::fully_initialized);
-  beq(CCR0, *L_fast_path);
+  bne(CCR0, L_check_thread);
+  isync();
+  b(*L_fast_path);
 
   // Fast path check: current thread is initializer thread
+  bind(L_check_thread);
   ld(R0, in_bytes(InstanceKlass::init_thread_offset()), klass);
   cmpd(CCR0, thread, R0);
   if (L_slow_path == &L_fallthrough) {

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21110#issuecomment-2382609010