RFR: 8363620: AArch64: reimplement emit_static_call_stub() [v3]

Andrew Haley aph at openjdk.org
Thu Dec 4 11:52:00 UTC 2025


On Tue, 2 Dec 2025 13:54:01 GMT, Fei Gao <fgao at openjdk.org> wrote:

>> In the existing implementation, the static call stub typically emits a sequence like:
>> `isb; movk; movz; movz; movk; movz; movz; br`.
>> 
>> This patch reimplements it using a more compact and patch-friendly sequence:
>> 
>> ldr x12, Label_data
>> ldr x8, Label_entry
>> br x8
>> Label_data:
>>   0x00000000
>>   0x00000000
>> Label_entry:
>>   0x00000000
>>   0x00000000
>> 
>> The new approach places the target addresses adjacent to the code and loads them dynamically. This allows us to update the call target by modifying only the data in memory, without changing any instructions. This avoids the need for I-cache flushes or issuing an `isb`[1], which are both relatively expensive operations.
>> 
>> While emitting direct branches in static stubs for small code caches can save 2 instructions compared to the new implementation, modifying those branches still requires I-cache flushes or an `isb`. This patch unifies the code generation by emitting the same static stubs for both small and large code caches.
>> 
>> A microbenchmark (StaticCallStub.java) demonstrates a performance uplift of approximately 43%.
>> 
>> 
>> Benchmark                       (length)   Mode   Cnt Master     Patch      Units
>> StaticCallStubFar.callCompiled    1000     avgt   5   39.346     22.474     us/op
>> StaticCallStubFar.callCompiled    10000    avgt   5   390.05     218.478    us/op
>> StaticCallStubFar.callCompiled    100000   avgt   5   3869.264   2174.001   us/op
>> StaticCallStubNear.callCompiled   1000     avgt   5   39.093     22.582     us/op
>> StaticCallStubNear.callCompiled   10000    avgt   5   387.319    217.398    us/op
>> StaticCallStubNear.callCompiled   100000   avgt   5   3855.825   2206.923   us/op
>> 
>> 
>> All tests in Tier1 to Tier3, under both release and debug builds, have passed.
>> 
>> [1] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-code-working-with-threads
>
> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision:
> 
>  - Update comments and fix benchmarks
>  - The patch is contributed by @theRealAph
>  - Merge branch 'master' into reimplement-static-call-stub
>  - Patch 'isb' to 'nop'
>  - Merge branch 'master' into reimplement-static-call-stub
>  - 8363620: AArch64: reimplement emit_static_call_stub()
>    
>    In the existing implementation, the static call stub typically
>    emits a sequence like:
>    `isb; movk; movz; movz; movk; movz; movz; br`.
>    
>    This patch reimplements it using a more compact and patch-friendly
>    sequence:
>    ```
>    ldr x12, Label_data
>    ldr x8, Label_entry
>    br x8
>    Label_data:
>      0x00000000
>      0x00000000
>    Label_entry:
>      0x00000000
>      0x00000000
>    ```
>    The new approach places the target addresses adjacent to the code
>    and loads them dynamically. This allows us to update the call
>    target by modifying only the data in memory, without changing any
>    instructions. This avoids the need for I-cache flushes or
>    issuing an `isb`[1], which are both relatively expensive
>    operations.
>    
>    While emitting direct branches in static stubs for small code
>    caches can save 2 bytes compared to the new implementation,
>    modifying those branches still requires I-cache flushes or an
>    `isb`. This patch unifies the code generation by emitting the
>    same static stubs for both small and large code caches.
>    
>    A microbenchmark (StaticCallStub.java) demonstrates a performance
>    uplift of approximately 43%.
>    
>    Benchmark                       (length)   Mode   Cnt Master     Patch      Units
>    StaticCallStubFar.callCompiled    1000     avgt   5   39.346     22.474     us/op
>    StaticCallStubFar.callCompiled    10000    avgt   5   390.05     218.478    us/op
>    StaticCallStubFar.callCompiled    100000   avgt   5   3869.264   2174.001   us/op
>    StaticCallStubNear.callCompiled   1000     avgt   5   39.093     22.582     us/op
>    StaticCallStubNear.callCompiled   10000    avgt   5   387.319    217.398    us/op
>    StaticCallStubNear.callCompiled   100000   avgt   5   3855.825   2206.923   us/op
>    
>    All tests in Tier1 to Tier3, under both release and debug builds,
>    have passed.
>    
>    [1] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-...

src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp line 180:

> 178:     nativeCallTrampolineStub_at(trampoline_stub_addr)->set_destination(stub);
> 179:   }
> 180: 

Suggestion:


  // This code is executed while other threads are running. We must                                                                                                                  
  // ensure that at all times there is a valid path of execution. A                                                                                                                  
  // racing thread either observes a call (possibly via a trampoline)                                                                                                                
  // to SharedRuntime::resolve_static_call_C or a complete call to the                                                                                                               
  // interpreter.                                                                                                                                                                    
  //                                                                                                                                                                                 
  // If a racing thread observes an updated direct branch at a call                                                                                                                  
  // site, it must also observe all of the updated instructions in the                                                                                                               
  // static interpreter stub.                                                                                                                                                        
  //                                                                                                                                                                                 
  // To ensure this, we first update the static interpreter stub, then                                                                                                               
  // the trampoline, then the direct branch at the call site.                                                                                                                        
  //                                                                                                                                                                                 
  // AArch64 stub_via_BL                                                                                                                                                             
  // {                                                                                                                                                                               
  // 0:X0=instr:"MOV w0, #2";                                                                                                                                                        
  // 0:X1=instr:"BL .+16";                                                                                                                                                           
  // 0:X10=P1:new;                                                                                                                                                                   
  // 0:X11=P1:L0;                                                                                                                                                                    
  // }                                                                                                                                                                               
  //                                                                                                                                                                                 
  // P0              |  P1            ;                                                                                                                                              
  // STR W0, [X10]   |L0:             ;                                                                                                                                              
  // DC CVAU, X10    |  BL old        ;                                                                                                                                              
  // DSB ISH         |  B end         ;                                                                                                                                              
  // IC IVAU, X10    |old:            ;                                                                                                                                              
  // DSB ISH         |  MOV w0, #0    ;                                                                                                                                              
  //                 |  RET           ;                                                                                                                                              
  // STR W1, [X11]   |new:            ;                                                                                                                                              
  //                 |  MOV w0, #1    ;                                                                                                                                              
  //                 |  RET           ;                                                                                                                                              
  //                 |end:            ;                                                                                                                                              
  // forall(1:X0=0 / 1:X0=2)                                                                                                                                                        
  //                                                                                                                                                                                 
  // We maintan an invariant: every call site either points directly                                                                                                                 
  // to the call destination or to the call site's trampoline. The                                                                                                                   
  // trampoline points to the call destination. Even if the trampoline                                                                                                               
  // is not in use, and therefore not reachable, it still points to                                                                                                                  
  // the call destination.                                                                                                                                                           
  //                                                                                                                                                                                 
  // If a racing thread reaches the static call stub via a trampoline,                                                                                                               
  // we must ensure that it observes the static call stub in                                                                                                                          // full. Initially we place an ISB at the start of the static call                                                                                                                
  // stub. After we update the static call stub we rewrite the ISB                                                                                                                  
  // with 'B .+4' A racing thread either observes the ISB or the                                                                                                                    
  // branch. Once the stub has been rewritten and the instruction and                                                                                                               
  // data caches have been synchronized to the point of unification by                                                                                                              
  // ICache::invalidate_range, either is sufficient to ensure that the                                                                                                              
  // subsequent instructions are observed.                                                                                                                                          
  //                                                                                                                                                                                
  // As confirmed by the litmus test below, when a racing executing                                                                                                                 
  // thread reaches the static call stub:                                                                                                                                           

  //   - If it observes the 'B .+4', it will also observe the updated 'MOV's                                                                                                        
  //   - Or, it will execute the 'ISB' - the instruction fetch ensures                                                                                                              
  //     the updated 'MOV's are observed.                                                                                                                                           
  //                                                                                                                                                                                
  // AArch64 stub_via_BR                                                                                                                                                            
  // {                                                                                                                                                                              
  // [target] = P1:old;                                                                                                                                                             
  //                                                                                                                                                                                
  //                               1:X0 = 0;                                                                                                                                        
  // 0:X1 = instr:"MOV X0, #3";                                                                                                                                                     
  // 0:X2 = instr:"b .+4";                                                                                                                                                          
  // 0:X3 = target;                1:X3 = target;                                                                                                                                   
  // 0:X4 = P1:new;                                                                                                                                                                 
  // 0:X5 = P1:patch;                                                                                                                                                               
  // }                                                                                                                                                                              
  //                                                                                                                                                                                
  // P0                          | P1                        ;                                                                                                                      
  // STR W1, [X5]                |  LDR X2, [X3]             ;                                                                                                                      
  // DC CVAU, X5                 |  BR X2                    ;                                                                                                                      
  // DSB ISH                     |new:                       ;                                                                                                                      
  // IC IVAU, X5                 |  ISB                      ;                                                                                                                      
  // DSB ISH                     |patch:                     ;                                                                                                                      
  // STR W2, [X4]                |  MOV X0, #2               ;                                                                                                                      
  // STR X4, [X3]                |  B end                    ;                                                                                                                      
  //                             |old:                       ;                                                                                                                      
  //                             |  MOV X0, #1               ;                                                                                                                      
  //                             |  B end                    ;                                                                                                                      
  //                             |end:                       ;                                                                                                                      
  // forall (1:X0=1 / 1:X0=3)                                                                                                                                                      

  NativeJump::insert(stub, stub + NativeJump::instruction_size);

  address trampoline_stub_addr = _call->get_trampoline();
  if (trampoline_stub_addr != nullptr) {
    nativeCallTrampolineStub_at(trampoline_stub_addr)->set_destination(stub);
  }

  // Update jump to call.                                                                                                                                                           
  _call->set_destination(stub);
}

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26638#discussion_r2588743655


More information about the hotspot-dev mailing list