The usage of fence.i in openjdk

Sat Aug 6 18:15:12 UTC 2022

On Fri, 05 Aug 2022 15:06:32 PDT (-0700), vladimir.kempik at gmail.com wrote:
> More on this subject
> I can see the use of ifence() in the code is identical to the use of isb() in aarch64.
> Checking the documentation for fence.i and isb, I don’t see them to be 1:1 identical
>
> fence.i ( https://five-embeddev.com/riscv-isa-manual/latest/zifencei.html <https://five-embeddev.com/riscv-isa-manual/latest/zifencei.html> ): 
> FENCE.I instruction provides explicit synchronization between writes to instruction memory and instruction fetches on the same hart.
>
> ISB ( https://developer.arm.com/documentation/den0024/a/Memory-Ordering/Barriers/ISB-in-more-detail ):
> An ISB flushes the pipeline, and re-fetches the instructions from the cache or memory and ensures that the effects of any completed context-changing operation before the ISB are visible to any instruction after the ISB. It also ensures that any context-changing operations after the ISB instruction only take effect after the ISB has been executed and are not seen by instructions before the ISB. 
> And some info from the web:
>
> To me it sound like isb ( in aarch64) does the job a bit different than fence.i ( in rv64)

Broadly speaking I'd agree, with the caveat that there's no formal 
description of the RISC-V instruction fetch ordering requirements so 
it's kind of hard to tell exactly what fence.i does in detail.

> So, I think here:
>
>   __ la_patchable(t0, RuntimeAddress(CAST_FROM_FN_PTR(address, SharedRuntime::fixup_callers_callsite)), offset);
>   __ jalr(x1, t0, offset);
>
>   // Explicit fence.i required because fixup_callers_callsite may change the code
>   // stream.
>   __ safepoint_ifence();
>
>   __ pop_CPU_state();
>   // restore sp
>   __ leave();
>   __ bind(L);
>
>  we still have a small chance to start executing invalid ( old) code  from l1i if right after safepoint_ifence() our thread would be moved to another hart. Otherwise if fixup_callers_callsite would call icache_flush() somewhere inside, then safepoint_ifence wouldn’t be needed here

I don't know enough about the port to tell for sure, but generally 
speaking any code that executes a fence.i directly (ie, not from the 
VDSO call) can't rely on the orderings implied by the ISA for 
correctness.  I can imagine cases where a fence.i is the right thing to 
do for a performance reason (maybe the old code is safe, but it's better 
for performance to eagerly move to the new code), but with such a loose 
definition of fetch ordering that's going to require a lot of 
implementation-specific dependencies for correctness.

I wouldn't be at all surprised if the VDSO call is way too slow for some 
important workloads, but that's an issue we'd need to fix.  Exactly what 
the fix is will depend on what's wrong with the VDSO call, but I think 
there's two main reasons: either the VDSO call is too slow because it 
enters the kernel, or userspace can't emit a call at all because of some 
other restrictions (maybe it doesn't know where the VDSO is yet, there's 
stack issues, etc).  Those should both be fixable, but there's lots of 
things to fix so it just hasn't been a priority.

> Regards, Vladimir
>
>> 30 июля 2022 г., в 13:29, Vladimir Kempik <vladimir.kempik at gmail.com> написал(а):
>> 
>> Hello
>> Thanks for explanation.
>> that sounds like the fence.i in userspace code is not needed at all
>> Regards, Vladimir
>>> 30 июля 2022 г., в 05:41, wangyadong (E) <yadonn.wang at huawei.com> написал(а):
>>> 
>>>> Lets say you have a thread A running on hart 1.
>>>> You've changed some code in region 0x11223300 and need fence.i before executing that code.
>>>> you execute fence.i in your thread A running on hart 1. 
>>>> right after that your thread ( for some reason) got rescheduled ( by kernel) to hart 2.
>>>> if hart 2 had something in l1i corresponding to region 0x11223300, then you gonna have a problem: l1i on hart 2 has old code, it wasn’t refreshed, because fence.i was executed on hart 1 ( and never on hart 2). And you thread gonna execute old code, or mix of old and new code.
>>> 
>>> @vladimir Thanks for your explanation. I understand your concern now. We know the fence.i's scope, so the write hart does not rely solely on the fence.i in RISC-V port, but calls the icache_flush syscall in ICache::invalidate_range() every time after modifying the code.
>>> 
>>> For example:
>>> Hart 1
>>> void MacroAssembler::emit_static_call_stub() {
>>> // CompiledDirectStaticCall::set_to_interpreted knows the
>>> // exact layout of this stub.
>>> 
>>> ifence();
>>> mov_metadata(xmethod, (Metadata*)NULL); <- patchable code here
>>> 
>>> // Jump to the entry point of the i2c stub.
>>> int32_t offset = 0;
>>> movptr_with_offset(t0, 0, offset);
>>> jalr(x0, t0, offset);
>>> }
>>> 
>>> Hart 2 (write hart)
>>> void NativeMovConstReg::set_data(intptr_t x) {
>>> // ...
>>>   // Store x into the instruction stream.
>>>   MacroAssembler::pd_patch_instruction_size(instruction_address(), (address)x); <- write code
>>>   ICache::invalidate_range(instruction_address(), movptr_instruction_size);  <- syscall here
>>> // ...
>>> }  
>>> 
>>