The usage of fence.i in openjdk

Sat Jul 30 02:41:36 UTC 2022

> Lets say you have a thread A running on hart 1.
> You've changed some code in region 0x11223300 and need fence.i before executing that code.
> you execute fence.i in your thread A running on hart 1. 
> right after that your thread ( for some reason) got rescheduled ( by kernel) to hart 2.
> if hart 2 had something in l1i corresponding to region 0x11223300, then you gonna have a problem: l1i on hart 2 has old code, it wasn’t refreshed, because fence.i was executed on hart 1 ( and never on hart 2). And you thread gonna execute old code, or mix of old and new code.

@vladimir Thanks for your explanation. I understand your concern now. We know the fence.i's scope, so the write hart does not rely solely on the fence.i in RISC-V port, but calls the icache_flush syscall in ICache::invalidate_range() every time after modifying the code.

For example:
Hart 1
void MacroAssembler::emit_static_call_stub() {
  // CompiledDirectStaticCall::set_to_interpreted knows the
  // exact layout of this stub.

  ifence();
  mov_metadata(xmethod, (Metadata*)NULL); <- patchable code here

  // Jump to the entry point of the i2c stub.
  int32_t offset = 0;
  movptr_with_offset(t0, 0, offset);
  jalr(x0, t0, offset);
}

Hart 2 (write hart)
void NativeMovConstReg::set_data(intptr_t x) {
// ...
    // Store x into the instruction stream.
    MacroAssembler::pd_patch_instruction_size(instruction_address(), (address)x); <- write code
    ICache::invalidate_range(instruction_address(), movptr_instruction_size);  <- syscall here
// ...
}  

The syscall here:
void flush_icache_mm(struct mm_struct *mm, bool local)
{
        unsigned int cpu;
        cpumask_t others, *mask;

        preempt_disable();

        /* Mark every hart's icache as needing a flush for this MM. */
        mask = &mm->context.icache_stale_mask;
        cpumask_setall(mask);
        /* Flush this hart's I$ now, and mark it as flushed. */
        cpu = smp_processor_id();
        cpumask_clear_cpu(cpu, mask);
        local_flush_icache_all();

        /*
         * Flush the I$ of other harts concurrently executing, and mark them as
         * flushed.
         */
        cpumask_andnot(&others, mm_cpumask(mm), cpumask_of(cpu));
        local |= cpumask_empty(&others);
        if (mm == current->active_mm && local) {
                /*
                 * It's assumed that at least one strongly ordered operation is
                 * performed on this hart between setting a hart's cpumask bit
                 * and scheduling this MM context on that hart.  Sending an SBI
                 * remote message will do this, but in the case where no
                 * messages are sent we still need to order this hart's writes
                 * with flush_icache_deferred().
                 */
                smp_mb();
        } else if (IS_ENABLED(CONFIG_RISCV_SBI)) {
                sbi_remote_fence_i(&others);
        } else {
                on_each_cpu_mask(&others, ipi_remote_fence_i, NULL, 1);
        }

        preempt_enable();
}

> Maybe it would be good to get rid of them.
So maybe we can remove the fence.i used in the user space from a performance perspective, but not because it's not safe.

Please correct me if I'm wrong.

Yadong

-----Original Message-----
From: Palmer Dabbelt [mailto:palmer at dabbelt.com] 
Sent: Saturday, July 30, 2022 2:08 AM
To: vladimir.kempik at gmail.com
Cc: wangyadong (E) <yadonn.wang at huawei.com>; riscv-port-dev at openjdk.org
Subject: Re: The usage of fence.i in openjdk

On Fri, 29 Jul 2022 10:41:18 PDT (-0700), vladimir.kempik at gmail.com wrote:
>
>
>> 29 июля 2022 г., в 18:12, wangyadong (E) <yadonn.wang at huawei.com <mailto:yadonn.wang at huawei.com>> написал(а):
>> 
>> Hi, Vladimir,
>> 
>>> I believe Java’s threads can migrate to different hart at any moment, hence the use of fence.i is dangerous.
>> Could you describe in detail why the use of fence.i is dangerous? I think it may be just inefficient but not dangerous.
>> To a certain extent, this code is hangover from the aarch64 port and we use fence.i to mimic isb.
>> 
> Basically, fence.i executed on a hart, it is only doing anything to 
> the current hart. And your thread can be rescheduled to another hart 
> soon after fence.i
>
> Lets say you have a thread A running on hart 1.
> You've changed some code in region 0x11223300 and need fence.i before executing that code.
> you execute fence.i in your thread A running on hart 1. 
> right after that your thread ( for some reason) got rescheduled ( by kernel) to hart 2.
> if hart 2 had something in l1i corresponding to region 0x11223300, then you gonna have a problem: l1i on hart 2 has old code, it wasn’t refreshed, because fence.i was executed on hart 1 ( and never on hart 2). And you thread gonna execute old code, or mix of old and new code.

Sorry for forking the thread, I saw this come in right after I sent my message.  This is correct, the performance-related reasons we're not doing a fence.i when scheduling are discribed in that message: 
<https://mail.openjdk.org/pipermail/riscv-port-dev/2022-July/000570.html>

>
> Regards, Vladimir.
>
>>> Maybe there is a need to add __asm__ volatile ("fence":::"memory") at the beginning of this method.
>> You're right. It'd better place a full data fence before the syscall, because we cannot guarantee here the syscall leave a data fence there before IPI remote fence.i to other harts
>> 
>> Yadong
>>