ç”å¤: The usage of fence.i in openjdk

Thu Aug 4 02:08:53 UTC 2022

On Sat, 30 Jul 2022 06:35:11 PDT (-0700), yangfei at iscas.ac.cn wrote:
> Hi Palmer,
>
>
> > -----Original Messages-----
> > From: "Palmer Dabbelt" <palmer at dabbelt.com>
> > Sent Time: 2022-07-30 02:02:59 (Saturday)
> > To: yadonn.wang at huawei.com
> > Cc: vladimir.kempik at gmail.com, riscv-port-dev at openjdk.org
> > Subject: Re: 答复: The usage of fence.i in openjdk
> > 
> > On Fri, 29 Jul 2022 08:12:21 PDT (-0700), yadonn.wang at huawei.com wrote:
> > > Hi, Vladimir,
> > > 
> > >> I believe Java’s threads can migrate to different hart at any moment, hence the use of fence.i is dangerous.
> > > Could you describe in detail why the use of fence.i is dangerous? I think it may be just inefficient but not dangerous.
> > 
> > The issue here is that fence.i applies to the current hart, whereas 
> > Linux userspace processes just know the current thread.  Executing a 
> > fence.i in userspace adds a bunch of orderings to the thread's state 
> > that can only be moved to a new hart via another fence.i.  Normally that 
> > sort of thing isn't such a big deal (there's similar state for things 
> > like fence and lr), but the SiFive chips implement fence.i by flushing 
> > the instruction cache which is a slow operation to put on the scheduling 
> > path.
> > 
> > The only way for the kernel to avoid that fence.i on the scheduling path 
> > is for it to know if one's been executed in userspace.  There's no way 
> > to trap on fence.i so instead the Linux uABI just requires userspace to 
> > make a syscall (or a VDSO library call).  If userspace directly executes 
> > a fence.i then the kernel won't know and thus can't ensure the thread 
> > state is adequately moved to the new hart during scheduling, which may 
> > result in incorrect behavior.
> > 
> > We've known for a while that this will cause performance issues for JITs 
> > on some implemenations, but so far it's just not been a priority.  I 
> > poked around the RISC-V OpenJDK port a few months ago and I think 
> > there's some improvements that can be made there, but we're probably 
> > also going to want some kernel support.  Exactly how to fix it is 
> > probably going to depend on the workloads and implementations, though, 
> > and while I think I understand the OpenJDK part pretty well it's not 
> > clear what the other fence.i implementations are doing.
> > 
> > In the long run we're also going to need some ISA support for doing this 
> > sanely, but that's sort of a different problem.  I've been kind of 
> > crossing my fingers and hoping that anyone who has a system where JIT 
> > performance is important is also going to have some better write/fetch 
> > ordering instructions, but given how long it's been maybe that's a bad 
> > plan.
> > 
> > That said, the direct fence.i is incorrect and it's likely that the 
> > long-term solution involves making the VDSO call so it's probably best 
> > to swap over.  I remember having written a patch to do that at some 
> > point, but I can't find it so maybe I just forgot to send it?
>
> Thanks for all those considerations about the design. It's very helpfull.
>
> > > To a certain extent, this code is hangover from the aarch64 port and we use fence.i to mimic isb.
> > >> Maybe there is a need to add __asm__ volatile ("fence":::"memory") at the beginning of this method.
> > > You're right. It'd better place a full data fence before the syscall, because we cannot guarantee here the syscall leave a data fence there before IPI remote fence.i to other harts
> > 
> > That's not necessary with the current implementation, but it's not 
>
> But looks like this is not reflected in kernel function flush_icache_mm?
> I checked the code and it looks to me that data fence is issued for only one path:
>
>  59         if (mm == current->active_mm && local) {
>  60                 /*
>  61                  * It's assumed that at least one strongly ordered operation is
>  62                  * performed on this hart between setting a hart's cpumask bit
>  63                  * and scheduling this MM context on that hart.  Sending an SBI
>  64                  * remote message will do this, but in the case where no
>  65                  * messages are sent we still need to order this hart's writes
>  66                  * with flush_icache_deferred().
>  67                  */
>  68                 smp_mb();
>  69         }
>
> I just want to make sure that the data fence is there in this syscall with the current implementation.
> But I am not familar with the riscv linux kernel code and it's appreciated if you have more details.

That comment is trying to explain this, but essentially we're assuming 
that sbi_remote_fence_i() is a full fence.  That's probably not written 
down, but it's true in practice as there's no direct remote fence.i 
instrnuction so this just interrupts the target hart and there's a bunch 
of fences on that path in order to make sure right IPI message type 
shows up remotely.

This data fence exists on the local path to handle the case where there 
are other threads, but those threads are not currently running on any 
hart.  In that case we need the data fence because those threads could 
start running on a hart at any time in the future, so without the local 
data fence we'd just have a remote fence.i which isn't sufficient to 
ensure visibility (this is explicitly called out in the ISA manual).

I don't know if that makes more or less sense though...

>
> Thanks,
> Fei</palmer at dabbelt.com>

ç­”å¤: The usage of fence.i in openjdk

ç”å¤: The usage of fence.i in openjdk