PING: RFR: 8164652: c1 port, code patching

Mon Oct 3 16:43:09 UTC 2016

On 03/10/16 17:20, Anton Kozlov wrote:
> Already replied
> 
> http://mail.openjdk.java.net/pipermail/aarch32-port-dev/2016-September/000490.html

That really is weird: it did not arrive in my inbox.  Apologies!

> ARM ARM for v7 states: 
> 
> The ARMv7 architecture limits the set of instructions that can be executed by one thread of execution as they are
> being modified by another thread of execution without requiring explicit synchronization.
> Except for the instructions identified in this section, the effect of the concurrent modification and execution of an
> instruction is UNPREDICTABLE .
> For the following instructions only, the architecture guarantees that, after modification of the instruction, behavior
> is consistent with execution of either:
> • The instruction originally fetched.
> • A fetch of the new instruction. That is, a fetch of the instruction that results from the modification.
> The instructions to which this guarantee applies are:
> In the ARM instruction set
> The B , BL , NOP , BKPT , SVC , HVC , and SMC instructions.
> For all other instructions, to avoid UNPREDICTABLE behavior, instruction modifications must be explicitly
> synchronized before they are executed. The required synchronization is as follows:
> 1 ...
> 2 ..once the modified instructions are observable, ...
> 
> We are patching branches only, that aren't require explicit
> synchronization. You cited step 2 from explicit synchronization
> sequence.

We're also patching the stubs to be branched to.

> > But we are not doing that on all PEs.  For strict correctness we
> > should execute an ISB on every PE after a stub has been written and
> > before it is executed.  In practice I don't think that it matters
> > because the instruction cache on the other PEs probably won't already
> > contain a stale copy of the stub.  But it is possible: with a long
> > instruction cache it may do if the line has already been fetched for
> > an adjacent stub.  Without an ISB the PE may not re-fetch its icache
> > line and may then execute the stale stub code.
> 
> I'm a bit confused with this. Could you clarify your thought about
> icache problem?

> ISB is pipeline flush.

The architecture spec does not say anything about pipelines.  It says
that

   ...once the modified instructions are observable, each PE that is
   executing the modified instructions *must* (my emphasis) issue the
   following instruction to ensure execution of the modified
   instructions:

         ISB;         Synchronize fetched instruction stream

As far as I know there is no reason that a compliant AArch64
implementation should not interpret this rule as meaning that the
icache in another thread would not be flushed until an ISB.  In that
case, you may be jumping to a stale icache line in the stub.  Sure,
you have flushed it on the writer size with clear_cache(), but the
reader hasn't done anything.

> Let's suppose ISB is required after branch patch.

> >From my point of view, then, number of executed instructions from
> >point of patching to point of execution of patched instructions
> >matter.

> ISB could be required by ARM, but in practice bad things will happen
> if second point is reached in less than (pipeline-length)
> instructions. And we are patching deeply in runtime, there are a lot
> of branching, mutex release,..

But another thread executing concurrently does not have to wait for
any of that.  A soon as you do the write it potentially becomes
visible to that other thread.

> But yes, in case ISB required it should be added.

The problem is that there is no ordering on a reader PE (one which is
not doing the patching) between the two writes of instruction memory.
I can't see any architectural reason why it should not see the write
of the jump before it sees the write of the stub.  I know there is a
clear_cache on the writer side, but AFAICS there is nothing on the
reader side to enforce any ordering.

My thought is that the ISB issue of this could be made unnecessary
(and the code entirely compliant) simply by changing

   mov rN, #const
   movk ...
   jmp back

to

   ldr rN, l0
   jmp back
l0: .data const

Having said that, I guess there would need to be some sort of memory
fence as well.  I'm aware that it would be a little bit slower, but
this is a patch stub.

Andrew.