PING: RFR: 8164652: c1 port, code patching

Mon Oct 3 18:03:07 UTC 2016

On Mon, 2016-10-03 at 17:43 +0100, Andrew Haley wrote:
> On 03/10/16 17:20, Anton Kozlov wrote:
> > 
> > Already replied
> > 
> > http://mail.openjdk.java.net/pipermail/aarch32-port-dev/2016-September/00049
> > 0.html
> 
> That really is weird: it did not arrive in my inbox.  Apologies!

Never mind

> 
> > 
> > 
> > We are patching branches only, that aren't require explicit
> > synchronization. You cited step 2 from explicit synchronization
> > sequence.
> 
> We're also patching the stubs to be branched to.

Agree, but it's never executed until branches patched. Аll "Concurrent
modification and execution of instructions" deals with another type of
instructions

    .. instructions that can be executed by one thread of execution as they are
    being modified by another thread ...

    ... effect of the concurrent modification and execution of an
    instruction is UNPREDICTABLE

The branches are of that type, but not stub code. Stub code could be fetched,
but only after branch is pointing to final destination.

> 
> > 
> > > 
> > > But we are not doing that on all PEs.  For strict correctness we
> > > should execute an ISB on every PE after a stub has been written and
> > > before it is executed.  In practice I don't think that it matters
> > > because the instruction cache on the other PEs probably won't already
> > > contain a stale copy of the stub.  But it is possible: with a long
> > > instruction cache it may do if the line has already been fetched for
> > > an adjacent stub.  Without an ISB the PE may not re-fetch its icache
> > > line and may then execute the stale stub code.
> > 
> > I'm a bit confused with this. Could you clarify your thought about
> > icache problem?
> 
> > 
> > ISB is pipeline flush.
> 
> The architecture spec does not say anything about pipelines.

Actually, ISB instruction description says

    An ISB instruction flushes the pipeline in the processor, so that all
    instructions that come after the ISB instruction in
    program order are fetched from cache or memory only after the ISB instruction
    has completed.

> As far as I know there is no reason that a compliant AArch64
> implementation should not interpret this rule as meaning that the
> icache in another thread would not be flushed until an ISB.  In that
> case, you may be jumping to a stale icache line in the stub.  Sure,
> you have flushed it on the writer size with clear_cache(), but the
> reader hasn't done anything.

I see same description of ISB in AArch64 manual. Can't say, what makes icache
flush on executing side

> 
> > 
> > Let's suppose ISB is required after branch patch.
> 
> > 
> > > 
> > > From my point of view, then, number of executed instructions from
> > > point of patching to point of execution of patched instructions
> > > matter.
> 
> > 
> > ISB could be required by ARM, but in practice bad things will happen
> > if second point is reached in less than (pipeline-length)
> > instructions. And we are patching deeply in runtime, there are a lot
> > of branching, mutex release,..
> 
> But another thread executing concurrently does not have to wait for
> any of that.  A soon as you do the write it potentially becomes
> visible to that other thread.
> 

And so we patch the stub first, then ensure write is completed (by icache flush)
and only then start to patch branches.

> > 
> > But yes, in case ISB required it should be added.
> 
> The problem is that there is no ordering on a reader PE (one which is
> not doing the patching) between the two writes of instruction memory.
> I can't see any architectural reason why it should not see the write
> of the jump before it sees the write of the stub.  I know there is a
> clear_cache on the writer side, but AFAICS there is nothing on the
> reader side to enforce any ordering.

Thats a good point. From anology with data cache, yes, reader should have some
fence. But I can't see any evidence to this except ISB use in case, which could
be interpreted another way

> 
> My thought is that the ISB issue of this could be made unnecessary
> (and the code entirely compliant) simply by changing
> 
>    mov rN, #const
>    movk ...
>    jmp back
> 
> to
> 
>    ldr rN, l0
>    jmp back
> l0: .data const
> 
> Having said that, I guess there would need to be some sort of memory
> fence as well.  I'm aware that it would be a little bit slower, but
> this is a patch stub.

That's an easy way, to bypass the problem by replacing it by one with
existing solution =). There are cases when it will not work as is, some stubs
are still requried to go to runtime patching, after stub code already adjusted.
However, this could make v6 generated code be a bit better. I need some time to
play with and decide.

Thanks,
Anton

> 
> Andrew.