[8u] RFR: (?) New c1 patching, in Re: PING: RFR: 8164652: c1 port, code patching

Wed Nov 30 13:13:41 UTC 2016

Hi,

it's took some time to implement (and test) method supposed by Andrew. 
Now every constant that may be changed concurrently placed in const 
section, and patchable method looks like

b stub
dmb
add rD, 0xff000
ldr rD, [rD, 0xfff]

First `b stub` replaced with nop after patching applied.

Dmb acts as LoadLoad barrier to order constant load with instruction 
modification `b stub` -> `nop`

Maximum offset is 0xffffff (1Mb) which is enough to reach whole nmethod, 
which is 1.1*4*64K (<300Kb) for C1.

 From my point of view, generated code looks more straightforward, and 
ARMv6 could benefit from smaller 32bit constant loading, add+ldr instead 
of mov+3*orr (not implemented yet).

In terms of performance, spec2008 shows minor improvement in average 
0.01 ops/m (~1-2%)

Also, number of relocations grows, now for every load there should be 
another relocation to fix offset when const section moves. But oop and 
metadata relocations are not holding respecting values now, so it's not 
a big loose.
Changing relocations implies changes in c1 shared code that deals with them.

Diff is in attach. Could you review it?

Thanks,
Anton

On 10/03/2016 09:03 PM, Anton Kozlov wrote:
> On Mon, 2016-10-03 at 17:43 +0100, Andrew Haley wrote:
>> On 03/10/16 17:20, Anton Kozlov wrote:
>>>
>>> Already replied
>>>
>>> http://mail.openjdk.java.net/pipermail/aarch32-port-dev/2016-September/00049
>>> 0.html
>>
>> That really is weird: it did not arrive in my inbox.  Apologies!
>
> Never mind
>
>>
>>>
>>>
>>> We are patching branches only, that aren't require explicit
>>> synchronization. You cited step 2 from explicit synchronization
>>> sequence.
>>
>> We're also patching the stubs to be branched to.
>
> Agree, but it's never executed until branches patched. Аll "Concurrent
> modification and execution of instructions" deals with another type of
> instructions
>
>     .. instructions that can be executed by one thread of execution as they are
>     being modified by another thread ...
>
>     ... effect of the concurrent modification and execution of an
>     instruction is UNPREDICTABLE
>
> The branches are of that type, but not stub code. Stub code could be fetched,
> but only after branch is pointing to final destination.
>
>>
>>>
>>>>
>>>> But we are not doing that on all PEs.  For strict correctness we
>>>> should execute an ISB on every PE after a stub has been written and
>>>> before it is executed.  In practice I don't think that it matters
>>>> because the instruction cache on the other PEs probably won't already
>>>> contain a stale copy of the stub.  But it is possible: with a long
>>>> instruction cache it may do if the line has already been fetched for
>>>> an adjacent stub.  Without an ISB the PE may not re-fetch its icache
>>>> line and may then execute the stale stub code.
>>>
>>> I'm a bit confused with this. Could you clarify your thought about
>>> icache problem?
>>
>>>
>>> ISB is pipeline flush.
>>
>> The architecture spec does not say anything about pipelines.
>
> Actually, ISB instruction description says
>
>     An ISB instruction flushes the pipeline in the processor, so that all
>     instructions that come after the ISB instruction in
>     program order are fetched from cache or memory only after the ISB instruction
>     has completed.
>
>> As far as I know there is no reason that a compliant AArch64
>> implementation should not interpret this rule as meaning that the
>> icache in another thread would not be flushed until an ISB.  In that
>> case, you may be jumping to a stale icache line in the stub.  Sure,
>> you have flushed it on the writer size with clear_cache(), but the
>> reader hasn't done anything.
>
> I see same description of ISB in AArch64 manual. Can't say, what makes icache
> flush on executing side
>
>>
>>>
>>> Let's suppose ISB is required after branch patch.
>>
>>>
>>>>
>>>> From my point of view, then, number of executed instructions from
>>>> point of patching to point of execution of patched instructions
>>>> matter.
>>
>>>
>>> ISB could be required by ARM, but in practice bad things will happen
>>> if second point is reached in less than (pipeline-length)
>>> instructions. And we are patching deeply in runtime, there are a lot
>>> of branching, mutex release,..
>>
>> But another thread executing concurrently does not have to wait for
>> any of that.  A soon as you do the write it potentially becomes
>> visible to that other thread.
>>
>
> And so we patch the stub first, then ensure write is completed (by icache flush)
> and only then start to patch branches.
>
>
>>>
>>> But yes, in case ISB required it should be added.
>>
>> The problem is that there is no ordering on a reader PE (one which is
>> not doing the patching) between the two writes of instruction memory.
>> I can't see any architectural reason why it should not see the write
>> of the jump before it sees the write of the stub.  I know there is a
>> clear_cache on the writer side, but AFAICS there is nothing on the
>> reader side to enforce any ordering.
>
> Thats a good point. From anology with data cache, yes, reader should have some
> fence. But I can't see any evidence to this except ISB use in case, which could
> be interpreted another way
>
>>
>> My thought is that the ISB issue of this could be made unnecessary
>> (and the code entirely compliant) simply by changing
>>
>>    mov rN, #const
>>    movk ...
>>    jmp back
>>
>> to
>>
>>    ldr rN, l0
>>    jmp back
>> l0: .data const
>>
>> Having said that, I guess there would need to be some sort of memory
>> fence as well.  I'm aware that it would be a little bit slower, but
>> this is a patch stub.
>
> That's an easy way, to bypass the problem by replacing it by one with
> existing solution =). There are cases when it will not work as is, some stubs
> are still requried to go to runtime patching, after stub code already adjusted.
> However, this could make v6 generated code be a bit better. I need some time to
> play with and decide.
>
> Thanks,
> Anton
>
>>
>> Andrew.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: new_c1_load_patching.patch
Type: text/x-patch
Size: 33841 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/aarch32-port-dev/attachments/20161130/35ff4aa2/new_c1_load_patching-0001.patch>