[8u] RFR: (?) New c1 patching, in Re: PING: RFR: 8164652: c1 port, code patching

Thu Dec 1 09:50:31 UTC 2016

Hi,

On 11/30/2016 01:13 PM, Anton Kozlov wrote:
> Hi,
>
> it's took some time to implement (and test) method supposed by Andrew.
> Now every constant that may be changed concurrently placed in const
> section, and patchable method looks like
>
> b stub
> dmb
> add rD, 0xff000
> ldr rD, [rD, 0xfff]
>
> First `b stub` replaced with nop after patching applied.
>
> Dmb acts as LoadLoad barrier to order constant load with instruction
> modification `b stub` -> `nop`
>
> Maximum offset is 0xffffff (1Mb) which is enough to reach whole nmethod,
> which is 1.1*4*64K (<300Kb) for C1.
>
>  From my point of view, generated code looks more straightforward, and
> ARMv6 could benefit from smaller 32bit constant loading, add+ldr instead
> of mov+3*orr (not implemented yet).
>
> In terms of performance, spec2008 shows minor improvement in average
> 0.01 ops/m (~1-2%)
>
> Also, number of relocations grows, now for every load there should be
> another relocation to fix offset when const section moves. But oop and
> metadata relocations are not holding respecting values now, so it's not
> a big loose.
> Changing relocations implies changes in c1 shared code that deals with
> them.
>
> Diff is in attach. Could you review it?

Smoke-testing with build+bootstrap worked fine for me; citing response 
to this patch from Andrew Haley from internal list:

On 11/30/2016 04:25 PM, Andrew Haley wrote:
 > [...]  Looks OK.

>
> Thanks,
> Anton
>
> On 10/03/2016 09:03 PM, Anton Kozlov wrote:
>> On Mon, 2016-10-03 at 17:43 +0100, Andrew Haley wrote:
>>> On 03/10/16 17:20, Anton Kozlov wrote:
>>>>
>>>> Already replied
>>>>
>>>> http://mail.openjdk.java.net/pipermail/aarch32-port-dev/2016-September/00049
>>>>
>>>> 0.html
>>>
>>> That really is weird: it did not arrive in my inbox.  Apologies!
>>
>> Never mind
>>
>>>
>>>>
>>>>
>>>> We are patching branches only, that aren't require explicit
>>>> synchronization. You cited step 2 from explicit synchronization
>>>> sequence.
>>>
>>> We're also patching the stubs to be branched to.
>>
>> Agree, but it's never executed until branches patched. Аll "Concurrent
>> modification and execution of instructions" deals with another type of
>> instructions
>>
>>     .. instructions that can be executed by one thread of execution as
>> they are
>>     being modified by another thread ...
>>
>>     ... effect of the concurrent modification and execution of an
>>     instruction is UNPREDICTABLE
>>
>> The branches are of that type, but not stub code. Stub code could be
>> fetched,
>> but only after branch is pointing to final destination.
>>
>>>
>>>>
>>>>>
>>>>> But we are not doing that on all PEs.  For strict correctness we
>>>>> should execute an ISB on every PE after a stub has been written and
>>>>> before it is executed.  In practice I don't think that it matters
>>>>> because the instruction cache on the other PEs probably won't already
>>>>> contain a stale copy of the stub.  But it is possible: with a long
>>>>> instruction cache it may do if the line has already been fetched for
>>>>> an adjacent stub.  Without an ISB the PE may not re-fetch its icache
>>>>> line and may then execute the stale stub code.
>>>>
>>>> I'm a bit confused with this. Could you clarify your thought about
>>>> icache problem?
>>>
>>>>
>>>> ISB is pipeline flush.
>>>
>>> The architecture spec does not say anything about pipelines.
>>
>> Actually, ISB instruction description says
>>
>>     An ISB instruction flushes the pipeline in the processor, so that all
>>     instructions that come after the ISB instruction in
>>     program order are fetched from cache or memory only after the ISB
>> instruction
>>     has completed.
>>
>>> As far as I know there is no reason that a compliant AArch64
>>> implementation should not interpret this rule as meaning that the
>>> icache in another thread would not be flushed until an ISB.  In that
>>> case, you may be jumping to a stale icache line in the stub.  Sure,
>>> you have flushed it on the writer size with clear_cache(), but the
>>> reader hasn't done anything.
>>
>> I see same description of ISB in AArch64 manual. Can't say, what makes
>> icache
>> flush on executing side
>>
>>>
>>>>
>>>> Let's suppose ISB is required after branch patch.
>>>
>>>>
>>>>>
>>>>> From my point of view, then, number of executed instructions from
>>>>> point of patching to point of execution of patched instructions
>>>>> matter.
>>>
>>>>
>>>> ISB could be required by ARM, but in practice bad things will happen
>>>> if second point is reached in less than (pipeline-length)
>>>> instructions. And we are patching deeply in runtime, there are a lot
>>>> of branching, mutex release,..
>>>
>>> But another thread executing concurrently does not have to wait for
>>> any of that.  A soon as you do the write it potentially becomes
>>> visible to that other thread.
>>>
>>
>> And so we patch the stub first, then ensure write is completed (by
>> icache flush)
>> and only then start to patch branches.
>>
>>
>>>>
>>>> But yes, in case ISB required it should be added.
>>>
>>> The problem is that there is no ordering on a reader PE (one which is
>>> not doing the patching) between the two writes of instruction memory.
>>> I can't see any architectural reason why it should not see the write
>>> of the jump before it sees the write of the stub.  I know there is a
>>> clear_cache on the writer side, but AFAICS there is nothing on the
>>> reader side to enforce any ordering.
>>
>> Thats a good point. From anology with data cache, yes, reader should
>> have some
>> fence. But I can't see any evidence to this except ISB use in case,
>> which could
>> be interpreted another way
>>
>>>
>>> My thought is that the ISB issue of this could be made unnecessary
>>> (and the code entirely compliant) simply by changing
>>>
>>>    mov rN, #const
>>>    movk ...
>>>    jmp back
>>>
>>> to
>>>
>>>    ldr rN, l0
>>>    jmp back
>>> l0: .data const
>>>
>>> Having said that, I guess there would need to be some sort of memory
>>> fence as well.  I'm aware that it would be a little bit slower, but
>>> this is a patch stub.
>>
>> That's an easy way, to bypass the problem by replacing it by one with
>> existing solution =). There are cases when it will not work as is,
>> some stubs
>> are still requried to go to runtime patching, after stub code already
>> adjusted.
>> However, this could make v6 generated code be a bit better. I need
>> some time to
>> play with and decide.
>>
>> Thanks,
>> Anton
>>
>>>
>>> Andrew.

-- 
-Alex