RFR(m): 8220351: Cross-modifying code

Wed Mar 13 13:37:55 UTC 2019

Hi Robbin,

> I think maybe letting the JavaThread disarm itself and check for
> safepoint/handshake after disarm so we didn't disarm the next
> safepoint/handshakes
Yes, you understand what I mean.

I guess one additional C call for each thread which returns from native and didn't notice one or more safepoints should be affordable from performance point of view.

If that was a concern, VM thread could perform the disarm for simple handshakes because oops in instruction stream can only get patched when all Java threads are at safepoint AFAIK.

Thank you very much for trying this.

Best regards,
Martin

-----Original Message-----
From: Robbin Ehn <robbin.ehn at oracle.com> 
Sent: Mittwoch, 13. März 2019 12:42
To: Doerr, Martin <martin.doerr at sap.com>; Andrew Haley <aph at redhat.com>; hotspot-dev at openjdk.java.net; David Holmes (david.holmes at oracle.com) <david.holmes at oracle.com>
Subject: Re: RFR(m): 8220351: Cross-modifying code

Hi Martin,

On 2019-03-13 12:03, Doerr, Martin wrote:
> Hi Robbin,
> 
> I'm not so much concerned about the performance impact on PPC if we have to add isync instructions. But seems like the performance impact on x86 is so significant that we should find a way to move the CPUID instruction out of the most performance critical path.
> 
> Do you think it would be feasible to do the following?
> 1. Skip disarm for threads running in_native (or in_native_trans). safepoint.cpp:480
> 2. Execute instruction fence in JavaThread::check_special_condition_for_native_trans(_and_transition). We should probably skip the rest if the safepoint is over.

I think maybe letting the JavaThread disarm itself and check for
safepoint/handshake after disarm so we didn't disarm the next
safepoint/handshakes.

So only the first transition back from native after a safepoint/handshake is 
'slow'. And only safepoints/handshakes that modify the instruction stream can 
leave them armed for native threads.

I think that is doable, I'll experiment with that, thanks!

/Robbin

> 
> Best regards,
> Martin
> 
> 
> -----Original Message-----
> From: Doerr, Martin
> Sent: Dienstag, 12. März 2019 15:58
> To: 'Robbin Ehn' <robbin.ehn at oracle.com>; Andrew Haley <aph at redhat.com>; hotspot-dev at openjdk.java.net; David Holmes (david.holmes at oracle.com) <david.holmes at oracle.com>
> Subject: RE: RFR(m): 8220351: Cross-modifying code
> 
> Hi Robbin,
> 
>> In the transition we have a StoreLoad between storing the unsafe thread state,
>> e.g. native_trans and loading of the poll. But the instruction barrier must
>> happen after poll. Otherwise an oop can be updated after the
>> StoreLoad+InstructionPipeline barrier and poll being disarmed before poll check.
> 
> Right, we'd need to make the disarm procedure more complicated to support this.
> E.g. VM thread could let Java threads which were in_native or in_native_trans disarm themselves with required barrier.
> 
> I just wanted to share my idea to reduce performance impact.
> 
> Best regards,
> Martin
> 
> 
> -----Original Message-----
> From: Robbin Ehn <robbin.ehn at oracle.com>
> Sent: Dienstag, 12. März 2019 13:56
> To: Doerr, Martin <martin.doerr at sap.com>; Andrew Haley <aph at redhat.com>; hotspot-dev at openjdk.java.net; David Holmes (david.holmes at oracle.com) <david.holmes at oracle.com>
> Subject: Re: RFR(m): 8220351: Cross-modifying code
> 
> Hi Martin,
> 
> On 3/12/19 12:47 PM, Doerr, Martin wrote:
>> Hi Robbin,
>>
>> for PPC, isync came into my mind first (also mentioned by David). However, isync is only strictly required for a thread which performs modifications and wants to execute these modified instructions. We already have isync at the patch sites.
>>
>> The normal "fence" = "sync" does the job for cross-modification on PPC, so I don't think we need to execute any additional instructions.
>> The existing barriers look fine for PPC.
>>
>> I believe other non-x86 platforms are also fine with existing fences (s390, haven't checked others).
>>
>> For x86, why don't we just "upgrade" the few existing barriers to include instruction stream synchronization?
>> E.g.
>> sharedRuntime_x86_64.cpp: 2564
>> templateInterpreterGenerator_x86.cpp: 1094
>> safepoint.cpp: 822
> 
> Since at the moment don't have a way to ask for multiple fences in OrderAccess
> it's only available as a separate method for now.
> 
> In the transition we have a StoreLoad between storing the unsafe thread state,
> e.g. native_trans and loading of the poll. But the instruction barrier must
> happen after poll. Otherwise an oop can be updated after the
> StoreLoad+InstructionPipeline barrier and poll being disarmed before poll check.
> 
> JavaThread:                  | VMThread
> StoreLoad+InstructionPipeline|
>                                | update <immediate oop>
>                                | disarm
> load thread_poll             |
> 
> Thanks, Robbin
> 
>> ...
>>
>> Am I missing anything?
>>
>> Best regards,
>> Martin
>>
>>
>> -----Original Message-----
>> From: Robbin Ehn <robbin.ehn at oracle.com>
>> Sent: Dienstag, 12. März 2019 10:44
>> To: Doerr, Martin <martin.doerr at sap.com>; Andrew Haley <aph at redhat.com>; hotspot-dev at openjdk.java.net
>> Subject: Re: RFR(m): 8220351: Cross-modifying code
>>
>> Hi Martin,
>>
>> On 3/11/19 6:43 PM, Doerr, Martin wrote:
>>> Hi Robbin,
>>>
>>>> They are only changed during a GC safepoint on archs with
>>>> mustIterateImmediateOopsInCode set to true, AFAIK.
>>>
>>> Only x86 uses Oops directly encoded into the instruction stream.
>>> But all platforms have compressed Oops encoded: loadConN nodes in ad files.
>>
>> Sorry, forgot about that. Was just looking at the immediate ones.
>>
>>>
>>> I wonder if there are already barriers in place which have the required effect for non-x86 CPUs.
>>> For example fence() and acquire() already include instruction stream synchronization on PPC.
>>> Is that why you implemented the new barrier empty for non-x86 CPUs?
>>
>> Partly because wasn't sure e.g. isync was what you wanted.
>> And maybe someone would choose to gamble to not take the performance hit.
>> Or there may be other considering for that arch which I'm not aware of.
>> (I wasn't aware of arms need to sync I-cache with D-cache, I thought ISB was enough)
>>
>> Thanks, Robbin
>>
>>>
>>> Best regards,
>>> Martin
>>>
>>>
>>> -----Original Message-----
>>> From: hotspot-dev <hotspot-dev-bounces at openjdk.java.net> On Behalf Of Robbin Ehn
>>> Sent: Montag, 11. März 2019 13:56
>>> To: Andrew Haley <aph at redhat.com>; hotspot-dev at openjdk.java.net
>>> Subject: Re: RFR(m): 8220351: Cross-modifying code
>>>
>>> On 2019-03-11 11:26, Andrew Haley wrote:
>>>> On 3/8/19 3:24 PM, Robbin Ehn wrote:
>>>>
>>>>> Issue:
>>>>> https://bugs.openjdk.java.net/browse/JDK-8220351
>>>>> Changeset:
>>>>> http://cr.openjdk.java.net/~rehn/8220351/webrev/
>>>>>
>>>>> After a JavaThread have been in a safepoint/(handshake) safe state
>>>>> it can start executing updated code. E.g. an oop in the instruction
>>>>> stream can have been updated.
>>>>
>>>> Hmm, interesting. I think it works on on AArch64 at present because
>>>> the segfault trap we take is effectively a full synchronization
>>>> operation.
>>>>
>>>> An AArch64 ISB (instruction synchronization barrier) invalidates the
>>>> pipeline. It doesn't invalidate the instruction cache, which we'd need
>>>> to do to see a relocated OOP. If an instruction is modified by some
>>>> other thread we need to flush the local icache, but to do that we need
>>>> to know which instructions have been changed. We could create a
>>>> modification queue, but that seems rather elaborate.
>>>>
>>>> What is is that changes OOPs in the instructions during a handshake?
>>>> Is it just garbage collection, or does it happen at other times too?
>>>
>>> They are only changed during a GC safepoint on archs with
>>> mustIterateImmediateOopsInCode set to true, AFAIK.
>>> Arm and aarch64 have it false.
>>>
>>> /Robbin
>>>
>>>>
>>>> We could simply move OOPs (and class metadata pointers?) into the
>>>> constant pool. That would have some performance impact on in-order
>>>> CPUs, but hopefully not much on out-of-order ones.
>>>>
>>>
>>>