RFR(m): 8220351: Cross-modifying code

Thu Mar 14 09:02:16 UTC 2019

On 14/03/2019 6:24 pm, Robbin Ehn wrote:
> Hi David,
> 
> On 3/14/19 5:48 AM, David Holmes wrote:
>> I don't see how it is better to cover cases that do not need to be 
>> covered. _thread_blocked is safepoint-safe while _thread_in_VM is not, 
>> but you don't need to sync with the i-stream when unblocking and 
>> returning to the VM! Only transitions back to Java need to do this. 
>> You then want to optimize things so that you only sync when needed.
> 
> I see it the other way around.
> If you transition java->vm->java, you are never safe so you don't need 
> the barrier.

True. I think it would be interesting to flesh out the complete matrix 
to see which variation requires the fewest checks. There are obviously 
cases for each approach that don't need the barrier. But given 
semantically it is the execution of Java code that requires the barrier, 
it makes more sense to me to at least initially start with the 
-return-to-java states and then from there show there are optimisations 
possible that allow you place the barriers more strategically in some of 
the safe->unsafe transitions.

> Introducing more cost to the blocked state (and the orthogonal suspended 
> state) I don't see any problems with and don't need any optimizations.

If you're arguing that "we're blocked anyway so cost doesn't matter" 
that is not always true. If you think about thread-startup handshakes 
during VM initialization blocking longer than needed can add to startup 
time.

David

> While the java->vm->java would need such.
> 
> /Robbin
> 
>>
>> But lets see how this plays out after you disarm experiments
>>
>> Thanks,
>> David
>>
>>>>
>>>> On 9/03/2019 1:24 am, Robbin Ehn wrote:
>>>>> Hi all, please review.
>>>>>
>>>>> Issue:
>>>>> https://bugs.openjdk.java.net/browse/JDK-8220351
>>>>> Changeset:
>>>>> http://cr.openjdk.java.net/~rehn/8220351/webrev/
>>>>
>>>> src/hotspot/cpu/x86/assembler_x86.hpp
>>>>
>>>> As discussed elsewhere mfence() is no longer the advised mechanism, 
>>>> but instead a serializing instruction like cpuid must be used.
>>>
>>> AMD64
>>> "Upon release, the target thread must then execute a serializing 
>>> instruction
>>> such as CPUID or MFENCE (a locked operation is not sufficient) before 
>>> proceeding
>>> to the modified code to avoid executing a stale view of the 
>>> instructions which
>>> may have been speculatively fetched."
>>>
>>> Intel
>>> "MFENCE does not serialize the instruction stream."
>>>
>>> CPUID seems to be common ground, changed.
>>>
>>>> src/hotspot/cpu/x86/macroAssembler_x86.cpp
>>>>
>>>> MacroAssembler::debug32 seems quite bizarre. First, pre-existing:
>>>
>>> Reverted.
>>>
>>>
>>>>
>>>> ---
>>>>
>>>> src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp
>>>>
>>>> 2562   __ lock();
>>>> 2563   __ addl(Address(r15_thread, 
>>>> JavaThread::thread_state_offset()), 1 /* _thread_in_native_trans */);
>>>>
>>>> This changes seems incorrect. You need a storeload barrier (at 
>>>> least) after the update to thread_state but your lock() prefix would 
>>>> place barriers before I think. In any case I find the existing code 
>>>> clearer and more obviously correct.
>>>
>>> We describe this as <fence> <addl> <release (StoreStore|StoreLoad)>.
>>> So it is correct.
>>>
>>> Reverted.
>>>
>>>
>>>>
>>>> ---
>>>>
>>>> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp
>>>> src/hotspot/os_cpu/windows_x86/orderAccess_windows_x86.hpp
>>>> src/hotspot/share/runtime/orderAccess.cpp
>>>> src/hotspot/share/runtime/stubRoutines.cpp
>>>> src/hotspot/share/runtime/stubRoutines.hpp
>>>>
>>>> I don't understand why we need a stub for this (or for fence() for 
>>>> that matter) ??
>>>
>>> Visual studio don't support inline assembly on x64.
>>> Changing to CPUID.
>>>
>>>>
>>>> ---
>>>>
>>>> src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp
>>>>
>>>> 1090   __ lock();
>>>> 1091   __ addl(Address(thread, JavaThread::thread_state_offset()), 1 
>>>> /*_thread_in_native_trans*/);
>>>>
>>>> Same comment as above.
>>>
>>> Reverted.
>>>
>>>>
>>>> ---
>>>>
>>>> src/hotspot/os_cpu/aix_ppc/orderAccess_aix_ppc.hpp
>>>> src/hotspot/os_cpu/linux_ppc/orderAccess_linux_ppc.hpp
>>>>
>>>> PPC has isync for instruction stream synchronization, so I'd expect 
>>>> to see that used in OrderAccess::instruction_pipeline()
>>>
>>> I'll let PPC folks see what they need to do, but yes that was my 
>>> thought also.
>>>
>>>>
>>>> ---
>>>>
>>>> src/hotspot/os_cpu/bsd_x86/orderAccess_bsd_x86.hpp
>>>> src/hotspot/os_cpu/linux_x86/orderAccess_linux_x86.hpp
>>>> src/hotspot/os_cpu/solaris_x86/orderAccess_solaris_x86.hpp
>>>>
>>>> s/mfence/cpuid
>>>
>>> Fixed.
>>>
>>>>
>>>> ---
>>>>
>>>> src/hotspot/os_cpu/linux_aarch64/orderAccess_linux_aarch64.hpp
>>>> src/hotspot/os_cpu/linux_arm/orderAccess_linux_arm.hpp
>>>>
>>>> Aarch64/ARM has ISB for instruction stream sync.
>>>
>>> As Haley said, arm do not have instruction cache coherency, so ISB is 
>>> not enough in all cases.
>>>
>>>>
>>>> ---
>>>>
>>>> src/hotspot/os_cpu/linux_s390/orderAccess_linux_s390.hpp
>>>>
>>>> Need input from s390 folk.
>>>
>>> Yes.
>>>
>>>>
>>>> ---
>>>>
>>>> src/hotspot/os_cpu/linux_sparc/orderAccess_linux_sparc.hpp
>>>> src/hotspot/os_cpu/solaris_sparc/orderAccess_solaris_sparc.hpp
>>>>
>>>> SPARC needs a FLUSH
>>>>
>>>> "H.1.6 Self-Modifying Code
>>>> If a program includes self-modifying code, it must issue a FLUSH 
>>>> instruction for each modified doubleword of instructions (or a call 
>>>> to supervisor software having an equivalent effect)." [SPARCV9 
>>>> Architecture manual]
>>>>
>>>
>>> Yes.
>>>
>>>> ---
>>>>
>>>> src/hotspot/share/classfile/classFileParser.cpp
>>>>
>>>> Change is unrelated to this bug.
>>>
>>> It's related that without it 32-bit don't compile (without 
>>> disable-errors was warning). So I need it for test compiling. Want me 
>>> to revert it?
>>>
>>>>
>>>> ---
>>>>
>>>> src/hotspot/share/prims/jvmtiEnv.cpp
>>>>
>>>> The changes seem unnecessary. As the code comment states we are 
>>>> going from one safepoint unsafe state to another. There is no way to 
>>>> execute any JIT'd code before returning to Java, and that return is 
>>>> where we should sync if needed.
>>>
>>> You are correct.
>>>
>>>>
>>>> ---
>>>>
>>>> src/hotspot/share/prims/jvmtiExport.cpp
>>>>
>>>> The change is unnecessary as we are executing in the VMThread. Not 
>>>> only would it be the one doing the code changes at a safepoint, it 
>>>> never executes compiled Java code anyway.
>>>
>>> Yes.
>>>
>>>>
>>>> ---
>>>>
>>>> src/hotspot/share/runtime/interfaceSupport.inline.hpp
>>>> src/hotspot/share/runtime/safepoint.cpp
>>>>
>>>> My initial thought was that the OrderAccess::instruction_pipeline() 
>>>> should be buried in the safepoint blocking code, but in some cases 
>>>> we may not have actually blocked at the safepoint and so 
>>>> (theoretically at least) may need to sync with the i-stream in case 
>>>> an intervening safepoint changed some code. Can't help but worry 
>>>> that unconditionally doing this on native transitions is going to 
>>>> hurt JNI performance. It also looks like we will double up on the 
>>>> instruction_pipeline() in some cases.
>>>
>>> Yes. I had an idea with passing around a token for it, but it just 
>>> clobbers the code more.
>>>
>>>>
>>>> ---
>>>>
>>>> src/hotspot/share/runtime/thread.cpp
>>>>
>>>> 1839   OrderAccess::instruction_pipeline();
>>>>
>>>> Can't see why this is necessary here, we should sync when returning 
>>>> to _thread_in_Java.
>>>
>>> In this changeset we always do the barrier when leaving a safe state.
>>> When you are not on threads list you are safe, so you can go 
>>> new->vm->java just missing the safepoint.
>>>
>>>>
>>>> 2471   if (thread_state() != _thread_blocked) {
>>>> 2472     OrderAccess::instruction_pipeline();
>>>> 2473   }
>>>>
>>>> Again not sure why this is necessary. The call to java_suspend_self 
>>>> in the not-blocked case comes from 
>>>> JavaThread::handle_special_runtime_exit_condition, which itself is 
>>>> called from the transition code, so I'd expect this to be done in 
>>>> the transition code.
>>>
>>> java_suspend_self is executed in multiple different states.
>>> Since you are safe when your have suspend flag set, we must emit the 
>>> barrier
>>> when leaving that safe state. Transitions from unsafe to unsafe e.g. 
>>> vm to java
>>> do not emit the barrier.
>>>
>>>>
>>>>       thread->set_thread_state(_thread_blocked);
>>>>       thread->java_suspend_self();
>>>>       thread->set_thread_state(state);
>>>> +     OrderAccess::instruction_pipeline();
>>>>
>>>> Again seems unnecessary - especially given you have one in 
>>>> java_suspend_self() already! (Notwithstanding I think that one 
>>>> should be removed.)
>>>
>>> The above one is if we are not blocked, in this case we are blocked.
>>> The reason for this is that if we are blocked via a real transition, 
>>> the transition does the instruction barrier. If you don't mind an 
>>> extra barrier we can emit it unconditionally in java_suspend_self.
>>>
>>> I think set_thread_state should be private and just the 
>>> ThreadStateTransistion should be allowed to use set_thread_state() :)
>>>
>>> Posting update to RFR when completed.
>>>
>>> /Robbin
>>>
>>>>
>>>> Thanks,
>>>> David
>>>> -----
>>>>
>>>>> After a JavaThread have been in a safepoint/(handshake) safe state 
>>>>> it can start
>>>>> executing updated code. E.g. an oop in the instruction stream can 
>>>>> have been
>>>>> updated.
>>>>>
>>>>> Most hardware's require a barrier or that the code cross modified 
>>>>> is far away to
>>>>> guarantee that the thread executing the updated instruction stream 
>>>>> sees the
>>>>> modification.
>>>>>
>>>>> What far away is and how far an update instruction stream is from a 
>>>>> safepoint
>>>>> safe state is not clear.
>>>>>
>>>>> To be compliant with those architectures an instruction stream 
>>>>> barrier must be
>>>>> added when leaving the safepoint safe state.
>>>>>
>>>>> There may be crashes today due to this missing barrier.
>>>>> A new CPU with deeper pipeline or changes to the VM which moves a 
>>>>> safepoint safe
>>>>> state closer to a nmethod can increase changes of a crash.
>>>>>
>>>>> A few benchmarks will see a performance regression with the extra 
>>>>> serializing,
>>>>> such as Octane-Crypto -5% (x86).
>>>>>
>>>>> With a much more elaborate fix, such as changing the local poll to 
>>>>> have more
>>>>> than two states (armed/disarmed), it would be possible to only emit 
>>>>> such
>>>>> instruction stream barrier when the poll have been armed while the 
>>>>> thread was
>>>>> safepoint safe.
>>>>>
>>>>> Passes t1-3 on our platforms, and test built what I can.
>>>>>
>>>>> Thanks, Robbin