RFR(m): 8220351: Cross-modifying code
Doerr, Martin
martin.doerr at sap.com
Tue Mar 19 14:30:20 UTC 2019
Hi Robbin,
the lazy disarm numbers look good. Looks like many safepoints don't make it slower.
Thanks for changing and for sharing these numbers.
I'm glad you can also use the lazy disarm for other purposes (removal of suspend flags).
Best regards,
Martin
-----Original Message-----
From: Robbin Ehn <robbin.ehn at oracle.com>
Sent: Dienstag, 19. März 2019 15:11
To: Doerr, Martin <martin.doerr at sap.com>; David Holmes <david.holmes at oracle.com>; Reingruber, Richard <richard.reingruber at sap.com>; Andrew Haley <aph at redhat.com>; hotspot-dev at openjdk.java.net
Subject: Re: RFR(m): 8220351: Cross-modifying code
Hi Martin,
I'm running the final testing of v2 which includes the lazy disarm, since as you
say, is costly.
The numbers I'm seeing is (sps = safepoints per second):
Baseline 1sps
Trans.TotalMemory avgt 2 47.634 ns/op
Trans.nativeMethod avgt 2 19.755 ns/op
Baseline 100sps
Trans.TotalMemory avgt 2 48.099 ns/op
Trans.nativeMethod avgt 2 20.342 ns/op
Always CPUID 1sps
Trans.TotalMemory avgt 2 206.012 ns/op
Trans.nativeMethod avgt 2 131.647 ns/op
Always CPUID 100sps
Trans.TotalMemory avgt 2 211.741 ns/op
Trans.nativeMethod avgt 2 127.572 ns/op
Lazy disarm 1sps
Trans.TotalMemory avgt 2 49.111 ns/op
Trans.nativeMethod avgt 2 19.734 ns/op
Lazy disarm 100sps
Trans.TotalMemory avgt 2 50.568 ns/op
Trans.nativeMethod avgt 2 20.335 ns/op
So hopefully it will be pretty unnoticeable.
Thanks, Robbin
On 3/19/19 3:04 PM, Doerr, Martin wrote:
> Hi Robbin,
>
> I just tried a small stress test [1] with (+UseNewCode) and without added CPUID instruction in native wrapper [2].
>
> jdk/bin/java -XX:+UnlockDiagnosticVMOptions StressJVMEntries
> result: 2113929216, duration: 40.1181962
> jdk/bin/java -XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode StressJVMEntries
> result: 2113929216, duration: 76.0759127
>
> Executed on 40 core 2-socket Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz.
> So I guess nobody wants CPUID on the regular path if it can be avoided.
>
> Best regards,
> Martin
>
>
>
> [1] StressJVMEntries.java:
> public class StressJVMEntries{
>
> static long sum = 0;
> static final int loop_cnt = 10000000;
>
> public static void main(String args[]){
> Runtime rt = Runtime.getRuntime();
> long duration = System.nanoTime();
> for (int x=0; x<loop_cnt; x++) sum += rt.totalMemory();
> duration = System.nanoTime() - duration;
> System.out.println("result: " + sum/loop_cnt + ", duration: " + (double)duration/loop_cnt);
> }
> }
>
>
> [2] patch:
> diff -r 3827cd66e788 src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp
> --- a/src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp Mon Mar 18 16:04:23 2019 +0100
> +++ b/src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp Tue Mar 19 15:02:31 2019 +0100
> @@ -2607,6 +2607,18 @@
> }
>
> __ bind(Continue);
> + if (UseNewCode) {
> + __ push(rax);
> + __ push(rbx);
> + __ push(rcx);
> + __ push(rdx);
> + __ xorl(rax, rax);
> + __ cpuid();
> + __ pop(rdx);
> + __ pop(rcx);
> + __ pop(rbx);
> + __ pop(rax);
> + }
> }
>
> // change thread state
>
>
>
> -----Original Message-----
> From: Robbin Ehn <robbin.ehn at oracle.com>
> Sent: Freitag, 15. März 2019 13:32
> To: David Holmes <david.holmes at oracle.com>; Reingruber, Richard <richard.reingruber at sap.com>; Doerr, Martin <martin.doerr at sap.com>; Andrew Haley <aph at redhat.com>; hotspot-dev at openjdk.java.net
> Subject: Re: RFR(m): 8220351: Cross-modifying code
>
>>>
>>> The last point was rather exploring the space of legal solutions than a
>>> suggestion to solve the issue. I'd favor your approach if you manage to do the
>>> serialization conditionally (BTW: a mechanism similar to
>>> JavaThread::*deopt_suspend*
>>> would be yet another way to do that).
>
> FYI: In the long run I'm removing the suspend-flags and turning suspend/deopt/..
> into handshakes instead. So we only have the poll to check.
>
> I'm testing a change where I removed the changes to the native wrappers,
> leaving native threads armed (as Martin suggested) and in the poll path notice
> that we are still armed (when returning from a safepoint/handshake we should be
> disarmed). Disarm yourself, re-check safepoint/handshake, execute instruction
> barrier.
> So we don't need any changes in any native wrapper, we handle the barrier in the
> polling code instead.
>
> I have just one issue to sort-out with that and benchmarking.
> The v2 patch (not yet published) didn't show to bad of a result when we always
> emit cpuid, but there is no super JNI heavy test in that suite.
>
> /Robbin
>
>>>
>>> Cheers, Richard.
>>>
>>> [1] 8.1.3 Handling Self- and Cross-Modifying Code
>>> in Intel(R) 64 and IA-32 ArchitecturesSoftware Developer's Manual. Volume 3
>>> (3A, 3B, 3C & 3D): System Programming Guide
>>>
>>> https://software.intel.com/sites/default/files/managed/a4/60/325384-sdm-vol-3abcd.pdf
>>>
>>> on https://software.intel.com/en-us/articles/intel-sdm
>>>
>>> -----Original Message-----
>>> From: Robbin Ehn <robbin.ehn at oracle.com>
>>> Sent: Donnerstag, 14. März 2019 13:24
>>> To: Reingruber, Richard <richard.reingruber at sap.com>; Doerr, Martin
>>> <martin.doerr at sap.com>; David Holmes <david.holmes at oracle.com>;
>>> hotspot-dev at openjdk.java.net
>>> Subject: Re: RFR(m): 8220351: Cross-modifying code
>>>
>>> Hi Richard,
>>>
>>> On 3/14/19 11:12 AM, Reingruber, Richard wrote:
>>>> Hi Robbin,
>>>>
>>>> > the reason I heard for not doing invalidate_range/word in nmethod::oops_do
>>>> > is that it was deemed that a safepoint is 'long enough'.
>>>>
>>>> This doesn't seem to be a very good reason. I did a little modification to your
>>>> initial figure:
>>>
>>> That's why we are adding the instruction barrier!
>>>
>>>>
>>>> JavaThread: | VMThread
>>>> StoreLoad |
>>>> | update <immediate oop>
>>>> | ? Global ICache Invalidate ?
>>>> | disarm
>>>> load thread_poll |
>>>>
>>>> Question would be: does an ICache flush happen after the immediate oop update?
>>>> I would consider it problematic if not. Even if the safepoint was very long.
>>>
>>> The icache flush assumes it can stop speculation and flush instruction cache in
>>> patchee thread.
>>> Correct me if anything is wrong:
>>> - sparc needs flush <adr> in all patchee threads.
>>> - arm/aarch64 needs isb + ___clear_cache(adr,..) in all patchee threads.
>>> - ppc needs isync in all patchee threads?
>>> - s390 needs zarch_sync all in patchee threads?
>>> - x86 needs cpuid in all in patchee threads.
>>>
>>> So a "Global ICache Invalidate" done by patcher thread (VMThread in this case)
>>> have no effect in patchee threads!?
>>>
>>> Adding a few words on x86 ICache::Invalidate/word.
>>> We have TSO, so if the patchee thread can see poll is disarmed all other stores
>>> are visible. I have seen no case ICache::Invalidate/word would make a difference
>>> from a StoreStore/compiler barrier (still talking about platform code for x86) ?
>>> The problem is if we prefect an instruction over the load of the poll.
>>> If poll is disarmed this perfected instruction can be fully executed.
>>> If that happens to contain an immediate oop, we don't know if it's the patched
>>> oop or not.
>>>
>>> /Robbin
>>>
>>>>
>>>> Cheers Richard.
>>>>
>>>> -----Original Message-----
>>>> From: Robbin Ehn <robbin.ehn at oracle.com>
>>>> Sent: Donnerstag, 14. März 2019 10:34
>>>> To: Doerr, Martin <martin.doerr at sap.com>; David Holmes
>>>> <david.holmes at oracle.com>; hotspot-dev at openjdk.java.net
>>>> Cc: Reingruber, Richard <richard.reingruber at sap.com>
>>>> Subject: Re: RFR(m): 8220351: Cross-modifying code
>>>>
>>>> Hi Martin,
>>>>
>>>> On 3/14/19 10:25 AM, Doerr, Martin wrote:
>>>>> Hi Robbin,
>>>>>
>>>>>> But that is whole new subject and let's keep that separate from this :)
>>>>> Sure, we can keep the modification side discussion separate. This change is
>>>>> already difficult enough
>>>>>
>>>>>
>>>>> Btw. I haven't sent the ppc/s390 instruction cache barriers. Here they are:
>>>>>
>>>>> orderAccess_aix/linux_ppc.hpp:
>>>>> inline void OrderAccess::instruction_pipeline() { inlasm_isync(); }
>>>>>
>>>>> orderAccess_linux_s390.hpp:
>>>>> inline void OrderAccess::instruction_pipeline() { inlasm_zarch_sync(); }
>>>>>
>>>>
>>>> Thanks!
>>>>
>>>>>> nmethod::oops_do calls the closure f->do_oop(r->oop_addr()) for oops
>>>>>> encoded into the instruction stream.
>>>>>> I couldn't find a call to AbstractICache::invalidate_range.
>>>>
>>>> I think I miss-understood you here, the reason I heard for not doing
>>>> invalidate_range/word in nmethod::oops_do is that it was deemed that a
>>>> safepoint is 'long enough'.
>>>>
>>>> /Robbin
>>>>
>>>>>
>>>>> Look for ICache::invalidate_range, I see 11 calls on x64.
>>>>> And one call to AbstractICache/ICache::invalidate_word.
>>>>> Btw I have spent some time looking at x86/icache_x86.cpp...
>>>>> But that is whole new subject and let's keep that separate from this :)
>>>>>
>>>>> I'll create issues for some of these topics, which if nothing else should be
>>>>> discussed.
>>>>>
>>>>> /Robbin
>>>>>
>>>>>>
>>>>>> Does anybody know where it is or if it's missing?
>>>>>>
>>>>>> Best regards,
>>>>>> Martin
>>>>>>
More information about the hotspot-dev
mailing list