RFR(m): 8220351: Cross-modifying code

Tue Mar 19 14:30:20 UTC 2019

Hi Robbin,

the lazy disarm numbers look good. Looks like many safepoints don't make it slower.
Thanks for changing and for sharing these numbers.

I'm glad you can also use the lazy disarm for other purposes (removal of suspend flags).

Best regards,
Martin

-----Original Message-----
From: Robbin Ehn <robbin.ehn at oracle.com> 
Sent: Dienstag, 19. März 2019 15:11
To: Doerr, Martin <martin.doerr at sap.com>; David Holmes <david.holmes at oracle.com>; Reingruber, Richard <richard.reingruber at sap.com>; Andrew Haley <aph at redhat.com>; hotspot-dev at openjdk.java.net
Subject: Re: RFR(m): 8220351: Cross-modifying code

Hi Martin,

I'm running the final testing of v2 which includes the lazy disarm, since as you
say, is costly.
The numbers I'm seeing is (sps = safepoints per second):

Baseline 1sps
Trans.TotalMemory   avgt    2  47.634          ns/op
Trans.nativeMethod  avgt    2  19.755          ns/op

Baseline 100sps
Trans.TotalMemory   avgt    2  48.099          ns/op
Trans.nativeMethod  avgt    2  20.342          ns/op

Always CPUID 1sps
Trans.TotalMemory   avgt    2  206.012          ns/op
Trans.nativeMethod  avgt    2  131.647          ns/op

Always CPUID 100sps
Trans.TotalMemory   avgt    2  211.741          ns/op
Trans.nativeMethod  avgt    2  127.572          ns/op

Lazy disarm 1sps
Trans.TotalMemory   avgt    2  49.111          ns/op
Trans.nativeMethod  avgt    2  19.734          ns/op

Lazy disarm 100sps
Trans.TotalMemory   avgt    2  50.568          ns/op
Trans.nativeMethod  avgt    2  20.335          ns/op

So hopefully it will be pretty unnoticeable.

Thanks, Robbin

On 3/19/19 3:04 PM, Doerr, Martin wrote:
> Hi Robbin,
> 
> I just tried a small stress test [1] with (+UseNewCode) and without added CPUID instruction in native wrapper [2].
> 
> jdk/bin/java -XX:+UnlockDiagnosticVMOptions StressJVMEntries
> result: 2113929216, duration: 40.1181962
> jdk/bin/java -XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode StressJVMEntries
> result: 2113929216, duration: 76.0759127
> 
> Executed on 40 core 2-socket Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz.
> So I guess nobody wants CPUID on the regular path if it can be avoided.
> 
> Best regards,
> Martin
> 
> 
> 
> [1] StressJVMEntries.java:
> public class StressJVMEntries{
> 
>      static long sum = 0;
>      static final int loop_cnt = 10000000;
> 
>      public static void main(String args[]){
>          Runtime rt = Runtime.getRuntime();
>          long duration = System.nanoTime();
>          for (int x=0; x<loop_cnt; x++) sum += rt.totalMemory();
>          duration = System.nanoTime() - duration;
>          System.out.println("result: " + sum/loop_cnt + ", duration: " + (double)duration/loop_cnt);
>      }
> }
> 
> 
> [2] patch:
> diff -r 3827cd66e788 src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp
> --- a/src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp      Mon Mar 18 16:04:23 2019 +0100
> +++ b/src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp      Tue Mar 19 15:02:31 2019 +0100
> @@ -2607,6 +2607,18 @@
>       }
> 
>       __ bind(Continue);
> +    if (UseNewCode) {
> +      __ push(rax);
> +      __ push(rbx);
> +      __ push(rcx);
> +      __ push(rdx);
> +      __ xorl(rax, rax);
> +      __ cpuid();
> +      __ pop(rdx);
> +      __ pop(rcx);
> +      __ pop(rbx);
> +      __ pop(rax);
> +    }
>     }
> 
>     // change thread state
> 
> 
> 
> -----Original Message-----
> From: Robbin Ehn <robbin.ehn at oracle.com>
> Sent: Freitag, 15. März 2019 13:32
> To: David Holmes <david.holmes at oracle.com>; Reingruber, Richard <richard.reingruber at sap.com>; Doerr, Martin <martin.doerr at sap.com>; Andrew Haley <aph at redhat.com>; hotspot-dev at openjdk.java.net
> Subject: Re: RFR(m): 8220351: Cross-modifying code
> 
>>>
>>> The last point was rather exploring the space of legal solutions than a
>>> suggestion to solve the issue. I'd favor your approach if you manage to do the
>>> serialization conditionally (BTW: a mechanism similar to
>>> JavaThread::*deopt_suspend*
>>> would be yet another way to do that).
> 
> FYI: In the long run I'm removing the suspend-flags and turning suspend/deopt/..
> into handshakes instead. So we only have the poll to check.
> 
> I'm testing a change where I removed the changes to the native wrappers,
> leaving native threads armed (as Martin suggested) and in the poll path notice
> that we are still armed (when returning from a safepoint/handshake we should be
> disarmed). Disarm yourself, re-check safepoint/handshake, execute instruction
> barrier.
> So we don't need any changes in any native wrapper, we handle the barrier in the
> polling code instead.
> 
> I have just one issue to sort-out with that and benchmarking.
> The v2 patch (not yet published) didn't show to bad of a result when we always
> emit cpuid, but there is no super JNI heavy test in that suite.
> 
> /Robbin
> 
>>>
>>> Cheers, Richard.
>>>
>>> [1] 8.1.3 Handling Self- and Cross-Modifying Code
>>>       in Intel(R) 64 and IA-32 ArchitecturesSoftware Developer's Manual. Volume 3
>>>       (3A, 3B, 3C & 3D): System Programming Guide
>>>       
>>> https://software.intel.com/sites/default/files/managed/a4/60/325384-sdm-vol-3abcd.pdf
>>>
>>>       on https://software.intel.com/en-us/articles/intel-sdm
>>>
>>> -----Original Message-----
>>> From: Robbin Ehn <robbin.ehn at oracle.com>
>>> Sent: Donnerstag, 14. März 2019 13:24
>>> To: Reingruber, Richard <richard.reingruber at sap.com>; Doerr, Martin
>>> <martin.doerr at sap.com>; David Holmes <david.holmes at oracle.com>;
>>> hotspot-dev at openjdk.java.net
>>> Subject: Re: RFR(m): 8220351: Cross-modifying code
>>>
>>> Hi Richard,
>>>
>>> On 3/14/19 11:12 AM, Reingruber, Richard wrote:
>>>> Hi Robbin,
>>>>
>>>>      > the reason I heard for not doing invalidate_range/word in nmethod::oops_do
>>>>      > is that it was deemed that a safepoint is 'long enough'.
>>>>
>>>> This doesn't seem to be a very good reason. I did a little modification to your
>>>> initial figure:
>>>
>>> That's why we are adding the instruction barrier!
>>>
>>>>
>>>> JavaThread:             | VMThread
>>>> StoreLoad               |
>>>>                            | update <immediate oop>
>>>>                            | ? Global ICache Invalidate ?
>>>>                            | disarm
>>>> load thread_poll        |
>>>>
>>>> Question would be: does an ICache flush happen after the immediate oop update?
>>>> I would consider it problematic if not. Even if the safepoint was very long.
>>>
>>> The icache flush assumes it can stop speculation and flush instruction cache in
>>> patchee thread.
>>> Correct me if anything is wrong:
>>> - sparc needs flush <adr> in all patchee threads.
>>> - arm/aarch64 needs isb + ___clear_cache(adr,..) in all patchee threads.
>>> - ppc needs isync in all patchee threads?
>>> - s390 needs zarch_sync all in patchee threads?
>>> - x86 needs cpuid in all in patchee threads.
>>>
>>> So a "Global ICache Invalidate" done by patcher thread (VMThread in this case)
>>> have no effect in patchee threads!?
>>>
>>> Adding a few words on x86 ICache::Invalidate/word.
>>> We have TSO, so if the patchee thread can see poll is disarmed all other stores
>>> are visible. I have seen no case ICache::Invalidate/word would make a difference
>>> from a StoreStore/compiler barrier (still talking about platform code for x86) ?
>>> The problem is if we prefect an instruction over the load of the poll.
>>> If poll is disarmed this perfected instruction can be fully executed.
>>> If that happens to contain an immediate oop, we don't know if it's the patched
>>> oop or not.
>>>
>>> /Robbin
>>>
>>>>
>>>> Cheers Richard.
>>>>
>>>> -----Original Message-----
>>>> From: Robbin Ehn <robbin.ehn at oracle.com>
>>>> Sent: Donnerstag, 14. März 2019 10:34
>>>> To: Doerr, Martin <martin.doerr at sap.com>; David Holmes
>>>> <david.holmes at oracle.com>; hotspot-dev at openjdk.java.net
>>>> Cc: Reingruber, Richard <richard.reingruber at sap.com>
>>>> Subject: Re: RFR(m): 8220351: Cross-modifying code
>>>>
>>>> Hi Martin,
>>>>
>>>> On 3/14/19 10:25 AM, Doerr, Martin wrote:
>>>>> Hi Robbin,
>>>>>
>>>>>> But that is whole new subject and let's keep that separate from this :)
>>>>> Sure, we can keep the modification side discussion separate. This change is
>>>>> already difficult enough ��
>>>>>
>>>>>
>>>>> Btw. I haven't sent the ppc/s390 instruction cache barriers. Here they are:
>>>>>
>>>>> orderAccess_aix/linux_ppc.hpp:
>>>>> inline void OrderAccess::instruction_pipeline() { inlasm_isync(); }
>>>>>
>>>>> orderAccess_linux_s390.hpp:
>>>>> inline void OrderAccess::instruction_pipeline() { inlasm_zarch_sync(); }
>>>>>
>>>>
>>>> Thanks!
>>>>
>>>>>> nmethod::oops_do calls the closure f->do_oop(r->oop_addr()) for oops
>>>>>> encoded into the instruction stream.
>>>>>> I couldn't find a call to AbstractICache::invalidate_range.
>>>>
>>>> I think I miss-understood you here, the reason I heard for not doing
>>>> invalidate_range/word in nmethod::oops_do is that it was deemed that a
>>>> safepoint is 'long enough'.
>>>>
>>>> /Robbin
>>>>
>>>>>
>>>>> Look for ICache::invalidate_range, I see 11 calls on x64.
>>>>> And one call to AbstractICache/ICache::invalidate_word.
>>>>> Btw I have spent some time looking at x86/icache_x86.cpp...
>>>>> But that is whole new subject and let's keep that separate from this :)
>>>>>
>>>>> I'll create issues for some of these topics, which if nothing else should be
>>>>> discussed.
>>>>>
>>>>> /Robbin
>>>>>
>>>>>>
>>>>>> Does anybody know where it is or if it's missing?
>>>>>>
>>>>>> Best regards,
>>>>>> Martin
>>>>>>