RFR(m): 8220351: Cross-modifying code

Tue Mar 19 14:11:29 UTC 2019

Hi Martin,

I'm running the final testing of v2 which includes the lazy disarm, since as you
say, is costly.
The numbers I'm seeing is (sps = safepoints per second):

Baseline 1sps
Trans.TotalMemory   avgt    2  47.634          ns/op
Trans.nativeMethod  avgt    2  19.755          ns/op

Baseline 100sps
Trans.TotalMemory   avgt    2  48.099          ns/op
Trans.nativeMethod  avgt    2  20.342          ns/op

Always CPUID 1sps
Trans.TotalMemory   avgt    2  206.012          ns/op
Trans.nativeMethod  avgt    2  131.647          ns/op

Always CPUID 100sps
Trans.TotalMemory   avgt    2  211.741          ns/op
Trans.nativeMethod  avgt    2  127.572          ns/op

Lazy disarm 1sps
Trans.TotalMemory   avgt    2  49.111          ns/op
Trans.nativeMethod  avgt    2  19.734          ns/op

Lazy disarm 100sps
Trans.TotalMemory   avgt    2  50.568          ns/op
Trans.nativeMethod  avgt    2  20.335          ns/op

So hopefully it will be pretty unnoticeable.

Thanks, Robbin

On 3/19/19 3:04 PM, Doerr, Martin wrote:
> Hi Robbin,
> 
> I just tried a small stress test [1] with (+UseNewCode) and without added CPUID instruction in native wrapper [2].
> 
> jdk/bin/java -XX:+UnlockDiagnosticVMOptions StressJVMEntries
> result: 2113929216, duration: 40.1181962
> jdk/bin/java -XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode StressJVMEntries
> result: 2113929216, duration: 76.0759127
> 
> Executed on 40 core 2-socket Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz.
> So I guess nobody wants CPUID on the regular path if it can be avoided.
> 
> Best regards,
> Martin
> 
> 
> 
> [1] StressJVMEntries.java:
> public class StressJVMEntries{
> 
>      static long sum = 0;
>      static final int loop_cnt = 10000000;
> 
>      public static void main(String args[]){
>          Runtime rt = Runtime.getRuntime();
>          long duration = System.nanoTime();
>          for (int x=0; x<loop_cnt; x++) sum += rt.totalMemory();
>          duration = System.nanoTime() - duration;
>          System.out.println("result: " + sum/loop_cnt + ", duration: " + (double)duration/loop_cnt);
>      }
> }
> 
> 
> [2] patch:
> diff -r 3827cd66e788 src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp
> --- a/src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp      Mon Mar 18 16:04:23 2019 +0100
> +++ b/src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp      Tue Mar 19 15:02:31 2019 +0100
> @@ -2607,6 +2607,18 @@
>       }
> 
>       __ bind(Continue);
> +    if (UseNewCode) {
> +      __ push(rax);
> +      __ push(rbx);
> +      __ push(rcx);
> +      __ push(rdx);
> +      __ xorl(rax, rax);
> +      __ cpuid();
> +      __ pop(rdx);
> +      __ pop(rcx);
> +      __ pop(rbx);
> +      __ pop(rax);
> +    }
>     }
> 
>     // change thread state
> 
> 
> 
> -----Original Message-----
> From: Robbin Ehn <robbin.ehn at oracle.com>
> Sent: Freitag, 15. März 2019 13:32
> To: David Holmes <david.holmes at oracle.com>; Reingruber, Richard <richard.reingruber at sap.com>; Doerr, Martin <martin.doerr at sap.com>; Andrew Haley <aph at redhat.com>; hotspot-dev at openjdk.java.net
> Subject: Re: RFR(m): 8220351: Cross-modifying code
> 
>>>
>>> The last point was rather exploring the space of legal solutions than a
>>> suggestion to solve the issue. I'd favor your approach if you manage to do the
>>> serialization conditionally (BTW: a mechanism similar to
>>> JavaThread::*deopt_suspend*
>>> would be yet another way to do that).
> 
> FYI: In the long run I'm removing the suspend-flags and turning suspend/deopt/..
> into handshakes instead. So we only have the poll to check.
> 
> I'm testing a change where I removed the changes to the native wrappers,
> leaving native threads armed (as Martin suggested) and in the poll path notice
> that we are still armed (when returning from a safepoint/handshake we should be
> disarmed). Disarm yourself, re-check safepoint/handshake, execute instruction
> barrier.
> So we don't need any changes in any native wrapper, we handle the barrier in the
> polling code instead.
> 
> I have just one issue to sort-out with that and benchmarking.
> The v2 patch (not yet published) didn't show to bad of a result when we always
> emit cpuid, but there is no super JNI heavy test in that suite.
> 
> /Robbin
> 
>>>
>>> Cheers, Richard.
>>>
>>> [1] 8.1.3 Handling Self- and Cross-Modifying Code
>>>       in Intel(R) 64 and IA-32 ArchitecturesSoftware Developer's Manual. Volume 3
>>>       (3A, 3B, 3C & 3D): System Programming Guide
>>>       
>>> https://software.intel.com/sites/default/files/managed/a4/60/325384-sdm-vol-3abcd.pdf
>>>
>>>       on https://software.intel.com/en-us/articles/intel-sdm
>>>
>>> -----Original Message-----
>>> From: Robbin Ehn <robbin.ehn at oracle.com>
>>> Sent: Donnerstag, 14. März 2019 13:24
>>> To: Reingruber, Richard <richard.reingruber at sap.com>; Doerr, Martin
>>> <martin.doerr at sap.com>; David Holmes <david.holmes at oracle.com>;
>>> hotspot-dev at openjdk.java.net
>>> Subject: Re: RFR(m): 8220351: Cross-modifying code
>>>
>>> Hi Richard,
>>>
>>> On 3/14/19 11:12 AM, Reingruber, Richard wrote:
>>>> Hi Robbin,
>>>>
>>>>      > the reason I heard for not doing invalidate_range/word in nmethod::oops_do
>>>>      > is that it was deemed that a safepoint is 'long enough'.
>>>>
>>>> This doesn't seem to be a very good reason. I did a little modification to your
>>>> initial figure:
>>>
>>> That's why we are adding the instruction barrier!
>>>
>>>>
>>>> JavaThread:             | VMThread
>>>> StoreLoad               |
>>>>                            | update <immediate oop>
>>>>                            | ? Global ICache Invalidate ?
>>>>                            | disarm
>>>> load thread_poll        |
>>>>
>>>> Question would be: does an ICache flush happen after the immediate oop update?
>>>> I would consider it problematic if not. Even if the safepoint was very long.
>>>
>>> The icache flush assumes it can stop speculation and flush instruction cache in
>>> patchee thread.
>>> Correct me if anything is wrong:
>>> - sparc needs flush <adr> in all patchee threads.
>>> - arm/aarch64 needs isb + ___clear_cache(adr,..) in all patchee threads.
>>> - ppc needs isync in all patchee threads?
>>> - s390 needs zarch_sync all in patchee threads?
>>> - x86 needs cpuid in all in patchee threads.
>>>
>>> So a "Global ICache Invalidate" done by patcher thread (VMThread in this case)
>>> have no effect in patchee threads!?
>>>
>>> Adding a few words on x86 ICache::Invalidate/word.
>>> We have TSO, so if the patchee thread can see poll is disarmed all other stores
>>> are visible. I have seen no case ICache::Invalidate/word would make a difference
>>> from a StoreStore/compiler barrier (still talking about platform code for x86) ?
>>> The problem is if we prefect an instruction over the load of the poll.
>>> If poll is disarmed this perfected instruction can be fully executed.
>>> If that happens to contain an immediate oop, we don't know if it's the patched
>>> oop or not.
>>>
>>> /Robbin
>>>
>>>>
>>>> Cheers Richard.
>>>>
>>>> -----Original Message-----
>>>> From: Robbin Ehn <robbin.ehn at oracle.com>
>>>> Sent: Donnerstag, 14. März 2019 10:34
>>>> To: Doerr, Martin <martin.doerr at sap.com>; David Holmes
>>>> <david.holmes at oracle.com>; hotspot-dev at openjdk.java.net
>>>> Cc: Reingruber, Richard <richard.reingruber at sap.com>
>>>> Subject: Re: RFR(m): 8220351: Cross-modifying code
>>>>
>>>> Hi Martin,
>>>>
>>>> On 3/14/19 10:25 AM, Doerr, Martin wrote:
>>>>> Hi Robbin,
>>>>>
>>>>>> But that is whole new subject and let's keep that separate from this :)
>>>>> Sure, we can keep the modification side discussion separate. This change is
>>>>> already difficult enough ��
>>>>>
>>>>>
>>>>> Btw. I haven't sent the ppc/s390 instruction cache barriers. Here they are:
>>>>>
>>>>> orderAccess_aix/linux_ppc.hpp:
>>>>> inline void OrderAccess::instruction_pipeline() { inlasm_isync(); }
>>>>>
>>>>> orderAccess_linux_s390.hpp:
>>>>> inline void OrderAccess::instruction_pipeline() { inlasm_zarch_sync(); }
>>>>>
>>>>
>>>> Thanks!
>>>>
>>>>>> nmethod::oops_do calls the closure f->do_oop(r->oop_addr()) for oops
>>>>>> encoded into the instruction stream.
>>>>>> I couldn't find a call to AbstractICache::invalidate_range.
>>>>
>>>> I think I miss-understood you here, the reason I heard for not doing
>>>> invalidate_range/word in nmethod::oops_do is that it was deemed that a
>>>> safepoint is 'long enough'.
>>>>
>>>> /Robbin
>>>>
>>>>>
>>>>> Look for ICache::invalidate_range, I see 11 calls on x64.
>>>>> And one call to AbstractICache/ICache::invalidate_word.
>>>>> Btw I have spent some time looking at x86/icache_x86.cpp...
>>>>> But that is whole new subject and let's keep that separate from this :)
>>>>>
>>>>> I'll create issues for some of these topics, which if nothing else should be
>>>>> discussed.
>>>>>
>>>>> /Robbin
>>>>>
>>>>>>
>>>>>> Does anybody know where it is or if it's missing?
>>>>>>
>>>>>> Best regards,
>>>>>> Martin
>>>>>>