From Pengfei.Li at arm.com Wed Apr 1 02:05:04 2020 From: Pengfei.Li at arm.com (Pengfei Li) Date: Wed, 1 Apr 2020 02:05:04 +0000 Subject: [aarch64-port-dev ] RFR(S): 8241475: AArch64: Add missing support for PopCountVI node In-Reply-To: <2ce24736-9b5c-5c23-bfde-14067d6d6b0d@redhat.com> References: <2ce24736-9b5c-5c23-bfde-14067d6d6b0d@redhat.com> Message-ID: Hi Andrew, Thanks for review. > INSN(absr, 0, 0b100000101110, 1); // accepted arrangements: T8B, T16B, > T4H, T8H, T4S > - INSN(negr, 1, 0b100000101110, 2); // accepted arrangements: T8B, T16B, > T4H, T8H, T2S, T4S, T2D > > is actually related to some other work you are doing? This change is related to - if (accepted < 2) guarantee(T != T2S && T != T2D, "incorrect arrangement"); \ - if (accepted == 0) guarantee(T == T8B || T == T16B, "incorrect arrangement"); \ + if (accepted < 3) guarantee(T != T2D, "incorrect arrangement"); \ + if (accepted < 2) guarantee(T != T2S, "incorrect arrangement"); \ + if (accepted < 1) guarantee(T == T8B || T == T16B, "incorrect arrangement"); \ Before my patch, the candidate values of "accepted" are 0, 1 and 2 meaning different accepted arrangements as below: 0 - Only T8B and T16B are accepted 1 - All arrangements but T2S and T2D are accepted 2 - All arrangements are accepted In my patch, the newly added instruction UADDLP supports T2S but doesn't support T2D. So I changed the value range to 0 - 3, where 3 means all arrangements are accepted now. That's why the value for parameter "accepted" of NEGR is promoted from 2 to 3 now. -- Thanks, Pengfei From richard.reingruber at sap.com Wed Apr 1 06:15:12 2020 From: richard.reingruber at sap.com (Reingruber, Richard) Date: Wed, 1 Apr 2020 06:15:12 +0000 Subject: RFR(L) 8227745: Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents In-Reply-To: References: <1f8a3c7a-fa0f-b5b2-4a8a-7d3d8dbbe1b5@oracle.com> <4b56a45c-a14c-6f74-2bfd-25deaabe8201@oracle.com> <5271429a-481d-ddb9-99dc-b3f6670fcc0b@oracle.com> Message-ID: Hi Martin, > thanks for addressing all my points. I've looked over webrev.5 and I'm satisfied with your changes. Thanks! > I had also promised to review the tests. Thanks++ I appreciate it very much, the tests are many lines of code. > test/jdk/com/sun/jdi/EATests.java > This is a substantial amount of tests which is appropriate for a such a large change. Skipping some subtests with UseJVMCICompiler makes sense because it doesn't provide the necessary JVMTI functionality, yet. > Nice work! > I also like that you test with and without BiasedLocking. Your tests will still be fine after BiasedLocking deprecation. Hope so :) > Very minor nits: > - 2 typos in comment above EARelockingNestedInflatedTarget: "lockes are ommitted" (sounds funny) > - You sometimes write "graal" and sometimes "Graal". I guess the capital G is better. (Also in EATestsJVMCI.java.) > test/jdk/com/sun/jdi/EATestsJVMCI.java > EATests with Graal enabled. Nice that you support Graal to some extent. Maybe Graal folks want to enhance them in the future. I think this is a good starting point. Will change this in the next webrev. > Conclusion: Looks good and not trivial :-) > Now, you have one full review. I'd be ok with covering 2nd review by partial reviews. > Compiler and JVMTI parts are not too complicated IMHO. > Runtime part should get at least one additional careful review. Thanks a lot, Richard. -----Original Message----- From: Doerr, Martin Sent: Dienstag, 31. M?rz 2020 16:01 To: Reingruber, Richard ; 'Robbin Ehn' ; Lindenmaier, Goetz ; David Holmes ; Vladimir Kozlov (vladimir.kozlov at oracle.com) ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net Subject: RE: RFR(L) 8227745: Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents Hi Richard, thanks for addressing all my points. I've looked over webrev.5 and I'm satisfied with your changes. I had also promised to review the tests. test/hotspot/jtreg/serviceability/jvmti/Heap/IterateHeapWithEscapeAnalysisEnabled.java Thanks for updating the @summary comment. Looks good in webrev.5. test/hotspot/jtreg/serviceability/jvmti/Heap/libIterateHeapWithEscapeAnalysisEnabled.c JVMTI agent for object tagging and heap iteration. Good. test/jdk/com/sun/jdi/EATests.java This is a substantial amount of tests which is appropriate for a such a large change. Skipping some subtests with UseJVMCICompiler makes sense because it doesn't provide the necessary JVMTI functionality, yet. Nice work! I also like that you test with and without BiasedLocking. Your tests will still be fine after BiasedLocking deprecation. Very minor nits: - 2 typos in comment above EARelockingNestedInflatedTarget: "lockes are ommitted" (sounds funny) - You sometimes write "graal" and sometimes "Graal". I guess the capital G is better. (Also in EATestsJVMCI.java.) test/jdk/com/sun/jdi/EATestsJVMCI.java EATests with Graal enabled. Nice that you support Graal to some extent. Maybe Graal folks want to enhance them in the future. I think this is a good starting point. Conclusion: Looks good and not trivial :-) Now, you have one full review. I'd be ok with covering 2nd review by partial reviews. Compiler and JVMTI parts are not too complicated IMHO. Runtime part should get at least one additional careful review. Best regards, Martin > -----Original Message----- > From: Reingruber, Richard > Sent: Montag, 30. M?rz 2020 10:32 > To: Doerr, Martin ; 'Robbin Ehn' > ; Lindenmaier, Goetz > ; David Holmes ; > Vladimir Kozlov (vladimir.kozlov at oracle.com) > ; serviceability-dev at openjdk.java.net; > hotspot-compiler-dev at openjdk.java.net; hotspot-runtime- > dev at openjdk.java.net > Subject: RE: RFR(L) 8227745: Enable Escape Analysis for Better Performance > in the Presence of JVMTI Agents > > Hi, > > this is webrev.5 based on Robbin's feedback and Martin's review - thanks! :) > > The change affects jvmti, hotspot and c2. Partial reviews are very welcome > too. > > Full: http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.5/ > Delta: > http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.5.inc/ > > Robbin, Martin, please let me know, if anything shouldn't be quite as you > wanted it. Also find my > comments on your feedback below. > > Robbin, can I count you as Reviewer for the runtime part? > > Thanks, Richard. > > -- > > > DeoptimizeObjectsALotThread is only used in compileBroker.cpp. > > You can move both declaration and definition to that file, no need to > clobber > > thread.[c|h]pp. (and the static function deopt_objs_alot_thread_entry) > > Done. > > > Does JvmtiDeferredUpdates really need to be in thread.hpp, can't be in it's > own > > hpp file? It doesn't seem right to add JVM TI classes into thread.hpp. > > I moved JvmtiDeferredUpdates to vframe_hp.hpp where preexisting > jvmtiDeferredLocalVariableSet is > declared. > > > src/hotspot/share/code/compiledMethod.cpp > > Nice cleanup! > > Thanks :) > > > src/hotspot/share/code/debugInfoRec.cpp > > src/hotspot/share/code/debugInfoRec.hpp > > Additional parmeters. (Remark: I think "non_global_escape_in_scope" > would read better than "not_global_escape_in_scope", but your version is > consistent with existing code, so no change request from my side.) Ok. > > I've been thinking about this too and finally stayed with > not_global_escape_in_scope. It's supposed > to mean an object whose escape state is not GlobalEscape is in scope. > > > src/hotspot/share/compiler/compileBroker.cpp > > src/hotspot/share/compiler/compileBroker.hpp > > Extra thread for DeoptimizeObjectsALot. (Remark: I would have put it into > a follow up change together with the test in order to make this webrev > smaller, but since it is included, I'm reviewing everything at once. Not a big > deal.) Ok. > > Yes the change would be a little smaller. And if it helps I'll split it off. In > general I prefer > patches that bring along a suitable amount of tests. > > > src/hotspot/share/opto/c2compiler.cpp > > Make do_escape_analysis independent of JVMCI capabilities. Nice! > > It is the main goal of the enhancement. It is done for C2, but could be done > for JVMCI compilers > with just a small effort as well. > > > src/hotspot/share/opto/escape.cpp > > Annotation for MachSafePointNodes. Your added functionality looks > correct. > > But I'd prefer to move the bulky code out of the large function. > > I suggest to factor out something like has_not_global_escape and > has_arg_escape. So the code could look like this: > > SafePointNode* sfn = sfn_worklist.at(next); > > sfn->set_not_global_escape_in_scope(has_not_global_escape(sfn)); > > if (sfn->is_CallJava()) { > > CallJavaNode* call = sfn->as_CallJava(); > > call->set_arg_escape(has_arg_escape(call)); > > } > > This would also allow us to get rid of the found_..._escape_in_args > variables making the loops better readable. > > Done. > > > It's kind of ugly to use strcmp to recognize uncommon trap, but that seems > to be the way to do it (there are more such places). So it's ok. > > Yeah. I copied the snippet. > > > src/hotspot/share/prims/jvmtiImpl.cpp > > src/hotspot/share/prims/jvmtiImpl.hpp > > The sequence is pretty complex: > > VM_GetOrSetLocal element initialization executes EscapeBarrier code > which suspends the target thread (extra VM Operation). > > Note that the target threads have to be suspended already for > VM_GetOrSetLocal*. So it's mainly the > synchronization effect of EscapeBarrier::sync_and_suspend_one() that is > required here. Also no extra > _handshake_ is executed, since sync_and_suspend_one() will find the > target threads already > suspended. > > > VM_GetOrSetLocal::doit_prologue performs object deoptimization (by VM > Thread to prepare VM Operation with frame deoptimization). > > VM_GetOrSetLocal destructor implicitly calls EscapeBarrier destructor > which resumes the target thread. > > But I don't have any improvement proposal. Performance is probably not a > concern, here. So it's ok. > > > VM_GetOrSetLocal::deoptimize_objects deoptimizes the top frame if it > has non-globally escaping objects and other frames if they have arg escaping > ones. Good. > > It's not specifically the top frame, but the frame that is accessed. > > > src/hotspot/share/runtime/deoptimization.cpp > > Object deoptimization. I have more comments and proposals, here. > > First of all, handling recursive and waiting locks in relock_objects is tricky, > but looks correct. > > Comments are sufficient to understand why things are done as they are > implemented. > > > BiasedLocking related parts are complex, but we may get rid of them in the > future (with BiasedLocking removal). > > Anyway, looks correct, too. > > > Typo in comment: "regularily" => "regularly" > > > Deoptimization::fetch_unroll_info_helper is the only place where > _jvmti_deferred_updates get deallocated (except JavaThread destructor). > But I think we always go through it, so I can't see a memory leak or such kind > of issues. > > That's correct. The compiled frame for which deferred updates are allocated > is always deoptimized > before (see EscapeBarrier::deoptimize_objects()). This is also asserted in > compiledVFrame::update_deferred_value(). I've added the same assertion > to > Deoptimization::relock_objects(). So we can be sure that > _jvmti_deferred_updates are deallocated > again in fetch_unroll_info_helper(). > > > EscapeBarrier::deoptimize_objects: ResourceMark should use > calling_thread(). > > Sure, well spotted! > > > You can use MutexLocker and MonitorLocker with Thread* to save the > Thread::current() call. > > Right, good hint. This was recently introduced with 8235678. I even had to > resolve conflicts. Should > have done this then. > > > I'd make set_objs_are_deoptimized static and remove it from the > EscapeBarrier interface because I think it shouldn't be used outside of > EscapeBarrier::deoptimize_objects. > > Done. > > > Typo in comment: "we must only deoptimize" => "we only have to > deoptimize" > > Replaced with "[...] we deoptimize iff local objects are passed as args" > > > "bool EscapeBarrier::deoptimize_objects(intptr_t* fr_id)" is trivial and > barrier_active() is redundant. Implementation can get moved to hpp file. > > Ok. Done. > > > I'll get back to suspend flags, later. > > > There are weird cases regarding _self_deoptimization_in_progress. > > Assume we have 3 threads A, B and C. A deopts C, B deopts C, C deopts C. > C can set _self_deoptimization_in_progress while A performs the handshake > for suspending C. I think this doesn't lead to errors, but it's probably not > desired. > > I think it would be better to use only one "wait" call in > sync_and_suspend_one and sync_and_suspend_all. > > You're right. We've discussed that face-to-face, but couldn't find a real issue. > But now, thinking again, a reckon I found one: > > 2808 // Sync with other threads that might be doing deoptimizations > 2809 { > 2810 // Need to switch to _thread_blocked for the wait() call > 2811 ThreadBlockInVM tbivm(_calling_thread); > 2812 MonitorLocker ml(EscapeBarrier_lock, > Mutex::_no_safepoint_check_flag); > 2813 while (_self_deoptimization_in_progress) { > 2814 ml.wait(); > 2815 } > 2816 > 2817 if (self_deopt()) { > 2818 _self_deoptimization_in_progress = true; > 2819 } > 2820 > 2821 while (_deoptee_thread->is_ea_obj_deopt_suspend()) { > 2822 ml.wait(); > 2823 } > 2824 > 2825 if (self_deopt()) { > 2826 return; > 2827 } > 2828 > 2829 // set suspend flag for target thread > 2830 _deoptee_thread->set_ea_obj_deopt_flag(); > 2831 } > > - A waits in 2822 > - C is suspended > - B notifies all in resume_one() > - A and C wake up > - C wins over A and sets _self_deoptimization_in_progress = true in 2818 > - C does the self deoptimization > - A executes 2830 _deoptee_thread->set_ea_obj_deopt_flag() > > C will self suspend at some undefined point. The resulting state is illegal. > > > I first thought it'd be better to move ThreadBlockInVM before wait() to > reduce thread state transitions, but that seems to be problematic because > ThreadBlockInVM destructor contains a safepoint check which we shouldn't > do while holding EscapeBarrier_lock. So no change request. > > Yes, would be nice to have the state change only if needed, but for the > reason you mentioned it is > not quite as easy as it seems to be. I experimented as well with a second > lock, but did not succeed. > > > Change in thred_added: > > I think the sequence would be more comprehensive if we waited for > deopt_all_threads in Thread::start and all other places where a new thread > can run into Java code (e.g. JVMTI attach). > > Your version makes new threads come up with suspend flag set. That looks > correct, too. Advantage is that you only have to change one place > (thread_added). It'll be interesting to see how it will look like when we use > async handshakes instead of suspend flags. > > For now, I'm ok with your version. > > I had a version that did what you are suggesting. The current version also has > the advantage, that > there are fewer places where a thread has to wait for ongoing object > deoptimization. This means > viewer places where you have to worry about correct thread state > transitions, possible deadlocks, > and if all oops are properly Handle'ed. > > > I'd only move MutexLocker ml(EscapeBarrier_lock...) after if (!jt- > >is_hidden_from_external_view()). > > Done. > > > Having 4 different deoptimize_objects functions makes it a little hard to > keep an overview of which one is used for what. > > Maybe adding suffixes would help a little bit, but I can also live with what > you have. > > Implementation looks correct to me. > > 2 are internal. I added the suffix _internal to them. This leaves 2 to choose > from. > > > src/hotspot/share/runtime/deoptimization.hpp > > Escape barriers and object deoptimization functions. > > Typo in comment: "helt" => "held" > > Done in place already. > > > src/hotspot/share/runtime/interfaceSupport.cpp > > InterfaceSupport::deoptimizeAllObjects() is only used for > DeoptimizeObjectsALot = 1. > > I think DeoptimizeObjectsALot = 2 is more important, but I think it's not bad > to have DeoptimizeObjectsALot = 1 in addition. Ok. > > I never used DeoptimizeObjectsALot = 1 that much. It could be more > deterministic in single threaded > scenarios. I wouldn't object to get rid of it though. > > > src/hotspot/share/runtime/stackValue.hpp > > Better reinitilization in StackValue. Good. > > StackValue::obj_is_scalar_replaced() should not return true after calling > set_obj(). > > > src/hotspot/share/runtime/thread.cpp > > src/hotspot/share/runtime/thread.hpp > > src/hotspot/share/runtime/thread.inline.hpp > > wait_for_object_deoptimization, suspend flag, deferred updates and test > feature to deoptimize objects. > > > In the long term, we want to get rid of suspend flags, so it's not so nice to > introduce a new one. But I agree with G?tz that it should be acceptable as > temporary solution until async handshakes are available (which takes more > time). So I'm ok with your change. > > I'm keen to build the feature on async handshakes when the arive. > > > You can use MutexLocker with Thread*. > > Done. > > > JVMTIDeferredUpdates: I agree with Robin. It'd be nice to move the class > out of thread.hpp. > > Done. > > > src/hotspot/share/runtime/vframe.cpp > > Added support for entry frame to new_vframe. Ok. > > > > src/hotspot/share/runtime/vframe_hp.cpp > > src/hotspot/share/runtime/vframe_hp.hpp > > > I think code()->as_nmethod() in not_global_escape_in_scope() and > arg_escape() should better be under #ifdef ASSERT or inside the assert > statement (no need for code cache walking in product build). > > Done. > > > jvmtiDeferredLocalVariableSet::update_monitors: > > Please add a comment explaining that owner referenced by original info > may be scalar replaced, but it is deoptimized in the vframe. > > Done. > > -----Original Message----- > From: Doerr, Martin > Sent: Donnerstag, 12. M?rz 2020 17:28 > To: Reingruber, Richard ; 'Robbin Ehn' > ; Lindenmaier, Goetz > ; David Holmes ; > Vladimir Kozlov (vladimir.kozlov at oracle.com) > ; serviceability-dev at openjdk.java.net; > hotspot-compiler-dev at openjdk.java.net; hotspot-runtime- > dev at openjdk.java.net > Subject: RE: RFR(L) 8227745: Enable Escape Analysis for Better Performance > in the Presence of JVMTI Agents > > Hi Richard, > > > I managed to find time for a (almost) complete review of webrev.4. (I'll > review the tests separately.) > > First of all, the change seems to be in pretty good quality for its significant > complexity. I couldn't find any real bugs. But I'd like to propose minor > improvements. > I'm convinced that it's mature because we did substantial testing. > > I like the new functionality for object deoptimization. It can possibly be > reused for future escape analysis based optimizations. So I appreciate having > it available in the code base. > In addition to that, your change makes the JVMTI implementation better > integrated into the VM. > > > Now to the details: > > > src/hotspot/share/c1/c1_IR.hpp > describe_scope parameters. Ok. > > > src/hotspot/share/ci/ciEnv.cpp > src/hotspot/share/ci/ciEnv.hpp > Fix for JvmtiExport::can_walk_any_space() capability. Ok. > > > src/hotspot/share/code/compiledMethod.cpp > Nice cleanup! > > > src/hotspot/share/code/debugInfoRec.cpp > src/hotspot/share/code/debugInfoRec.hpp > Additional parmeters. (Remark: I think "non_global_escape_in_scope" > would read better than "not_global_escape_in_scope", but your version is > consistent with existing code, so no change request from my side.) Ok. > > > src/hotspot/share/code/nmethod.cpp > Nice cleanup! > > > src/hotspot/share/code/pcDesc.hpp > Additional parameters. Ok. > > > src/hotspot/share/code/scopeDesc.cpp > src/hotspot/share/code/scopeDesc.hpp > Improved implementation + additional parameters. Ok. > > > src/hotspot/share/compiler/compileBroker.cpp > src/hotspot/share/compiler/compileBroker.hpp > Extra thread for DeoptimizeObjectsALot. (Remark: I would have put it into a > follow up change together with the test in order to make this webrev > smaller, but since it is included, I'm reviewing everything at once. Not a big > deal.) Ok. > > > src/hotspot/share/jvmci/jvmciCodeInstaller.cpp > Additional parameters. Ok. > > > src/hotspot/share/opto/c2compiler.cpp > Make do_escape_analysis independent of JVMCI capabilities. Nice! > > > src/hotspot/share/opto/callnode.hpp > Additional fields for MachSafePointNodes. Ok. > > > src/hotspot/share/opto/escape.cpp > Annotation for MachSafePointNodes. Your added functionality looks correct. > But I'd prefer to move the bulky code out of the large function. > I suggest to factor out something like has_not_global_escape and > has_arg_escape. So the code could look like this: > SafePointNode* sfn = sfn_worklist.at(next); > sfn->set_not_global_escape_in_scope(has_not_global_escape(sfn)); > if (sfn->is_CallJava()) { > CallJavaNode* call = sfn->as_CallJava(); > call->set_arg_escape(has_arg_escape(call)); > } > This would also allow us to get rid of the found_..._escape_in_args variables > making the loops better readable. > > It's kind of ugly to use strcmp to recognize uncommon trap, but that seems > to be the way to do it (there are more such places). So it's ok. > > > src/hotspot/share/opto/machnode.hpp > Additional fields for MachSafePointNodes. Ok. > > > src/hotspot/share/opto/macro.cpp > Allow elimination of non-escaping allocations. Ok. > > > src/hotspot/share/opto/matcher.cpp > src/hotspot/share/opto/output.cpp > Copy attribute / pass parameters. Ok. > > > src/hotspot/share/prims/jvmtiCodeBlobEvents.cpp > Nice cleanup! > > > src/hotspot/share/prims/jvmtiEnv.cpp > src/hotspot/share/prims/jvmtiEnvBase.cpp > Escape barriers + deoptimize objects for target thread. Good. > > > src/hotspot/share/prims/jvmtiImpl.cpp > src/hotspot/share/prims/jvmtiImpl.hpp > The sequence is pretty complex: > VM_GetOrSetLocal element initialization executes EscapeBarrier code which > suspends the target thread (extra VM Operation). > VM_GetOrSetLocal::doit_prologue performs object deoptimization (by VM > Thread to prepare VM Operation with frame deoptimization). > VM_GetOrSetLocal destructor implicitly calls EscapeBarrier destructor which > resumes the target thread. > But I don't have any improvement proposal. Performance is probably not a > concern, here. So it's ok. > > VM_GetOrSetLocal::deoptimize_objects deoptimizes the top frame if it has > non-globally escaping objects and other frames if they have arg escaping > ones. Good. > > > src/hotspot/share/prims/jvmtiTagMap.cpp > Escape barriers + deoptimize objects for all threads. Ok. > > > src/hotspot/share/prims/whitebox.cpp > Added WB_IsFrameDeoptimized to API. Ok. > > > src/hotspot/share/runtime/deoptimization.cpp > Object deoptimization. I have more comments and proposals, here. > First of all, handling recursive and waiting locks in relock_objects is tricky, but > looks correct. > Comments are sufficient to understand why things are done as they are > implemented. > > BiasedLocking related parts are complex, but we may get rid of them in the > future (with BiasedLocking removal). > Anyway, looks correct, too. > > Typo in comment: "regularily" => "regularly" > > Deoptimization::fetch_unroll_info_helper is the only place where > _jvmti_deferred_updates get deallocated (except JavaThread destructor). > But I think we always go through it, so I can't see a memory leak or such kind > of issues. > > EscapeBarrier::deoptimize_objects: ResourceMark should use > calling_thread(). > > You can use MutexLocker and MonitorLocker with Thread* to save the > Thread::current() call. > > I'd make set_objs_are_deoptimized static and remove it from the > EscapeBarrier interface because I think it shouldn't be used outside of > EscapeBarrier::deoptimize_objects. > > Typo in comment: "we must only deoptimize" => "we only have to > deoptimize" > > "bool EscapeBarrier::deoptimize_objects(intptr_t* fr_id)" is trivial and > barrier_active() is redundant. Implementation can get moved to hpp file. > > I'll get back to suspend flags, later. > > There are weird cases regarding _self_deoptimization_in_progress. > Assume we have 3 threads A, B and C. A deopts C, B deopts C, C deopts C. C > can set _self_deoptimization_in_progress while A performs the handshake > for suspending C. I think this doesn't lead to errors, but it's probably not > desired. > I think it would be better to use only one "wait" call in > sync_and_suspend_one and sync_and_suspend_all. > > I first thought it'd be better to move ThreadBlockInVM before wait() to > reduce thread state transitions, but that seems to be problematic because > ThreadBlockInVM destructor contains a safepoint check which we shouldn't > do while holding EscapeBarrier_lock. So no change request. > > Change in thred_added: > I think the sequence would be more comprehensive if we waited for > deopt_all_threads in Thread::start and all other places where a new thread > can run into Java code (e.g. JVMTI attach). > Your version makes new threads come up with suspend flag set. That looks > correct, too. Advantage is that you only have to change one place > (thread_added). It'll be interesting to see how it will look like when we use > async handshakes instead of suspend flags. > For now, I'm ok with your version. > > I'd only move MutexLocker ml(EscapeBarrier_lock...) after if (!jt- > >is_hidden_from_external_view()). > > Having 4 different deoptimize_objects functions makes it a little hard to keep > an overview of which one is used for what. > Maybe adding suffixes would help a little bit, but I can also live with what you > have. > Implementation looks correct to me. > > > src/hotspot/share/runtime/deoptimization.hpp > Escape barriers and object deoptimization functions. > Typo in comment: "helt" => "held" > > > src/hotspot/share/runtime/globals.hpp > Addition of develop flag DeoptimizeObjectsALotInterval. Ok. > > > src/hotspot/share/runtime/interfaceSupport.cpp > InterfaceSupport::deoptimizeAllObjects() is only used for > DeoptimizeObjectsALot = 1. > I think DeoptimizeObjectsALot = 2 is more important, but I think it's not bad > to have DeoptimizeObjectsALot = 1 in addition. Ok. > > > src/hotspot/share/runtime/interfaceSupport.inline.hpp > Addition of deoptimizeAllObjects. Ok. > > > src/hotspot/share/runtime/mutexLocker.cpp > src/hotspot/share/runtime/mutexLocker.hpp > Addition of EscapeBarrier_lock. Ok. > > > src/hotspot/share/runtime/objectMonitor.cpp > Make recursion count relock aware. Ok. > > > src/hotspot/share/runtime/stackValue.hpp > Better reinitilization in StackValue. Good. > > > src/hotspot/share/runtime/thread.cpp > src/hotspot/share/runtime/thread.hpp > src/hotspot/share/runtime/thread.inline.hpp > wait_for_object_deoptimization, suspend flag, deferred updates and test > feature to deoptimize objects. > > In the long term, we want to get rid of suspend flags, so it's not so nice to > introduce a new one. But I agree with G?tz that it should be acceptable as > temporary solution until async handshakes are available (which takes more > time). So I'm ok with your change. > > You can use MutexLocker with Thread*. > > JVMTIDeferredUpdates: I agree with Robin. It'd be nice to move the class out > of thread.hpp. > > > src/hotspot/share/runtime/vframe.cpp > Added support for entry frame to new_vframe. Ok. > > > src/hotspot/share/runtime/vframe_hp.cpp > src/hotspot/share/runtime/vframe_hp.hpp > > I think code()->as_nmethod() in not_global_escape_in_scope() and > arg_escape() should better be under #ifdef ASSERT or inside the assert > statement (no need for code cache walking in product build). > > jvmtiDeferredLocalVariableSet::update_monitors: > Please add a comment explaining that owner referenced by original info may > be scalar replaced, but it is deoptimized in the vframe. > > > src/hotspot/share/utilities/macros.hpp > Addition of NOT_COMPILER2_OR_JVMCI_RETURN macros. Ok. > > > test/hotspot/jtreg/serviceability/jvmti/Heap/IterateHeapWithEscapeAnalysi > sEnabled.java > test/hotspot/jtreg/serviceability/jvmti/Heap/libIterateHeapWithEscapeAnal > ysisEnabled.c > New test. Will review separately. > > > test/jdk/TEST.ROOT > Addition of vm.jvmci as required property. Ok. > > > test/jdk/com/sun/jdi/EATests.java > test/jdk/com/sun/jdi/EATestsJVMCI.java > New test. Will review separately. > > > test/lib/sun/hotspot/WhiteBox.java > Added isFrameDeoptimized to API. Ok. > > > That was it. Best regards, > Martin > > > > -----Original Message----- > > From: hotspot-compiler-dev > bounces at openjdk.java.net> On Behalf Of Reingruber, Richard > > Sent: Dienstag, 3. M?rz 2020 21:23 > > To: 'Robbin Ehn' ; Lindenmaier, Goetz > > ; David Holmes > ; > > Vladimir Kozlov (vladimir.kozlov at oracle.com) > > ; serviceability-dev at openjdk.java.net; > > hotspot-compiler-dev at openjdk.java.net; hotspot-runtime- > > dev at openjdk.java.net > > Subject: RE: RFR(L) 8227745: Enable Escape Analysis for Better > > Performance in the Presence of JVMTI Agents > > > > Hi Robbin, > > > > > > I understand that Robbin proposed to replace the usage of > > > > _suspend_flag with handshakes. Apparently, async handshakes > > > > are needed to do so. We have been waiting a while for removal > > > > of the _suspend_flag / introduction of async handshakes [2]. > > > > What is the status here? > > > > > I have an old prototype which I would like to continue to work on. > > > So do not assume asynch handshakes will make 15. > > > Even if it would, I think there are a lot more investigate work to remove > > > _suspend_flag. > > > > Let us know, if we can be of any help to you and be it only testing. > > > > > >> Full: > > http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.4/ > > > > > DeoptimizeObjectsALotThread is only used in compileBroker.cpp. > > > You can move both declaration and definition to that file, no need to > > clobber > > > thread.[c|h]pp. (and the static function deopt_objs_alot_thread_entry) > > > > Will do. > > > > > Does JvmtiDeferredUpdates really need to be in thread.hpp, can't be in > it's > > own > > > hpp file? It doesn't seem right to add JVM TI classes into thread.hpp. > > > > You are right. It shouldn't be declared in thread.hpp. I will look into that. > > > > > Note that we also think we may have a bug in deopt: > > > https://bugs.openjdk.java.net/browse/JDK-8238237 > > > > > I think it would be best, if possible, to push after that is resolved. > > > > Sure. > > > > > Not even nearly a full review :) > > > > I know :) > > > > Anyways, thanks a lot, > > Richard. > > > > > > -----Original Message----- > > From: Robbin Ehn > > Sent: Monday, March 2, 2020 11:17 AM > > To: Lindenmaier, Goetz ; Reingruber, > Richard > > ; David Holmes > ; > > Vladimir Kozlov (vladimir.kozlov at oracle.com) > > ; serviceability-dev at openjdk.java.net; > > hotspot-compiler-dev at openjdk.java.net; hotspot-runtime- > > dev at openjdk.java.net > > Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better Performance > > in the Presence of JVMTI Agents > > > > Hi, > > > > On 2/24/20 5:39 PM, Lindenmaier, Goetz wrote: > > > Hi, > > > > > > I had a look at the progress of this change. Nothing > > > happened since Richard posted his update using more > > > handshakes [1]. > > > But we (SAP) would appreciate a lot if this change could > > > be successfully reviewed and pushed. > > > > > > I think there is basic understanding that this > > > change is helpful. It fixes a number of issues with JVMTI, > > > and will deliver the same performance benefits as EA > > > does in current production mode for debugging scenarios. > > > > > > This is important for us as we run our VMs prepared > > > for debugging in production mode. > > > > > > I understand that Robbin proposed to replace the usage of > > > _suspend_flag with handshakes. Apparently, async handshakes > > > are needed to do so. We have been waiting a while for removal > > > of the _suspend_flag / introduction of async handshakes [2]. > > > What is the status here? > > > > I have an old prototype which I would like to continue to work on. > > So do not assume asynch handshakes will make 15. > > Even if it would, I think there are a lot more investigate work to remove > > _suspend_flag. > > > > > > > > I think we should no longer wait, but proceed with > > > this change. We will look into removing the usage of > > > suspend_flag introduced here once it is possible to implement > > > it with handshakes. > > > > Yes, sure. > > > > >> Full: > http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.4/ > > > > DeoptimizeObjectsALotThread is only used in compileBroker.cpp. > > You can move both declaration and definition to that file, no need to > clobber > > thread.[c|h]pp. (and the static function deopt_objs_alot_thread_entry) > > > > Does JvmtiDeferredUpdates really need to be in thread.hpp, can't be in it's > > own > > hpp file? It doesn't seem right to add JVM TI classes into thread.hpp. > > > > Note that we also think we may have a bug in deopt: > > https://bugs.openjdk.java.net/browse/JDK-8238237 > > > > I think it would be best, if possible, to push after that is resolved. > > > > Not even nearly a full review :) > > > > Thanks, Robbin > > > > > > >> Incremental: > > >> > http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.4.inc/ > > >> > > >> I was not able to eliminate the additional suspend flag now. I'll take care > > of this > > >> as soon as the > > >> existing suspend-resume-mechanism is reworked. > > >> > > >> Testing: > > >> > > >> Nightly tests @SAP: > > >> > > >> JCK and JTREG, also in Xcomp mode, SPECjvm2008, SPECjbb2015, > > Renaissance > > >> Suite, SAP specific tests > > >> with fastdebug and release builds on all platforms > > >> > > >> Stress testing with DeoptimizeObjectsALot running SPECjvm2008 40x > > parallel > > >> for 24h > > >> > > >> Thanks, Richard. > > >> > > >> > > >> More details on the changes: > > >> > > >> * Hide DeoptimizeObjectsALotThread from external view. > > >> > > >> * Changed EscapeBarrier_lock to be a _safepoint_check_never lock. > > >> It used to be _safepoint_check_sometimes, which will be eliminated > > sooner or > > >> later. > > >> I added explicit thread state changes with ThreadBlockInVM to code > > paths > > >> where we can wait() > > >> on EscapeBarrier_lock to become safepoint safe. > > >> > > >> * Use handshake EscapeBarrierSuspendHandshake to suspend target > > threads > > >> instead of vm operation > > >> VM_ThreadSuspendAllForObjDeopt. > > >> > > >> * Removed uses of Threads_lock. When adding a new thread we > suspend > > it iff > > >> EA optimizations are > > >> being reverted. In the previous version we were waiting on > > Threads_lock > > >> while EA optimizations > > >> were reverted. See EscapeBarrier::thread_added(). > > >> > > >> * Made tests require Xmixed compilation mode. > > >> > > >> * Made tests agnostic regarding tiered compilation. > > >> I.e. tc isn't disabled anymore, and the tests can be run with tc enabled > or > > >> disabled. > > >> > > >> * Exercising EATests.java as well with stress test options > > >> DeoptimizeObjectsALot* > > >> Due to the non-deterministic deoptimizations some tests need to be > > skipped. > > >> We do this to prevent bit-rot of the stress test code. > > >> > > >> * Executing EATests.java as well with graal if available. Driver for this is > > >> EATestsJVMCI.java. Graal cannot pass all tests, because it does not > > provide all > > >> the new debug info > > >> (namely not_global_escape_in_scope and arg_escape in > > scopeDesc.hpp). > > >> And graal does not yet support the JVMTI operations force early > return > > and > > >> pop frame. > > >> > > >> * Removed tracing from new jdi tests in EATests.java. Too much trace > > output > > >> before the debugging > > >> connection is established can cause deadlock because output buffers > fill > > up. > > >> (See https://bugs.openjdk.java.net/browse/JDK-8173304) > > >> > > >> * Many copyright year changes and smaller clean-up changes of testing > > code > > >> (trailing white-space and > > >> the like). > > >> > > >> > > >> -----Original Message----- > > >> From: David Holmes > > >> Sent: Donnerstag, 19. Dezember 2019 03:12 > > >> To: Reingruber, Richard ; serviceability- > > >> dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; > > hotspot- > > >> runtime-dev at openjdk.java.net; Vladimir Kozlov > > (vladimir.kozlov at oracle.com) > > >> > > >> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better > > Performance in > > >> the Presence of JVMTI Agents > > >> > > >> Hi Richard, > > >> > > >> I think my issue is with the way EliminateNestedLocks works so I'm going > > >> to look into that more deeply. > > >> > > >> Thanks for the explanations. > > >> > > >> David > > >> > > >> On 18/12/2019 12:47 am, Reingruber, Richard wrote: > > >>> Hi David, > > >>> > > >>> > > > Some further queries/concerns: > > >>> > > > > > >>> > > > src/hotspot/share/runtime/objectMonitor.cpp > > >>> > > > > > >>> > > > Can you please explain the changes to ObjectMonitor::wait: > > >>> > > > > > >>> > > > ! _recursions = save // restore the old recursion count > > >>> > > > ! + jt->get_and_reset_relock_count_after_wait(); // > > >>> > > > increased by the deferred relock count > > >>> > > > > > >>> > > > what is the "deferred relock count"? I gather it relates to > > >>> > > > > > >>> > > > "The code was extended to be able to deoptimize objects of a > > >>> > > frame that > > >>> > > > is not the top frame and to let another thread than the > owning > > >>> > > thread do > > >>> > > > it." > > >>> > > > > >>> > > Yes, these relate. Currently EA based optimizations are reverted, > > when a > > >> compiled frame is > > >>> > > replaced with corresponding interpreter frames. Part of this is > > relocking > > >> objects with eliminated > > >>> > > locking. New with the enhancement is that we do this also just > > before > > >> object references are > > >>> > > acquired through JVMTI. In this case we deoptimize also the > > owning > > >> compiled frame C and we > > >>> > > register deoptimized objects as deferred updates. When control > > returns > > >> to C it gets deoptimized, > > >>> > > we notice that objects are already deoptimized (reallocated and > > >> relocked), so we don't do it again > > >>> > > (relocking twice would be incorrect of course). Deferred updates > > are > > >> copied into the new > > >>> > > interpreter frames. > > >>> > > > > >>> > > Problem: relocking is not possible if the target thread T is waiting > > on the > > >> monitor that needs to > > >>> > > be relocked. This happens only with non-local objects with > > >> EliminateNestedLocks. Instead relocking > > >>> > > is deferred until T owns the monitor again. This is what the piece > of > > >> code above does. > > >>> > > > >>> > Sorry I need some more detail here. How can you wait() on an > > object > > >>> > monitor if the object allocation and/or locking was optimised > away? > > And > > >>> > what is a "non-local object" in this context? Isn't EA restricted to > > >>> > thread-confined objects? > > >>> > > >>> "Non-local object" is an object that escapes its thread. The issue I'm > > >> addressing with the changes > > >>> in ObjectMonitor::wait are almost unrelated to EA. They are caused by > > >> EliminateNestedLocks, where C2 > > >>> eliminates recursive locking of an already owned lock. The lock owning > > object > > >> exists on the heap, it > > >>> is locked and you can call wait() on it. > > >>> > > >>> EliminateLocks is the C2 option that controls lock elimination based on > > EA. > > >> Both optimizations have > > >>> in common that objects with eliminated locking need to be relocked > > when > > >> deoptimizing a frame, > > >>> i.e. when replacing a compiled frame with equivalent interpreter > > >>> frames. Deoptimization::relock_objects does that job for /all/ > eliminated > > >> locks in scope. /All/ can > > >>> be a mix of eliminated nested locks and locks of not-escaping objects. > > >>> > > >>> New with the enhancement: I call relock_objects earlier, just before > > objects > > >> pontentially > > >>> escape. But then later when the owning compiled frame gets > > deoptimized, I > > >> must not do it again: > > >>> > > >>> See call to EscapeBarrier::objs_are_deoptimized in > deoptimization.cpp: > > >>> > > >>> 373 if ((jvmci_enabled || ((DoEscapeAnalysis || > > EliminateNestedLocks) && > > >> EliminateLocks)) > > >>> 374 && !EscapeBarrier::objs_are_deoptimized(thread, > > deoptee.id())) { > > >>> 375 bool unused; > > >>> 376 eliminate_locks(thread, chunk, realloc_failures, deoptee, > > exec_mode, > > >> unused); > > >>> 377 } > > >>> > > >>> Now when calling relock_objects early it is quiet possible that I have to > > relock > > >> an object the > > >>> target thread currently waits for. Obviously I cannot relock in this case, > > >> instead I chose to > > >>> introduce relock_count_after_wait to JavaThread. > > >>> > > >>> > Is it just that some of the locking gets optimized away e.g. > > >>> > > > >>> > synchronised(obj) { > > >>> > synchronised(obj) { > > >>> > synchronised(obj) { > > >>> > obj.wait(); > > >>> > } > > >>> > } > > >>> > } > > >>> > > > >>> > If this is reduced to a form as-if it were a single lock of the monitor > > >>> > (due to EA) and the wait() triggers a JVM TI event which leads to > the > > >>> > escape of "obj" then we need to reconstruct the true lock state, > and > > so > > >>> > when the wait() internally unblocks and reacquires the monitor it > > has to > > >>> > set the true recursion count to 3, not the 1 that it appeared to be > > when > > >>> > wait() was initially called. Is that the scenario? > > >>> > > >>> Kind of... except that the locking is not eliminated due to EA and there > is > > no > > >> JVM TI event > > >>> triggered by wait. > > >>> > > >>> Add > > >>> > > >>> LocalObject l1 = new LocalObject(); > > >>> > > >>> in front of the synchrnized blocks and assume a JVM TI agent acquires > l1. > > This > > >> triggers the code in > > >>> question. > > >>> > > >>> See that relocking/reallocating is transactional. If it is done then for > /all/ > > >> objects in scope and it is > > >>> done at most once. It wouldn't be quite so easy to split this in relocking > > of > > >> nested/EA-based > > >>> eliminated locks. > > >>> > > >>> > If so I find this truly awful. Anyone using wait() in a realistic form > > >>> > requires a notification and so the object cannot be thread > confined. > > In > > >>> > > >>> It is not thread confined. > > >>> > > >>> > which case I would strongly argue that upon hitting the wait() the > > deopt > > >>> > should occur unconditionally and so the lock state is correct before > > we > > >>> > wait and so we don't need to mess with the recursion count > > internally > > >>> > when we reacquire the monitor. > > >>> > > > >>> > > > > >>> > > > which I don't like the sound of at all when it comes to > > ObjectMonitor > > >>> > > > state. So I'd like to understand in detail exactly what is going > on > > here > > >>> > > > and why. This is a very intrusive change that seems to badly > > break > > >>> > > > encapsulation and impacts future changes to ObjectMonitor > > that are > > >> under > > >>> > > > investigation. > > >>> > > > > >>> > > I would not regard this as breaking encapsulation. Certainly not > > badly. > > >>> > > > > >>> > > I've added a property relock_count_after_wait to JavaThread. > The > > >> property is well > > >>> > > encapsulated. Future ObjectMonitor implementations have to > deal > > with > > >> recursion too. They are free > > >>> > > in choosing a way to do that as long as that property is taken into > > >> account. This is hardly a > > >>> > > limitation. > > >>> > > > >>> > I do think this badly breaks encapsulation as you have to add a > > callout > > >>> > from the guts of the ObjectMonitor code to reach into the thread > to > > get > > >>> > this lock count adjustment. I understand why you have had to do > > this but > > >>> > I would much rather see a change to the EA optimisation strategy > so > > that > > >>> > this is not needed. > > >>> > > > >>> > > Note also that the property is a straight forward extension of the > > >> existing concept of deferred > > >>> > > local updates. It is embedded into the structure holding them. So > > not > > >> even the footprint of a > > >>> > > JavaThread is enlarged if no deferred updates are generated. > > >>> > > > >>> > [...] > > >>> > > > >>> > > > > >>> > > I'm actually duplicating the existing external suspend mechanism, > > >> because a thread can be > > >>> > > suspended at most once. And hey, and don't like that either! But > it > > >> seems not unlikely that the > > >>> > > duplicate can be removed together with the original and the new > > type > > >> of handshakes that will be > > >>> > > used for thread suspend can be used for object deoptimization > > too. See > > >> today's discussion in > > >>> > > JDK-8227745 [2]. > > >>> > > > >>> > I hope that discussion bears some fruit, at the moment it seems > not > > to > > >>> > be possible to use handshakes here. :( > > >>> > > > >>> > The external suspend mechanism is a royal pain in the proverbial > > that we > > >>> > have to carefully live with. The idea that we're duplicating that for > > >>> > use in another fringe area of functionality does not thrill me at all. > > >>> > > > >>> > To be clear, I understand the problem that exists and that you > wish > > to > > >>> > solve, but for the runtime parts I balk at the complexity cost of > > >>> > solving it. > > >>> > > >>> I know it's complex, but by far no rocket science. > > >>> > > >>> Also I find it hard to imagine another fix for JDK-8233915 besides > > changing > > >> the JVM TI specification. > > >>> > > >>> Thanks, Richard. > > >>> > > >>> -----Original Message----- > > >>> From: David Holmes > > >>> Sent: Dienstag, 17. Dezember 2019 08:03 > > >>> To: Reingruber, Richard ; serviceability- > > >> dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; > > hotspot- > > >> runtime-dev at openjdk.java.net; Vladimir Kozlov > > (vladimir.kozlov at oracle.com) > > >> > > >>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better > > Performance > > >> in the Presence of JVMTI Agents > > >>> > > >>> > > >>> > > >>> David > > >>> > > >>> On 17/12/2019 4:57 pm, David Holmes wrote: > > >>>> Hi Richard, > > >>>> > > >>>> On 14/12/2019 5:01 am, Reingruber, Richard wrote: > > >>>>> Hi David, > > >>>>> > > >>>>> ?? > Some further queries/concerns: > > >>>>> ?? > > > >>>>> ?? > src/hotspot/share/runtime/objectMonitor.cpp > > >>>>> ?? > > > >>>>> ?? > Can you please explain the changes to ObjectMonitor::wait: > > >>>>> ?? > > > >>>>> ?? > !?? _recursions = save????? // restore the old recursion count > > >>>>> ?? > !???????????????? + jt->get_and_reset_relock_count_after_wait(); // > > >>>>> ?? > increased by the deferred relock count > > >>>>> ?? > > > >>>>> ?? > what is the "deferred relock count"? I gather it relates to > > >>>>> ?? > > > >>>>> ?? > "The code was extended to be able to deoptimize objects of a > > >>>>> frame that > > >>>>> ?? > is not the top frame and to let another thread than the owning > > >>>>> thread do > > >>>>> ?? > it." > > >>>>> > > >>>>> Yes, these relate. Currently EA based optimizations are reverted, > > when > > >>>>> a compiled frame is replaced > > >>>>> with corresponding interpreter frames. Part of this is relocking > > >>>>> objects with eliminated > > >>>>> locking. New with the enhancement is that we do this also just > before > > >>>>> object references are acquired > > >>>>> through JVMTI. In this case we deoptimize also the owning compiled > > >>>>> frame C and we register > > >>>>> deoptimized objects as deferred updates. When control returns to > C > > it > > >>>>> gets deoptimized, we notice > > >>>>> that objects are already deoptimized (reallocated and relocked), so > > we > > >>>>> don't do it again (relocking > > >>>>> twice would be incorrect of course). Deferred updates are copied > into > > >>>>> the new interpreter frames. > > >>>>> > > >>>>> Problem: relocking is not possible if the target thread T is waiting > > >>>>> on the monitor that needs to be > > >>>>> relocked. This happens only with non-local objects with > > >>>>> EliminateNestedLocks. Instead relocking is > > >>>>> deferred until T owns the monitor again. This is what the piece of > > >>>>> code above does. > > >>>> > > >>>> Sorry I need some more detail here. How can you wait() on an object > > >>>> monitor if the object allocation and/or locking was optimised away? > > And > > >>>> what is a "non-local object" in this context? Isn't EA restricted to > > >>>> thread-confined objects? > > >>>> > > >>>> Is it just that some of the locking gets optimized away e.g. > > >>>> > > >>>> synchronised(obj) { > > >>>> ? synchronised(obj) { > > >>>> ??? synchronised(obj) { > > >>>> ????? obj.wait(); > > >>>> ??? } > > >>>> ? } > > >>>> } > > >>>> > > >>>> If this is reduced to a form as-if it were a single lock of the monitor > > >>>> (due to EA) and the wait() triggers a JVM TI event which leads to the > > >>>> escape of "obj" then we need to reconstruct the true lock state, and > so > > >>>> when the wait() internally unblocks and reacquires the monitor it has > to > > >>>> set the true recursion count to 3, not the 1 that it appeared to be > when > > >>>> wait() was initially called. Is that the scenario? > > >>>> > > >>>> If so I find this truly awful. Anyone using wait() in a realistic form > > >>>> requires a notification and so the object cannot be thread confined. > In > > >>>> which case I would strongly argue that upon hitting the wait() the > > deopt > > >>>> should occur unconditionally and so the lock state is correct before > we > > >>>> wait and so we don't need to mess with the recursion count internally > > >>>> when we reacquire the monitor. > > >>>> > > >>>>> > > >>>>> ?? > which I don't like the sound of at all when it comes to > > >>>>> ObjectMonitor > > >>>>> ?? > state. So I'd like to understand in detail exactly what is going > > >>>>> on here > > >>>>> ?? > and why.? This is a very intrusive change that seems to badly > > break > > >>>>> ?? > encapsulation and impacts future changes to ObjectMonitor > that > > >>>>> are under > > >>>>> ?? > investigation. > > >>>>> > > >>>>> I would not regard this as breaking encapsulation. Certainly not > badly. > > >>>>> > > >>>>> I've added a property relock_count_after_wait to JavaThread. The > > >>>>> property is well > > >>>>> encapsulated. Future ObjectMonitor implementations have to deal > > with > > >>>>> recursion too. They are free in > > >>>>> choosing a way to do that as long as that property is taken into > > >>>>> account. This is hardly a > > >>>>> limitation. > > >>>> > > >>>> I do think this badly breaks encapsulation as you have to add a callout > > >>>> from the guts of the ObjectMonitor code to reach into the thread to > > get > > >>>> this lock count adjustment. I understand why you have had to do this > > but > > >>>> I would much rather see a change to the EA optimisation strategy so > > that > > >>>> this is not needed. > > >>>> > > >>>>> Note also that the property is a straight forward extension of the > > >>>>> existing concept of deferred > > >>>>> local updates. It is embedded into the structure holding them. So > not > > >>>>> even the footprint of a > > >>>>> JavaThread is enlarged if no deferred updates are generated. > > >>>>> > > >>>>> ?? > --- > > >>>>> ?? > > > >>>>> ?? > src/hotspot/share/runtime/thread.cpp > > >>>>> ?? > > > >>>>> ?? > Can you please explain why > > >>>>> JavaThread::wait_for_object_deoptimization > > >>>>> ?? > has to be handcrafted in this way rather than using proper > > >>>>> transitions. > > >>>>> ?? > > > >>>>> > > >>>>> I wrote wait_for_object_deoptimization taking > > >>>>> JavaThread::java_suspend_self_with_safepoint_check > > >>>>> as template. So in short: for the same reasons :) > > >>>>> > > >>>>> Threads reach both methods as part of thread state transitions, > > >>>>> therefore special handling is > > >>>>> required to change thread state on top of ongoing transitions. > > >>>>> > > >>>>> ?? > We got rid of "deopt suspend" some time ago and it is > disturbing > > >>>>> to see > > >>>>> ?? > it being added back (effectively). This seems like it may be > > >>>>> something > > >>>>> ?? > that handshakes could be used for. > > >>>>> > > >>>>> Deopt suspend used to be something rather different with a similar > > >>>>> name[1]. It is not being added back. > > >>>> > > >>>> I stand corrected. Despite comments in the code to the contrary > > >>>> deopt_suspend didn't actually cause a self-suspend. I was doing a lot > of > > >>>> cleanup in this area 13 years ago :) > > >>>> > > >>>>> > > >>>>> I'm actually duplicating the existing external suspend mechanism, > > >>>>> because a thread can be suspended > > >>>>> at most once. And hey, and don't like that either! But it seems not > > >>>>> unlikely that the duplicate can > > >>>>> be removed together with the original and the new type of > > handshakes > > >>>>> that will be used for > > >>>>> thread suspend can be used for object deoptimization too. See > > today's > > >>>>> discussion in JDK-8227745 [2]. > > >>>> > > >>>> I hope that discussion bears some fruit, at the moment it seems not > to > > >>>> be possible to use handshakes here. :( > > >>>> > > >>>> The external suspend mechanism is a royal pain in the proverbial that > > we > > >>>> have to carefully live with. The idea that we're duplicating that for > > >>>> use in another fringe area of functionality does not thrill me at all. > > >>>> > > >>>> To be clear, I understand the problem that exists and that you wish to > > >>>> solve, but for the runtime parts I balk at the complexity cost of > > >>>> solving it. > > >>>> > > >>>> Thanks, > > >>>> David > > >>>> ----- > > >>>> > > >>>>> Thanks, Richard. > > >>>>> > > >>>>> [1] Deopt suspend was something like an async. handshake for > > >>>>> architectures with register windows, > > >>>>> ???? where patching the return pc for deoptimization of a compiled > > >>>>> frame was racy if the owner thread > > >>>>> ???? was in native code. Instead a "deopt" suspend flag was set on > > >>>>> which the thread patched its own > > >>>>> ???? frame upon return from native. So no thread was suspended. It > > got > > >>>>> its name only from the name of > > >>>>> ???? the flags. > > >>>>> > > >>>>> [2] Discussion about using handshakes to sync. with the target > thread: > > >>>>> > > >>>>> https://bugs.openjdk.java.net/browse/JDK- > > >> > > > 8227745?focusedCommentId=14306727&page=com.atlassian.jira.plugin.syst > > e > > >> m.issuetabpanels:comment-tabpanel#comment-14306727 > > >>>>> > > >>>>> > > >>>>> -----Original Message----- > > >>>>> From: David Holmes > > >>>>> Sent: Freitag, 13. Dezember 2019 00:56 > > >>>>> To: Reingruber, Richard ; > > >>>>> serviceability-dev at openjdk.java.net; > > >>>>> hotspot-compiler-dev at openjdk.java.net; > > >>>>> hotspot-runtime-dev at openjdk.java.net > > >>>>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better > > >>>>> Performance in the Presence of JVMTI Agents > > >>>>> > > >>>>> Hi Richard, > > >>>>> > > >>>>> Some further queries/concerns: > > >>>>> > > >>>>> src/hotspot/share/runtime/objectMonitor.cpp > > >>>>> > > >>>>> Can you please explain the changes to ObjectMonitor::wait: > > >>>>> > > >>>>> !?? _recursions = save????? // restore the old recursion count > > >>>>> !???????????????? + jt->get_and_reset_relock_count_after_wait(); // > > >>>>> increased by the deferred relock count > > >>>>> > > >>>>> what is the "deferred relock count"? I gather it relates to > > >>>>> > > >>>>> "The code was extended to be able to deoptimize objects of a > frame > > that > > >>>>> is not the top frame and to let another thread than the owning > thread > > do > > >>>>> it." > > >>>>> > > >>>>> which I don't like the sound of at all when it comes to ObjectMonitor > > >>>>> state. So I'd like to understand in detail exactly what is going on here > > >>>>> and why.? This is a very intrusive change that seems to badly break > > >>>>> encapsulation and impacts future changes to ObjectMonitor that > are > > under > > >>>>> investigation. > > >>>>> > > >>>>> --- > > >>>>> > > >>>>> src/hotspot/share/runtime/thread.cpp > > >>>>> > > >>>>> Can you please explain why > > JavaThread::wait_for_object_deoptimization > > >>>>> has to be handcrafted in this way rather than using proper > transitions. > > >>>>> > > >>>>> We got rid of "deopt suspend" some time ago and it is disturbing to > > see > > >>>>> it being added back (effectively). This seems like it may be > something > > >>>>> that handshakes could be used for. > > >>>>> > > >>>>> Thanks, > > >>>>> David > > >>>>> ----- > > >>>>> > > >>>>> On 12/12/2019 7:02 am, David Holmes wrote: > > >>>>>> On 12/12/2019 1:07 am, Reingruber, Richard wrote: > > >>>>>>> Hi David, > > >>>>>>> > > >>>>>>> ??? > Most of the details here are in areas I can comment on in > > detail, > > >>>>>>> but I > > >>>>>>> ??? > did take an initial general look at things. > > >>>>>>> > > >>>>>>> Thanks for taking the time! > > >>>>>> > > >>>>>> Apologies the above should read: > > >>>>>> > > >>>>>> "Most of the details here are in areas I *can't* comment on in > detail > > >>>>>> ..." > > >>>>>> > > >>>>>> David > > >>>>>> > > >>>>>>> ??? > The only thing that jumped out at me is that I think the > > >>>>>>> ??? > DeoptimizeObjectsALotThread should be a hidden thread. > > >>>>>>> ??? > > > >>>>>>> ??? > +? bool is_hidden_from_external_view() const { return true; > } > > >>>>>>> > > >>>>>>> Yes, it should. Will add the method like above. > > >>>>>>> > > >>>>>>> ??? > Also I don't see any testing of the > > DeoptimizeObjectsALotThread. > > >>>>>>> Without > > >>>>>>> ??? > active testing this will just bit-rot. > > >>>>>>> > > >>>>>>> DeoptimizeObjectsALot is meant for stress testing with a larger > > >>>>>>> workload. I will add a minimal test > > >>>>>>> to keep it fresh. > > >>>>>>> > > >>>>>>> ??? > Also on the tests I don't understand your @requires clause: > > >>>>>>> ??? > > > >>>>>>> ??? >?? @requires ((vm.compMode != "Xcomp") & > > vm.compiler2.enabled > > >> & > > >>>>>>> ??? > (vm.opt.TieredCompilation != true)) > > >>>>>>> ??? > > > >>>>>>> ??? > This seems to require that TieredCompilation is disabled, but > > >>>>>>> tiered is > > >>>>>>> ??? > our normal mode of operation. ?? > > >>>>>>> ??? > > > >>>>>>> > > >>>>>>> I removed the clause. I guess I wanted to target the tests towards > > the > > >>>>>>> code they are supposed to > > >>>>>>> test, and it's easier to analyze failures w/o tiered compilation and > > >>>>>>> with just one compiler thread. > > >>>>>>> > > >>>>>>> Additionally I will make use of > > >>>>>>> compiler.whitebox.CompilerWhiteBoxTest.THRESHOLD in the > tests. > > >>>>>>> > > >>>>>>> Thanks, > > >>>>>>> Richard. > > >>>>>>> > > >>>>>>> -----Original Message----- > > >>>>>>> From: David Holmes > > >>>>>>> Sent: Mittwoch, 11. Dezember 2019 08:03 > > >>>>>>> To: Reingruber, Richard ; > > >>>>>>> serviceability-dev at openjdk.java.net; > > >>>>>>> hotspot-compiler-dev at openjdk.java.net; > > >>>>>>> hotspot-runtime-dev at openjdk.java.net > > >>>>>>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better > > >>>>>>> Performance in the Presence of JVMTI Agents > > >>>>>>> > > >>>>>>> Hi Richard, > > >>>>>>> > > >>>>>>> On 11/12/2019 7:45 am, Reingruber, Richard wrote: > > >>>>>>>> Hi, > > >>>>>>>> > > >>>>>>>> I would like to get reviews please for > > >>>>>>>> > > >>>>>>>> > > http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.3/ > > >>>>>>>> > > >>>>>>>> Corresponding RFE: > > >>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8227745 > > >>>>>>>> > > >>>>>>>> Fixes also https://bugs.openjdk.java.net/browse/JDK-8233915 > > >>>>>>>> And potentially https://bugs.openjdk.java.net/browse/JDK- > > 8214584 [1] > > >>>>>>>> > > >>>>>>>> Vladimir Kozlov kindly put webrev.3 through tier1-8 testing > > without > > >>>>>>>> issues (thanks!). In addition the > > >>>>>>>> change is being tested at SAP since I posted the first RFR some > > >>>>>>>> months ago. > > >>>>>>>> > > >>>>>>>> The intention of this enhancement is to benefit performance > wise > > from > > >>>>>>>> escape analysis even if JVMTI > > >>>>>>>> agents request capabilities that allow them to access local > variable > > >>>>>>>> values. E.g. if you start-up > > >>>>>>>> with -agentlib:jdwp=transport=dt_socket,server=y,suspend=n, > > then > > >>>>>>>> escape analysis is disabled right > > >>>>>>>> from the beginning, well before a debugger attaches -- if ever > one > > >>>>>>>> should do so. With the > > >>>>>>>> enhancement, escape analysis will remain enabled until and > after > > a > > >>>>>>>> debugger attaches. EA based > > >>>>>>>> optimizations are reverted just before an agent acquires the > > >>>>>>>> reference to an object. In the JBS item > > >>>>>>>> you'll find more details. > > >>>>>>> > > >>>>>>> Most of the details here are in areas I can comment on in detail, > but > > I > > >>>>>>> did take an initial general look at things. > > >>>>>>> > > >>>>>>> The only thing that jumped out at me is that I think the > > >>>>>>> DeoptimizeObjectsALotThread should be a hidden thread. > > >>>>>>> > > >>>>>>> +? bool is_hidden_from_external_view() const { return true; } > > >>>>>>> > > >>>>>>> Also I don't see any testing of the DeoptimizeObjectsALotThread. > > >>>>>>> Without > > >>>>>>> active testing this will just bit-rot. > > >>>>>>> > > >>>>>>> Also on the tests I don't understand your @requires clause: > > >>>>>>> > > >>>>>>> ??? @requires ((vm.compMode != "Xcomp") & > > vm.compiler2.enabled & > > >>>>>>> (vm.opt.TieredCompilation != true)) > > >>>>>>> > > >>>>>>> This seems to require that TieredCompilation is disabled, but > tiered > > is > > >>>>>>> our normal mode of operation. ?? > > >>>>>>> > > >>>>>>> Thanks, > > >>>>>>> David > > >>>>>>> > > >>>>>>>> Thanks, > > >>>>>>>> Richard. > > >>>>>>>> > > >>>>>>>> [1] Experimental fix for JDK-8214584 based on JDK-8227745 > > >>>>>>>> > > >> > > > http://cr.openjdk.java.net/~rrich/webrevs/2019/8214584/experiment_v1.pa > > tc > > >> h > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> From richard.reingruber at sap.com Wed Apr 1 06:19:10 2020 From: richard.reingruber at sap.com (Reingruber, Richard) Date: Wed, 1 Apr 2020 06:19:10 +0000 Subject: RFR(L) 8227745: Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents In-Reply-To: <0a07f87e-ede1-edbd-c754-e7df884e0545@oracle.com> References: <1f8a3c7a-fa0f-b5b2-4a8a-7d3d8dbbe1b5@oracle.com> <4b56a45c-a14c-6f74-2bfd-25deaabe8201@oracle.com> <5271429a-481d-ddb9-99dc-b3f6670fcc0b@oracle.com> <0a07f87e-ede1-edbd-c754-e7df884e0545@oracle.com> Message-ID: > Thanks for cleaning up thread.hpp! Thanks for providing the feedback! I justed noticed that the forward declaration of class jvmtiDeferredLocalVariableSet is not required anymore. Will remove it in the next webrev. Hope to get some more (partial) reviews. Thanks, Richard. -----Original Message----- From: Robbin Ehn Sent: Dienstag, 31. M?rz 2020 16:21 To: Reingruber, Richard ; Doerr, Martin ; Lindenmaier, Goetz ; David Holmes ; Vladimir Kozlov (vladimir.kozlov at oracle.com) ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents Thanks for cleaning up thread.hpp! /Robbin On 2020-03-30 10:31, Reingruber, Richard wrote: > Hi, > > this is webrev.5 based on Robbin's feedback and Martin's review - thanks! :) > > The change affects jvmti, hotspot and c2. Partial reviews are very welcome too. > > Full: http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.5/ > Delta: http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.5.inc/ > > Robbin, Martin, please let me know, if anything shouldn't be quite as you wanted it. Also find my > comments on your feedback below. > > Robbin, can I count you as Reviewer for the runtime part? > > Thanks, Richard. > > -- > >> DeoptimizeObjectsALotThread is only used in compileBroker.cpp. >> You can move both declaration and definition to that file, no need to clobber >> thread.[c|h]pp. (and the static function deopt_objs_alot_thread_entry) > > Done. > >> Does JvmtiDeferredUpdates really need to be in thread.hpp, can't be in it's own >> hpp file? It doesn't seem right to add JVM TI classes into thread.hpp. > > I moved JvmtiDeferredUpdates to vframe_hp.hpp where preexisting jvmtiDeferredLocalVariableSet is > declared. > >> src/hotspot/share/code/compiledMethod.cpp >> Nice cleanup! > > Thanks :) > >> src/hotspot/share/code/debugInfoRec.cpp >> src/hotspot/share/code/debugInfoRec.hpp >> Additional parmeters. (Remark: I think "non_global_escape_in_scope" would read better than "not_global_escape_in_scope", but your version is consistent with existing code, so no change request from my side.) Ok. > > I've been thinking about this too and finally stayed with not_global_escape_in_scope. It's supposed > to mean an object whose escape state is not GlobalEscape is in scope. > >> src/hotspot/share/compiler/compileBroker.cpp >> src/hotspot/share/compiler/compileBroker.hpp >> Extra thread for DeoptimizeObjectsALot. (Remark: I would have put it into a follow up change together with the test in order to make this webrev smaller, but since it is included, I'm reviewing everything at once. Not a big deal.) Ok. > > Yes the change would be a little smaller. And if it helps I'll split it off. In general I prefer > patches that bring along a suitable amount of tests. > >> src/hotspot/share/opto/c2compiler.cpp >> Make do_escape_analysis independent of JVMCI capabilities. Nice! > > It is the main goal of the enhancement. It is done for C2, but could be done for JVMCI compilers > with just a small effort as well. > >> src/hotspot/share/opto/escape.cpp >> Annotation for MachSafePointNodes. Your added functionality looks correct. >> But I'd prefer to move the bulky code out of the large function. >> I suggest to factor out something like has_not_global_escape and has_arg_escape. So the code could look like this: >> SafePointNode* sfn = sfn_worklist.at(next); >> sfn->set_not_global_escape_in_scope(has_not_global_escape(sfn)); >> if (sfn->is_CallJava()) { >> CallJavaNode* call = sfn->as_CallJava(); >> call->set_arg_escape(has_arg_escape(call)); >> } >> This would also allow us to get rid of the found_..._escape_in_args variables making the loops better readable. > > Done. > >> It's kind of ugly to use strcmp to recognize uncommon trap, but that seems to be the way to do it (there are more such places). So it's ok. > > Yeah. I copied the snippet. > >> src/hotspot/share/prims/jvmtiImpl.cpp >> src/hotspot/share/prims/jvmtiImpl.hpp >> The sequence is pretty complex: >> VM_GetOrSetLocal element initialization executes EscapeBarrier code which suspends the target thread (extra VM Operation). > > Note that the target threads have to be suspended already for VM_GetOrSetLocal*. So it's mainly the > synchronization effect of EscapeBarrier::sync_and_suspend_one() that is required here. Also no extra > _handshake_ is executed, since sync_and_suspend_one() will find the target threads already > suspended. > >> VM_GetOrSetLocal::doit_prologue performs object deoptimization (by VM Thread to prepare VM Operation with frame deoptimization). >> VM_GetOrSetLocal destructor implicitly calls EscapeBarrier destructor which resumes the target thread. >> But I don't have any improvement proposal. Performance is probably not a concern, here. So it's ok. > >> VM_GetOrSetLocal::deoptimize_objects deoptimizes the top frame if it has non-globally escaping objects and other frames if they have arg escaping ones. Good. > > It's not specifically the top frame, but the frame that is accessed. > >> src/hotspot/share/runtime/deoptimization.cpp >> Object deoptimization. I have more comments and proposals, here. >> First of all, handling recursive and waiting locks in relock_objects is tricky, but looks correct. >> Comments are sufficient to understand why things are done as they are implemented. > >> BiasedLocking related parts are complex, but we may get rid of them in the future (with BiasedLocking removal). >> Anyway, looks correct, too. > >> Typo in comment: "regularily" => "regularly" > >> Deoptimization::fetch_unroll_info_helper is the only place where _jvmti_deferred_updates get deallocated (except JavaThread destructor). But I think we always go through it, so I can't see a memory leak or such kind of issues. > > That's correct. The compiled frame for which deferred updates are allocated is always deoptimized > before (see EscapeBarrier::deoptimize_objects()). This is also asserted in > compiledVFrame::update_deferred_value(). I've added the same assertion to > Deoptimization::relock_objects(). So we can be sure that _jvmti_deferred_updates are deallocated > again in fetch_unroll_info_helper(). > >> EscapeBarrier::deoptimize_objects: ResourceMark should use calling_thread(). > > Sure, well spotted! > >> You can use MutexLocker and MonitorLocker with Thread* to save the Thread::current() call. > > Right, good hint. This was recently introduced with 8235678. I even had to resolve conflicts. Should > have done this then. > >> I'd make set_objs_are_deoptimized static and remove it from the EscapeBarrier interface because I think it shouldn't be used outside of EscapeBarrier::deoptimize_objects. > > Done. > >> Typo in comment: "we must only deoptimize" => "we only have to deoptimize" > > Replaced with "[...] we deoptimize iff local objects are passed as args" > >> "bool EscapeBarrier::deoptimize_objects(intptr_t* fr_id)" is trivial and barrier_active() is redundant. Implementation can get moved to hpp file. > > Ok. Done. > >> I'll get back to suspend flags, later. > >> There are weird cases regarding _self_deoptimization_in_progress. >> Assume we have 3 threads A, B and C. A deopts C, B deopts C, C deopts C. C can set _self_deoptimization_in_progress while A performs the handshake for suspending C. I think this doesn't lead to errors, but it's probably not desired. >> I think it would be better to use only one "wait" call in sync_and_suspend_one and sync_and_suspend_all. > > You're right. We've discussed that face-to-face, but couldn't find a real issue. But now, thinking again, a reckon I found one: > > 2808 // Sync with other threads that might be doing deoptimizations > 2809 { > 2810 // Need to switch to _thread_blocked for the wait() call > 2811 ThreadBlockInVM tbivm(_calling_thread); > 2812 MonitorLocker ml(EscapeBarrier_lock, Mutex::_no_safepoint_check_flag); > 2813 while (_self_deoptimization_in_progress) { > 2814 ml.wait(); > 2815 } > 2816 > 2817 if (self_deopt()) { > 2818 _self_deoptimization_in_progress = true; > 2819 } > 2820 > 2821 while (_deoptee_thread->is_ea_obj_deopt_suspend()) { > 2822 ml.wait(); > 2823 } > 2824 > 2825 if (self_deopt()) { > 2826 return; > 2827 } > 2828 > 2829 // set suspend flag for target thread > 2830 _deoptee_thread->set_ea_obj_deopt_flag(); > 2831 } > > - A waits in 2822 > - C is suspended > - B notifies all in resume_one() > - A and C wake up > - C wins over A and sets _self_deoptimization_in_progress = true in 2818 > - C does the self deoptimization > - A executes 2830 _deoptee_thread->set_ea_obj_deopt_flag() > > C will self suspend at some undefined point. The resulting state is illegal. > >> I first thought it'd be better to move ThreadBlockInVM before wait() to reduce thread state transitions, but that seems to be problematic because ThreadBlockInVM destructor contains a safepoint check which we shouldn't do while holding EscapeBarrier_lock. So no change request. > > Yes, would be nice to have the state change only if needed, but for the reason you mentioned it is > not quite as easy as it seems to be. I experimented as well with a second lock, but did not succeed. > >> Change in thred_added: >> I think the sequence would be more comprehensive if we waited for deopt_all_threads in Thread::start and all other places where a new thread can run into Java code (e.g. JVMTI attach). >> Your version makes new threads come up with suspend flag set. That looks correct, too. Advantage is that you only have to change one place (thread_added). It'll be interesting to see how it will look like when we use async handshakes instead of suspend flags. >> For now, I'm ok with your version. > > I had a version that did what you are suggesting. The current version also has the advantage, that > there are fewer places where a thread has to wait for ongoing object deoptimization. This means > viewer places where you have to worry about correct thread state transitions, possible deadlocks, > and if all oops are properly Handle'ed. > >> I'd only move MutexLocker ml(EscapeBarrier_lock...) after if (!jt->is_hidden_from_external_view()). > > Done. > >> Having 4 different deoptimize_objects functions makes it a little hard to keep an overview of which one is used for what. >> Maybe adding suffixes would help a little bit, but I can also live with what you have. >> Implementation looks correct to me. > > 2 are internal. I added the suffix _internal to them. This leaves 2 to choose from. > >> src/hotspot/share/runtime/deoptimization.hpp >> Escape barriers and object deoptimization functions. >> Typo in comment: "helt" => "held" > > Done in place already. > >> src/hotspot/share/runtime/interfaceSupport.cpp >> InterfaceSupport::deoptimizeAllObjects() is only used for DeoptimizeObjectsALot = 1. >> I think DeoptimizeObjectsALot = 2 is more important, but I think it's not bad to have DeoptimizeObjectsALot = 1 in addition. Ok. > > I never used DeoptimizeObjectsALot = 1 that much. It could be more deterministic in single threaded > scenarios. I wouldn't object to get rid of it though. > >> src/hotspot/share/runtime/stackValue.hpp >> Better reinitilization in StackValue. Good. > > StackValue::obj_is_scalar_replaced() should not return true after calling set_obj(). > >> src/hotspot/share/runtime/thread.cpp >> src/hotspot/share/runtime/thread.hpp >> src/hotspot/share/runtime/thread.inline.hpp >> wait_for_object_deoptimization, suspend flag, deferred updates and test feature to deoptimize objects. > >> In the long term, we want to get rid of suspend flags, so it's not so nice to introduce a new one. But I agree with G?tz that it should be acceptable as temporary solution until async handshakes are available (which takes more time). So I'm ok with your change. > > I'm keen to build the feature on async handshakes when the arive. > >> You can use MutexLocker with Thread*. > > Done. > >> JVMTIDeferredUpdates: I agree with Robin. It'd be nice to move the class out of thread.hpp. > > Done. > >> src/hotspot/share/runtime/vframe.cpp >> Added support for entry frame to new_vframe. Ok. > > >> src/hotspot/share/runtime/vframe_hp.cpp >> src/hotspot/share/runtime/vframe_hp.hpp > >> I think code()->as_nmethod() in not_global_escape_in_scope() and arg_escape() should better be under #ifdef ASSERT or inside the assert statement (no need for code cache walking in product build). > > Done. > >> jvmtiDeferredLocalVariableSet::update_monitors: >> Please add a comment explaining that owner referenced by original info may be scalar replaced, but it is deoptimized in the vframe. > > Done. > > -----Original Message----- > From: Doerr, Martin > Sent: Donnerstag, 12. M?rz 2020 17:28 > To: Reingruber, Richard ; 'Robbin Ehn' ; Lindenmaier, Goetz ; David Holmes ; Vladimir Kozlov (vladimir.kozlov at oracle.com) ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net > Subject: RE: RFR(L) 8227745: Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents > > Hi Richard, > > > I managed to find time for a (almost) complete review of webrev.4. (I'll review the tests separately.) > > First of all, the change seems to be in pretty good quality for its significant complexity. I couldn't find any real bugs. But I'd like to propose minor improvements. > I'm convinced that it's mature because we did substantial testing. > > I like the new functionality for object deoptimization. It can possibly be reused for future escape analysis based optimizations. So I appreciate having it available in the code base. > In addition to that, your change makes the JVMTI implementation better integrated into the VM. > > > Now to the details: > > > src/hotspot/share/c1/c1_IR.hpp > describe_scope parameters. Ok. > > > src/hotspot/share/ci/ciEnv.cpp > src/hotspot/share/ci/ciEnv.hpp > Fix for JvmtiExport::can_walk_any_space() capability. Ok. > > > src/hotspot/share/code/compiledMethod.cpp > Nice cleanup! > > > src/hotspot/share/code/debugInfoRec.cpp > src/hotspot/share/code/debugInfoRec.hpp > Additional parmeters. (Remark: I think "non_global_escape_in_scope" would read better than "not_global_escape_in_scope", but your version is consistent with existing code, so no change request from my side.) Ok. > > > src/hotspot/share/code/nmethod.cpp > Nice cleanup! > > > src/hotspot/share/code/pcDesc.hpp > Additional parameters. Ok. > > > src/hotspot/share/code/scopeDesc.cpp > src/hotspot/share/code/scopeDesc.hpp > Improved implementation + additional parameters. Ok. > > > src/hotspot/share/compiler/compileBroker.cpp > src/hotspot/share/compiler/compileBroker.hpp > Extra thread for DeoptimizeObjectsALot. (Remark: I would have put it into a follow up change together with the test in order to make this webrev smaller, but since it is included, I'm reviewing everything at once. Not a big deal.) Ok. > > > src/hotspot/share/jvmci/jvmciCodeInstaller.cpp > Additional parameters. Ok. > > > src/hotspot/share/opto/c2compiler.cpp > Make do_escape_analysis independent of JVMCI capabilities. Nice! > > > src/hotspot/share/opto/callnode.hpp > Additional fields for MachSafePointNodes. Ok. > > > src/hotspot/share/opto/escape.cpp > Annotation for MachSafePointNodes. Your added functionality looks correct. > But I'd prefer to move the bulky code out of the large function. > I suggest to factor out something like has_not_global_escape and has_arg_escape. So the code could look like this: > SafePointNode* sfn = sfn_worklist.at(next); > sfn->set_not_global_escape_in_scope(has_not_global_escape(sfn)); > if (sfn->is_CallJava()) { > CallJavaNode* call = sfn->as_CallJava(); > call->set_arg_escape(has_arg_escape(call)); > } > This would also allow us to get rid of the found_..._escape_in_args variables making the loops better readable. > > It's kind of ugly to use strcmp to recognize uncommon trap, but that seems to be the way to do it (there are more such places). So it's ok. > > > src/hotspot/share/opto/machnode.hpp > Additional fields for MachSafePointNodes. Ok. > > > src/hotspot/share/opto/macro.cpp > Allow elimination of non-escaping allocations. Ok. > > > src/hotspot/share/opto/matcher.cpp > src/hotspot/share/opto/output.cpp > Copy attribute / pass parameters. Ok. > > > src/hotspot/share/prims/jvmtiCodeBlobEvents.cpp > Nice cleanup! > > > src/hotspot/share/prims/jvmtiEnv.cpp > src/hotspot/share/prims/jvmtiEnvBase.cpp > Escape barriers + deoptimize objects for target thread. Good. > > > src/hotspot/share/prims/jvmtiImpl.cpp > src/hotspot/share/prims/jvmtiImpl.hpp > The sequence is pretty complex: > VM_GetOrSetLocal element initialization executes EscapeBarrier code which suspends the target thread (extra VM Operation). > VM_GetOrSetLocal::doit_prologue performs object deoptimization (by VM Thread to prepare VM Operation with frame deoptimization). > VM_GetOrSetLocal destructor implicitly calls EscapeBarrier destructor which resumes the target thread. > But I don't have any improvement proposal. Performance is probably not a concern, here. So it's ok. > > VM_GetOrSetLocal::deoptimize_objects deoptimizes the top frame if it has non-globally escaping objects and other frames if they have arg escaping ones. Good. > > > src/hotspot/share/prims/jvmtiTagMap.cpp > Escape barriers + deoptimize objects for all threads. Ok. > > > src/hotspot/share/prims/whitebox.cpp > Added WB_IsFrameDeoptimized to API. Ok. > > > src/hotspot/share/runtime/deoptimization.cpp > Object deoptimization. I have more comments and proposals, here. > First of all, handling recursive and waiting locks in relock_objects is tricky, but looks correct. > Comments are sufficient to understand why things are done as they are implemented. > > BiasedLocking related parts are complex, but we may get rid of them in the future (with BiasedLocking removal). > Anyway, looks correct, too. > > Typo in comment: "regularily" => "regularly" > > Deoptimization::fetch_unroll_info_helper is the only place where _jvmti_deferred_updates get deallocated (except JavaThread destructor). But I think we always go through it, so I can't see a memory leak or such kind of issues. > > EscapeBarrier::deoptimize_objects: ResourceMark should use calling_thread(). > > You can use MutexLocker and MonitorLocker with Thread* to save the Thread::current() call. > > I'd make set_objs_are_deoptimized static and remove it from the EscapeBarrier interface because I think it shouldn't be used outside of EscapeBarrier::deoptimize_objects. > > Typo in comment: "we must only deoptimize" => "we only have to deoptimize" > > "bool EscapeBarrier::deoptimize_objects(intptr_t* fr_id)" is trivial and barrier_active() is redundant. Implementation can get moved to hpp file. > > I'll get back to suspend flags, later. > > There are weird cases regarding _self_deoptimization_in_progress. > Assume we have 3 threads A, B and C. A deopts C, B deopts C, C deopts C. C can set _self_deoptimization_in_progress while A performs the handshake for suspending C. I think this doesn't lead to errors, but it's probably not desired. > I think it would be better to use only one "wait" call in sync_and_suspend_one and sync_and_suspend_all. > > I first thought it'd be better to move ThreadBlockInVM before wait() to reduce thread state transitions, but that seems to be problematic because ThreadBlockInVM destructor contains a safepoint check which we shouldn't do while holding EscapeBarrier_lock. So no change request. > > Change in thred_added: > I think the sequence would be more comprehensive if we waited for deopt_all_threads in Thread::start and all other places where a new thread can run into Java code (e.g. JVMTI attach). > Your version makes new threads come up with suspend flag set. That looks correct, too. Advantage is that you only have to change one place (thread_added). It'll be interesting to see how it will look like when we use async handshakes instead of suspend flags. > For now, I'm ok with your version. > > I'd only move MutexLocker ml(EscapeBarrier_lock...) after if (!jt->is_hidden_from_external_view()). > > Having 4 different deoptimize_objects functions makes it a little hard to keep an overview of which one is used for what. > Maybe adding suffixes would help a little bit, but I can also live with what you have. > Implementation looks correct to me. > > > src/hotspot/share/runtime/deoptimization.hpp > Escape barriers and object deoptimization functions. > Typo in comment: "helt" => "held" > > > src/hotspot/share/runtime/globals.hpp > Addition of develop flag DeoptimizeObjectsALotInterval. Ok. > > > src/hotspot/share/runtime/interfaceSupport.cpp > InterfaceSupport::deoptimizeAllObjects() is only used for DeoptimizeObjectsALot = 1. > I think DeoptimizeObjectsALot = 2 is more important, but I think it's not bad to have DeoptimizeObjectsALot = 1 in addition. Ok. > > > src/hotspot/share/runtime/interfaceSupport.inline.hpp > Addition of deoptimizeAllObjects. Ok. > > > src/hotspot/share/runtime/mutexLocker.cpp > src/hotspot/share/runtime/mutexLocker.hpp > Addition of EscapeBarrier_lock. Ok. > > > src/hotspot/share/runtime/objectMonitor.cpp > Make recursion count relock aware. Ok. > > > src/hotspot/share/runtime/stackValue.hpp > Better reinitilization in StackValue. Good. > > > src/hotspot/share/runtime/thread.cpp > src/hotspot/share/runtime/thread.hpp > src/hotspot/share/runtime/thread.inline.hpp > wait_for_object_deoptimization, suspend flag, deferred updates and test feature to deoptimize objects. > > In the long term, we want to get rid of suspend flags, so it's not so nice to introduce a new one. But I agree with G?tz that it should be acceptable as temporary solution until async handshakes are available (which takes more time). So I'm ok with your change. > > You can use MutexLocker with Thread*. > > JVMTIDeferredUpdates: I agree with Robin. It'd be nice to move the class out of thread.hpp. > > > src/hotspot/share/runtime/vframe.cpp > Added support for entry frame to new_vframe. Ok. > > > src/hotspot/share/runtime/vframe_hp.cpp > src/hotspot/share/runtime/vframe_hp.hpp > > I think code()->as_nmethod() in not_global_escape_in_scope() and arg_escape() should better be under #ifdef ASSERT or inside the assert statement (no need for code cache walking in product build). > > jvmtiDeferredLocalVariableSet::update_monitors: > Please add a comment explaining that owner referenced by original info may be scalar replaced, but it is deoptimized in the vframe. > > > src/hotspot/share/utilities/macros.hpp > Addition of NOT_COMPILER2_OR_JVMCI_RETURN macros. Ok. > > > test/hotspot/jtreg/serviceability/jvmti/Heap/IterateHeapWithEscapeAnalysisEnabled.java > test/hotspot/jtreg/serviceability/jvmti/Heap/libIterateHeapWithEscapeAnalysisEnabled.c > New test. Will review separately. > > > test/jdk/TEST.ROOT > Addition of vm.jvmci as required property. Ok. > > > test/jdk/com/sun/jdi/EATests.java > test/jdk/com/sun/jdi/EATestsJVMCI.java > New test. Will review separately. > > > test/lib/sun/hotspot/WhiteBox.java > Added isFrameDeoptimized to API. Ok. > > > That was it. Best regards, > Martin > > >> -----Original Message----- >> From: hotspot-compiler-dev > bounces at openjdk.java.net> On Behalf Of Reingruber, Richard >> Sent: Dienstag, 3. M?rz 2020 21:23 >> To: 'Robbin Ehn' ; Lindenmaier, Goetz >> ; David Holmes ; >> Vladimir Kozlov (vladimir.kozlov at oracle.com) >> ; serviceability-dev at openjdk.java.net; >> hotspot-compiler-dev at openjdk.java.net; hotspot-runtime- >> dev at openjdk.java.net >> Subject: RE: RFR(L) 8227745: Enable Escape Analysis for Better >> Performance in the Presence of JVMTI Agents >> >> Hi Robbin, >> >>>> I understand that Robbin proposed to replace the usage of >>>> _suspend_flag with handshakes. Apparently, async handshakes >>>> are needed to do so. We have been waiting a while for removal >>>> of the _suspend_flag / introduction of async handshakes [2]. >>>> What is the status here? >> >>> I have an old prototype which I would like to continue to work on. >>> So do not assume asynch handshakes will make 15. >>> Even if it would, I think there are a lot more investigate work to remove >>> _suspend_flag. >> >> Let us know, if we can be of any help to you and be it only testing. >> >>>>> Full: >> http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.4/ >> >>> DeoptimizeObjectsALotThread is only used in compileBroker.cpp. >>> You can move both declaration and definition to that file, no need to >> clobber >>> thread.[c|h]pp. (and the static function deopt_objs_alot_thread_entry) >> >> Will do. >> >>> Does JvmtiDeferredUpdates really need to be in thread.hpp, can't be in it's >> own >>> hpp file? It doesn't seem right to add JVM TI classes into thread.hpp. >> >> You are right. It shouldn't be declared in thread.hpp. I will look into that. >> >>> Note that we also think we may have a bug in deopt: >>> https://bugs.openjdk.java.net/browse/JDK-8238237 >> >>> I think it would be best, if possible, to push after that is resolved. >> >> Sure. >> >>> Not even nearly a full review :) >> >> I know :) >> >> Anyways, thanks a lot, >> Richard. >> >> >> -----Original Message----- >> From: Robbin Ehn >> Sent: Monday, March 2, 2020 11:17 AM >> To: Lindenmaier, Goetz ; Reingruber, Richard >> ; David Holmes ; >> Vladimir Kozlov (vladimir.kozlov at oracle.com) >> ; serviceability-dev at openjdk.java.net; >> hotspot-compiler-dev at openjdk.java.net; hotspot-runtime- >> dev at openjdk.java.net >> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better Performance >> in the Presence of JVMTI Agents >> >> Hi, >> >> On 2/24/20 5:39 PM, Lindenmaier, Goetz wrote: >>> Hi, >>> >>> I had a look at the progress of this change. Nothing >>> happened since Richard posted his update using more >>> handshakes [1]. >>> But we (SAP) would appreciate a lot if this change could >>> be successfully reviewed and pushed. >>> >>> I think there is basic understanding that this >>> change is helpful. It fixes a number of issues with JVMTI, >>> and will deliver the same performance benefits as EA >>> does in current production mode for debugging scenarios. >>> >>> This is important for us as we run our VMs prepared >>> for debugging in production mode. >>> >>> I understand that Robbin proposed to replace the usage of >>> _suspend_flag with handshakes. Apparently, async handshakes >>> are needed to do so. We have been waiting a while for removal >>> of the _suspend_flag / introduction of async handshakes [2]. >>> What is the status here? >> >> I have an old prototype which I would like to continue to work on. >> So do not assume asynch handshakes will make 15. >> Even if it would, I think there are a lot more investigate work to remove >> _suspend_flag. >> >>> >>> I think we should no longer wait, but proceed with >>> this change. We will look into removing the usage of >>> suspend_flag introduced here once it is possible to implement >>> it with handshakes. >> >> Yes, sure. >> >>>> Full: http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.4/ >> >> DeoptimizeObjectsALotThread is only used in compileBroker.cpp. >> You can move both declaration and definition to that file, no need to clobber >> thread.[c|h]pp. (and the static function deopt_objs_alot_thread_entry) >> >> Does JvmtiDeferredUpdates really need to be in thread.hpp, can't be in it's >> own >> hpp file? It doesn't seem right to add JVM TI classes into thread.hpp. >> >> Note that we also think we may have a bug in deopt: >> https://bugs.openjdk.java.net/browse/JDK-8238237 >> >> I think it would be best, if possible, to push after that is resolved. >> >> Not even nearly a full review :) >> >> Thanks, Robbin >> >> >>>> Incremental: >>>> http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.4.inc/ >>>> >>>> I was not able to eliminate the additional suspend flag now. I'll take care >> of this >>>> as soon as the >>>> existing suspend-resume-mechanism is reworked. >>>> >>>> Testing: >>>> >>>> Nightly tests @SAP: >>>> >>>> JCK and JTREG, also in Xcomp mode, SPECjvm2008, SPECjbb2015, >> Renaissance >>>> Suite, SAP specific tests >>>> with fastdebug and release builds on all platforms >>>> >>>> Stress testing with DeoptimizeObjectsALot running SPECjvm2008 40x >> parallel >>>> for 24h >>>> >>>> Thanks, Richard. >>>> >>>> >>>> More details on the changes: >>>> >>>> * Hide DeoptimizeObjectsALotThread from external view. >>>> >>>> * Changed EscapeBarrier_lock to be a _safepoint_check_never lock. >>>> It used to be _safepoint_check_sometimes, which will be eliminated >> sooner or >>>> later. >>>> I added explicit thread state changes with ThreadBlockInVM to code >> paths >>>> where we can wait() >>>> on EscapeBarrier_lock to become safepoint safe. >>>> >>>> * Use handshake EscapeBarrierSuspendHandshake to suspend target >> threads >>>> instead of vm operation >>>> VM_ThreadSuspendAllForObjDeopt. >>>> >>>> * Removed uses of Threads_lock. When adding a new thread we suspend >> it iff >>>> EA optimizations are >>>> being reverted. In the previous version we were waiting on >> Threads_lock >>>> while EA optimizations >>>> were reverted. See EscapeBarrier::thread_added(). >>>> >>>> * Made tests require Xmixed compilation mode. >>>> >>>> * Made tests agnostic regarding tiered compilation. >>>> I.e. tc isn't disabled anymore, and the tests can be run with tc enabled or >>>> disabled. >>>> >>>> * Exercising EATests.java as well with stress test options >>>> DeoptimizeObjectsALot* >>>> Due to the non-deterministic deoptimizations some tests need to be >> skipped. >>>> We do this to prevent bit-rot of the stress test code. >>>> >>>> * Executing EATests.java as well with graal if available. Driver for this is >>>> EATestsJVMCI.java. Graal cannot pass all tests, because it does not >> provide all >>>> the new debug info >>>> (namely not_global_escape_in_scope and arg_escape in >> scopeDesc.hpp). >>>> And graal does not yet support the JVMTI operations force early return >> and >>>> pop frame. >>>> >>>> * Removed tracing from new jdi tests in EATests.java. Too much trace >> output >>>> before the debugging >>>> connection is established can cause deadlock because output buffers fill >> up. >>>> (See https://bugs.openjdk.java.net/browse/JDK-8173304) >>>> >>>> * Many copyright year changes and smaller clean-up changes of testing >> code >>>> (trailing white-space and >>>> the like). >>>> >>>> >>>> -----Original Message----- >>>> From: David Holmes >>>> Sent: Donnerstag, 19. Dezember 2019 03:12 >>>> To: Reingruber, Richard ; serviceability- >>>> dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; >> hotspot- >>>> runtime-dev at openjdk.java.net; Vladimir Kozlov >> (vladimir.kozlov at oracle.com) >>>> >>>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better >> Performance in >>>> the Presence of JVMTI Agents >>>> >>>> Hi Richard, >>>> >>>> I think my issue is with the way EliminateNestedLocks works so I'm going >>>> to look into that more deeply. >>>> >>>> Thanks for the explanations. >>>> >>>> David >>>> >>>> On 18/12/2019 12:47 am, Reingruber, Richard wrote: >>>>> Hi David, >>>>> >>>>> > > > Some further queries/concerns: >>>>> > > > >>>>> > > > src/hotspot/share/runtime/objectMonitor.cpp >>>>> > > > >>>>> > > > Can you please explain the changes to ObjectMonitor::wait: >>>>> > > > >>>>> > > > ! _recursions = save // restore the old recursion count >>>>> > > > ! + jt->get_and_reset_relock_count_after_wait(); // >>>>> > > > increased by the deferred relock count >>>>> > > > >>>>> > > > what is the "deferred relock count"? I gather it relates to >>>>> > > > >>>>> > > > "The code was extended to be able to deoptimize objects of a >>>>> > > frame that >>>>> > > > is not the top frame and to let another thread than the owning >>>>> > > thread do >>>>> > > > it." >>>>> > > >>>>> > > Yes, these relate. Currently EA based optimizations are reverted, >> when a >>>> compiled frame is >>>>> > > replaced with corresponding interpreter frames. Part of this is >> relocking >>>> objects with eliminated >>>>> > > locking. New with the enhancement is that we do this also just >> before >>>> object references are >>>>> > > acquired through JVMTI. In this case we deoptimize also the >> owning >>>> compiled frame C and we >>>>> > > register deoptimized objects as deferred updates. When control >> returns >>>> to C it gets deoptimized, >>>>> > > we notice that objects are already deoptimized (reallocated and >>>> relocked), so we don't do it again >>>>> > > (relocking twice would be incorrect of course). Deferred updates >> are >>>> copied into the new >>>>> > > interpreter frames. >>>>> > > >>>>> > > Problem: relocking is not possible if the target thread T is waiting >> on the >>>> monitor that needs to >>>>> > > be relocked. This happens only with non-local objects with >>>> EliminateNestedLocks. Instead relocking >>>>> > > is deferred until T owns the monitor again. This is what the piece of >>>> code above does. >>>>> > >>>>> > Sorry I need some more detail here. How can you wait() on an >> object >>>>> > monitor if the object allocation and/or locking was optimised away? >> And >>>>> > what is a "non-local object" in this context? Isn't EA restricted to >>>>> > thread-confined objects? >>>>> >>>>> "Non-local object" is an object that escapes its thread. The issue I'm >>>> addressing with the changes >>>>> in ObjectMonitor::wait are almost unrelated to EA. They are caused by >>>> EliminateNestedLocks, where C2 >>>>> eliminates recursive locking of an already owned lock. The lock owning >> object >>>> exists on the heap, it >>>>> is locked and you can call wait() on it. >>>>> >>>>> EliminateLocks is the C2 option that controls lock elimination based on >> EA. >>>> Both optimizations have >>>>> in common that objects with eliminated locking need to be relocked >> when >>>> deoptimizing a frame, >>>>> i.e. when replacing a compiled frame with equivalent interpreter >>>>> frames. Deoptimization::relock_objects does that job for /all/ eliminated >>>> locks in scope. /All/ can >>>>> be a mix of eliminated nested locks and locks of not-escaping objects. >>>>> >>>>> New with the enhancement: I call relock_objects earlier, just before >> objects >>>> pontentially >>>>> escape. But then later when the owning compiled frame gets >> deoptimized, I >>>> must not do it again: >>>>> >>>>> See call to EscapeBarrier::objs_are_deoptimized in deoptimization.cpp: >>>>> >>>>> 373 if ((jvmci_enabled || ((DoEscapeAnalysis || >> EliminateNestedLocks) && >>>> EliminateLocks)) >>>>> 374 && !EscapeBarrier::objs_are_deoptimized(thread, >> deoptee.id())) { >>>>> 375 bool unused; >>>>> 376 eliminate_locks(thread, chunk, realloc_failures, deoptee, >> exec_mode, >>>> unused); >>>>> 377 } >>>>> >>>>> Now when calling relock_objects early it is quiet possible that I have to >> relock >>>> an object the >>>>> target thread currently waits for. Obviously I cannot relock in this case, >>>> instead I chose to >>>>> introduce relock_count_after_wait to JavaThread. >>>>> >>>>> > Is it just that some of the locking gets optimized away e.g. >>>>> > >>>>> > synchronised(obj) { >>>>> > synchronised(obj) { >>>>> > synchronised(obj) { >>>>> > obj.wait(); >>>>> > } >>>>> > } >>>>> > } >>>>> > >>>>> > If this is reduced to a form as-if it were a single lock of the monitor >>>>> > (due to EA) and the wait() triggers a JVM TI event which leads to the >>>>> > escape of "obj" then we need to reconstruct the true lock state, and >> so >>>>> > when the wait() internally unblocks and reacquires the monitor it >> has to >>>>> > set the true recursion count to 3, not the 1 that it appeared to be >> when >>>>> > wait() was initially called. Is that the scenario? >>>>> >>>>> Kind of... except that the locking is not eliminated due to EA and there is >> no >>>> JVM TI event >>>>> triggered by wait. >>>>> >>>>> Add >>>>> >>>>> LocalObject l1 = new LocalObject(); >>>>> >>>>> in front of the synchrnized blocks and assume a JVM TI agent acquires l1. >> This >>>> triggers the code in >>>>> question. >>>>> >>>>> See that relocking/reallocating is transactional. If it is done then for /all/ >>>> objects in scope and it is >>>>> done at most once. It wouldn't be quite so easy to split this in relocking >> of >>>> nested/EA-based >>>>> eliminated locks. >>>>> >>>>> > If so I find this truly awful. Anyone using wait() in a realistic form >>>>> > requires a notification and so the object cannot be thread confined. >> In >>>>> >>>>> It is not thread confined. >>>>> >>>>> > which case I would strongly argue that upon hitting the wait() the >> deopt >>>>> > should occur unconditionally and so the lock state is correct before >> we >>>>> > wait and so we don't need to mess with the recursion count >> internally >>>>> > when we reacquire the monitor. >>>>> > >>>>> > > >>>>> > > > which I don't like the sound of at all when it comes to >> ObjectMonitor >>>>> > > > state. So I'd like to understand in detail exactly what is going on >> here >>>>> > > > and why. This is a very intrusive change that seems to badly >> break >>>>> > > > encapsulation and impacts future changes to ObjectMonitor >> that are >>>> under >>>>> > > > investigation. >>>>> > > >>>>> > > I would not regard this as breaking encapsulation. Certainly not >> badly. >>>>> > > >>>>> > > I've added a property relock_count_after_wait to JavaThread. The >>>> property is well >>>>> > > encapsulated. Future ObjectMonitor implementations have to deal >> with >>>> recursion too. They are free >>>>> > > in choosing a way to do that as long as that property is taken into >>>> account. This is hardly a >>>>> > > limitation. >>>>> > >>>>> > I do think this badly breaks encapsulation as you have to add a >> callout >>>>> > from the guts of the ObjectMonitor code to reach into the thread to >> get >>>>> > this lock count adjustment. I understand why you have had to do >> this but >>>>> > I would much rather see a change to the EA optimisation strategy so >> that >>>>> > this is not needed. >>>>> > >>>>> > > Note also that the property is a straight forward extension of the >>>> existing concept of deferred >>>>> > > local updates. It is embedded into the structure holding them. So >> not >>>> even the footprint of a >>>>> > > JavaThread is enlarged if no deferred updates are generated. >>>>> > >>>>> > [...] >>>>> > >>>>> > > >>>>> > > I'm actually duplicating the existing external suspend mechanism, >>>> because a thread can be >>>>> > > suspended at most once. And hey, and don't like that either! But it >>>> seems not unlikely that the >>>>> > > duplicate can be removed together with the original and the new >> type >>>> of handshakes that will be >>>>> > > used for thread suspend can be used for object deoptimization >> too. See >>>> today's discussion in >>>>> > > JDK-8227745 [2]. >>>>> > >>>>> > I hope that discussion bears some fruit, at the moment it seems not >> to >>>>> > be possible to use handshakes here. :( >>>>> > >>>>> > The external suspend mechanism is a royal pain in the proverbial >> that we >>>>> > have to carefully live with. The idea that we're duplicating that for >>>>> > use in another fringe area of functionality does not thrill me at all. >>>>> > >>>>> > To be clear, I understand the problem that exists and that you wish >> to >>>>> > solve, but for the runtime parts I balk at the complexity cost of >>>>> > solving it. >>>>> >>>>> I know it's complex, but by far no rocket science. >>>>> >>>>> Also I find it hard to imagine another fix for JDK-8233915 besides >> changing >>>> the JVM TI specification. >>>>> >>>>> Thanks, Richard. >>>>> >>>>> -----Original Message----- >>>>> From: David Holmes >>>>> Sent: Dienstag, 17. Dezember 2019 08:03 >>>>> To: Reingruber, Richard ; serviceability- >>>> dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; >> hotspot- >>>> runtime-dev at openjdk.java.net; Vladimir Kozlov >> (vladimir.kozlov at oracle.com) >>>> >>>>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better >> Performance >>>> in the Presence of JVMTI Agents >>>>> >>>>> >>>>> >>>>> David >>>>> >>>>> On 17/12/2019 4:57 pm, David Holmes wrote: >>>>>> Hi Richard, >>>>>> >>>>>> On 14/12/2019 5:01 am, Reingruber, Richard wrote: >>>>>>> Hi David, >>>>>>> >>>>>>> ?? > Some further queries/concerns: >>>>>>> ?? > >>>>>>> ?? > src/hotspot/share/runtime/objectMonitor.cpp >>>>>>> ?? > >>>>>>> ?? > Can you please explain the changes to ObjectMonitor::wait: >>>>>>> ?? > >>>>>>> ?? > !?? _recursions = save????? // restore the old recursion count >>>>>>> ?? > !???????????????? + jt->get_and_reset_relock_count_after_wait(); // >>>>>>> ?? > increased by the deferred relock count >>>>>>> ?? > >>>>>>> ?? > what is the "deferred relock count"? I gather it relates to >>>>>>> ?? > >>>>>>> ?? > "The code was extended to be able to deoptimize objects of a >>>>>>> frame that >>>>>>> ?? > is not the top frame and to let another thread than the owning >>>>>>> thread do >>>>>>> ?? > it." >>>>>>> >>>>>>> Yes, these relate. Currently EA based optimizations are reverted, >> when >>>>>>> a compiled frame is replaced >>>>>>> with corresponding interpreter frames. Part of this is relocking >>>>>>> objects with eliminated >>>>>>> locking. New with the enhancement is that we do this also just before >>>>>>> object references are acquired >>>>>>> through JVMTI. In this case we deoptimize also the owning compiled >>>>>>> frame C and we register >>>>>>> deoptimized objects as deferred updates. When control returns to C >> it >>>>>>> gets deoptimized, we notice >>>>>>> that objects are already deoptimized (reallocated and relocked), so >> we >>>>>>> don't do it again (relocking >>>>>>> twice would be incorrect of course). Deferred updates are copied into >>>>>>> the new interpreter frames. >>>>>>> >>>>>>> Problem: relocking is not possible if the target thread T is waiting >>>>>>> on the monitor that needs to be >>>>>>> relocked. This happens only with non-local objects with >>>>>>> EliminateNestedLocks. Instead relocking is >>>>>>> deferred until T owns the monitor again. This is what the piece of >>>>>>> code above does. >>>>>> >>>>>> Sorry I need some more detail here. How can you wait() on an object >>>>>> monitor if the object allocation and/or locking was optimised away? >> And >>>>>> what is a "non-local object" in this context? Isn't EA restricted to >>>>>> thread-confined objects? >>>>>> >>>>>> Is it just that some of the locking gets optimized away e.g. >>>>>> >>>>>> synchronised(obj) { >>>>>> ? synchronised(obj) { >>>>>> ??? synchronised(obj) { >>>>>> ????? obj.wait(); >>>>>> ??? } >>>>>> ? } >>>>>> } >>>>>> >>>>>> If this is reduced to a form as-if it were a single lock of the monitor >>>>>> (due to EA) and the wait() triggers a JVM TI event which leads to the >>>>>> escape of "obj" then we need to reconstruct the true lock state, and so >>>>>> when the wait() internally unblocks and reacquires the monitor it has to >>>>>> set the true recursion count to 3, not the 1 that it appeared to be when >>>>>> wait() was initially called. Is that the scenario? >>>>>> >>>>>> If so I find this truly awful. Anyone using wait() in a realistic form >>>>>> requires a notification and so the object cannot be thread confined. In >>>>>> which case I would strongly argue that upon hitting the wait() the >> deopt >>>>>> should occur unconditionally and so the lock state is correct before we >>>>>> wait and so we don't need to mess with the recursion count internally >>>>>> when we reacquire the monitor. >>>>>> >>>>>>> >>>>>>> ?? > which I don't like the sound of at all when it comes to >>>>>>> ObjectMonitor >>>>>>> ?? > state. So I'd like to understand in detail exactly what is going >>>>>>> on here >>>>>>> ?? > and why.? This is a very intrusive change that seems to badly >> break >>>>>>> ?? > encapsulation and impacts future changes to ObjectMonitor that >>>>>>> are under >>>>>>> ?? > investigation. >>>>>>> >>>>>>> I would not regard this as breaking encapsulation. Certainly not badly. >>>>>>> >>>>>>> I've added a property relock_count_after_wait to JavaThread. The >>>>>>> property is well >>>>>>> encapsulated. Future ObjectMonitor implementations have to deal >> with >>>>>>> recursion too. They are free in >>>>>>> choosing a way to do that as long as that property is taken into >>>>>>> account. This is hardly a >>>>>>> limitation. >>>>>> >>>>>> I do think this badly breaks encapsulation as you have to add a callout >>>>>> from the guts of the ObjectMonitor code to reach into the thread to >> get >>>>>> this lock count adjustment. I understand why you have had to do this >> but >>>>>> I would much rather see a change to the EA optimisation strategy so >> that >>>>>> this is not needed. >>>>>> >>>>>>> Note also that the property is a straight forward extension of the >>>>>>> existing concept of deferred >>>>>>> local updates. It is embedded into the structure holding them. So not >>>>>>> even the footprint of a >>>>>>> JavaThread is enlarged if no deferred updates are generated. >>>>>>> >>>>>>> ?? > --- >>>>>>> ?? > >>>>>>> ?? > src/hotspot/share/runtime/thread.cpp >>>>>>> ?? > >>>>>>> ?? > Can you please explain why >>>>>>> JavaThread::wait_for_object_deoptimization >>>>>>> ?? > has to be handcrafted in this way rather than using proper >>>>>>> transitions. >>>>>>> ?? > >>>>>>> >>>>>>> I wrote wait_for_object_deoptimization taking >>>>>>> JavaThread::java_suspend_self_with_safepoint_check >>>>>>> as template. So in short: for the same reasons :) >>>>>>> >>>>>>> Threads reach both methods as part of thread state transitions, >>>>>>> therefore special handling is >>>>>>> required to change thread state on top of ongoing transitions. >>>>>>> >>>>>>> ?? > We got rid of "deopt suspend" some time ago and it is disturbing >>>>>>> to see >>>>>>> ?? > it being added back (effectively). This seems like it may be >>>>>>> something >>>>>>> ?? > that handshakes could be used for. >>>>>>> >>>>>>> Deopt suspend used to be something rather different with a similar >>>>>>> name[1]. It is not being added back. >>>>>> >>>>>> I stand corrected. Despite comments in the code to the contrary >>>>>> deopt_suspend didn't actually cause a self-suspend. I was doing a lot of >>>>>> cleanup in this area 13 years ago :) >>>>>> >>>>>>> >>>>>>> I'm actually duplicating the existing external suspend mechanism, >>>>>>> because a thread can be suspended >>>>>>> at most once. And hey, and don't like that either! But it seems not >>>>>>> unlikely that the duplicate can >>>>>>> be removed together with the original and the new type of >> handshakes >>>>>>> that will be used for >>>>>>> thread suspend can be used for object deoptimization too. See >> today's >>>>>>> discussion in JDK-8227745 [2]. >>>>>> >>>>>> I hope that discussion bears some fruit, at the moment it seems not to >>>>>> be possible to use handshakes here. :( >>>>>> >>>>>> The external suspend mechanism is a royal pain in the proverbial that >> we >>>>>> have to carefully live with. The idea that we're duplicating that for >>>>>> use in another fringe area of functionality does not thrill me at all. >>>>>> >>>>>> To be clear, I understand the problem that exists and that you wish to >>>>>> solve, but for the runtime parts I balk at the complexity cost of >>>>>> solving it. >>>>>> >>>>>> Thanks, >>>>>> David >>>>>> ----- >>>>>> >>>>>>> Thanks, Richard. >>>>>>> >>>>>>> [1] Deopt suspend was something like an async. handshake for >>>>>>> architectures with register windows, >>>>>>> ???? where patching the return pc for deoptimization of a compiled >>>>>>> frame was racy if the owner thread >>>>>>> ???? was in native code. Instead a "deopt" suspend flag was set on >>>>>>> which the thread patched its own >>>>>>> ???? frame upon return from native. So no thread was suspended. It >> got >>>>>>> its name only from the name of >>>>>>> ???? the flags. >>>>>>> >>>>>>> [2] Discussion about using handshakes to sync. with the target thread: >>>>>>> >>>>>>> https://bugs.openjdk.java.net/browse/JDK- >>>> >> 8227745?focusedCommentId=14306727&page=com.atlassian.jira.plugin.syst >> e >>>> m.issuetabpanels:comment-tabpanel#comment-14306727 >>>>>>> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: David Holmes >>>>>>> Sent: Freitag, 13. Dezember 2019 00:56 >>>>>>> To: Reingruber, Richard ; >>>>>>> serviceability-dev at openjdk.java.net; >>>>>>> hotspot-compiler-dev at openjdk.java.net; >>>>>>> hotspot-runtime-dev at openjdk.java.net >>>>>>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better >>>>>>> Performance in the Presence of JVMTI Agents >>>>>>> >>>>>>> Hi Richard, >>>>>>> >>>>>>> Some further queries/concerns: >>>>>>> >>>>>>> src/hotspot/share/runtime/objectMonitor.cpp >>>>>>> >>>>>>> Can you please explain the changes to ObjectMonitor::wait: >>>>>>> >>>>>>> !?? _recursions = save????? // restore the old recursion count >>>>>>> !???????????????? + jt->get_and_reset_relock_count_after_wait(); // >>>>>>> increased by the deferred relock count >>>>>>> >>>>>>> what is the "deferred relock count"? I gather it relates to >>>>>>> >>>>>>> "The code was extended to be able to deoptimize objects of a frame >> that >>>>>>> is not the top frame and to let another thread than the owning thread >> do >>>>>>> it." >>>>>>> >>>>>>> which I don't like the sound of at all when it comes to ObjectMonitor >>>>>>> state. So I'd like to understand in detail exactly what is going on here >>>>>>> and why.? This is a very intrusive change that seems to badly break >>>>>>> encapsulation and impacts future changes to ObjectMonitor that are >> under >>>>>>> investigation. >>>>>>> >>>>>>> --- >>>>>>> >>>>>>> src/hotspot/share/runtime/thread.cpp >>>>>>> >>>>>>> Can you please explain why >> JavaThread::wait_for_object_deoptimization >>>>>>> has to be handcrafted in this way rather than using proper transitions. >>>>>>> >>>>>>> We got rid of "deopt suspend" some time ago and it is disturbing to >> see >>>>>>> it being added back (effectively). This seems like it may be something >>>>>>> that handshakes could be used for. >>>>>>> >>>>>>> Thanks, >>>>>>> David >>>>>>> ----- >>>>>>> >>>>>>> On 12/12/2019 7:02 am, David Holmes wrote: >>>>>>>> On 12/12/2019 1:07 am, Reingruber, Richard wrote: >>>>>>>>> Hi David, >>>>>>>>> >>>>>>>>> ??? > Most of the details here are in areas I can comment on in >> detail, >>>>>>>>> but I >>>>>>>>> ??? > did take an initial general look at things. >>>>>>>>> >>>>>>>>> Thanks for taking the time! >>>>>>>> >>>>>>>> Apologies the above should read: >>>>>>>> >>>>>>>> "Most of the details here are in areas I *can't* comment on in detail >>>>>>>> ..." >>>>>>>> >>>>>>>> David >>>>>>>> >>>>>>>>> ??? > The only thing that jumped out at me is that I think the >>>>>>>>> ??? > DeoptimizeObjectsALotThread should be a hidden thread. >>>>>>>>> ??? > >>>>>>>>> ??? > +? bool is_hidden_from_external_view() const { return true; } >>>>>>>>> >>>>>>>>> Yes, it should. Will add the method like above. >>>>>>>>> >>>>>>>>> ??? > Also I don't see any testing of the >> DeoptimizeObjectsALotThread. >>>>>>>>> Without >>>>>>>>> ??? > active testing this will just bit-rot. >>>>>>>>> >>>>>>>>> DeoptimizeObjectsALot is meant for stress testing with a larger >>>>>>>>> workload. I will add a minimal test >>>>>>>>> to keep it fresh. >>>>>>>>> >>>>>>>>> ??? > Also on the tests I don't understand your @requires clause: >>>>>>>>> ??? > >>>>>>>>> ??? >?? @requires ((vm.compMode != "Xcomp") & >> vm.compiler2.enabled >>>> & >>>>>>>>> ??? > (vm.opt.TieredCompilation != true)) >>>>>>>>> ??? > >>>>>>>>> ??? > This seems to require that TieredCompilation is disabled, but >>>>>>>>> tiered is >>>>>>>>> ??? > our normal mode of operation. ?? >>>>>>>>> ??? > >>>>>>>>> >>>>>>>>> I removed the clause. I guess I wanted to target the tests towards >> the >>>>>>>>> code they are supposed to >>>>>>>>> test, and it's easier to analyze failures w/o tiered compilation and >>>>>>>>> with just one compiler thread. >>>>>>>>> >>>>>>>>> Additionally I will make use of >>>>>>>>> compiler.whitebox.CompilerWhiteBoxTest.THRESHOLD in the tests. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Richard. >>>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: David Holmes >>>>>>>>> Sent: Mittwoch, 11. Dezember 2019 08:03 >>>>>>>>> To: Reingruber, Richard ; >>>>>>>>> serviceability-dev at openjdk.java.net; >>>>>>>>> hotspot-compiler-dev at openjdk.java.net; >>>>>>>>> hotspot-runtime-dev at openjdk.java.net >>>>>>>>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better >>>>>>>>> Performance in the Presence of JVMTI Agents >>>>>>>>> >>>>>>>>> Hi Richard, >>>>>>>>> >>>>>>>>> On 11/12/2019 7:45 am, Reingruber, Richard wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I would like to get reviews please for >>>>>>>>>> >>>>>>>>>> >> http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.3/ >>>>>>>>>> >>>>>>>>>> Corresponding RFE: >>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8227745 >>>>>>>>>> >>>>>>>>>> Fixes also https://bugs.openjdk.java.net/browse/JDK-8233915 >>>>>>>>>> And potentially https://bugs.openjdk.java.net/browse/JDK- >> 8214584 [1] >>>>>>>>>> >>>>>>>>>> Vladimir Kozlov kindly put webrev.3 through tier1-8 testing >> without >>>>>>>>>> issues (thanks!). In addition the >>>>>>>>>> change is being tested at SAP since I posted the first RFR some >>>>>>>>>> months ago. >>>>>>>>>> >>>>>>>>>> The intention of this enhancement is to benefit performance wise >> from >>>>>>>>>> escape analysis even if JVMTI >>>>>>>>>> agents request capabilities that allow them to access local variable >>>>>>>>>> values. E.g. if you start-up >>>>>>>>>> with -agentlib:jdwp=transport=dt_socket,server=y,suspend=n, >> then >>>>>>>>>> escape analysis is disabled right >>>>>>>>>> from the beginning, well before a debugger attaches -- if ever one >>>>>>>>>> should do so. With the >>>>>>>>>> enhancement, escape analysis will remain enabled until and after >> a >>>>>>>>>> debugger attaches. EA based >>>>>>>>>> optimizations are reverted just before an agent acquires the >>>>>>>>>> reference to an object. In the JBS item >>>>>>>>>> you'll find more details. >>>>>>>>> >>>>>>>>> Most of the details here are in areas I can comment on in detail, but >> I >>>>>>>>> did take an initial general look at things. >>>>>>>>> >>>>>>>>> The only thing that jumped out at me is that I think the >>>>>>>>> DeoptimizeObjectsALotThread should be a hidden thread. >>>>>>>>> >>>>>>>>> +? bool is_hidden_from_external_view() const { return true; } >>>>>>>>> >>>>>>>>> Also I don't see any testing of the DeoptimizeObjectsALotThread. >>>>>>>>> Without >>>>>>>>> active testing this will just bit-rot. >>>>>>>>> >>>>>>>>> Also on the tests I don't understand your @requires clause: >>>>>>>>> >>>>>>>>> ??? @requires ((vm.compMode != "Xcomp") & >> vm.compiler2.enabled & >>>>>>>>> (vm.opt.TieredCompilation != true)) >>>>>>>>> >>>>>>>>> This seems to require that TieredCompilation is disabled, but tiered >> is >>>>>>>>> our normal mode of operation. ?? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> David >>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Richard. >>>>>>>>>> >>>>>>>>>> [1] Experimental fix for JDK-8214584 based on JDK-8227745 >>>>>>>>>> >>>> >> http://cr.openjdk.java.net/~rrich/webrevs/2019/8214584/experiment_v1.pa >> tc >>>> h >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> From tobias.hartmann at oracle.com Wed Apr 1 06:26:47 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Wed, 1 Apr 2020 08:26:47 +0200 Subject: [15] RFR(S): 8241909: Remove useless code cache lookup in frame::patch_pc In-Reply-To: <76b44f19-0c1d-3efb-e922-4a108b136b52@oracle.com> References: <39c001b5-e39e-8e3a-c74a-cd2d35dabf5c@oracle.com> <76b44f19-0c1d-3efb-e922-4a108b136b52@oracle.com> Message-ID: <6c0db3fe-e14d-3693-46dc-a187f264dc47@oracle.com> Vladimir, Dean, thanks for the review! Best regards, Tobias On 31.03.20 21:36, Dean Long wrote: > +1 > > dl > > On 3/31/20 10:42 AM, Vladimir Kozlov wrote: >> Good. >> >> thanks, >> Vladimir >> >> On 3/31/20 1:45 AM, Tobias Hartmann wrote: >>> Hi, >>> >>> please review the following patch: >>> https://bugs.openjdk.java.net/browse/JDK-8241909 >>> http://cr.openjdk.java.net/~thartmann/8241909/webrev.00/ >>> >>> The code cache lookup in frame::patch_pc [1] is useless because the method is only called from >>> frame::deoptimize and vframeArrayElement::unpack_on_stack where pc is always part of _cb. >>> >>> If the method is called from frame::deoptimize [2], pc is either _cb->deopt_mh_handler_begin() or >>> _cb->deopt_handler_begin(). Both are part of _cb. >>> >>> If the method is called from vframeArrayElement::unpack_on_stack [3], _frame is an interpreter frame >>> and therefore _frame._cb is the interpreter buffer blob. pc is only set in this method and always >>> points to an interpreter entry which is part of the interpreter buffer blob. >>> >>> Thanks, >>> Tobias >>> >>> [1] http://hg.openjdk.java.net/jdk/jdk/file/ee44884f3ab8/src/hotspot/cpu/x86/frame_x86.cpp#l265 >>> [2] http://hg.openjdk.java.net/jdk/jdk/file/ee44884f3ab8/src/hotspot/share/runtime/frame.cpp#l287 >>> [3] >>> http://hg.openjdk.java.net/jdk/jdk/file/ee44884f3ab8/src/hotspot/share/runtime/vframeArray.cpp#l303 >>> > From aph at redhat.com Wed Apr 1 08:54:52 2020 From: aph at redhat.com (Andrew Haley) Date: Wed, 1 Apr 2020 09:54:52 +0100 Subject: [aarch64-port-dev ] RFR(S): 8241475: AArch64: Add missing support for PopCountVI node In-Reply-To: References: <2ce24736-9b5c-5c23-bfde-14067d6d6b0d@redhat.com> Message-ID: On 4/1/20 3:05 AM, Pengfei Li wrote: > In my patch, the newly added instruction UADDLP supports T2S but doesn't support T2D. So I changed the value range to 0 - 3, where 3 means all arrangements are accepted now. That's why the value for parameter "accepted" of NEGR is promoted from 2 to 3 now. I see. OK, thanks. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From aph at redhat.com Wed Apr 1 10:22:23 2020 From: aph at redhat.com (Andrew Haley) Date: Wed, 1 Apr 2020 11:22:23 +0100 Subject: [8u] RFR: 8237951: CTW: C2 compilation fails with "malformed control flow" In-Reply-To: <871rp8ek1x.fsf@redhat.com> References: <871rp8ek1x.fsf@redhat.com> Message-ID: <51b56814-c654-beaf-f4d3-0e952ff337fa@redhat.com> On 3/31/20 2:22 PM, Roland Westrelin wrote: > The patch from the fix applies cleanly but it relies on > Node::find_out_with() that's missing from 8. The backport below cherry > picks that method from 8066312 (Add new Node* Node::find_out(int opc) > method). > > http://cr.openjdk.java.net/~roland/8237951.8u/webrev.00/ OK, thanks. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From erik.osterlund at oracle.com Wed Apr 1 10:24:20 2020 From: erik.osterlund at oracle.com (=?UTF-8?Q?Erik_=c3=96sterlund?=) Date: Wed, 1 Apr 2020 12:24:20 +0200 Subject: RFR: 8241438: Move IntelJccErratum mitigation code to platform-specific code In-Reply-To: <7f4033fa-d380-582e-43cc-343aa6d0fe1c@oracle.com> References: <14fe1c02-520b-f5d7-5c66-dc35d63f0a0d@oracle.com> <19c75204-d036-4768-686e-834995c5e21f@oracle.com> <73e13987-92c9-c189-657e-0abe7a69ecaa@oracle.com> <7833a4ea-613a-4b4c-da23-4b3e064b2f1c@oracle.com> <7f4033fa-d380-582e-43cc-343aa6d0fe1c@oracle.com> Message-ID: <2c038184-d7c0-94f1-47d1-56444c8b243d@oracle.com> Hi Vladimir, On 2020-03-30 21:14, Vladimir Kozlov wrote: > But you at least can do static check at the beginning of method: > > int MachNode::pd_alignment_required() const { > ? if (VM_Version::has_intel_jcc_erratum()) { > ??? PhaseOutput* output = Compile::current()->output(); > ??? Block* block = output->block(); > ??? int index = output->index(); > ??? assert(output->mach() == this, "incorrect iterator state in > PhaseOutput"); > ??? if (IntelJccErratum::is_jcc_erratum_branch(block, this, index)) { > ????? // Conservatively add worst case padding. We assume that > relocInfo::addr_unit() is 1 on x86. > ????? return IntelJccErratum::largest_jcc_size() + 1; > ??? } > ? } > ? return 1; > } That is equivalent to the compiler. I verified that by disassembling the release bits before and after your suggestion, and it is instruction by instruction the same. In both cases it first checks ifVM_Version::has_intel_jcc_erratum(), and if not, returns before even building a frame. I'd rather keep the not nested variant because it is equivalent, yet easier to read. >> >>> In compute_padding() reads done under check so I have less concerns >>> about it. But I also don't get why you use saved _mach instead of >>> using MachNode 'this'. >> >> Good point. I changed to this + an assert checking that they are >> indeed the same. > > Why do you need Output._mach at all if you use it only in this assert? > Even logically it looks strange. In what case it could be different? It should never be different; that was the point. The index and mach node exposed by the iterator are related and refer to the same entity. So if you use the exposed index in code in a mach node, you must know that this mach node is the same mach node that the index refers to, and it is. The assert was meant to enforce it so that if you were to call either the alignment or padding function in a new context, for whatever reason, and don't happen to know that you can't do that without having a consistent iteration state, you would immediately catch that in the assertions, instead of getting strange silent logic errors. Having said that, I am okay with removing _mach if you prefer having one seat belt less, it is up to you: http://cr.openjdk.java.net/~eosterlund/8241438/webrev.03/ Incremental: http://cr.openjdk.java.net/~eosterlund/8241438/webrev.02_03/ Thanks, /Erik > Thanks, > Vladimir > >> >> Here is an updated webrev with your concerns and Vladimir Ivanov's >> concerns addressed: >> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.02/ >> >> Incremental: >> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.01_02/ >> >> Thanks, >> /Erik >> >>> Thanks, >>> Vladimir >>> >>>> >>>>> In pd_alignment_required() you implicitly use knowledge that >>>>> relocInfo::addr_unit() on x86 is 1. >>>>> At least add comment about that. >>>> >>>> I can add a comment about that. >>>> >>>> New webrev: >>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.01/ >>>> >>>> Incremental: >>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.00_01/ >>>> >>>> Thanks, >>>> /Erik >>>> >>>>> Thanks, >>>>> Vladimir >>>>> >>>>> On 3/23/20 6:09 AM, Erik ?sterlund wrote: >>>>>> Hi, >>>>>> >>>>>> There is some platform-specific code in PhaseOutput that deals >>>>>> with the IntelJccErratum mitigation, >>>>>> which is ifdef:ed in shared code. It should move to >>>>>> platform-specific code. >>>>>> >>>>>> This patch exposes the iteration state of PhaseOutput, which >>>>>> allows hiding the Intel-specific code >>>>>> completely in x86-specific files. >>>>>> >>>>>> Webrev: >>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.00/ >>>>>> >>>>>> Bug: >>>>>> https://bugs.openjdk.java.net/browse/JDK-8241438 >>>>>> >>>>>> Thanks, >>>>>> /Erik >>>> >> From jatin.bhateja at intel.com Wed Apr 1 18:23:29 2020 From: jatin.bhateja at intel.com (Bhateja, Jatin) Date: Wed, 1 Apr 2020 18:23:29 +0000 Subject: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction In-Reply-To: References: <373cb1c9-80f3-ec08-7d43-cbd3202bc134@oracle.com>, Message-ID: Hi Vladimir, Please find an updated unified patch at the following link. http://cr.openjdk.java.net/~jbhateja/8241040/webrev.05/ This removes Optimized NotV handling for AVX3, as suggested it will be brought via vectorIntrinsics branch. Thanks for your help in shaping up this patch, please let me know if there are other comments. Best Regards, Jatin ________________________________________ From: Bhateja, Jatin Sent: Wednesday, March 25, 2020 12:14 PM To: Vladimir Ivanov Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya Subject: RE: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction Hi Vladimir, I have placed updated patch at following links:- 1) Optimized NotV handling: http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/ 2) Changes for MacroLogic opt: http://cr.openjdk.java.net/~jbhateja/8241040/webrev.03_over_notV/ Kindly review and let me know your feedback. Thanks, Jatin > -----Original Message----- > From: Vladimir Ivanov > Sent: Wednesday, March 25, 2020 12:33 AM > To: Bhateja, Jatin > Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya > > Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction > > Hi Jatin, > > I tried to submit the patches for testing, but windows-x64 build failed with the > following errors: > > src/hotspot/share/opto/compile.cpp(2345): error C2131: expression did not > evaluate to a constant > src/hotspot/share/opto/compile.cpp(2345): note: failure was caused by a read > of a variable outside its lifetime > src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition' > src/hotspot/share/opto/compile.cpp(2404): error C3863: array type 'int > ['function']' is not assignable > > Best regards, > Vladimir Ivanov > > On 24.03.2020 10:34, Bhateja, Jatin wrote: > > Hi Vladimir, > > > > Thanks for your comments , I have split the original patch into two sub- > patches. > > > > 1) Optimized NotV handling: > > http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/ > > > > 2) Changes for MacroLogic opt: > > http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/ > > > > Added a new flag "UseVectorMacroLogic" which guards MacroLogic > optimization. > > > > Kindly review and let me know your feedback. > > > > Best Regards, > > Jatin > > > >> -----Original Message----- > >> From: Vladimir Ivanov > >> Sent: Tuesday, March 17, 2020 4:31 PM > >> To: Bhateja, Jatin ; hotspot-compiler- > >> dev at openjdk.java.net > >> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic > >> Instruction > >> > >> > >>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/ > >> > >> Very nice contribution, Jatin! > >> > >> Some comments after a brief review pass: > >> > >> * Please, contribute NotV part separately. > >> > >> * Why don't you perform (XorV v 0xFF..FF) => (NotV v) > >> transformation during GVN instead? > >> > >> * As of now, vector nodes are only produced by SuperWord > >> analysis. It makes sense to limit new optimization pass to SuperWord > >> pass only (probably, introduce a new dedicated Phase ). Once Vector > >> API is available, it can be extended to cases when vector nodes are > >> present > >> (C->max_vector_size() > 0). > >> > >> * There are more efficient ways to produce a vector of all-1s [1] [2]. > >> > >> Best regards, > >> Vladimir Ivanov > >> > >> [1] > >> https://urldefense.com/v3/__https://stackoverflow.com/questions/45105 > >> 164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3DgJgc > >> qfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$ > >> 1-efficiently > >> > >> [2] > >> https://urldefense.com/v3/__https://stackoverflow.com/questions/37469 > >> 930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI > >> _RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$ > >> value-to-all-one-bits > >> > >>> > >>> A new optimization pass has been added post Auto-Vectorization which > >> folds expression tree involving vector boolean logic operations > >> (ANDV/ORV/NOTV/XORV) into a MacroLogic node. > >>> Optimization pass has following stages: > >>> > >>> 1. Collection stage : > >>> * This performs a DFS traversal over Ideal Graph and collects the root > >> nodes of all vector logic expression trees. > >>> 2. Processing stage: > >>> * Performs a bottom up traversal over expression tree and > >> simultaneously folds specific DAG patterns involving Boolean logic > >> parent and child nodes. > >>> * Transforms (XORV INP , -1) -> (NOTV INP) to promote logic folding. > >>> * Folding is performed under a constraint on the total number of > inputs > >> which a MacroLogic node can have, in this case it's 3. > >>> * A partition is created around a DAG pattern involving logic parent > and > >> one or two logic child node, it encapsulate the nodes in post-order fashion. > >>> * This partition is then evaluated by traversing over the nodes, > assigning > >> boolean values to its inputs and performing operations over them > >> based on its Opcode. Node along with its computed result is stored in > >> a map which is accessed during the evaluation of its user/parent node. > >>> * Post-evaluation a MacroLogic node is created which is equivalent to > a > >> three input truth-table. Expression tree leaf level inputs along with > >> result of its evaluation are the inputs fed to this new node. > >>> * Entire expression tree is eventually subsumed/replaced by newly > >> create MacroLogic node. > >>> > >>> > >>> Following are the JMH benchmarks results with and without changes. > >>> > >>> Without Changes: > >>> > >>> Benchmark (VECLEN) Mode Cnt Score Error Units > >>> MacroLogicOpt.workload1_caller 64 thrpt 2904.480 ops/s > >>> MacroLogicOpt.workload1_caller 128 thrpt 2219.252 ops/s > >>> MacroLogicOpt.workload1_caller 256 thrpt 1507.267 ops/s > >>> MacroLogicOpt.workload1_caller 512 thrpt 860.926 ops/s > >>> MacroLogicOpt.workload1_caller 1024 thrpt 470.163 ops/s > >>> MacroLogicOpt.workload1_caller 2048 thrpt 246.608 ops/s > >>> MacroLogicOpt.workload1_caller 4096 thrpt 108.031 ops/s > >>> MacroLogicOpt.workload2_caller 64 thrpt 344.633 ops/s > >>> MacroLogicOpt.workload2_caller 128 thrpt 209.818 ops/s > >>> MacroLogicOpt.workload2_caller 256 thrpt 111.678 ops/s > >>> MacroLogicOpt.workload2_caller 512 thrpt 53.360 ops/s > >>> MacroLogicOpt.workload2_caller 1024 thrpt 27.888 ops/s > >>> MacroLogicOpt.workload2_caller 2048 thrpt 12.103 ops/s > >>> MacroLogicOpt.workload2_caller 4096 thrpt 6.018 ops/s > >>> MacroLogicOpt.workload3_caller 64 thrpt 3110.669 ops/s > >>> MacroLogicOpt.workload3_caller 128 thrpt 1996.861 ops/s > >>> MacroLogicOpt.workload3_caller 256 thrpt 870.166 ops/s > >>> MacroLogicOpt.workload3_caller 512 thrpt 389.629 ops/s > >>> MacroLogicOpt.workload3_caller 1024 thrpt 151.203 ops/s > >>> MacroLogicOpt.workload3_caller 2048 thrpt 75.086 ops/s > >>> MacroLogicOpt.workload3_caller 4096 thrpt 37.576 ops/s > >>> > >>> With Changes: > >>> > >>> Benchmark (VECLEN) Mode Cnt Score Error Units > >>> MacroLogicOpt.workload1_caller 64 thrpt 3306.670 ops/s > >>> MacroLogicOpt.workload1_caller 128 thrpt 2936.851 ops/s > >>> MacroLogicOpt.workload1_caller 256 thrpt 2413.827 ops/s > >>> MacroLogicOpt.workload1_caller 512 thrpt 1440.291 ops/s > >>> MacroLogicOpt.workload1_caller 1024 thrpt 707.576 ops/s > >>> MacroLogicOpt.workload1_caller 2048 thrpt 384.863 ops/s > >>> MacroLogicOpt.workload1_caller 4096 thrpt 132.753 ops/s > >>> MacroLogicOpt.workload2_caller 64 thrpt 450.856 ops/s > >>> MacroLogicOpt.workload2_caller 128 thrpt 323.925 ops/s > >>> MacroLogicOpt.workload2_caller 256 thrpt 135.191 ops/s > >>> MacroLogicOpt.workload2_caller 512 thrpt 69.424 ops/s > >>> MacroLogicOpt.workload2_caller 1024 thrpt 35.744 ops/s > >>> MacroLogicOpt.workload2_caller 2048 thrpt 14.168 ops/s > >>> MacroLogicOpt.workload2_caller 4096 thrpt 7.245 ops/s > >>> MacroLogicOpt.workload3_caller 64 thrpt 3333.550 ops/s > >>> MacroLogicOpt.workload3_caller 128 thrpt 2269.428 ops/s > >>> MacroLogicOpt.workload3_caller 256 thrpt 995.691 ops/s > >>> MacroLogicOpt.workload3_caller 512 thrpt 412.452 ops/s > >>> MacroLogicOpt.workload3_caller 1024 thrpt 151.157 ops/s > >>> MacroLogicOpt.workload3_caller 2048 thrpt 75.079 ops/s > >>> MacroLogicOpt.workload3_caller 4096 thrpt 37.158 ops/s > >>> > >>> Please review the patch. > >>> > >>> Best Regards, > >>> Jatin > >>> > >>> [1] Section 17.7 : > >>> https://urldefense.com/v3/__https://software.intel.com/sites/default > >>> /files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllG > >>> QTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$ > >>> architectures-optimization-manual.pdf > >>> From daniel.daugherty at oracle.com Wed Apr 1 18:45:48 2020 From: daniel.daugherty at oracle.com (Daniel D. Daugherty) Date: Wed, 1 Apr 2020 14:45:48 -0400 Subject: RFR: 8241234: Unify monitor enter/exit runtime entries In-Reply-To: References: <222D2846-F6AE-4D5B-B41F-F976D90E329C@oracle.com> <91eeada8-e05f-bc73-b029-94e169216a56@oracle.com> <534b8cf7-cd8c-565b-5163-09a216d4f94e@oracle.com> <904faf68-4fff-f1b8-2fb8-48d65f282fa2@oracle.com> Message-ID: <09be678a-2742-4ab4-2e91-8cb7cef2c811@oracle.com> Hi Yudi, I grabbed a copy of this patch: http://cr.openjdk.java.net/~yzheng/8241234/webrev.04/open.patch pushed it into my jdk-15+16 baseline and ran it thru a single cycle of my regular stress kit (~24 hours). There were no failures which matches my jdk-15+16 baseline stress testing (~72 hours, no failures). I also ran it through my ObjectMonitor inflation stress kit for ~24 hours and there were no failures there either. Dan On 3/30/20 10:20 AM, Daniel D. Daugherty wrote: > On 3/30/20 10:15 AM, Yudi Zheng wrote: >> Hi Daniel, >> >> Thanks for the review! I have uploaded a new version with your >> comments addressed: >> http://cr.openjdk.java.net/~yzheng/8241234/webrev.04/ >> >>> src/hotspot/share/runtime/sharedRuntime.hpp >>> ???? Please don't forget to update the copyright year before you push. >> Fixed. >> >>> src/hotspot/share/runtime/sharedRuntime.cpp >>> ???? L2104:?? ObjectSynchronizer::exit(obj, lock, THREAD); >>> ???????? The use of 'THREAD' here and 'TRAPS' in the function itself >>> ???????? standout more now, but that's something for me to cleanup. >> Also, I noticed that C2 was using CHECK >>> ??? ObjectSynchronizer::enter(h_obj, lock, CHECK); >> While C1 and JVMCI were using THREAD: >>> ??? ObjectSynchronizer::enter(h_obj, lock->lock(), THREAD); >> I have no idea when to use what, and hope unifying to the C2 entries >> would help. >> Let me know if there is something I should address in this patch. >> Otherwise, I would >> rather leave it to the expert, i.e., you ;) > > Yes, please leave it for me to clean up. > > >>> src/hotspot/share/c1/c1_Runtime1.cpp >>> ???? old L718:?? assert(thread == JavaThread::current(), "threads >>> must correspond"); >>> ???????? Removed in favor of the assert in >>> SharedRuntime::monitor_enter_helper(). >>> ???????? Okay that makes sense. >>> >>> ???? old L721:?? EXCEPTION_MARK; >>> ???????? Removed in favor of the same in >>> SharedRuntime::monitor_enter_helper(). >>> ???????? Okay that makes sense. >>> >>> src/hotspot/share/jvmci/jvmciRuntime.cpp >>> ???? old L403:?? assert(thread == JavaThread::current(), "threads >>> must correspond"); >>> ???? old L406:?? EXCEPTION_MARK; >>> ???????? Same as for c1_Runtime1.cpp >> I assume I don?t need to do anything regarding the comments above. > > Correct. Just observations on the old code. > > >>> ???? L390:???? TRACE_jvmci_3("%s: entered locking slow case with >>> obj="... >>> ???? L394:?? TRACE_jvmci_3("%s: exiting locking slow with obj=" >>> ???? L417:???? TRACE_jvmci_3("%s: exited locking slow case with obj=" >>> ???????? But this is no longer the "slow" case so I'm a bit confused. >>> >>> ???????? Update: I see there's a comment about the tracing being >>> removed. >>> ???????? I have no opinion on that since it is JVM/CI code, but the >>> word >>> ???????? "slow" needs to be adjusted if you keep it. >> I removed all the tracing code. > > Thanks for cleaning that up. > > Dan > >> >> Many thanks, >> Yudi > From igor.ignatyev at oracle.com Wed Apr 1 19:13:09 2020 From: igor.ignatyev at oracle.com (Igor Ignatyev) Date: Wed, 1 Apr 2020 12:13:09 -0700 Subject: RFR(XS): 8174768: Make ProcessTools print executed process output into a separate file In-Reply-To: References: Message-ID: Hi Evgeny, (widening the audience, given this affects not just hotspot compiler, but hotspot tests as well as core libs tests in general) overall that looks good to me. one suggestion, for the ease of failure analysis it might be worth to print out the names of created files, although this might potentially clutter the output, I don't think it'll be a problem given we already print out things like 'Gathering output for process ...' , 'Waiting for completion...' in LazyOutputBuffer. > The change has been tested via a mach5 test runs (jdk-tier1 through 4) on the 4 common platforms (linux-x64, windows-x64, macosx-x64, sparcv9). this doesn't include any of hotspot tiers, could you please also run hs-tier1--4? // you can use tierN jobs which include both jdk and hs parts. Thanks, -- Igor > On Mar 30, 2020, at 3:55 AM, Evgeny Nikitin wrote: > > > Hi, > > > Bug: https://bugs.openjdk.java.net/browse/JDK-8174768 > > Webrev: http://cr.openjdk.java.net/~iignatyev/enikitin/8174768/webrev.00/ > > > The bug had been created as a request to simplify investigation for compiler control tests failures. > I found the functionality pretty generic and useful and made ProcessTools dump output as well as some diagnostic information for every executed process into a separate file. > The diagnostic information contains cmdline, exit code, stdout and stderr. The output files are named like 'pid--output.log'. > > The change has been tested via a mach5 test runs (jdk-tier1 through 4) on the 4 common platforms (linux-x64, windows-x64, macosx-x64, sparcv9). > > Please review, > /Evgeny Nikitin. From tom.rodriguez at oracle.com Wed Apr 1 19:56:54 2020 From: tom.rodriguez at oracle.com (Tom Rodriguez) Date: Wed, 1 Apr 2020 12:56:54 -0700 Subject: RFR(XS) 8191930: [Graal] emits unparseable XML into compile log Message-ID: http://cr.openjdk.java.net/~never/8191930/webrev https://bugs.openjdk.java.net/browse/JDK-8191930 This was something that was fixed in 8 but never made it into 9+ I think because the code moved after 8. Tested by forcing a bailout with the problematic string and inspecting the resulting xml. From nils.eliasson at oracle.com Wed Apr 1 20:06:28 2020 From: nils.eliasson at oracle.com (Nils Eliasson) Date: Wed, 1 Apr 2020 22:06:28 +0200 Subject: RFR(S): 8241556: Memory leak if -XX:CompileCommand is set In-Reply-To: References: Message-ID: Hi Man, Your fix looks good. Thanks for fixing! Reviewed. Best regards, Nils Eliasson On 2020-03-25 00:21, Man Cao wrote: > Hi all, > > Could I have reviews for this fix for a memory leak? This memory leak is > pretty significant in production, and it took us weeks to identify the root > cause. > Webrev: https://cr.openjdk.java.net/~manc/8241556/webrev.00/ > Bug: https://bugs.openjdk.java.net/browse/JDK-8241556 > > A more elegant fix would be to use automatic allocation/deallocation on the > char*. Unfortunately std::string and std:unique_ptr are both unavailable in > HotSpot. > > -Man From yudi.zheng at oracle.com Wed Apr 1 20:23:02 2020 From: yudi.zheng at oracle.com (Yudi Zheng) Date: Wed, 1 Apr 2020 22:23:02 +0200 Subject: RFR: 8241234: Unify monitor enter/exit runtime entries In-Reply-To: <09be678a-2742-4ab4-2e91-8cb7cef2c811@oracle.com> References: <222D2846-F6AE-4D5B-B41F-F976D90E329C@oracle.com> <91eeada8-e05f-bc73-b029-94e169216a56@oracle.com> <534b8cf7-cd8c-565b-5163-09a216d4f94e@oracle.com> <904faf68-4fff-f1b8-2fb8-48d65f282fa2@oracle.com> <09be678a-2742-4ab4-2e91-8cb7cef2c811@oracle.com> Message-ID: Hi Dan, Thanks a lot for stress testing this patch! I will push this as soon as I get green lights from the mach5 tests. Best regards, -Yudi > On 1 Apr 2020, at 20:45, Daniel D. Daugherty wrote: > > Hi Yudi, > > I grabbed a copy of this patch: > > http://cr.openjdk.java.net/~yzheng/8241234/webrev.04/open.patch > > pushed it into my jdk-15+16 baseline and ran it thru a single cycle of > my regular stress kit (~24 hours). There were no failures which matches > my jdk-15+16 baseline stress testing (~72 hours, no failures). > > I also ran it through my ObjectMonitor inflation stress kit for ~24 > hours and there were no failures there either. > > Dan > > > On 3/30/20 10:20 AM, Daniel D. Daugherty wrote: >> On 3/30/20 10:15 AM, Yudi Zheng wrote: >>> Hi Daniel, >>> >>> Thanks for the review! I have uploaded a new version with your comments addressed: >>> http://cr.openjdk.java.net/~yzheng/8241234/webrev.04/ >>> >>>> src/hotspot/share/runtime/sharedRuntime.hpp >>>> Please don't forget to update the copyright year before you push. >>> Fixed. >>> >>>> src/hotspot/share/runtime/sharedRuntime.cpp >>>> L2104: ObjectSynchronizer::exit(obj, lock, THREAD); >>>> The use of 'THREAD' here and 'TRAPS' in the function itself >>>> standout more now, but that's something for me to cleanup. >>> Also, I noticed that C2 was using CHECK >>>> ObjectSynchronizer::enter(h_obj, lock, CHECK); >>> While C1 and JVMCI were using THREAD: >>>> ObjectSynchronizer::enter(h_obj, lock->lock(), THREAD); >>> I have no idea when to use what, and hope unifying to the C2 entries would help. >>> Let me know if there is something I should address in this patch. Otherwise, I would >>> rather leave it to the expert, i.e., you ;) >> >> Yes, please leave it for me to clean up. >> >> >>>> src/hotspot/share/c1/c1_Runtime1.cpp >>>> old L718: assert(thread == JavaThread::current(), "threads must correspond"); >>>> Removed in favor of the assert in SharedRuntime::monitor_enter_helper(). >>>> Okay that makes sense. >>>> >>>> old L721: EXCEPTION_MARK; >>>> Removed in favor of the same in SharedRuntime::monitor_enter_helper(). >>>> Okay that makes sense. >>>> >>>> src/hotspot/share/jvmci/jvmciRuntime.cpp >>>> old L403: assert(thread == JavaThread::current(), "threads must correspond"); >>>> old L406: EXCEPTION_MARK; >>>> Same as for c1_Runtime1.cpp >>> I assume I don?t need to do anything regarding the comments above. >> >> Correct. Just observations on the old code. >> >> >>>> L390: TRACE_jvmci_3("%s: entered locking slow case with obj="... >>>> L394: TRACE_jvmci_3("%s: exiting locking slow with obj=" >>>> L417: TRACE_jvmci_3("%s: exited locking slow case with obj=" >>>> But this is no longer the "slow" case so I'm a bit confused. >>>> >>>> Update: I see there's a comment about the tracing being removed. >>>> I have no opinion on that since it is JVM/CI code, but the word >>>> "slow" needs to be adjusted if you keep it. >>> I removed all the tracing code. >> >> Thanks for cleaning that up. >> >> Dan >> >>> >>> Many thanks, >>> Yudi >> > From vladimir.x.ivanov at oracle.com Wed Apr 1 20:25:48 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Wed, 1 Apr 2020 23:25:48 +0300 Subject: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction In-Reply-To: References: <373cb1c9-80f3-ec08-7d43-cbd3202bc134@oracle.com> Message-ID: Hi Jatin, > http://cr.openjdk.java.net/~jbhateja/8241040/webrev.05/ Looks good. I'll submit it for testing. FTR, in the longer term I'd like to see the dedicated pass to go away and the optimization to be migrated to GVN. I don't see any special requirements which justify additional complexity from a separate pass. Best regards, Vladimir Ivanov > This removes Optimized NotV handling for AVX3, as suggested it will be > brought via vectorIntrinsics branch. > > Thanks for your help in shaping up this patch, please let me know if there > are other comments. > > Best Regards, > Jatin > ________________________________________ > From: Bhateja, Jatin > Sent: Wednesday, March 25, 2020 12:14 PM > To: Vladimir Ivanov > Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya > Subject: RE: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction > > Hi Vladimir, > > I have placed updated patch at following links:- > > 1) Optimized NotV handling: > http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/ > > 2) Changes for MacroLogic opt: > http://cr.openjdk.java.net/~jbhateja/8241040/webrev.03_over_notV/ > > Kindly review and let me know your feedback. > > Thanks, > Jatin > >> -----Original Message----- >> From: Vladimir Ivanov >> Sent: Wednesday, March 25, 2020 12:33 AM >> To: Bhateja, Jatin >> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya >> >> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction >> >> Hi Jatin, >> >> I tried to submit the patches for testing, but windows-x64 build failed with the >> following errors: >> >> src/hotspot/share/opto/compile.cpp(2345): error C2131: expression did not >> evaluate to a constant >> src/hotspot/share/opto/compile.cpp(2345): note: failure was caused by a read >> of a variable outside its lifetime >> src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition' >> src/hotspot/share/opto/compile.cpp(2404): error C3863: array type 'int >> ['function']' is not assignable >> >> Best regards, >> Vladimir Ivanov >> >> On 24.03.2020 10:34, Bhateja, Jatin wrote: >>> Hi Vladimir, >>> >>> Thanks for your comments , I have split the original patch into two sub- >> patches. >>> >>> 1) Optimized NotV handling: >>> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/ >>> >>> 2) Changes for MacroLogic opt: >>> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/ >>> >>> Added a new flag "UseVectorMacroLogic" which guards MacroLogic >> optimization. >>> >>> Kindly review and let me know your feedback. >>> >>> Best Regards, >>> Jatin >>> >>>> -----Original Message----- >>>> From: Vladimir Ivanov >>>> Sent: Tuesday, March 17, 2020 4:31 PM >>>> To: Bhateja, Jatin ; hotspot-compiler- >>>> dev at openjdk.java.net >>>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic >>>> Instruction >>>> >>>> >>>>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/ >>>> >>>> Very nice contribution, Jatin! >>>> >>>> Some comments after a brief review pass: >>>> >>>> * Please, contribute NotV part separately. >>>> >>>> * Why don't you perform (XorV v 0xFF..FF) => (NotV v) >>>> transformation during GVN instead? >>>> >>>> * As of now, vector nodes are only produced by SuperWord >>>> analysis. It makes sense to limit new optimization pass to SuperWord >>>> pass only (probably, introduce a new dedicated Phase ). Once Vector >>>> API is available, it can be extended to cases when vector nodes are >>>> present >>>> (C->max_vector_size() > 0). >>>> >>>> * There are more efficient ways to produce a vector of all-1s [1] [2]. >>>> >>>> Best regards, >>>> Vladimir Ivanov >>>> >>>> [1] >>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/45105 >>>> 164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3DgJgc >>>> qfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$ >>>> 1-efficiently >>>> >>>> [2] >>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/37469 >>>> 930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI >>>> _RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$ >>>> value-to-all-one-bits >>>> >>>>> >>>>> A new optimization pass has been added post Auto-Vectorization which >>>> folds expression tree involving vector boolean logic operations >>>> (ANDV/ORV/NOTV/XORV) into a MacroLogic node. >>>>> Optimization pass has following stages: >>>>> >>>>> 1. Collection stage : >>>>> * This performs a DFS traversal over Ideal Graph and collects the root >>>> nodes of all vector logic expression trees. >>>>> 2. Processing stage: >>>>> * Performs a bottom up traversal over expression tree and >>>> simultaneously folds specific DAG patterns involving Boolean logic >>>> parent and child nodes. >>>>> * Transforms (XORV INP , -1) -> (NOTV INP) to promote logic folding. >>>>> * Folding is performed under a constraint on the total number of >> inputs >>>> which a MacroLogic node can have, in this case it's 3. >>>>> * A partition is created around a DAG pattern involving logic parent >> and >>>> one or two logic child node, it encapsulate the nodes in post-order fashion. >>>>> * This partition is then evaluated by traversing over the nodes, >> assigning >>>> boolean values to its inputs and performing operations over them >>>> based on its Opcode. Node along with its computed result is stored in >>>> a map which is accessed during the evaluation of its user/parent node. >>>>> * Post-evaluation a MacroLogic node is created which is equivalent to >> a >>>> three input truth-table. Expression tree leaf level inputs along with >>>> result of its evaluation are the inputs fed to this new node. >>>>> * Entire expression tree is eventually subsumed/replaced by newly >>>> create MacroLogic node. >>>>> >>>>> >>>>> Following are the JMH benchmarks results with and without changes. >>>>> >>>>> Without Changes: >>>>> >>>>> Benchmark (VECLEN) Mode Cnt Score Error Units >>>>> MacroLogicOpt.workload1_caller 64 thrpt 2904.480 ops/s >>>>> MacroLogicOpt.workload1_caller 128 thrpt 2219.252 ops/s >>>>> MacroLogicOpt.workload1_caller 256 thrpt 1507.267 ops/s >>>>> MacroLogicOpt.workload1_caller 512 thrpt 860.926 ops/s >>>>> MacroLogicOpt.workload1_caller 1024 thrpt 470.163 ops/s >>>>> MacroLogicOpt.workload1_caller 2048 thrpt 246.608 ops/s >>>>> MacroLogicOpt.workload1_caller 4096 thrpt 108.031 ops/s >>>>> MacroLogicOpt.workload2_caller 64 thrpt 344.633 ops/s >>>>> MacroLogicOpt.workload2_caller 128 thrpt 209.818 ops/s >>>>> MacroLogicOpt.workload2_caller 256 thrpt 111.678 ops/s >>>>> MacroLogicOpt.workload2_caller 512 thrpt 53.360 ops/s >>>>> MacroLogicOpt.workload2_caller 1024 thrpt 27.888 ops/s >>>>> MacroLogicOpt.workload2_caller 2048 thrpt 12.103 ops/s >>>>> MacroLogicOpt.workload2_caller 4096 thrpt 6.018 ops/s >>>>> MacroLogicOpt.workload3_caller 64 thrpt 3110.669 ops/s >>>>> MacroLogicOpt.workload3_caller 128 thrpt 1996.861 ops/s >>>>> MacroLogicOpt.workload3_caller 256 thrpt 870.166 ops/s >>>>> MacroLogicOpt.workload3_caller 512 thrpt 389.629 ops/s >>>>> MacroLogicOpt.workload3_caller 1024 thrpt 151.203 ops/s >>>>> MacroLogicOpt.workload3_caller 2048 thrpt 75.086 ops/s >>>>> MacroLogicOpt.workload3_caller 4096 thrpt 37.576 ops/s >>>>> >>>>> With Changes: >>>>> >>>>> Benchmark (VECLEN) Mode Cnt Score Error Units >>>>> MacroLogicOpt.workload1_caller 64 thrpt 3306.670 ops/s >>>>> MacroLogicOpt.workload1_caller 128 thrpt 2936.851 ops/s >>>>> MacroLogicOpt.workload1_caller 256 thrpt 2413.827 ops/s >>>>> MacroLogicOpt.workload1_caller 512 thrpt 1440.291 ops/s >>>>> MacroLogicOpt.workload1_caller 1024 thrpt 707.576 ops/s >>>>> MacroLogicOpt.workload1_caller 2048 thrpt 384.863 ops/s >>>>> MacroLogicOpt.workload1_caller 4096 thrpt 132.753 ops/s >>>>> MacroLogicOpt.workload2_caller 64 thrpt 450.856 ops/s >>>>> MacroLogicOpt.workload2_caller 128 thrpt 323.925 ops/s >>>>> MacroLogicOpt.workload2_caller 256 thrpt 135.191 ops/s >>>>> MacroLogicOpt.workload2_caller 512 thrpt 69.424 ops/s >>>>> MacroLogicOpt.workload2_caller 1024 thrpt 35.744 ops/s >>>>> MacroLogicOpt.workload2_caller 2048 thrpt 14.168 ops/s >>>>> MacroLogicOpt.workload2_caller 4096 thrpt 7.245 ops/s >>>>> MacroLogicOpt.workload3_caller 64 thrpt 3333.550 ops/s >>>>> MacroLogicOpt.workload3_caller 128 thrpt 2269.428 ops/s >>>>> MacroLogicOpt.workload3_caller 256 thrpt 995.691 ops/s >>>>> MacroLogicOpt.workload3_caller 512 thrpt 412.452 ops/s >>>>> MacroLogicOpt.workload3_caller 1024 thrpt 151.157 ops/s >>>>> MacroLogicOpt.workload3_caller 2048 thrpt 75.079 ops/s >>>>> MacroLogicOpt.workload3_caller 4096 thrpt 37.158 ops/s >>>>> >>>>> Please review the patch. >>>>> >>>>> Best Regards, >>>>> Jatin >>>>> >>>>> [1] Section 17.7 : >>>>> https://urldefense.com/v3/__https://software.intel.com/sites/default >>>>> /files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllG >>>>> QTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$ >>>>> architectures-optimization-manual.pdf >>>>> From ioi.lam at oracle.com Thu Apr 2 00:00:21 2020 From: ioi.lam at oracle.com (Ioi Lam) Date: Wed, 1 Apr 2020 17:00:21 -0700 Subject: RFR(XS): 8174768: Make ProcessTools print executed process output into a separate file In-Reply-To: References: Message-ID: <3bbe30fd-aae0-f55f-15f4-6a92ef918617@oracle.com> On 4/1/20 12:13 PM, Igor Ignatyev wrote: > Hi Evgeny, > > (widening the audience, given this affects not just hotspot compiler, but hotspot tests as well as core libs tests in general) > > overall that looks good to me. one suggestion, for the ease of failure analysis it might be worth to print out the names of created files, although this might potentially clutter the output, I don't think it'll be a problem given we already print out things like 'Gathering output for process ...' , 'Waiting for completion...' in LazyOutputBuffer. > FYI, We've been doing a similar thing with all the CDS tests -- all the logs from ProcessTools are saved, and we print out the name of stdout/stderr files in the .jtr files. It's been very valuable in diagnosing failures. Command line: [/home/iklam/jdk/bld/fre-fastdebug/images/jdk/bin/java -cp /jdk2/tmp/jtreg/work/classes/13/runtime/cds/appcds/HelloTest.d:/jdk2/fre/open/test/hotspot/jtreg/runtime/cds/appcds:/jdk2/tmp/jtreg/work/classes/13/test/lib:/jdk/tools/jtreg/5.0-b01/lib/javatest.jar:/jdk/tools/jtreg/5.0-b01/lib/jtreg.jar -XX:MaxRAM=8g -cp /jdk2/tmp/jtreg/work/classes/13/runtime/cds/appcds/HelloTest.d/hello.jar -Xshare:dump -Xlog:cds -XX:SharedArchiveFile=/jdk2/tmp/jtreg/work/scratch/2/appcds-23h24m40s432.jsa -XX:ExtraSharedClassListFile=/jdk2/tmp/jtreg/work/classes/13/runtime/cds/appcds/HelloTest.d/HelloTest-test.classlist ] [2020-04-01T06:24:40.530164Z] Gathering output for process 22666 [ELAPSED: 3068 ms] [logging stdout to HelloTest-0000-dump.stdout] [logging stderr to HelloTest-0000-dump.stderr] Thanks - Ioi >> The change has been tested via a mach5 test runs (jdk-tier1 through 4) on the 4 common platforms (linux-x64, windows-x64, macosx-x64, sparcv9). > this doesn't include any of hotspot tiers, could you please also run hs-tier1--4? > // you can use tierN jobs which include both jdk and hs parts. > > Thanks, > -- Igor > >> On Mar 30, 2020, at 3:55 AM, Evgeny Nikitin wrote: >> >> >> Hi, >> >> >> Bug: https://bugs.openjdk.java.net/browse/JDK-8174768 >> >> Webrev: http://cr.openjdk.java.net/~iignatyev/enikitin/8174768/webrev.00/ >> >> >> The bug had been created as a request to simplify investigation for compiler control tests failures. >> I found the functionality pretty generic and useful and made ProcessTools dump output as well as some diagnostic information for every executed process into a separate file. >> The diagnostic information contains cmdline, exit code, stdout and stderr. The output files are named like 'pid--output.log'. >> >> The change has been tested via a mach5 test runs (jdk-tier1 through 4) on the 4 common platforms (linux-x64, windows-x64, macosx-x64, sparcv9). >> >> Please review, >> /Evgeny Nikitin. From david.holmes at oracle.com Thu Apr 2 00:07:31 2020 From: david.holmes at oracle.com (David Holmes) Date: Wed, 1 Apr 2020 17:07:31 -0700 (PDT) Subject: RFR(XS): 8174768: Make ProcessTools print executed process output into a separate file In-Reply-To: References: Message-ID: <70147008-45b8-0b7f-6691-50f8429c5369@oracle.com> Thanks for sharing this Igor! I'm not at all sure this is generally what we want for every single test that uses ProcessTools! But I'm willing it to see it trialed. Evgeny: Please run full tier testing at least to tier 6 and ideally beyond before pushing this. There are potential implications for temporary (and more permanent) disk usage as well as additional time needed to write files out to disk. (Hopefully these are generally small enough that this doesn't make a noticeable difference.) Thanks, David On 2/04/2020 5:13 am, Igor Ignatyev wrote: > Hi Evgeny, > > (widening the audience, given this affects not just hotspot compiler, but hotspot tests as well as core libs tests in general) > > overall that looks good to me. one suggestion, for the ease of failure analysis it might be worth to print out the names of created files, although this might potentially clutter the output, I don't think it'll be a problem given we already print out things like 'Gathering output for process ...' , 'Waiting for completion...' in LazyOutputBuffer. > >> The change has been tested via a mach5 test runs (jdk-tier1 through 4) on the 4 common platforms (linux-x64, windows-x64, macosx-x64, sparcv9). > this doesn't include any of hotspot tiers, could you please also run hs-tier1--4? > // you can use tierN jobs which include both jdk and hs parts. > > Thanks, > -- Igor > >> On Mar 30, 2020, at 3:55 AM, Evgeny Nikitin wrote: >> >> >> Hi, >> >> >> Bug: https://bugs.openjdk.java.net/browse/JDK-8174768 >> >> Webrev: http://cr.openjdk.java.net/~iignatyev/enikitin/8174768/webrev.00/ >> >> >> The bug had been created as a request to simplify investigation for compiler control tests failures. >> I found the functionality pretty generic and useful and made ProcessTools dump output as well as some diagnostic information for every executed process into a separate file. >> The diagnostic information contains cmdline, exit code, stdout and stderr. The output files are named like 'pid--output.log'. >> >> The change has been tested via a mach5 test runs (jdk-tier1 through 4) on the 4 common platforms (linux-x64, windows-x64, macosx-x64, sparcv9). >> >> Please review, >> /Evgeny Nikitin. > From vladimir.kozlov at oracle.com Thu Apr 2 02:57:24 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 1 Apr 2020 19:57:24 -0700 Subject: RFR: 8241438: Move IntelJccErratum mitigation code to platform-specific code In-Reply-To: <2c038184-d7c0-94f1-47d1-56444c8b243d@oracle.com> References: <14fe1c02-520b-f5d7-5c66-dc35d63f0a0d@oracle.com> <19c75204-d036-4768-686e-834995c5e21f@oracle.com> <73e13987-92c9-c189-657e-0abe7a69ecaa@oracle.com> <7833a4ea-613a-4b4c-da23-4b3e064b2f1c@oracle.com> <7f4033fa-d380-582e-43cc-343aa6d0fe1c@oracle.com> <2c038184-d7c0-94f1-47d1-56444c8b243d@oracle.com> Message-ID: On 4/1/20 3:24 AM, Erik ?sterlund wrote: > Hi Vladimir, > > On 2020-03-30 21:14, Vladimir Kozlov wrote: >> But you at least can do static check at the beginning of method: >> >> int MachNode::pd_alignment_required() const { >> ? if (VM_Version::has_intel_jcc_erratum()) { >> ??? PhaseOutput* output = Compile::current()->output(); >> ??? Block* block = output->block(); >> ??? int index = output->index(); >> ??? assert(output->mach() == this, "incorrect iterator state in PhaseOutput"); >> ??? if (IntelJccErratum::is_jcc_erratum_branch(block, this, index)) { >> ????? // Conservatively add worst case padding. We assume that relocInfo::addr_unit() is 1 on x86. >> ????? return IntelJccErratum::largest_jcc_size() + 1; >> ??? } >> ? } >> ? return 1; >> } > > That is equivalent to the compiler. I verified that by disassembling the release bits before > and after your suggestion, and it is instruction by instruction the same. In both cases it > first checks ifVM_Version::has_intel_jcc_erratum(), and if not, returns before even building > a frame. I'd rather keep the not nested variant because it is equivalent, yet easier to read. I have reservation about this statement which may not true for all C++ compilers we use but I will not insist on refactoring it. > >>> >>>> In compute_padding() reads done under check so I have less concerns about it. But I also don't get why you use saved >>>> _mach instead of using MachNode 'this'. >>> >>> Good point. I changed to this + an assert checking that they are indeed the same. >> >> Why do you need Output._mach at all if you use it only in this assert? Even logically it looks strange. In what case >> it could be different? > > It should never be different; that was the point. The index and mach node exposed by the > iterator are related and refer to the same entity. So if you use the exposed index in code > in a mach node, you must know that this mach node is the same mach node that the index refers > to, and it is. The assert was meant to enforce it so that if you were to call either the > alignment or padding function in a new context, for whatever reason, and don't happen to know > that you can't do that without having a consistent iteration state, you would immediately catch > that in the assertions, instead of getting strange silent logic errors. > > Having said that, I am okay with removing _mach if you prefer having one seat belt less, it is up to you: > http://cr.openjdk.java.net/~eosterlund/8241438/webrev.03/ Okay. Good. Thanks, Vladimir > > Incremental: > http://cr.openjdk.java.net/~eosterlund/8241438/webrev.02_03/ > > Thanks, > /Erik > >> Thanks, >> Vladimir >> >>> >>> Here is an updated webrev with your concerns and Vladimir Ivanov's concerns addressed: >>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.02/ >>> >>> Incremental: >>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.01_02/ >>> >>> Thanks, >>> /Erik >>> >>>> Thanks, >>>> Vladimir >>>> >>>>> >>>>>> In pd_alignment_required() you implicitly use knowledge that relocInfo::addr_unit() on x86 is 1. >>>>>> At least add comment about that. >>>>> >>>>> I can add a comment about that. >>>>> >>>>> New webrev: >>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.01/ >>>>> >>>>> Incremental: >>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.00_01/ >>>>> >>>>> Thanks, >>>>> /Erik >>>>> >>>>>> Thanks, >>>>>> Vladimir >>>>>> >>>>>> On 3/23/20 6:09 AM, Erik ?sterlund wrote: >>>>>>> Hi, >>>>>>> >>>>>>> There is some platform-specific code in PhaseOutput that deals with the IntelJccErratum mitigation, >>>>>>> which is ifdef:ed in shared code. It should move to platform-specific code. >>>>>>> >>>>>>> This patch exposes the iteration state of PhaseOutput, which allows hiding the Intel-specific code >>>>>>> completely in x86-specific files. >>>>>>> >>>>>>> Webrev: >>>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.00/ >>>>>>> >>>>>>> Bug: >>>>>>> https://bugs.openjdk.java.net/browse/JDK-8241438 >>>>>>> >>>>>>> Thanks, >>>>>>> /Erik >>>>> >>> > From vladimir.kozlov at oracle.com Thu Apr 2 03:37:51 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 1 Apr 2020 20:37:51 -0700 Subject: RFR(XS) 8191930: [Graal] emits unparseable XML into compile log In-Reply-To: References: Message-ID: <6cb3928e-56e7-6fae-18e7-802792d4a6a7@oracle.com> Looks good. Thanks, Vladimir On 4/1/20 12:56 PM, Tom Rodriguez wrote: > http://cr.openjdk.java.net/~never/8191930/webrev > https://bugs.openjdk.java.net/browse/JDK-8191930 > > This was something that was fixed in 8 but never made it into 9+ I think because the code moved after 8.? Tested by > forcing a bailout with the problematic string and inspecting the resulting xml. > > From nils.eliasson at oracle.com Thu Apr 2 09:28:45 2020 From: nils.eliasson at oracle.com (Nils Eliasson) Date: Thu, 2 Apr 2020 11:28:45 +0200 Subject: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction In-Reply-To: References: <373cb1c9-80f3-ec08-7d43-cbd3202bc134@oracle.com> Message-ID: <869882d4-eb5a-d765-92d9-49cd389e3366@oracle.com> Hi Jatin, The patch is nice and clean. Reviewed. Best regards Nils Eliasson On 2020-04-01 20:23, Bhateja, Jatin wrote: > Hi Vladimir, > > Please find an updated unified patch at the following link. > > http://cr.openjdk.java.net/~jbhateja/8241040/webrev.05/ > > This removes Optimized NotV handling for AVX3, as suggested it will be > brought via vectorIntrinsics branch. > > Thanks for your help in shaping up this patch, please let me know if there > are other comments. > > Best Regards, > Jatin > ________________________________________ > From: Bhateja, Jatin > Sent: Wednesday, March 25, 2020 12:14 PM > To: Vladimir Ivanov > Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya > Subject: RE: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction > > Hi Vladimir, > > I have placed updated patch at following links:- > > 1) Optimized NotV handling: > http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/ > > 2) Changes for MacroLogic opt: > http://cr.openjdk.java.net/~jbhateja/8241040/webrev.03_over_notV/ > > Kindly review and let me know your feedback. > > Thanks, > Jatin > >> -----Original Message----- >> From: Vladimir Ivanov >> Sent: Wednesday, March 25, 2020 12:33 AM >> To: Bhateja, Jatin >> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya >> >> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction >> >> Hi Jatin, >> >> I tried to submit the patches for testing, but windows-x64 build failed with the >> following errors: >> >> src/hotspot/share/opto/compile.cpp(2345): error C2131: expression did not >> evaluate to a constant >> src/hotspot/share/opto/compile.cpp(2345): note: failure was caused by a read >> of a variable outside its lifetime >> src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition' >> src/hotspot/share/opto/compile.cpp(2404): error C3863: array type 'int >> ['function']' is not assignable >> >> Best regards, >> Vladimir Ivanov >> >> On 24.03.2020 10:34, Bhateja, Jatin wrote: >>> Hi Vladimir, >>> >>> Thanks for your comments , I have split the original patch into two sub- >> patches. >>> 1) Optimized NotV handling: >>> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/ >>> >>> 2) Changes for MacroLogic opt: >>> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/ >>> >>> Added a new flag "UseVectorMacroLogic" which guards MacroLogic >> optimization. >>> Kindly review and let me know your feedback. >>> >>> Best Regards, >>> Jatin >>> >>>> -----Original Message----- >>>> From: Vladimir Ivanov >>>> Sent: Tuesday, March 17, 2020 4:31 PM >>>> To: Bhateja, Jatin ; hotspot-compiler- >>>> dev at openjdk.java.net >>>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic >>>> Instruction >>>> >>>> >>>>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/ >>>> Very nice contribution, Jatin! >>>> >>>> Some comments after a brief review pass: >>>> >>>> * Please, contribute NotV part separately. >>>> >>>> * Why don't you perform (XorV v 0xFF..FF) => (NotV v) >>>> transformation during GVN instead? >>>> >>>> * As of now, vector nodes are only produced by SuperWord >>>> analysis. It makes sense to limit new optimization pass to SuperWord >>>> pass only (probably, introduce a new dedicated Phase ). Once Vector >>>> API is available, it can be extended to cases when vector nodes are >>>> present >>>> (C->max_vector_size() > 0). >>>> >>>> * There are more efficient ways to produce a vector of all-1s [1] [2]. >>>> >>>> Best regards, >>>> Vladimir Ivanov >>>> >>>> [1] >>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/45105 >>>> 164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3DgJgc >>>> qfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$ >>>> 1-efficiently >>>> >>>> [2] >>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/37469 >>>> 930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI >>>> _RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$ >>>> value-to-all-one-bits >>>> >>>>> A new optimization pass has been added post Auto-Vectorization which >>>> folds expression tree involving vector boolean logic operations >>>> (ANDV/ORV/NOTV/XORV) into a MacroLogic node. >>>>> Optimization pass has following stages: >>>>> >>>>> 1. Collection stage : >>>>> * This performs a DFS traversal over Ideal Graph and collects the root >>>> nodes of all vector logic expression trees. >>>>> 2. Processing stage: >>>>> * Performs a bottom up traversal over expression tree and >>>> simultaneously folds specific DAG patterns involving Boolean logic >>>> parent and child nodes. >>>>> * Transforms (XORV INP , -1) -> (NOTV INP) to promote logic folding. >>>>> * Folding is performed under a constraint on the total number of >> inputs >>>> which a MacroLogic node can have, in this case it's 3. >>>>> * A partition is created around a DAG pattern involving logic parent >> and >>>> one or two logic child node, it encapsulate the nodes in post-order fashion. >>>>> * This partition is then evaluated by traversing over the nodes, >> assigning >>>> boolean values to its inputs and performing operations over them >>>> based on its Opcode. Node along with its computed result is stored in >>>> a map which is accessed during the evaluation of its user/parent node. >>>>> * Post-evaluation a MacroLogic node is created which is equivalent to >> a >>>> three input truth-table. Expression tree leaf level inputs along with >>>> result of its evaluation are the inputs fed to this new node. >>>>> * Entire expression tree is eventually subsumed/replaced by newly >>>> create MacroLogic node. >>>>> >>>>> Following are the JMH benchmarks results with and without changes. >>>>> >>>>> Without Changes: >>>>> >>>>> Benchmark (VECLEN) Mode Cnt Score Error Units >>>>> MacroLogicOpt.workload1_caller 64 thrpt 2904.480 ops/s >>>>> MacroLogicOpt.workload1_caller 128 thrpt 2219.252 ops/s >>>>> MacroLogicOpt.workload1_caller 256 thrpt 1507.267 ops/s >>>>> MacroLogicOpt.workload1_caller 512 thrpt 860.926 ops/s >>>>> MacroLogicOpt.workload1_caller 1024 thrpt 470.163 ops/s >>>>> MacroLogicOpt.workload1_caller 2048 thrpt 246.608 ops/s >>>>> MacroLogicOpt.workload1_caller 4096 thrpt 108.031 ops/s >>>>> MacroLogicOpt.workload2_caller 64 thrpt 344.633 ops/s >>>>> MacroLogicOpt.workload2_caller 128 thrpt 209.818 ops/s >>>>> MacroLogicOpt.workload2_caller 256 thrpt 111.678 ops/s >>>>> MacroLogicOpt.workload2_caller 512 thrpt 53.360 ops/s >>>>> MacroLogicOpt.workload2_caller 1024 thrpt 27.888 ops/s >>>>> MacroLogicOpt.workload2_caller 2048 thrpt 12.103 ops/s >>>>> MacroLogicOpt.workload2_caller 4096 thrpt 6.018 ops/s >>>>> MacroLogicOpt.workload3_caller 64 thrpt 3110.669 ops/s >>>>> MacroLogicOpt.workload3_caller 128 thrpt 1996.861 ops/s >>>>> MacroLogicOpt.workload3_caller 256 thrpt 870.166 ops/s >>>>> MacroLogicOpt.workload3_caller 512 thrpt 389.629 ops/s >>>>> MacroLogicOpt.workload3_caller 1024 thrpt 151.203 ops/s >>>>> MacroLogicOpt.workload3_caller 2048 thrpt 75.086 ops/s >>>>> MacroLogicOpt.workload3_caller 4096 thrpt 37.576 ops/s >>>>> >>>>> With Changes: >>>>> >>>>> Benchmark (VECLEN) Mode Cnt Score Error Units >>>>> MacroLogicOpt.workload1_caller 64 thrpt 3306.670 ops/s >>>>> MacroLogicOpt.workload1_caller 128 thrpt 2936.851 ops/s >>>>> MacroLogicOpt.workload1_caller 256 thrpt 2413.827 ops/s >>>>> MacroLogicOpt.workload1_caller 512 thrpt 1440.291 ops/s >>>>> MacroLogicOpt.workload1_caller 1024 thrpt 707.576 ops/s >>>>> MacroLogicOpt.workload1_caller 2048 thrpt 384.863 ops/s >>>>> MacroLogicOpt.workload1_caller 4096 thrpt 132.753 ops/s >>>>> MacroLogicOpt.workload2_caller 64 thrpt 450.856 ops/s >>>>> MacroLogicOpt.workload2_caller 128 thrpt 323.925 ops/s >>>>> MacroLogicOpt.workload2_caller 256 thrpt 135.191 ops/s >>>>> MacroLogicOpt.workload2_caller 512 thrpt 69.424 ops/s >>>>> MacroLogicOpt.workload2_caller 1024 thrpt 35.744 ops/s >>>>> MacroLogicOpt.workload2_caller 2048 thrpt 14.168 ops/s >>>>> MacroLogicOpt.workload2_caller 4096 thrpt 7.245 ops/s >>>>> MacroLogicOpt.workload3_caller 64 thrpt 3333.550 ops/s >>>>> MacroLogicOpt.workload3_caller 128 thrpt 2269.428 ops/s >>>>> MacroLogicOpt.workload3_caller 256 thrpt 995.691 ops/s >>>>> MacroLogicOpt.workload3_caller 512 thrpt 412.452 ops/s >>>>> MacroLogicOpt.workload3_caller 1024 thrpt 151.157 ops/s >>>>> MacroLogicOpt.workload3_caller 2048 thrpt 75.079 ops/s >>>>> MacroLogicOpt.workload3_caller 4096 thrpt 37.158 ops/s >>>>> >>>>> Please review the patch. >>>>> >>>>> Best Regards, >>>>> Jatin >>>>> >>>>> [1] Section 17.7 : >>>>> https://urldefense.com/v3/__https://software.intel.com/sites/default >>>>> /files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllG >>>>> QTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$ >>>>> architectures-optimization-manual.pdf >>>>> From vladimir.x.ivanov at oracle.com Thu Apr 2 09:31:10 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Thu, 2 Apr 2020 12:31:10 +0300 Subject: RFR: 8241438: Move IntelJccErratum mitigation code to platform-specific code In-Reply-To: <2c038184-d7c0-94f1-47d1-56444c8b243d@oracle.com> References: <14fe1c02-520b-f5d7-5c66-dc35d63f0a0d@oracle.com> <19c75204-d036-4768-686e-834995c5e21f@oracle.com> <73e13987-92c9-c189-657e-0abe7a69ecaa@oracle.com> <7833a4ea-613a-4b4c-da23-4b3e064b2f1c@oracle.com> <7f4033fa-d380-582e-43cc-343aa6d0fe1c@oracle.com> <2c038184-d7c0-94f1-47d1-56444c8b243d@oracle.com> Message-ID: <92c54ba2-4c40-b28b-b687-19f7c2ae38c7@oracle.com> > http://cr.openjdk.java.net/~eosterlund/8241438/webrev.03/ Looks good. Best regards, Vladimir Ivanov >>> Here is an updated webrev with your concerns and Vladimir Ivanov's >>> concerns addressed: >>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.02/ >>> >>> Incremental: >>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.01_02/ >>> >>> Thanks, >>> /Erik >>> >>>> Thanks, >>>> Vladimir >>>> >>>>> >>>>>> In pd_alignment_required() you implicitly use knowledge that >>>>>> relocInfo::addr_unit() on x86 is 1. >>>>>> At least add comment about that. >>>>> >>>>> I can add a comment about that. >>>>> >>>>> New webrev: >>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.01/ >>>>> >>>>> Incremental: >>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.00_01/ >>>>> >>>>> Thanks, >>>>> /Erik >>>>> >>>>>> Thanks, >>>>>> Vladimir >>>>>> >>>>>> On 3/23/20 6:09 AM, Erik ?sterlund wrote: >>>>>>> Hi, >>>>>>> >>>>>>> There is some platform-specific code in PhaseOutput that deals >>>>>>> with the IntelJccErratum mitigation, >>>>>>> which is ifdef:ed in shared code. It should move to >>>>>>> platform-specific code. >>>>>>> >>>>>>> This patch exposes the iteration state of PhaseOutput, which >>>>>>> allows hiding the Intel-specific code >>>>>>> completely in x86-specific files. >>>>>>> >>>>>>> Webrev: >>>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.00/ >>>>>>> >>>>>>> Bug: >>>>>>> https://bugs.openjdk.java.net/browse/JDK-8241438 >>>>>>> >>>>>>> Thanks, >>>>>>> /Erik >>>>> >>> > From erik.osterlund at oracle.com Thu Apr 2 09:36:57 2020 From: erik.osterlund at oracle.com (=?UTF-8?Q?Erik_=c3=96sterlund?=) Date: Thu, 2 Apr 2020 11:36:57 +0200 Subject: RFR: 8241438: Move IntelJccErratum mitigation code to platform-specific code In-Reply-To: References: <14fe1c02-520b-f5d7-5c66-dc35d63f0a0d@oracle.com> <19c75204-d036-4768-686e-834995c5e21f@oracle.com> <73e13987-92c9-c189-657e-0abe7a69ecaa@oracle.com> <7833a4ea-613a-4b4c-da23-4b3e064b2f1c@oracle.com> <7f4033fa-d380-582e-43cc-343aa6d0fe1c@oracle.com> <2c038184-d7c0-94f1-47d1-56444c8b243d@oracle.com> Message-ID: <9d6a7ba1-3d0c-fdb2-1e79-01ae0e8058cf@oracle.com> Hi Vladimir, Thanks for the review. /Erik On 2020-04-02 04:57, Vladimir Kozlov wrote: > On 4/1/20 3:24 AM, Erik ?sterlund wrote: >> Hi Vladimir, >> >> On 2020-03-30 21:14, Vladimir Kozlov wrote: >>> But you at least can do static check at the beginning of method: >>> >>> int MachNode::pd_alignment_required() const { >>> ? if (VM_Version::has_intel_jcc_erratum()) { >>> ??? PhaseOutput* output = Compile::current()->output(); >>> ??? Block* block = output->block(); >>> ??? int index = output->index(); >>> ??? assert(output->mach() == this, "incorrect iterator state in >>> PhaseOutput"); >>> ??? if (IntelJccErratum::is_jcc_erratum_branch(block, this, index)) { >>> ????? // Conservatively add worst case padding. We assume that >>> relocInfo::addr_unit() is 1 on x86. >>> ????? return IntelJccErratum::largest_jcc_size() + 1; >>> ??? } >>> ? } >>> ? return 1; >>> } >> >> That is equivalent to the compiler. I verified that by disassembling >> the release bits before >> and after your suggestion, and it is instruction by instruction the >> same. In both cases it >> first checks ifVM_Version::has_intel_jcc_erratum(), and if not, >> returns before even building >> a frame. I'd rather keep the not nested variant because it is >> equivalent, yet easier to read. > > I have reservation about this statement which may not true for all C++ > compilers we use but I will not insist on refactoring it. > >> >>>> >>>>> In compute_padding() reads done under check so I have less >>>>> concerns about it. But I also don't get why you use saved _mach >>>>> instead of using MachNode 'this'. >>>> >>>> Good point. I changed to this + an assert checking that they are >>>> indeed the same. >>> >>> Why do you need Output._mach at all if you use it only in this >>> assert? Even logically it looks strange. In what case it could be >>> different? >> >> It should never be different; that was the point. The index and mach >> node exposed by the >> iterator are related and refer to the same entity. So if you use the >> exposed index in code >> in a mach node, you must know that this mach node is the same mach >> node that the index refers >> to, and it is. The assert was meant to enforce it so that if you were >> to call either the >> alignment or padding function in a new context, for whatever reason, >> and don't happen to know >> that you can't do that without having a consistent iteration state, >> you would immediately catch >> that in the assertions, instead of getting strange silent logic errors. >> >> Having said that, I am okay with removing _mach if you prefer having >> one seat belt less, it is up to you: >> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.03/ > > Okay. Good. > > Thanks, > Vladimir > >> >> Incremental: >> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.02_03/ >> >> Thanks, >> /Erik >> >>> Thanks, >>> Vladimir >>> >>>> >>>> Here is an updated webrev with your concerns and Vladimir Ivanov's >>>> concerns addressed: >>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.02/ >>>> >>>> Incremental: >>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.01_02/ >>>> >>>> Thanks, >>>> /Erik >>>> >>>>> Thanks, >>>>> Vladimir >>>>> >>>>>> >>>>>>> In pd_alignment_required() you implicitly use knowledge that >>>>>>> relocInfo::addr_unit() on x86 is 1. >>>>>>> At least add comment about that. >>>>>> >>>>>> I can add a comment about that. >>>>>> >>>>>> New webrev: >>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.01/ >>>>>> >>>>>> Incremental: >>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.00_01/ >>>>>> >>>>>> Thanks, >>>>>> /Erik >>>>>> >>>>>>> Thanks, >>>>>>> Vladimir >>>>>>> >>>>>>> On 3/23/20 6:09 AM, Erik ?sterlund wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> There is some platform-specific code in PhaseOutput that deals >>>>>>>> with the IntelJccErratum mitigation, >>>>>>>> which is ifdef:ed in shared code. It should move to >>>>>>>> platform-specific code. >>>>>>>> >>>>>>>> This patch exposes the iteration state of PhaseOutput, which >>>>>>>> allows hiding the Intel-specific code >>>>>>>> completely in x86-specific files. >>>>>>>> >>>>>>>> Webrev: >>>>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.00/ >>>>>>>> >>>>>>>> Bug: >>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8241438 >>>>>>>> >>>>>>>> Thanks, >>>>>>>> /Erik >>>>>> >>>> >> From erik.osterlund at oracle.com Thu Apr 2 09:37:09 2020 From: erik.osterlund at oracle.com (=?UTF-8?Q?Erik_=c3=96sterlund?=) Date: Thu, 2 Apr 2020 09:37:09 +0000 (UTC) Subject: RFR: 8241438: Move IntelJccErratum mitigation code to platform-specific code In-Reply-To: <92c54ba2-4c40-b28b-b687-19f7c2ae38c7@oracle.com> References: <14fe1c02-520b-f5d7-5c66-dc35d63f0a0d@oracle.com> <19c75204-d036-4768-686e-834995c5e21f@oracle.com> <73e13987-92c9-c189-657e-0abe7a69ecaa@oracle.com> <7833a4ea-613a-4b4c-da23-4b3e064b2f1c@oracle.com> <7f4033fa-d380-582e-43cc-343aa6d0fe1c@oracle.com> <2c038184-d7c0-94f1-47d1-56444c8b243d@oracle.com> <92c54ba2-4c40-b28b-b687-19f7c2ae38c7@oracle.com> Message-ID: <66550012-d164-855f-7d45-087a1151cc0a@oracle.com> Hi Vladimir, Thanks for the review. /Erik On 2020-04-02 11:31, Vladimir Ivanov wrote: > >> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.03/ > > Looks good. > > Best regards, > Vladimir Ivanov > >>>> Here is an updated webrev with your concerns and Vladimir Ivanov's >>>> concerns addressed: >>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.02/ >>>> >>>> Incremental: >>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.01_02/ >>>> >>>> Thanks, >>>> /Erik >>>> >>>>> Thanks, >>>>> Vladimir >>>>> >>>>>> >>>>>>> In pd_alignment_required() you implicitly use knowledge that >>>>>>> relocInfo::addr_unit() on x86 is 1. >>>>>>> At least add comment about that. >>>>>> >>>>>> I can add a comment about that. >>>>>> >>>>>> New webrev: >>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.01/ >>>>>> >>>>>> Incremental: >>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.00_01/ >>>>>> >>>>>> Thanks, >>>>>> /Erik >>>>>> >>>>>>> Thanks, >>>>>>> Vladimir >>>>>>> >>>>>>> On 3/23/20 6:09 AM, Erik ?sterlund wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> There is some platform-specific code in PhaseOutput that deals >>>>>>>> with the IntelJccErratum mitigation, >>>>>>>> which is ifdef:ed in shared code. It should move to >>>>>>>> platform-specific code. >>>>>>>> >>>>>>>> This patch exposes the iteration state of PhaseOutput, which >>>>>>>> allows hiding the Intel-specific code >>>>>>>> completely in x86-specific files. >>>>>>>> >>>>>>>> Webrev: >>>>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.00/ >>>>>>>> >>>>>>>> Bug: >>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8241438 >>>>>>>> >>>>>>>> Thanks, >>>>>>>> /Erik >>>>>> >>>> >> From vladimir.x.ivanov at oracle.com Thu Apr 2 10:14:53 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Thu, 2 Apr 2020 13:14:53 +0300 Subject: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction In-Reply-To: References: <373cb1c9-80f3-ec08-7d43-cbd3202bc134@oracle.com> Message-ID: <72e4bd89-3f56-2894-dced-dd5f3f06e66e@oracle.com> >> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.05/ > > Looks good. I'll submit it for testing. Test results are clean. Best regards, Vladimir Ivanov >> This removes Optimized NotV handling for AVX3, as suggested it will be >> brought via vectorIntrinsics branch. >> >> Thanks for your help in shaping up this patch, please let me know if >> there >> are other comments. >> >> Best Regards, >> Jatin >> ________________________________________ >> From: Bhateja, Jatin >> Sent: Wednesday, March 25, 2020 12:14 PM >> To: Vladimir Ivanov >> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya >> Subject: RE: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic >> Instruction >> >> Hi Vladimir, >> >> I have placed updated patch at following links:- >> >> ? 1)? Optimized NotV handling: >> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/ >> >> ? 2)? Changes for MacroLogic opt: >> ? http://cr.openjdk.java.net/~jbhateja/8241040/webrev.03_over_notV/ >> >> Kindly review and let me know your feedback. >> >> Thanks, >> Jatin >> >>> -----Original Message----- >>> From: Vladimir Ivanov >>> Sent: Wednesday, March 25, 2020 12:33 AM >>> To: Bhateja, Jatin >>> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya >>> >>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic >>> Instruction >>> >>> Hi Jatin, >>> >>> I tried to submit the patches for testing, but windows-x64 build >>> failed with the >>> following errors: >>> >>> src/hotspot/share/opto/compile.cpp(2345): error C2131: expression did >>> not >>> evaluate to a constant >>> src/hotspot/share/opto/compile.cpp(2345): note: failure was caused by >>> a read >>> of a variable outside its lifetime >>> src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition' >>> src/hotspot/share/opto/compile.cpp(2404): error C3863: array type 'int >>> ['function']' is not assignable >>> >>> Best regards, >>> Vladimir Ivanov >>> >>> On 24.03.2020 10:34, Bhateja, Jatin wrote: >>>> Hi Vladimir, >>>> >>>> Thanks for your comments , I have split the original patch into two >>>> sub- >>> patches. >>>> >>>> 1)? Optimized NotV handling: >>>> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/ >>>> >>>> 2)? Changes for MacroLogic opt: >>>> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/ >>>> >>>> Added a new flag "UseVectorMacroLogic" which guards MacroLogic >>> optimization. >>>> >>>> Kindly review and let me know your feedback. >>>> >>>> Best Regards, >>>> Jatin >>>> >>>>> -----Original Message----- >>>>> From: Vladimir Ivanov >>>>> Sent: Tuesday, March 17, 2020 4:31 PM >>>>> To: Bhateja, Jatin ; hotspot-compiler- >>>>> dev at openjdk.java.net >>>>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic >>>>> Instruction >>>>> >>>>> >>>>>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/ >>>>> >>>>> Very nice contribution, Jatin! >>>>> >>>>> Some comments after a brief review pass: >>>>> >>>>> ???? * Please, contribute NotV part separately. >>>>> >>>>> ???? * Why don't you perform (XorV v 0xFF..FF) => (NotV v) >>>>> transformation during GVN instead? >>>>> >>>>> ???? * As of now, vector nodes are only produced by SuperWord >>>>> analysis. It makes sense to limit new optimization pass to SuperWord >>>>> pass only (probably, introduce a new dedicated Phase ). Once Vector >>>>> API is available, it can be extended to cases when vector nodes are >>>>> present >>>>> (C->max_vector_size() > 0). >>>>> >>>>> ???? * There are more efficient ways to produce a vector of all-1s >>>>> [1] [2]. >>>>> >>>>> Best regards, >>>>> Vladimir Ivanov >>>>> >>>>> [1] >>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/45105 >>>>> 164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3DgJgc >>>>> qfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$ >>>>> 1-efficiently >>>>> >>>>> [2] >>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/37469 >>>>> 930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI >>>>> _RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$ >>>>> value-to-all-one-bits >>>>> >>>>>> >>>>>> A new optimization pass has been added post Auto-Vectorization which >>>>> folds expression tree involving vector boolean logic operations >>>>> (ANDV/ORV/NOTV/XORV) into a MacroLogic node. >>>>>> Optimization pass has following stages: >>>>>> >>>>>> ???? 1.? Collection stage : >>>>>> ??????? *?? This performs a DFS traversal over Ideal Graph and >>>>>> collects the root >>>>> nodes of all vector logic expression trees. >>>>>> ???? 2.? Processing stage: >>>>>> ??????? *?? Performs a bottom up traversal over expression tree and >>>>> simultaneously folds specific DAG patterns involving Boolean logic >>>>> parent and child nodes. >>>>>> ??????? *?? Transforms (XORV INP , -1) -> (NOTV INP) to promote >>>>>> logic folding. >>>>>> ??????? *?? Folding is performed under a constraint on the total >>>>>> number of >>> inputs >>>>> which a MacroLogic node can have, in this case it's 3. >>>>>> ??????? *?? A partition is created around a DAG pattern involving >>>>>> logic parent >>> and >>>>> one or two logic child node, it encapsulate the nodes in post-order >>>>> fashion. >>>>>> ??????? *?? This partition is then evaluated by traversing over >>>>>> the nodes, >>> assigning >>>>> boolean values to its inputs and performing operations over them >>>>> based on its Opcode. Node along with its computed result is stored in >>>>> a map which is accessed during the evaluation of its user/parent node. >>>>>> ??????? *?? Post-evaluation a MacroLogic node is created which is >>>>>> equivalent to >>> a >>>>> three input truth-table. Expression tree leaf level inputs along with >>>>> result of its evaluation are the inputs fed to this new node. >>>>>> ??????? *?? Entire expression tree is eventually subsumed/replaced >>>>>> by newly >>>>> create MacroLogic node. >>>>>> >>>>>> >>>>>> Following are the JMH benchmarks results with and without changes. >>>>>> >>>>>> Without Changes: >>>>>> >>>>>> Benchmark??????????????????????????? (VECLEN)?? Mode? Cnt >>>>>> Score?? Error? Units >>>>>> MacroLogicOpt.workload1_caller???????????? 64? thrpt >>>>>> 2904.480????????? ops/s >>>>>> MacroLogicOpt.workload1_caller??????????? 128? thrpt >>>>>> 2219.252????????? ops/s >>>>>> MacroLogicOpt.workload1_caller??????????? 256? thrpt >>>>>> 1507.267????????? ops/s >>>>>> MacroLogicOpt.workload1_caller??????????? 512? thrpt >>>>>> 860.926????????? ops/s >>>>>> MacroLogicOpt.workload1_caller?????????? 1024? thrpt >>>>>> 470.163????????? ops/s >>>>>> MacroLogicOpt.workload1_caller?????????? 2048? thrpt >>>>>> 246.608????????? ops/s >>>>>> MacroLogicOpt.workload1_caller?????????? 4096? thrpt >>>>>> 108.031????????? ops/s >>>>>> MacroLogicOpt.workload2_caller???????????? 64? thrpt >>>>>> 344.633????????? ops/s >>>>>> MacroLogicOpt.workload2_caller??????????? 128? thrpt >>>>>> 209.818????????? ops/s >>>>>> MacroLogicOpt.workload2_caller??????????? 256? thrpt >>>>>> 111.678????????? ops/s >>>>>> MacroLogicOpt.workload2_caller??????????? 512? thrpt >>>>>> 53.360????????? ops/s >>>>>> MacroLogicOpt.workload2_caller?????????? 1024? thrpt >>>>>> 27.888????????? ops/s >>>>>> MacroLogicOpt.workload2_caller?????????? 2048? thrpt >>>>>> 12.103????????? ops/s >>>>>> MacroLogicOpt.workload2_caller?????????? 4096? thrpt >>>>>> 6.018????????? ops/s >>>>>> MacroLogicOpt.workload3_caller???????????? 64? thrpt >>>>>> 3110.669????????? ops/s >>>>>> MacroLogicOpt.workload3_caller??????????? 128? thrpt >>>>>> 1996.861????????? ops/s >>>>>> MacroLogicOpt.workload3_caller??????????? 256? thrpt >>>>>> 870.166????????? ops/s >>>>>> MacroLogicOpt.workload3_caller??????????? 512? thrpt >>>>>> 389.629????????? ops/s >>>>>> MacroLogicOpt.workload3_caller?????????? 1024? thrpt >>>>>> 151.203????????? ops/s >>>>>> MacroLogicOpt.workload3_caller?????????? 2048? thrpt >>>>>> 75.086????????? ops/s >>>>>> MacroLogicOpt.workload3_caller?????????? 4096? thrpt >>>>>> 37.576????????? ops/s >>>>>> >>>>>> With Changes: >>>>>> >>>>>> Benchmark??????????????????????????? (VECLEN)?? Mode? Cnt >>>>>> Score?? Error? Units >>>>>> MacroLogicOpt.workload1_caller???????????? 64? thrpt >>>>>> 3306.670????????? ops/s >>>>>> MacroLogicOpt.workload1_caller??????????? 128? thrpt >>>>>> 2936.851????????? ops/s >>>>>> MacroLogicOpt.workload1_caller??????????? 256? thrpt >>>>>> 2413.827????????? ops/s >>>>>> MacroLogicOpt.workload1_caller??????????? 512? thrpt >>>>>> 1440.291????????? ops/s >>>>>> MacroLogicOpt.workload1_caller?????????? 1024? thrpt >>>>>> 707.576????????? ops/s >>>>>> MacroLogicOpt.workload1_caller?????????? 2048? thrpt >>>>>> 384.863????????? ops/s >>>>>> MacroLogicOpt.workload1_caller?????????? 4096? thrpt >>>>>> 132.753????????? ops/s >>>>>> MacroLogicOpt.workload2_caller???????????? 64? thrpt >>>>>> 450.856????????? ops/s >>>>>> MacroLogicOpt.workload2_caller??????????? 128? thrpt >>>>>> 323.925????????? ops/s >>>>>> MacroLogicOpt.workload2_caller??????????? 256? thrpt >>>>>> 135.191????????? ops/s >>>>>> MacroLogicOpt.workload2_caller??????????? 512? thrpt >>>>>> 69.424????????? ops/s >>>>>> MacroLogicOpt.workload2_caller?????????? 1024? thrpt >>>>>> 35.744????????? ops/s >>>>>> MacroLogicOpt.workload2_caller?????????? 2048? thrpt >>>>>> 14.168????????? ops/s >>>>>> MacroLogicOpt.workload2_caller?????????? 4096? thrpt >>>>>> 7.245????????? ops/s >>>>>> MacroLogicOpt.workload3_caller???????????? 64? thrpt >>>>>> 3333.550????????? ops/s >>>>>> MacroLogicOpt.workload3_caller??????????? 128? thrpt >>>>>> 2269.428????????? ops/s >>>>>> MacroLogicOpt.workload3_caller??????????? 256? thrpt >>>>>> 995.691????????? ops/s >>>>>> MacroLogicOpt.workload3_caller??????????? 512? thrpt >>>>>> 412.452????????? ops/s >>>>>> MacroLogicOpt.workload3_caller?????????? 1024? thrpt >>>>>> 151.157????????? ops/s >>>>>> MacroLogicOpt.workload3_caller?????????? 2048? thrpt >>>>>> 75.079????????? ops/s >>>>>> MacroLogicOpt.workload3_caller?????????? 4096? thrpt >>>>>> 37.158????????? ops/s >>>>>> >>>>>> Please review the patch. >>>>>> >>>>>> Best Regards, >>>>>> Jatin >>>>>> >>>>>> [1] Section 17.7 : >>>>>> https://urldefense.com/v3/__https://software.intel.com/sites/default >>>>>> /files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllG >>>>>> QTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$ >>>>>> architectures-optimization-manual.pdf >>>>>> From rwestrel at redhat.com Thu Apr 2 14:14:04 2020 From: rwestrel at redhat.com (Roland Westrelin) Date: Thu, 02 Apr 2020 16:14:04 +0200 Subject: RFR(S): 8239072: subtype check macro node causes node budget to be exhausted In-Reply-To: <736f1832-b44c-162d-35fb-fbad07a84c39@oracle.com> References: <87d09llldp.fsf@redhat.com> <62ef48e0-fae8-38cc-7a48-2deb0f054cdd@oracle.com> <87v9nbjilg.fsf@redhat.com> <63a6f167-1d5b-7624-b4e6-0f2b89707b00@oracle.com> <3d3cc6d7-6b6c-ebbc-d28e-7350c50c5f58@oracle.com> <875zekewj7.fsf@redhat.com> <736f1832-b44c-162d-35fb-fbad07a84c39@oracle.com> Message-ID: <87mu7uc6vn.fsf@redhat.com> Thanks for the reviews, Vladimir & Vladimir. Roland. From HORIE at jp.ibm.com Thu Apr 2 14:27:10 2020 From: HORIE at jp.ibm.com (Michihiro Horie) Date: Thu, 2 Apr 2020 23:27:10 +0900 Subject: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9 In-Reply-To: <0079874e-7bc2-5ff4-f004-337c718ec6df@linux.ibm.com> References: <0079874e-7bc2-5ff4-f004-337c718ec6df@linux.ibm.com> Message-ID: Hi Corey, I?m not a reviewer, but I can run your benchmark in my local P9 node if you share it. Best regards, Michihiro ----- Original message ----- From: Corey Ashford Sent by: "hotspot-compiler-dev" To: hotspot-compiler-dev at openjdk.java.net Cc: Subject: [EXTERNAL] RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9 Date: Tue, Mar 31, 2020 7:52 AM Hello, This is my first OpenJDK patch for review. It increases the performance of byte reversal for Integer.reverseBytes() and Long.reverseBytes() on Power9 via its VSX xxbrw and xxbrd vector instructions. https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.openjdk.java.net_browse_JDK-2D8241874&d=DwICaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=oecsIpYF-cifqq2i1JEH0Q&m=Q0ug0imG7nRw-N8m1U0RobPS3M9D2mmT8nY3GnID3io&s=TXqhnYzhTVyILKGJBOpWSmqe-iP6ixmCAqwxYT19K8E&e= https://urldefense.proofpoint.com/v2/url?u=http-3A__cr.openjdk.java.net_-7Egromero_8241874_v1_&d=DwICaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=oecsIpYF-cifqq2i1JEH0Q&m=Q0ug0imG7nRw-N8m1U0RobPS3M9D2mmT8nY3GnID3io&s=1elFXKQoR_CB9mG6g4TM0z5-Da27XveB77RBXKwQi3I&e= I have tested on Power9 and see a 38%+ performance improvement on Long.reverseBytes() and 15%+ on Integer.reverseBytes(). (I add the + because the benchmark code has a fair amount of fixed overhead). Testing on Power8 reveals no regressions. I believe the patch itself is pretty self-explanatory. It adds definitions for four instructions that are needed to get the data in and out of the vector registers, and to perform the reversal operation, and it adds the instructs to use them. Also VM_Version::initialize() autodetects that the instructions are available, and warns for trying to set the UseVectorByteReverseInstructionsPPC64 flag on earlier Power processors that don't possess these PowerISA 3.0 instructions. Thanks to Michihiro Horie, Jose Ricardo Ziviani, and Gustav Romero for their help! Please review this patch. Thanks for your consideration, Corey Ashford From rwestrel at redhat.com Thu Apr 2 14:35:42 2020 From: rwestrel at redhat.com (Roland Westrelin) Date: Thu, 02 Apr 2020 16:35:42 +0200 Subject: RFR(S): 8241041: C2: "assert((Value(phase) == t) || (t != TypeInt::CC_GT && t != TypeInt::CC_EQ)) failed: missing Value() optimization" still happens after fix for 8239335 In-Reply-To: <87tv2ef536.fsf@redhat.com> References: <87tv2ef536.fsf@redhat.com> Message-ID: <87k12yc5vl.fsf@redhat.com> > http://cr.openjdk.java.net/~roland/8241041/webrev.00/ Anyone else for this? Roland. From rwestrel at redhat.com Thu Apr 2 14:36:29 2020 From: rwestrel at redhat.com (Roland Westrelin) Date: Thu, 02 Apr 2020 16:36:29 +0200 Subject: [11u] 8217230: assert(t == t_no_spec) failure in NodeHash::check_no_speculative_types() In-Reply-To: <874kubfked.fsf@redhat.com> References: <874kubfked.fsf@redhat.com> Message-ID: <87h7y2c5ua.fsf@redhat.com> > This is required to backport 8237086 (assert(is_MachReturn()) running > CTW with fix for JDK-8231291). > > Original bug: > https://bugs.openjdk.java.net/browse/JDK-8217230 > http://hg.openjdk.java.net/jdk/jdk12/rev/1b292ae4eb50 > > Original patch does not apply cleanly to 11u because context changed in > compile.hpp. Patch is otherwise identical. > > 11u webrev: > http://cr.openjdk.java.net/~roland/8217230.11u/webrev.00/ > > Testing: x86_64 build, tier1 + tier2 Anyone for this review? Roland. From jatin.bhateja at intel.com Thu Apr 2 17:09:16 2020 From: jatin.bhateja at intel.com (Bhateja, Jatin) Date: Thu, 2 Apr 2020 17:09:16 +0000 Subject: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction In-Reply-To: <72e4bd89-3f56-2894-dced-dd5f3f06e66e@oracle.com> References: <373cb1c9-80f3-ec08-7d43-cbd3202bc134@oracle.com> <72e4bd89-3f56-2894-dced-dd5f3f06e66e@oracle.com> Message-ID: Thanks Nils , Vladimir. Changes have been pushed. http://hg.openjdk.java.net/jdk/jdk/rev/29d878d3af35 Best Regards, Jatin > -----Original Message----- > From: Vladimir Ivanov > Sent: Thursday, April 2, 2020 3:45 PM > To: Bhateja, Jatin > Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya > > Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction > > > >> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.05/ > > > > Looks good. I'll submit it for testing. > > Test results are clean. > > Best regards, > Vladimir Ivanov > > >> This removes Optimized NotV handling for AVX3, as suggested it will > >> be brought via vectorIntrinsics branch. > >> > >> Thanks for your help in shaping up this patch, please let me know if > >> there are other comments. > >> > >> Best Regards, > >> Jatin > >> ________________________________________ > >> From: Bhateja, Jatin > >> Sent: Wednesday, March 25, 2020 12:14 PM > >> To: Vladimir Ivanov > >> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya > >> Subject: RE: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic > >> Instruction > >> > >> Hi Vladimir, > >> > >> I have placed updated patch at following links:- > >> > >> ? 1)? Optimized NotV handling: > >> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/ > >> > >> ? 2)? Changes for MacroLogic opt: > >> ? http://cr.openjdk.java.net/~jbhateja/8241040/webrev.03_over_notV/ > >> > >> Kindly review and let me know your feedback. > >> > >> Thanks, > >> Jatin > >> > >>> -----Original Message----- > >>> From: Vladimir Ivanov > >>> Sent: Wednesday, March 25, 2020 12:33 AM > >>> To: Bhateja, Jatin > >>> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya > >>> > >>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic > >>> Instruction > >>> > >>> Hi Jatin, > >>> > >>> I tried to submit the patches for testing, but windows-x64 build > >>> failed with the following errors: > >>> > >>> src/hotspot/share/opto/compile.cpp(2345): error C2131: expression > >>> did not evaluate to a constant > >>> src/hotspot/share/opto/compile.cpp(2345): note: failure was caused > >>> by a read of a variable outside its lifetime > >>> src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition' > >>> src/hotspot/share/opto/compile.cpp(2404): error C3863: array type > >>> 'int ['function']' is not assignable > >>> > >>> Best regards, > >>> Vladimir Ivanov > >>> > >>> On 24.03.2020 10:34, Bhateja, Jatin wrote: > >>>> Hi Vladimir, > >>>> > >>>> Thanks for your comments , I have split the original patch into two > >>>> sub- > >>> patches. > >>>> > >>>> 1)? Optimized NotV handling: > >>>> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/ > >>>> > >>>> 2)? Changes for MacroLogic opt: > >>>> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/ > >>>> > >>>> Added a new flag "UseVectorMacroLogic" which guards MacroLogic > >>> optimization. > >>>> > >>>> Kindly review and let me know your feedback. > >>>> > >>>> Best Regards, > >>>> Jatin > >>>> > >>>>> -----Original Message----- > >>>>> From: Vladimir Ivanov > >>>>> Sent: Tuesday, March 17, 2020 4:31 PM > >>>>> To: Bhateja, Jatin ; hotspot-compiler- > >>>>> dev at openjdk.java.net > >>>>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic > >>>>> Instruction > >>>>> > >>>>> > >>>>>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/ > >>>>> > >>>>> Very nice contribution, Jatin! > >>>>> > >>>>> Some comments after a brief review pass: > >>>>> > >>>>> ???? * Please, contribute NotV part separately. > >>>>> > >>>>> ???? * Why don't you perform (XorV v 0xFF..FF) => (NotV v) > >>>>> transformation during GVN instead? > >>>>> > >>>>> ???? * As of now, vector nodes are only produced by SuperWord > >>>>> analysis. It makes sense to limit new optimization pass to > >>>>> SuperWord pass only (probably, introduce a new dedicated Phase ). > >>>>> Once Vector API is available, it can be extended to cases when > >>>>> vector nodes are present > >>>>> (C->max_vector_size() > 0). > >>>>> > >>>>> ???? * There are more efficient ways to produce a vector of all-1s > >>>>> [1] [2]. > >>>>> > >>>>> Best regards, > >>>>> Vladimir Ivanov > >>>>> > >>>>> [1] > >>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/45 > >>>>> 105 > >>>>> 164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3Dg > >>>>> Jgc qfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$ > >>>>> 1-efficiently > >>>>> > >>>>> [2] > >>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/37 > >>>>> 469 > >>>>> 930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllG > >>>>> QTI _RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$ > >>>>> value-to-all-one-bits > >>>>> > >>>>>> > >>>>>> A new optimization pass has been added post Auto-Vectorization > >>>>>> which > >>>>> folds expression tree involving vector boolean logic operations > >>>>> (ANDV/ORV/NOTV/XORV) into a MacroLogic node. > >>>>>> Optimization pass has following stages: > >>>>>> > >>>>>> ???? 1.? Collection stage : > >>>>>> ??????? *?? This performs a DFS traversal over Ideal Graph and > >>>>>> collects the root > >>>>> nodes of all vector logic expression trees. > >>>>>> ???? 2.? Processing stage: > >>>>>> ??????? *?? Performs a bottom up traversal over expression tree > >>>>>> and > >>>>> simultaneously folds specific DAG patterns involving Boolean logic > >>>>> parent and child nodes. > >>>>>> ??????? *?? Transforms (XORV INP , -1) -> (NOTV INP) to promote > >>>>>> logic folding. > >>>>>> ??????? *?? Folding is performed under a constraint on the total > >>>>>> number of > >>> inputs > >>>>> which a MacroLogic node can have, in this case it's 3. > >>>>>> ??????? *?? A partition is created around a DAG pattern involving > >>>>>> logic parent > >>> and > >>>>> one or two logic child node, it encapsulate the nodes in > >>>>> post-order fashion. > >>>>>> ??????? *?? This partition is then evaluated by traversing over > >>>>>> the nodes, > >>> assigning > >>>>> boolean values to its inputs and performing operations over them > >>>>> based on its Opcode. Node along with its computed result is stored > >>>>> in a map which is accessed during the evaluation of its user/parent > node. > >>>>>> ??????? *?? Post-evaluation a MacroLogic node is created which is > >>>>>> equivalent to > >>> a > >>>>> three input truth-table. Expression tree leaf level inputs along > >>>>> with result of its evaluation are the inputs fed to this new node. > >>>>>> ??????? *?? Entire expression tree is eventually > >>>>>> subsumed/replaced by newly > >>>>> create MacroLogic node. > >>>>>> > >>>>>> > >>>>>> Following are the JMH benchmarks results with and without changes. > >>>>>> > >>>>>> Without Changes: > >>>>>> > >>>>>> Benchmark??????????????????????????? (VECLEN)?? Mode? Cnt > >>>>>> Score?? Error? Units > >>>>>> MacroLogicOpt.workload1_caller???????????? 64? thrpt > >>>>>> 2904.480????????? ops/s > >>>>>> MacroLogicOpt.workload1_caller??????????? 128? thrpt > >>>>>> 2219.252????????? ops/s > >>>>>> MacroLogicOpt.workload1_caller??????????? 256? thrpt > >>>>>> 1507.267????????? ops/s > >>>>>> MacroLogicOpt.workload1_caller??????????? 512? thrpt > >>>>>> 860.926????????? ops/s > >>>>>> MacroLogicOpt.workload1_caller?????????? 1024? thrpt > >>>>>> 470.163????????? ops/s > >>>>>> MacroLogicOpt.workload1_caller?????????? 2048? thrpt > >>>>>> 246.608????????? ops/s > >>>>>> MacroLogicOpt.workload1_caller?????????? 4096? thrpt > >>>>>> 108.031????????? ops/s > >>>>>> MacroLogicOpt.workload2_caller???????????? 64? thrpt > >>>>>> 344.633????????? ops/s > >>>>>> MacroLogicOpt.workload2_caller??????????? 128? thrpt > >>>>>> 209.818????????? ops/s > >>>>>> MacroLogicOpt.workload2_caller??????????? 256? thrpt > >>>>>> 111.678????????? ops/s > >>>>>> MacroLogicOpt.workload2_caller??????????? 512? thrpt > >>>>>> 53.360????????? ops/s > >>>>>> MacroLogicOpt.workload2_caller?????????? 1024? thrpt > >>>>>> 27.888????????? ops/s > >>>>>> MacroLogicOpt.workload2_caller?????????? 2048? thrpt > >>>>>> 12.103????????? ops/s > >>>>>> MacroLogicOpt.workload2_caller?????????? 4096? thrpt > >>>>>> 6.018????????? ops/s > >>>>>> MacroLogicOpt.workload3_caller???????????? 64? thrpt > >>>>>> 3110.669????????? ops/s > >>>>>> MacroLogicOpt.workload3_caller??????????? 128? thrpt > >>>>>> 1996.861????????? ops/s > >>>>>> MacroLogicOpt.workload3_caller??????????? 256? thrpt > >>>>>> 870.166????????? ops/s > >>>>>> MacroLogicOpt.workload3_caller??????????? 512? thrpt > >>>>>> 389.629????????? ops/s > >>>>>> MacroLogicOpt.workload3_caller?????????? 1024? thrpt > >>>>>> 151.203????????? ops/s > >>>>>> MacroLogicOpt.workload3_caller?????????? 2048? thrpt > >>>>>> 75.086????????? ops/s > >>>>>> MacroLogicOpt.workload3_caller?????????? 4096? thrpt > >>>>>> 37.576????????? ops/s > >>>>>> > >>>>>> With Changes: > >>>>>> > >>>>>> Benchmark??????????????????????????? (VECLEN)?? Mode? Cnt > >>>>>> Score?? Error? Units > >>>>>> MacroLogicOpt.workload1_caller???????????? 64? thrpt > >>>>>> 3306.670????????? ops/s > >>>>>> MacroLogicOpt.workload1_caller??????????? 128? thrpt > >>>>>> 2936.851????????? ops/s > >>>>>> MacroLogicOpt.workload1_caller??????????? 256? thrpt > >>>>>> 2413.827????????? ops/s > >>>>>> MacroLogicOpt.workload1_caller??????????? 512? thrpt > >>>>>> 1440.291????????? ops/s > >>>>>> MacroLogicOpt.workload1_caller?????????? 1024? thrpt > >>>>>> 707.576????????? ops/s > >>>>>> MacroLogicOpt.workload1_caller?????????? 2048? thrpt > >>>>>> 384.863????????? ops/s > >>>>>> MacroLogicOpt.workload1_caller?????????? 4096? thrpt > >>>>>> 132.753????????? ops/s > >>>>>> MacroLogicOpt.workload2_caller???????????? 64? thrpt > >>>>>> 450.856????????? ops/s > >>>>>> MacroLogicOpt.workload2_caller??????????? 128? thrpt > >>>>>> 323.925????????? ops/s > >>>>>> MacroLogicOpt.workload2_caller??????????? 256? thrpt > >>>>>> 135.191????????? ops/s > >>>>>> MacroLogicOpt.workload2_caller??????????? 512? thrpt > >>>>>> 69.424????????? ops/s > >>>>>> MacroLogicOpt.workload2_caller?????????? 1024? thrpt > >>>>>> 35.744????????? ops/s > >>>>>> MacroLogicOpt.workload2_caller?????????? 2048? thrpt > >>>>>> 14.168????????? ops/s > >>>>>> MacroLogicOpt.workload2_caller?????????? 4096? thrpt > >>>>>> 7.245????????? ops/s > >>>>>> MacroLogicOpt.workload3_caller???????????? 64? thrpt > >>>>>> 3333.550????????? ops/s > >>>>>> MacroLogicOpt.workload3_caller??????????? 128? thrpt > >>>>>> 2269.428????????? ops/s > >>>>>> MacroLogicOpt.workload3_caller??????????? 256? thrpt > >>>>>> 995.691????????? ops/s > >>>>>> MacroLogicOpt.workload3_caller??????????? 512? thrpt > >>>>>> 412.452????????? ops/s > >>>>>> MacroLogicOpt.workload3_caller?????????? 1024? thrpt > >>>>>> 151.157????????? ops/s > >>>>>> MacroLogicOpt.workload3_caller?????????? 2048? thrpt > >>>>>> 75.079????????? ops/s > >>>>>> MacroLogicOpt.workload3_caller?????????? 4096? thrpt > >>>>>> 37.158????????? ops/s > >>>>>> > >>>>>> Please review the patch. > >>>>>> > >>>>>> Best Regards, > >>>>>> Jatin > >>>>>> > >>>>>> [1] Section 17.7 : > >>>>>> https://urldefense.com/v3/__https://software.intel.com/sites/defa > >>>>>> ult > >>>>>> /files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqf > >>>>>> llG QTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$ > >>>>>> architectures-optimization-manual.pdf > >>>>>> From tom.rodriguez at oracle.com Thu Apr 2 17:58:09 2020 From: tom.rodriguez at oracle.com (Tom Rodriguez) Date: Thu, 2 Apr 2020 10:58:09 -0700 Subject: RFR(XS) 8191930: [Graal] emits unparseable XML into compile log In-Reply-To: <6cb3928e-56e7-6fae-18e7-802792d4a6a7@oracle.com> References: <6cb3928e-56e7-6fae-18e7-802792d4a6a7@oracle.com> Message-ID: <51121a1b-8c2a-dbf5-286f-a7815fac064b@oracle.com> Thanks! tom Vladimir Kozlov wrote on 4/1/20 8:37 PM: > Looks good. > > Thanks, > Vladimir > > On 4/1/20 12:56 PM, Tom Rodriguez wrote: >> http://cr.openjdk.java.net/~never/8191930/webrev >> https://bugs.openjdk.java.net/browse/JDK-8191930 >> >> This was something that was fixed in 8 but never made it into 9+ I >> think because the code moved after 8.? Tested by forcing a bailout >> with the problematic string and inspecting the resulting xml. >> >> From tom.rodriguez at oracle.com Thu Apr 2 19:12:39 2020 From: tom.rodriguez at oracle.com (Tom Rodriguez) Date: Thu, 2 Apr 2020 12:12:39 -0700 Subject: RFR(S) 8231756: [JVMCI] need support for deoptimizing virtual byte arrays encoding non-byte primitives Message-ID: <5eb640c1-fb33-2fea-ef2f-416e19220d97@oracle.com> http://cr.openjdk.java.net/~never/8231756/webrev https://bugs.openjdk.java.net/browse/JDK-8231756 This adds support for deoptimizing with non-byte primitive values stored on top of a byte array, similarly to the way that a double or long can be stored on top of 2 int fields. More detail is provided in the bug report and new unit tests exercise the deoptimization. mach5 testing is in progress. tom From nils.eliasson at oracle.com Thu Apr 2 19:37:56 2020 From: nils.eliasson at oracle.com (Nils Eliasson) Date: Thu, 2 Apr 2020 21:37:56 +0200 Subject: RFR(S): 8241041: C2: "assert((Value(phase) == t) || (t != TypeInt::CC_GT && t != TypeInt::CC_EQ)) failed: missing Value() optimization" still happens after fix for 8239335 In-Reply-To: <87k12yc5vl.fsf@redhat.com> References: <87tv2ef536.fsf@redhat.com> <87k12yc5vl.fsf@redhat.com> Message-ID: Looks good! Review. Best regards, Nils Eliasson On 2020-04-02 16:35, Roland Westrelin wrote: >> http://cr.openjdk.java.net/~roland/8241041/webrev.00/ > Anyone else for this? > > Roland. > From cjashfor at linux.ibm.com Thu Apr 2 23:07:31 2020 From: cjashfor at linux.ibm.com (Corey Ashford) Date: Thu, 2 Apr 2020 16:07:31 -0700 Subject: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9 In-Reply-To: References: <0079874e-7bc2-5ff4-f004-337c718ec6df@linux.ibm.com> Message-ID: <67fa8056-a8ed-cdfc-1e5a-d36b49c4af18@linux.ibm.com> On 4/2/20 7:27 AM, Michihiro Horie wrote: > Hi Corey, > > I?m not a reviewer, but I can run your benchmark in my local P9 node if > you share it. > > Best regards, > Michihiro The tests are somewhat hokey; I added the shifts to keep the compiler from hoisting the code that it could predetermine the result. Here's the one for Long.reverseBytes(): import java.lang.*; class ReverseLong { public static void main(String args[]) { long reversed, re_reversed; long accum = 0; long orig = 0x1122334455667788L; long start = System.currentTimeMillis(); for (int i = 0; i < 1_000_000_000; i++) { // Try to keep java from figuring out stuff in advance reversed = Long.reverseBytes(orig); re_reversed = Long.reverseBytes(reversed); if (re_reversed != orig) { System.out.println("Orig: " + String.format("%16x", orig) + " Re-reversed: " + String.format("%16x", re_reversed)); } accum += orig; orig = Long.rotateRight(orig, 3); } System.out.println("Elapsed time: " + Long.toString(System.currentTimeMillis() - start)); System.out.println("accum: " + Long.toString(accum)); } } And the one for Integer.reverseBytes(): import java.lang.*; class ReverseInt { public static void main(String args[]) { int reversed, re_reversed; int orig = 0x11223344; int accum = 0; long start = System.currentTimeMillis(); for (int i = 0; i < 1_000_000_000; i++) { // Try to keep java from figuring out stuff in advance reversed = Integer.reverseBytes(orig); re_reversed = Integer.reverseBytes(reversed); if (re_reversed != orig) { System.out.println("Orig: " + String.format("%08x", orig) + " Re-reversed: " + String.format("%08x", re_reversed)); } accum += orig; orig = Integer.rotateRight(orig, 3); } System.out.println("Elapsed time: " + Long.toString(System.currentTimeMillis() - start)); System.out.println("accum: " + Integer.toString(accum)); } } From ningsheng.jian at arm.com Fri Apr 3 02:41:04 2020 From: ningsheng.jian at arm.com (Ningsheng Jian) Date: Fri, 3 Apr 2020 10:41:04 +0800 Subject: [aarch64-port-dev ] RFR(S): 8241475: AArch64: Add missing support for PopCountVI node In-Reply-To: References: Message-ID: <110347ce-0629-c5ff-d072-080094570f09@arm.com> Hi Pengfei, On 3/31/20 5:32 PM, Pengfei Li wrote: > Hi, > > Please help review this another missing node support for AArch64. > > JBS: https://bugs.openjdk.java.net/browse/JDK-8241475 > Webrev: http://cr.openjdk.java.net/~pli/rfr/8241475/webrev.01/ > Just took a close look before pushing your code, and I think this line can be removed? + effect(TEMP_DEF dst); Thanks, Ningsheng From Pengfei.Li at arm.com Fri Apr 3 05:48:05 2020 From: Pengfei.Li at arm.com (Pengfei Li) Date: Fri, 3 Apr 2020 05:48:05 +0000 Subject: [aarch64-port-dev ] RFR(S): 8241475: AArch64: Add missing support for PopCountVI node In-Reply-To: <110347ce-0629-c5ff-d072-080094570f09@arm.com> References: <110347ce-0629-c5ff-d072-080094570f09@arm.com> Message-ID: Hi, > Just took a close look before pushing your code, and I think this line can be > removed? > > + effect(TEMP_DEF dst); Yes, thanks for pointing out. It is redundant since I don't use temps this time. I've updated and rebased the patch. See http://cr.openjdk.java.net/~pli/rfr/8241475/webrev.02/ -- Thanks, Pengfei From shade at redhat.com Fri Apr 3 07:30:24 2020 From: shade at redhat.com (Aleksey Shipilev) Date: Fri, 3 Apr 2020 09:30:24 +0200 Subject: RFR (XS) 8242073: x86_32 build failure after JDK-8241040 Message-ID: Build bug: https://bugs.openjdk.java.net/browse/JDK-8242073 immU8 is undefined in x86_32.ad, so new matchers in x86.ad fail. Copying immU8 definition from x86_64.ad helps. Matched the operand block order and internal format of x86_32.ad with this patch: diff -r f50a7df94744 src/hotspot/cpu/x86/x86_32.ad --- a/src/hotspot/cpu/x86/x86_32.ad Fri Apr 03 07:27:53 2020 +0100 +++ b/src/hotspot/cpu/x86/x86_32.ad Fri Apr 03 09:29:33 2020 +0200 @@ -3367,10 +3367,19 @@ op_cost(5); format %{ %} interface(CONST_INTER); %} +operand immU8() %{ + predicate((0 <= n->get_int()) && (n->get_int() <= 255)); + match(ConI); + + op_cost(5); + format %{ %} + interface(CONST_INTER); +%} + operand immI16() %{ predicate((-32768 <= n->get_int()) && (n->get_int() <= 32767)); match(ConI); op_cost(10); Testing: x86_32 build -- Thanks, -Aleksey From rwestrel at redhat.com Fri Apr 3 07:51:37 2020 From: rwestrel at redhat.com (Roland Westrelin) Date: Fri, 03 Apr 2020 09:51:37 +0200 Subject: RFR(S): 8241041: C2: "assert((Value(phase) == t) || (t != TypeInt::CC_GT && t != TypeInt::CC_EQ)) failed: missing Value() optimization" still happens after fix for 8239335 In-Reply-To: References: <87tv2ef536.fsf@redhat.com> <87k12yc5vl.fsf@redhat.com> Message-ID: <87eet5c8hi.fsf@redhat.com> Thanks for the review, Nils! Roland. From manc at google.com Fri Apr 3 08:42:53 2020 From: manc at google.com (Man Cao) Date: Fri, 3 Apr 2020 01:42:53 -0700 Subject: RFR(S): 8241556: Memory leak if -XX:CompileCommand is set In-Reply-To: References: Message-ID: Thanks for the reviews! -Man From rwestrel at redhat.com Fri Apr 3 08:55:10 2020 From: rwestrel at redhat.com (Roland Westrelin) Date: Fri, 03 Apr 2020 10:55:10 +0200 Subject: RFR(S): 8241900: Loop unswitching may cause dependence on null check to be lost Message-ID: <878sjdc5jl.fsf@redhat.com> http://cr.openjdk.java.net/~roland/8241900/webrev.00/ When a loop is unswitched, the now redundant test in the loop bodies is changed so it always fails or succeeds. Data nodes that are control dependent on the test become control dependent on the dominating control. In the test case: 1) the loop is unswitched once. The test that's hoisted is: if (o3 != null) { 2) the loop is unswitched a second time. This time, the hoisted test is: if (o != null) { 3) that test has a control dependent CastPP. That CastPP becomes dependent on the dominating test: if (o2 == null) { that test never fails so it's compiled as a test + uncommon trap 4) partial peeling is applied The chain of tests is now: if (array[1] != null) { // hoisted o3 != null by unswitching if (objectField != null) { // hoisted o != null by unswitching if (array[1] != null) { // peeled o2 == null // CastPP on objectField is here 5) because the 3rd test is identical to the first one this becomes: if (array[1] != null) { // hoisted o3 != null by unswitching // CastPP on objectField is here if (objectField != null) { // hoisted o != null by unswitching So the CastPP bypasses the null check on its input and so a dependent load can flow above the null check. The fix I propose is to keep the dependence on the hoisted test on loop unswitching by using dominated_by() instead of short_circuit_if(). This way on step 2) 3) above, the CastPP is made dependent on the hoisted test so reordering of the CastPP with its null check can't happen. Roland. From aph at redhat.com Fri Apr 3 08:56:34 2020 From: aph at redhat.com (Andrew Haley) Date: Fri, 3 Apr 2020 09:56:34 +0100 Subject: [aarch64-port-dev ] RFR(S): 8241475: AArch64: Add missing support for PopCountVI node In-Reply-To: References: <110347ce-0629-c5ff-d072-080094570f09@arm.com> Message-ID: On 4/3/20 6:48 AM, Pengfei Li wrote: > Yes, thanks for pointing out. It is redundant since I don't use temps this time. > > I've updated and rebased the patch. See http://cr.openjdk.java.net/~pli/rfr/8241475/webrev.02/ Please push. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From ningsheng.jian at arm.com Fri Apr 3 09:11:15 2020 From: ningsheng.jian at arm.com (Ningsheng Jian) Date: Fri, 3 Apr 2020 17:11:15 +0800 Subject: [aarch64-port-dev ] RFR(S): 8241475: AArch64: Add missing support for PopCountVI node In-Reply-To: References: <110347ce-0629-c5ff-d072-080094570f09@arm.com> Message-ID: <6c0bcfbd-118c-3fa7-96f7-7e832314a05c@arm.com> On 4/3/20 4:56 PM, Andrew Haley wrote: > On 4/3/20 6:48 AM, Pengfei Li wrote: >> Yes, thanks for pointing out. It is redundant since I don't use temps this time. >> >> I've updated and rebased the patch. See http://cr.openjdk.java.net/~pli/rfr/8241475/webrev.02/ > > Please push. > Pushed. Thanks, Ningsheng From adinn at redhat.com Fri Apr 3 09:13:40 2020 From: adinn at redhat.com (Andrew Dinn) Date: Fri, 3 Apr 2020 10:13:40 +0100 Subject: [aarch64-port-dev ] RFR(S): 8241475: AArch64: Add missing support for PopCountVI node In-Reply-To: <110347ce-0629-c5ff-d072-080094570f09@arm.com> References: <110347ce-0629-c5ff-d072-080094570f09@arm.com> Message-ID: On 03/04/2020 03:41, Ningsheng Jian wrote: > Hi Pengfei, > > On 3/31/20 5:32 PM, Pengfei Li wrote: >> Hi, >> >> Please help review this another missing node support for AArch64. >> >> JBS: https://bugs.openjdk.java.net/browse/JDK-8241475 >> Webrev: http://cr.openjdk.java.net/~pli/rfr/8241475/webrev.01/ >> > > Just took a close look before pushing your code, and I think this line > can be removed? > > +? effect(TEMP_DEF dst); Strictly, I think this is correct but I don't think it matters. I believe this usage is meant to identify a case where a generated multi-instruction sequence uses the output register (i.e. dst = target of Set) both as an output in the final instruction and as an intermediate scratch register in intervening instructions. That is the case for both these rules. The only way that might make a difference is if the back end were able to interleave instructions in other generated sequences with the instructions generated by this rule during instruction scheduling (or, say, via peephole rules). However, I don't believe that can happen given the current adlc code and AArch64 rules. n.b. there are several other exemples of TEMP_DEF use in aarch64.ad. I am not sure that they are the only ones where a dst register is used as both output and intermediary (we will only find out by carefully eyeballing every rule). regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill From aph at redhat.com Fri Apr 3 09:22:30 2020 From: aph at redhat.com (Andrew Haley) Date: Fri, 3 Apr 2020 10:22:30 +0100 Subject: [aarch64-port-dev ] RFR(S): 8241475: AArch64: Add missing support for PopCountVI node In-Reply-To: References: <110347ce-0629-c5ff-d072-080094570f09@arm.com> Message-ID: <9b007363-0380-3d6a-8df6-f0afca4c50d5@redhat.com> On 4/3/20 10:13 AM, Andrew Dinn wrote: > On 03/04/2020 03:41, Ningsheng Jian wrote: >> Hi Pengfei, >> >> On 3/31/20 5:32 PM, Pengfei Li wrote: >>> Hi, >>> >>> Please help review this another missing node support for AArch64. >>> >>> JBS: https://bugs.openjdk.java.net/browse/JDK-8241475 >>> Webrev: http://cr.openjdk.java.net/~pli/rfr/8241475/webrev.01/ >>> >> >> Just took a close look before pushing your code, and I think this line >> can be removed? >> >> +? effect(TEMP_DEF dst); > Strictly, I think this is correct but I don't think it matters. > > I believe this usage is meant to identify a case where a generated > multi-instruction sequence uses the output register (i.e. dst = target > of Set) both as an output in the final instruction and as an > intermediate scratch register in intervening instructions. That is the > case for both these rules. More simply, it prevents the situation where the same register is used as both an output and an input. Withe these patterns that doesn't matter. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From vladimir.x.ivanov at oracle.com Fri Apr 3 09:27:02 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Fri, 3 Apr 2020 12:27:02 +0300 Subject: RFR (XS) 8242073: x86_32 build failure after JDK-8241040 In-Reply-To: References: Message-ID: <88a77b32-7b4d-087d-3fe3-6fa154156e92@oracle.com> Looks good and trivial. Best regards, Vladimir Ivanov On 03.04.2020 10:30, Aleksey Shipilev wrote: > Build bug: > https://bugs.openjdk.java.net/browse/JDK-8242073 > > immU8 is undefined in x86_32.ad, so new matchers in x86.ad fail. Copying immU8 definition from > x86_64.ad helps. Matched the operand block order and internal format of x86_32.ad with this patch: > > diff -r f50a7df94744 src/hotspot/cpu/x86/x86_32.ad > --- a/src/hotspot/cpu/x86/x86_32.ad Fri Apr 03 07:27:53 2020 +0100 > +++ b/src/hotspot/cpu/x86/x86_32.ad Fri Apr 03 09:29:33 2020 +0200 > @@ -3367,10 +3367,19 @@ > op_cost(5); > format %{ %} > interface(CONST_INTER); > %} > > +operand immU8() %{ > + predicate((0 <= n->get_int()) && (n->get_int() <= 255)); > + match(ConI); > + > + op_cost(5); > + format %{ %} > + interface(CONST_INTER); > +%} > + > operand immI16() %{ > predicate((-32768 <= n->get_int()) && (n->get_int() <= 32767)); > match(ConI); > > op_cost(10); > > Testing: x86_32 build > From shade at redhat.com Fri Apr 3 09:43:20 2020 From: shade at redhat.com (Aleksey Shipilev) Date: Fri, 3 Apr 2020 11:43:20 +0200 Subject: RFR (XS) 8242073: x86_32 build failure after JDK-8241040 In-Reply-To: <88a77b32-7b4d-087d-3fe3-6fa154156e92@oracle.com> References: <88a77b32-7b4d-087d-3fe3-6fa154156e92@oracle.com> Message-ID: <8d690500-67db-8c9d-424d-a836f9d49a61@redhat.com> Thanks, pushed. On 4/3/20 11:27 AM, Vladimir Ivanov wrote: > Looks good and trivial. > > Best regards, > Vladimir Ivanov > > On 03.04.2020 10:30, Aleksey Shipilev wrote: >> Build bug: >> https://bugs.openjdk.java.net/browse/JDK-8242073 >> >> immU8 is undefined in x86_32.ad, so new matchers in x86.ad fail. Copying immU8 definition from >> x86_64.ad helps. Matched the operand block order and internal format of x86_32.ad with this patch: >> >> diff -r f50a7df94744 src/hotspot/cpu/x86/x86_32.ad >> --- a/src/hotspot/cpu/x86/x86_32.ad Fri Apr 03 07:27:53 2020 +0100 >> +++ b/src/hotspot/cpu/x86/x86_32.ad Fri Apr 03 09:29:33 2020 +0200 >> @@ -3367,10 +3367,19 @@ >> op_cost(5); >> format %{ %} >> interface(CONST_INTER); >> %} >> >> +operand immU8() %{ >> + predicate((0 <= n->get_int()) && (n->get_int() <= 255)); >> + match(ConI); >> + >> + op_cost(5); >> + format %{ %} >> + interface(CONST_INTER); >> +%} >> + >> operand immI16() %{ >> predicate((-32768 <= n->get_int()) && (n->get_int() <= 32767)); >> match(ConI); >> >> op_cost(10); >> >> Testing: x86_32 build >> > -- Thanks, -Aleksey From ningsheng.jian at arm.com Fri Apr 3 10:00:38 2020 From: ningsheng.jian at arm.com (Ningsheng Jian) Date: Fri, 3 Apr 2020 18:00:38 +0800 Subject: [aarch64-port-dev ] RFR(S): 8241475: AArch64: Add missing support for PopCountVI node In-Reply-To: <9b007363-0380-3d6a-8df6-f0afca4c50d5@redhat.com> References: <110347ce-0629-c5ff-d072-080094570f09@arm.com> <9b007363-0380-3d6a-8df6-f0afca4c50d5@redhat.com> Message-ID: <34dcff53-5afc-29c2-6086-e0d66882026c@arm.com> On 4/3/20 5:22 PM, Andrew Haley wrote: > On 4/3/20 10:13 AM, Andrew Dinn wrote: >> On 03/04/2020 03:41, Ningsheng Jian wrote: >>> Hi Pengfei, >>> >>> On 3/31/20 5:32 PM, Pengfei Li wrote: >>>> Hi, >>>> >>>> Please help review this another missing node support for AArch64. >>>> >>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8241475 >>>> Webrev: http://cr.openjdk.java.net/~pli/rfr/8241475/webrev.01/ >>>> >>> >>> Just took a close look before pushing your code, and I think this line >>> can be removed? >>> >>> +? effect(TEMP_DEF dst); >> Strictly, I think this is correct but I don't think it matters. >> >> I believe this usage is meant to identify a case where a generated >> multi-instruction sequence uses the output register (i.e. dst = target >> of Set) both as an output in the final instruction and as an >> intermediate scratch register in intervening instructions. That is the >> case for both these rules. > > More simply, it prevents the situation where the same register is used as both > an output and an input. Withe these patterns that doesn't matter. > Yeah, in this code block dst and src are not necessary to be different regs. Thanks, Ningsheng From Yang.Zhang at arm.com Fri Apr 3 10:49:06 2020 From: Yang.Zhang at arm.com (Yang Zhang) Date: Fri, 3 Apr 2020 10:49:06 +0000 Subject: RFR(S): 8241911: AArch64: Fix a potential register clash issue in reduce_add2I Message-ID: Hi, Could you please help to review this patch? In original reduce_add2I, dst may be the same as tmp2, which may get incorrect result. Some reduction operation instruct code formats are also cleaned up. JBS: https://bugs.openjdk.java.net/browse/JDK-8241911 Webrev: http://cr.openjdk.java.net/~yzhang/8241911/webrev.00/ Regards Yang From tobias.hartmann at oracle.com Fri Apr 3 11:21:25 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Fri, 3 Apr 2020 13:21:25 +0200 Subject: [15] RFR(T): 8242090: Remove dead code from c1_LIR Message-ID: Hi, please review the following patch that removes some dead code: https://bugs.openjdk.java.net/browse/JDK-8242090 http://cr.openjdk.java.net/~thartmann/8242090/webrev.00/ Thanks, Tobias From claes.redestad at oracle.com Fri Apr 3 11:54:50 2020 From: claes.redestad at oracle.com (Claes Redestad) Date: Fri, 3 Apr 2020 13:54:50 +0200 Subject: [15] RFR(T): 8242090: Remove dead code from c1_LIR In-Reply-To: References: Message-ID: <84a5e6e1-7441-002d-7564-94d9eb1d4f31@oracle.com> Looks good to me! /Claes On 2020-04-03 13:21, Tobias Hartmann wrote: > Hi, > > please review the following patch that removes some dead code: > https://bugs.openjdk.java.net/browse/JDK-8242090 > http://cr.openjdk.java.net/~thartmann/8242090/webrev.00/ > > Thanks, > Tobias > From tobias.hartmann at oracle.com Fri Apr 3 11:59:36 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Fri, 3 Apr 2020 13:59:36 +0200 Subject: [15] RFR(T): 8242090: Remove dead code from c1_LIR In-Reply-To: <84a5e6e1-7441-002d-7564-94d9eb1d4f31@oracle.com> References: <84a5e6e1-7441-002d-7564-94d9eb1d4f31@oracle.com> Message-ID: <40e00a69-7b3e-bef2-8b84-a71659611bc3@oracle.com> Thanks Claes! Best regards, Tobias On 03.04.20 13:54, Claes Redestad wrote: > Looks good to me! > > /Claes > > On 2020-04-03 13:21, Tobias Hartmann wrote: >> Hi, >> >> please review the following patch that removes some dead code: >> https://bugs.openjdk.java.net/browse/JDK-8242090 >> http://cr.openjdk.java.net/~thartmann/8242090/webrev.00/ >> >> Thanks, >> Tobias >> From tobias.hartmann at oracle.com Fri Apr 3 13:41:29 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Fri, 3 Apr 2020 15:41:29 +0200 Subject: [15] RFR(S): 8241997: Scalar replacement of cloned array is broken after JDK-8238759 Message-ID: <8a6527a1-f8dc-0306-58ea-93bbc4671775@oracle.com> Hi, please review the following patch: https://bugs.openjdk.java.net/browse/JDK-8241997 http://cr.openjdk.java.net/~thartmann/8241997/webrev.00/ When merging the fix for JDK-8238759 [1] into the Valhalla repo, we've noticed that some of our test started to fail because their C2 IR matching rules detected that cloned, non-escaping array allocations are no longer scalar replaced (for example, [2]). The problem is that the scalar replacement code still expects ArrayCopyNode::Dest to be an AddPNode. I've verified that my fix re-enables scalar replacement. The related Valhalla tests now pass. Thanks, Tobias [1] https://bugs.openjdk.java.net/browse/JDK-8238759 [2] http://hg.openjdk.java.net/valhalla/valhalla/file/00010b44d679/test/hotspot/jtreg/compiler/valhalla/valuetypes/TestArrays.java#l672 From rwestrel at redhat.com Fri Apr 3 14:05:23 2020 From: rwestrel at redhat.com (Roland Westrelin) Date: Fri, 03 Apr 2020 16:05:23 +0200 Subject: [15] RFR(S): 8241997: Scalar replacement of cloned array is broken after JDK-8238759 In-Reply-To: <8a6527a1-f8dc-0306-58ea-93bbc4671775@oracle.com> References: <8a6527a1-f8dc-0306-58ea-93bbc4671775@oracle.com> Message-ID: <875zegd5r0.fsf@redhat.com> > http://cr.openjdk.java.net/~thartmann/8241997/webrev.00/ Looks good to me. Roland. From tobias.hartmann at oracle.com Fri Apr 3 14:08:28 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Fri, 3 Apr 2020 16:08:28 +0200 Subject: [15] RFR(T): 8242090: Remove dead code from c1_LIR In-Reply-To: <40e00a69-7b3e-bef2-8b84-a71659611bc3@oracle.com> References: <84a5e6e1-7441-002d-7564-94d9eb1d4f31@oracle.com> <40e00a69-7b3e-bef2-8b84-a71659611bc3@oracle.com> Message-ID: Claes pointed out that lir_word_align is unused as well: http://cr.openjdk.java.net/~thartmann/8242090/webrev.01/ Thanks, Tobias On 03.04.20 13:59, Tobias Hartmann wrote: > Thanks Claes! > > Best regards, > Tobias > > On 03.04.20 13:54, Claes Redestad wrote: >> Looks good to me! >> >> /Claes >> >> On 2020-04-03 13:21, Tobias Hartmann wrote: >>> Hi, >>> >>> please review the following patch that removes some dead code: >>> https://bugs.openjdk.java.net/browse/JDK-8242090 >>> http://cr.openjdk.java.net/~thartmann/8242090/webrev.00/ >>> >>> Thanks, >>> Tobias >>> From tobias.hartmann at oracle.com Fri Apr 3 14:08:49 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Fri, 3 Apr 2020 16:08:49 +0200 Subject: [15] RFR(S): 8241997: Scalar replacement of cloned array is broken after JDK-8238759 In-Reply-To: <875zegd5r0.fsf@redhat.com> References: <8a6527a1-f8dc-0306-58ea-93bbc4671775@oracle.com> <875zegd5r0.fsf@redhat.com> Message-ID: <0e1907ea-db37-d354-ee76-f6cf8ca0af0a@oracle.com> Thanks Roland! Best regards, Tobias On 03.04.20 16:05, Roland Westrelin wrote: > >> http://cr.openjdk.java.net/~thartmann/8241997/webrev.00/ > > Looks good to me. > > Roland. > From claes.redestad at oracle.com Fri Apr 3 14:15:22 2020 From: claes.redestad at oracle.com (Claes Redestad) Date: Fri, 3 Apr 2020 16:15:22 +0200 Subject: [15] RFR(T): 8242090: Remove dead code from c1_LIR In-Reply-To: References: <84a5e6e1-7441-002d-7564-94d9eb1d4f31@oracle.com> <40e00a69-7b3e-bef2-8b84-a71659611bc3@oracle.com> Message-ID: <9ea01eb1-6498-7bdf-416f-675afd621110@oracle.com> On 2020-04-03 16:08, Tobias Hartmann wrote: > Claes pointed out that lir_word_align is unused as well: > http://cr.openjdk.java.net/~thartmann/8242090/webrev.01/ Looks good, lir_fpop_raw also looked unused, but seems to be used on x86_32 only. I'm not sure it's worth the trouble guarding its use with X86 && NOT_LP64..? /Claes From nils.eliasson at oracle.com Fri Apr 3 15:29:07 2020 From: nils.eliasson at oracle.com (Nils Eliasson) Date: Fri, 3 Apr 2020 17:29:07 +0200 Subject: [15] RFR(S): 8241997: Scalar replacement of cloned array is broken after JDK-8238759 In-Reply-To: <8a6527a1-f8dc-0306-58ea-93bbc4671775@oracle.com> References: <8a6527a1-f8dc-0306-58ea-93bbc4671775@oracle.com> Message-ID: <902ed245-ee7e-a98e-cf36-ff96bff79245@oracle.com> Hi, Nice find - but not all changes in macro.cpp seems related to what was caused by JDK-8238759. What are the additional changes in PhaseMacroExpand::process_users_of_allocation and PhaseMacroExpand::can_eliminate_allocation motivated by? Regards, Nils On 2020-04-03 15:41, Tobias Hartmann wrote: > Hi, > > please review the following patch: > https://bugs.openjdk.java.net/browse/JDK-8241997 > http://cr.openjdk.java.net/~thartmann/8241997/webrev.00/ > > When merging the fix for JDK-8238759 [1] into the Valhalla repo, we've noticed that some of our test > started to fail because their C2 IR matching rules detected that cloned, non-escaping array > allocations are no longer scalar replaced (for example, [2]). > > The problem is that the scalar replacement code still expects ArrayCopyNode::Dest to be an AddPNode. > I've verified that my fix re-enables scalar replacement. The related Valhalla tests now pass. > > Thanks, > Tobias > > [1] https://bugs.openjdk.java.net/browse/JDK-8238759 > [2] > http://hg.openjdk.java.net/valhalla/valhalla/file/00010b44d679/test/hotspot/jtreg/compiler/valhalla/valuetypes/TestArrays.java#l672 From vladimir.kozlov at oracle.com Fri Apr 3 17:31:32 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 3 Apr 2020 10:31:32 -0700 Subject: RFR(S) 8231756: [JVMCI] need support for deoptimizing virtual byte arrays encoding non-byte primitives In-Reply-To: <5eb640c1-fb33-2fea-ef2f-416e19220d97@oracle.com> References: <5eb640c1-fb33-2fea-ef2f-416e19220d97@oracle.com> Message-ID: <68e115c1-2088-c075-8ffc-2c79f489fa81@oracle.com> Hi Tom, I looked on testing results and one test fails consistently: compiler/jvmci/jdk.vm.ci.hotspot.test/src/jdk/vm/ci/hotspot/test/VirtualObjectLayoutTest.java Vladimir K On 4/2/20 12:12 PM, Tom Rodriguez wrote: > http://cr.openjdk.java.net/~never/8231756/webrev > https://bugs.openjdk.java.net/browse/JDK-8231756 > > This adds support for deoptimizing with non-byte primitive values stored on top of a byte array, similarly to the way > that a double or long can be stored on top of 2 int fields.? More detail is provided in the bug report and new unit > tests exercise the deoptimization.? mach5 testing is in progress. > > tom From tom.rodriguez at oracle.com Fri Apr 3 19:37:49 2020 From: tom.rodriguez at oracle.com (Tom Rodriguez) Date: Fri, 3 Apr 2020 12:37:49 -0700 Subject: RFR(S) 8231756: [JVMCI] need support for deoptimizing virtual byte arrays encoding non-byte primitives In-Reply-To: <68e115c1-2088-c075-8ffc-2c79f489fa81@oracle.com> References: <5eb640c1-fb33-2fea-ef2f-416e19220d97@oracle.com> <68e115c1-2088-c075-8ffc-2c79f489fa81@oracle.com> Message-ID: <19684e18-4d08-c687-1bb0-c869ab707d74@oracle.com> Vladimir Kozlov wrote on 4/3/20 10:31 AM: > Hi Tom, > > I looked on testing results and one test fails consistently: > > compiler/jvmci/jdk.vm.ci.hotspot.test/src/jdk/vm/ci/hotspot/test/VirtualObjectLayoutTest.java Sorry that was an old mach5 run and I forgot to update with the new one. There are some failures but they seem unrelated to me. tom > > > Vladimir K > > On 4/2/20 12:12 PM, Tom Rodriguez wrote: >> http://cr.openjdk.java.net/~never/8231756/webrev >> https://bugs.openjdk.java.net/browse/JDK-8231756 >> >> This adds support for deoptimizing with non-byte primitive values >> stored on top of a byte array, similarly to the way that a double or >> long can be stored on top of 2 int fields.? More detail is provided in >> the bug report and new unit tests exercise the deoptimization.? mach5 >> testing is in progress. >> >> tom From vladimir.x.ivanov at oracle.com Fri Apr 3 23:12:30 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Sat, 4 Apr 2020 02:12:30 +0300 Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator): General HotSpot changes Message-ID: Hi, Following up on review requests of API [0] and Java implementation [1] for Vector API (JEP 338 [2]), here's a request for review of general HotSpot changes (in shared code) required for supporting the API: http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/all.00-03/ (First of all, to set proper expectations: since the JEP is still in Candidate state, the intention is to initiate preliminary round(s) of review to inform the community and gather feedback before sending out final/official RFRs once the JEP is Targeted to a release.) Vector API (being developed in Project Panama [3]) relies on JVM support to utilize optimal vector hardware instructions at runtime. It interacts with JVM through intrinsics (declared in jdk.internal.vm.vector.VectorSupport [4]) which expose vector operations support in C2 JIT-compiler. As Paul wrote earlier: "A vector intrinsic is an internal low-level vector operation. The last argument to the intrinsic is fall back behavior in Java, implementing the scalar operation over the number of elements held by the vector. Thus, If the intrinsic is not supported in C2 for the other arguments then the Java implementation is executed (the Java implementation is always executed when running in the interpreter or for C1)." The rest of JVM support is about aggressively optimizing vector boxes to minimize (ideally eliminate) the overhead of boxing for vector values. It's a stop-the-gap solution for vector box elimination problem until inline classes arrive. Vector classes are value-based and in the longer term will be migrated to inline classes once the support becomes available. Vector API talk from JVMLS'18 [5] contains brief overview of JVM implementation and some details. Complete implementation resides in vector-unstable branch of panama/dev repository [6]. Now to gory details (the patch is split in multiple "sub-webrevs"): =========================================================== (1) http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/00.backend.shared/ Ideal vector nodes for new operations introduced by Vector API. (Platform-specific back end support will be posted for review separately). =========================================================== (2) http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/ JVM Java interface (VectorSupport) and intrinsic support in C2. Vector instances are initially represented as VectorBox macro nodes and "unboxing" is represented by VectorUnbox node. It simplifies vector box elimination analysis and the nodes are expanded later right before EA pass. Vectors have 2-level on-heap representation: for the vector value primitive array is used as a backing storage and it is encapsulated in a typed wrapper (e.g., Int256Vector - vector of 8 ints - contains a int[8] instance which is used to store vector value). Unless VectorBox node goes away, it needs to be expanded into an allocation eventually, but it is a pure node and doesn't have any JVM state associated with it. The problem is solved by keeping JVM state separately in a VectorBoxAllocate node associated with VectorBox node and use it during expansion. Also, to simplify vector box elimination, inlining of vector reboxing calls (VectorSupport::maybeRebox) is delayed until the analysis is over. =========================================================== (3) http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/02.vbox_elimination/ Vector box elimination analysis implementation. (Brief overview: slides #36-42 [5].) The main part is devoted to scalarization across safepoints and rematerialization support during deoptimization. In C2-generated code vector operations work with raw vector values which live in registers or spilled on the stack and it allows to avoid boxing/unboxing when a vector value is alive across a safepoint. As with other values, there's just a location of the vector value at the safepoint and vector type information recorded in the relevant nmethod metadata and all the heavy-lifting happens only when rematerialization takes place. The analysis preserves object identity invariants except during aggressive reboxing (guarded by -XX:+EnableAggressiveReboxing). (Aggressive reboxing is crucial for cases when vectors "escape": it allocates a fresh instance at every escape point thus enabling original instance to go away.) =========================================================== (4) http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/03.module.hotspot/ HotSpot changes for jdk.incubator.vector module. Vector support is makred experimental and turned off by default. JEP 338 proposes the API to be released as an incubator module, so a user has to specify "--add-module jdk.incubator.vector" on the command line to be able to use it. When user does that, JVM automatically enables Vector API support. It improves usability (user doesn't need to separately "open" the API and enable JVM support) while minimizing risks of destabilitzation from new code when the API is not used. That's it! Will be happy to answer any questions. And thanks in advance for any feedback! Best regards, Vladimir Ivanov [0] https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html [1] https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041228.html [2] https://openjdk.java.net/jeps/338 [3] https://openjdk.java.net/projects/panama/ [4] http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java.html [5] http://cr.openjdk.java.net/~vlivanov/talks/2018_JVMLS_VectorAPI.pdf [6] http://hg.openjdk.java.net/panama/dev/shortlog/92bbd44386e9 $ hg clone http://hg.openjdk.java.net/panama/dev/ -b vector-unstable From forax at univ-mlv.fr Fri Apr 3 23:31:11 2020 From: forax at univ-mlv.fr (Remi Forax) Date: Sat, 4 Apr 2020 01:31:11 +0200 (CEST) Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator): General HotSpot changes In-Reply-To: References: Message-ID: <579633672.1221535.1585956671384.JavaMail.zimbra@u-pem.fr> [...] > (4) > http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/03.module.hotspot/ > > HotSpot changes for jdk.incubator.vector module. Vector support is > makred experimental and turned off by default. JEP 338 proposes the API > to be released as an incubator module, so a user has to specify > "--add-module jdk.incubator.vector" on the command line to be able to > use it. Typo, it's --add-modules > When user does that, JVM automatically enables Vector API support. > It improves usability (user doesn't need to separately "open" the API > and enable JVM support) while minimizing risks of destabilitzation from > new code when the API is not used. Question, what if i declare a module-info that requires "jdk.incubator.vector", because in that case, i don't have to add --add-modules jdk.incubator.vector on the command line, but does the VM will enable the Vector API support ? > > > That's it! Will be happy to answer any questions. > > And thanks in advance for any feedback! regards, R?mi > > Best regards, > Vladimir Ivanov > > [0] > https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html > > [1] > https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041228.html > > [2] https://openjdk.java.net/jeps/338 > > [3] https://openjdk.java.net/projects/panama/ > > [4] > http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java.html > > [5] http://cr.openjdk.java.net/~vlivanov/talks/2018_JVMLS_VectorAPI.pdf > > [6] http://hg.openjdk.java.net/panama/dev/shortlog/92bbd44386e9 > > $ hg clone http://hg.openjdk.java.net/panama/dev/ -b vector-unstable From vladimir.x.ivanov at oracle.com Fri Apr 3 23:52:03 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Sat, 4 Apr 2020 02:52:03 +0300 Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator): General HotSpot changes In-Reply-To: <579633672.1221535.1585956671384.JavaMail.zimbra@u-pem.fr> References: <579633672.1221535.1585956671384.JavaMail.zimbra@u-pem.fr> Message-ID: <4e61bd37-8f8d-1f35-c1db-7826db2b0f53@oracle.com> > Typo, it's --add-modules Good catch, Remi. Thanks for the correction. > >> When user does that, JVM automatically enables Vector API support. >> It improves usability (user doesn't need to separately "open" the API >> and enable JVM support) while minimizing risks of destabilitzation from >> new code when the API is not used. > > Question, what if i declare a module-info that requires "jdk.incubator.vector", because in that case, i don't have to add --add-modules jdk.incubator.vector on the command line, but does the VM will enable the Vector API support ? Good point. JEP 11: "Incubator Modules" [1] states the following: "Applications on the class path must use the --add-modules command-line option to request that an incubator module be resolved. Applications developed as modules can specify requires or requires transitive dependences upon an incubator module directly." Current implementation doesn't distinguish whether the module is resolved for an application on the class path or by another module, so JVM support will be enabled by default in both cases. Do you see any problems with that? Best regards, Vladimir Ivanov [1] https://openjdk.java.net/jeps/11 >> That's it! Will be happy to answer any questions. >> >> And thanks in advance for any feedback! > > regards, > R?mi > >> >> Best regards, >> Vladimir Ivanov >> >> [0] >> https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html >> >> [1] >> https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041228.html >> >> [2] https://openjdk.java.net/jeps/338 >> >> [3] https://openjdk.java.net/projects/panama/ >> >> [4] >> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java.html >> >> [5] http://cr.openjdk.java.net/~vlivanov/talks/2018_JVMLS_VectorAPI.pdf >> >> [6] http://hg.openjdk.java.net/panama/dev/shortlog/92bbd44386e9 >> >> $ hg clone http://hg.openjdk.java.net/panama/dev/ -b vector-unstable From sandhya.viswanathan at intel.com Sat Apr 4 00:16:57 2020 From: sandhya.viswanathan at intel.com (Viswanathan, Sandhya) Date: Sat, 4 Apr 2020 00:16:57 +0000 Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator): x86 backend changes Message-ID: Hi, Following up on review requests of API [0], Java implementation [1] and General Hotspot changes[3] for Vector API, here's a request for review of x86 backend changes required for supporting the API: JEP: https://openjdk.java.net/jeps/338 JBS: https://bugs.openjdk.java.net/browse/JDK-8223347 Webrev:http://cr.openjdk.java.net/~sviswanathan/VAPI_RFR/x86_webrev/webrev.00/ Complete implementation resides in vector-unstable branch of panama/dev repository [3]. Looking forward to your feedback. Best Regards, Sandhya [0] https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html [1] https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-April/065587.html [2] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-April/037798.html [3] https://openjdk.java.net/projects/panama/ $ hg clone http://hg.openjdk.java.net/panama/dev/ -b vector-unstable From vladimir.kozlov at oracle.com Sat Apr 4 00:41:46 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 3 Apr 2020 17:41:46 -0700 Subject: RFR(S) 8231756: [JVMCI] need support for deoptimizing virtual byte arrays encoding non-byte primitives In-Reply-To: <19684e18-4d08-c687-1bb0-c869ab707d74@oracle.com> References: <5eb640c1-fb33-2fea-ef2f-416e19220d97@oracle.com> <68e115c1-2088-c075-8ffc-2c79f489fa81@oracle.com> <19684e18-4d08-c687-1bb0-c869ab707d74@oracle.com> Message-ID: <68c8608f-fb2c-83ed-3b14-39dccb9e80ad@oracle.com> I think new code in deoptimize.cpp should be JVMCI specific. I filed 8242150 for serviceability tests failures in testing. It seems caused by recent changes. It is weird to see SPARC_32 checks in deoptimization.cpp which we should not have in new code: #ifdef _LP64 jlong res = (jlong) *((jlong *) &val); #else #ifdef SPARC // For SPARC we have to swap high and low words. We don't support such configuration for eons. I don't see where _support_large_access_byte_array_virtualization is checked. If it is only in Graal then it should be guarded by #if. Thanks, Vladimir On 4/3/20 12:37 PM, Tom Rodriguez wrote: > > > Vladimir Kozlov wrote on 4/3/20 10:31 AM: >> Hi Tom, >> >> I looked on testing results and one test fails consistently: >> >> compiler/jvmci/jdk.vm.ci.hotspot.test/src/jdk/vm/ci/hotspot/test/VirtualObjectLayoutTest.java > > Sorry that was an old mach5 run and I forgot to update with the new one. ?There are some failures but they seem > unrelated to me. > > tom > >> >> >> Vladimir K >> >> On 4/2/20 12:12 PM, Tom Rodriguez wrote: >>> http://cr.openjdk.java.net/~never/8231756/webrev >>> https://bugs.openjdk.java.net/browse/JDK-8231756 >>> >>> This adds support for deoptimizing with non-byte primitive values stored on top of a byte array, similarly to the way >>> that a double or long can be stored on top of 2 int fields.? More detail is provided in the bug report and new unit >>> tests exercise the deoptimization.? mach5 testing is in progress. >>> >>> tom From forax at univ-mlv.fr Sat Apr 4 12:18:34 2020 From: forax at univ-mlv.fr (forax at univ-mlv.fr) Date: Sat, 4 Apr 2020 14:18:34 +0200 (CEST) Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator): General HotSpot changes In-Reply-To: <4e61bd37-8f8d-1f35-c1db-7826db2b0f53@oracle.com> References: <579633672.1221535.1585956671384.JavaMail.zimbra@u-pem.fr> <4e61bd37-8f8d-1f35-c1db-7826db2b0f53@oracle.com> Message-ID: <801330324.1421417.1586002714453.JavaMail.zimbra@u-pem.fr> ----- Mail original ----- > De: "Vladimir Ivanov" > ?: "Remi Forax" > Cc: "hotspot-dev" , "hotspot compiler" , > "panama-dev at openjdk.java.net'" > Envoy?: Samedi 4 Avril 2020 01:52:03 > Objet: Re: RFR (XXL): 8223347: Integration of Vector API (Incubator): General HotSpot changes >> Typo, it's --add-modules > > Good catch, Remi. Thanks for the correction. > >> >>> When user does that, JVM automatically enables Vector API support. >>> It improves usability (user doesn't need to separately "open" the API >>> and enable JVM support) while minimizing risks of destabilitzation from >>> new code when the API is not used. >> >> Question, what if i declare a module-info that requires "jdk.incubator.vector", >> because in that case, i don't have to add --add-modules jdk.incubator.vector on >> the command line, but does the VM will enable the Vector API support ? > > Good point. JEP 11: "Incubator Modules" [1] states the following: > > "Applications on the class path must use the --add-modules command-line > option to request that an incubator module be resolved. Applications > developed as modules can specify requires or requires transitive > dependences upon an incubator module directly." > > Current implementation doesn't distinguish whether the module is > resolved for an application on the class path or by another module, so > JVM support will be enabled by default in both cases. Do you see any > problems with that? So the VM supports is enabled either because there is an explicit --add-modules or because the module is transitively reachable from the root modules. It means that it doesn't work if the module jdk.incubator.vector is loaded using a ModuleLayer. Users has to use XX:+EnableVectorSupport in that case. regards, R?mi > > Best regards, > Vladimir Ivanov > > [1] https://openjdk.java.net/jeps/11 > >>> That's it! Will be happy to answer any questions. >>> >>> And thanks in advance for any feedback! >> >> regards, >> R?mi >> >>> >>> Best regards, >>> Vladimir Ivanov >>> >>> [0] >>> https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html >>> >>> [1] >>> https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041228.html >>> >>> [2] https://openjdk.java.net/jeps/338 >>> >>> [3] https://openjdk.java.net/projects/panama/ >>> >>> [4] >>> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java.html >>> >>> [5] http://cr.openjdk.java.net/~vlivanov/talks/2018_JVMLS_VectorAPI.pdf >>> >>> [6] http://hg.openjdk.java.net/panama/dev/shortlog/92bbd44386e9 >>> > >> $ hg clone http://hg.openjdk.java.net/panama/dev/ -b vector-unstable From Alan.Bateman at oracle.com Sat Apr 4 12:37:29 2020 From: Alan.Bateman at oracle.com (Alan Bateman) Date: Sat, 4 Apr 2020 13:37:29 +0100 Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator): General HotSpot changes In-Reply-To: <801330324.1421417.1586002714453.JavaMail.zimbra@u-pem.fr> References: <579633672.1221535.1585956671384.JavaMail.zimbra@u-pem.fr> <4e61bd37-8f8d-1f35-c1db-7826db2b0f53@oracle.com> <801330324.1421417.1586002714453.JavaMail.zimbra@u-pem.fr> Message-ID: On 04/04/2020 13:18, forax at univ-mlv.fr wrote: > : > So the VM supports is enabled either because there is an explicit --add-modules or because the module is transitively reachable from the root modules. > It means that it doesn't work if the module jdk.incubator.vector is loaded using a ModuleLayer. Users has to use XX:+EnableVectorSupport in that case. > Is jdk.incubator.vector is mapped to the boot loader? If so then it can't be loaded into a child layer. -Alan From tobias.hartmann at oracle.com Mon Apr 6 06:10:54 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Mon, 6 Apr 2020 08:10:54 +0200 Subject: [15] RFR(T): 8242090: Remove dead code from c1_LIR In-Reply-To: <9ea01eb1-6498-7bdf-416f-675afd621110@oracle.com> References: <84a5e6e1-7441-002d-7564-94d9eb1d4f31@oracle.com> <40e00a69-7b3e-bef2-8b84-a71659611bc3@oracle.com> <9ea01eb1-6498-7bdf-416f-675afd621110@oracle.com> Message-ID: <3d836412-09dc-d67b-7839-22942808fe65@oracle.com> On 03.04.20 16:15, Claes Redestad wrote: > lir_fpop_raw also looked unused, but seems to be used on x86_32 only. > I'm not sure it's worth the trouble guarding its use with X86 && > NOT_LP64..? I gave it a quick try but I don't think it's worth sprinkling additional #ifdefs into the enum and the shared code in c1_LinearScan.cpp. I've simply removed the unused fpop_raw() method: http://cr.openjdk.java.net/~thartmann/8242090/webrev.02/ Best regards, Tobias From tobias.hartmann at oracle.com Mon Apr 6 06:23:40 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Mon, 6 Apr 2020 08:23:40 +0200 Subject: [15] RFR(S): 8241997: Scalar replacement of cloned array is broken after JDK-8238759 In-Reply-To: <902ed245-ee7e-a98e-cf36-ff96bff79245@oracle.com> References: <8a6527a1-f8dc-0306-58ea-93bbc4671775@oracle.com> <902ed245-ee7e-a98e-cf36-ff96bff79245@oracle.com> Message-ID: Hi Nils, thanks for the review! On 03.04.20 17:29, Nils Eliasson wrote: > Nice find - but not all changes in macro.cpp seems related to what was caused by JDK-8238759. What > are the additional changes in PhaseMacroExpand::process_users_of_allocation and > PhaseMacroExpand::can_eliminate_allocation motivated by? Changes in 'can_eliminate_allocation' - line 675: Check is always false since an allocation result is not connected to a clonebasic through an AddP anymore. - line 686: Instead, clonebasic is now directly connected to the allocation through the ArrayCopyNode::Dest input. Changes to 'process_users_of_allocation': - line 970: This is a bit hard to follow in the webrev. I've moved the clonebasic handling from the use->is_AddP() branch to the use->is_ArrayCopy() branch, again because the clonebasic is now directly connected through the result cast and not indirectly through an AddP. Best regards, Tobias From tobias.hartmann at oracle.com Mon Apr 6 06:34:45 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Mon, 6 Apr 2020 08:34:45 +0200 Subject: [11u] 8217230: assert(t == t_no_spec) failure in NodeHash::check_no_speculative_types() In-Reply-To: <87h7y2c5ua.fsf@redhat.com> References: <874kubfked.fsf@redhat.com> <87h7y2c5ua.fsf@redhat.com> Message-ID: Hi Roland, looks good. Best regards, Tobias On 02.04.20 16:36, Roland Westrelin wrote: > >> This is required to backport 8237086 (assert(is_MachReturn()) running >> CTW with fix for JDK-8231291). >> >> Original bug: >> https://bugs.openjdk.java.net/browse/JDK-8217230 >> http://hg.openjdk.java.net/jdk/jdk12/rev/1b292ae4eb50 >> >> Original patch does not apply cleanly to 11u because context changed in >> compile.hpp. Patch is otherwise identical. >> >> 11u webrev: >> http://cr.openjdk.java.net/~roland/8217230.11u/webrev.00/ >> >> Testing: x86_64 build, tier1 + tier2 > > Anyone for this review? > > Roland. > From rwestrel at redhat.com Mon Apr 6 07:17:15 2020 From: rwestrel at redhat.com (Roland Westrelin) Date: Mon, 06 Apr 2020 09:17:15 +0200 Subject: [11u] 8217230: assert(t == t_no_spec) failure in NodeHash::check_no_speculative_types() In-Reply-To: References: <874kubfked.fsf@redhat.com> <87h7y2c5ua.fsf@redhat.com> Message-ID: <87369hccck.fsf@redhat.com> Thanks for the review. Roland. From nils.eliasson at oracle.com Mon Apr 6 07:23:50 2020 From: nils.eliasson at oracle.com (Nils Eliasson) Date: Mon, 6 Apr 2020 09:23:50 +0200 Subject: [15] RFR(S): 8241997: Scalar replacement of cloned array is broken after JDK-8238759 In-Reply-To: References: <8a6527a1-f8dc-0306-58ea-93bbc4671775@oracle.com> <902ed245-ee7e-a98e-cf36-ff96bff79245@oracle.com> Message-ID: <45926bbf-4388-8dbd-9e32-2d1ea00b7e5d@oracle.com> Thanks for the explanation. I think there will be more opportunities for cleaning up cloning optimizations. The array-clone should just be the special case of acopy where the full array is copied, and which can't fault on index or type check. Your change fixes a performance issue I have seen, but didn't understood that I caused it :) Best regards, // Nils On 2020-04-06 08:23, Tobias Hartmann wrote: > Hi Nils, > > thanks for the review! > > On 03.04.20 17:29, Nils Eliasson wrote: >> Nice find - but not all changes in macro.cpp seems related to what was caused by JDK-8238759. What >> are the additional changes in PhaseMacroExpand::process_users_of_allocation and >> PhaseMacroExpand::can_eliminate_allocation motivated by? > Changes in 'can_eliminate_allocation' > - line 675: Check is always false since an allocation result is not connected to a clonebasic > through an AddP anymore. > - line 686: Instead, clonebasic is now directly connected to the allocation through the > ArrayCopyNode::Dest input. > > Changes to 'process_users_of_allocation': > - line 970: This is a bit hard to follow in the webrev. I've moved the clonebasic handling from the > use->is_AddP() branch to the use->is_ArrayCopy() branch, again because the clonebasic is now > directly connected through the result cast and not indirectly through an AddP. > > Best regards, > Tobias From tobias.hartmann at oracle.com Mon Apr 6 07:31:25 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Mon, 6 Apr 2020 09:31:25 +0200 Subject: [15] RFR(S): 8241997: Scalar replacement of cloned array is broken after JDK-8238759 In-Reply-To: <45926bbf-4388-8dbd-9e32-2d1ea00b7e5d@oracle.com> References: <8a6527a1-f8dc-0306-58ea-93bbc4671775@oracle.com> <902ed245-ee7e-a98e-cf36-ff96bff79245@oracle.com> <45926bbf-4388-8dbd-9e32-2d1ea00b7e5d@oracle.com> Message-ID: <9105edc0-5f4f-bb4d-40db-e610828b204a@oracle.com> Hi Nils, On 06.04.20 09:23, Nils Eliasson wrote: > I think there will be more opportunities for cleaning up cloning optimizations. The array-clone > should just be the special case of acopy where the full array is copied, and which can't fault on > index or type check. Yes, we should try to get rid of most of the remaining is_clonebasic special-casing. > Your change fixes a performance issue I have seen, but didn't understood that I caused it :) Okay, great! :) Thanks, Tobias From tobias.hartmann at oracle.com Mon Apr 6 07:48:50 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Mon, 6 Apr 2020 09:48:50 +0200 Subject: RFR(S): 8241900: Loop unswitching may cause dependence on null check to be lost In-Reply-To: <878sjdc5jl.fsf@redhat.com> References: <878sjdc5jl.fsf@redhat.com> Message-ID: <36d43333-81e8-1f79-1a04-06f8e34a2c30@oracle.com> Hi Roland, On 03.04.20 10:55, Roland Westrelin wrote: > The fix I propose is to keep the dependence on the hoisted test on loop > unswitching by using dominated_by() instead of short_circuit_if(). This > way on step 2) 3) above, the CastPP is made dependent on the hoisted > test so reordering of the CastPP with its null check can't happen. This seems reasonable but I'm wondering if that doesn't enable incorrect re-ordering of dependent data nodes with other tests in-between the original and the hoisted test? I.e., without your fix, data nodes are made dependent on the test "just above" the unswitched test. With your fix, they are dependent on the hoisted test outside of the loop body. Please add the appropriate affects versions to the bug. Also, please add a link to the JBS bug to your RFRs. Best regards, Tobias From vladimir.x.ivanov at oracle.com Mon Apr 6 08:02:10 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Mon, 6 Apr 2020 11:02:10 +0300 Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator): General HotSpot changes In-Reply-To: References: <579633672.1221535.1585956671384.JavaMail.zimbra@u-pem.fr> <4e61bd37-8f8d-1f35-c1db-7826db2b0f53@oracle.com> <801330324.1421417.1586002714453.JavaMail.zimbra@u-pem.fr> Message-ID: >> So the VM supports is enabled either because there is an explicit >> --add-modules or because the module is transitively reachable from the >> root modules. >> It means that it doesn't work if the module jdk.incubator.vector is >> loaded using a ModuleLayer. Users has to use XX:+EnableVectorSupport >> in that case. >> > Is jdk.incubator.vector is mapped to the boot loader? If so then it > can't be loaded into a child layer. Yes, jdk.incubator.vector is a boot module. The reason to put it there is to be able to trust final instance fields by the JVM. Since the module extensively uses VM annotations, it should be either boot or platform module in order to have access to them, but in case of platform module existing logic for trusting final instance fields doesn't work and all such fields should be marked as @Stable instead. Best regards, Vladimir Ivanov From rwestrel at redhat.com Mon Apr 6 08:34:42 2020 From: rwestrel at redhat.com (Roland Westrelin) Date: Mon, 06 Apr 2020 10:34:42 +0200 Subject: RFR(S): 8241900: Loop unswitching may cause dependence on null check to be lost In-Reply-To: <36d43333-81e8-1f79-1a04-06f8e34a2c30@oracle.com> References: <878sjdc5jl.fsf@redhat.com> <36d43333-81e8-1f79-1a04-06f8e34a2c30@oracle.com> Message-ID: <87zhbpau71.fsf@redhat.com> Hi Tobias, Thanks for looking at this. > This seems reasonable but I'm wondering if that doesn't enable incorrect re-ordering of dependent > data nodes with other tests in-between the original and the hoisted test? I.e., without your fix, > data nodes are made dependent on the test "just above" the unswitched test. With your fix, they are > dependent on the hoisted test outside of the loop body. I've been wondering about that too but couldn't find a scenario where it would go wrong. dominated_by() is what's used when a if is replaced by a dominating if with the same condition in PhaseIdealLoop::split_if_with_blocks_post(). Loop switching is similar: we add a dominating if, and then remove the loop copies because they are redundant. > Please add the appropriate affects versions to the bug. Also, please add a link to the JBS bug to > your RFRs. Sorry about that, I keep forgetting. Roland. From tobias.hartmann at oracle.com Mon Apr 6 08:51:53 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Mon, 6 Apr 2020 10:51:53 +0200 Subject: RFR(S): 8241900: Loop unswitching may cause dependence on null check to be lost In-Reply-To: <87zhbpau71.fsf@redhat.com> References: <878sjdc5jl.fsf@redhat.com> <36d43333-81e8-1f79-1a04-06f8e34a2c30@oracle.com> <87zhbpau71.fsf@redhat.com> Message-ID: <6cd91986-952c-1419-47d1-f7d25b955a70@oracle.com> On 06.04.20 10:34, Roland Westrelin wrote: > I've been wondering about that too but couldn't find a scenario where it > would go wrong. dominated_by() is what's used when a if is replaced by a > dominating if with the same condition in > PhaseIdealLoop::split_if_with_blocks_post(). Loop switching is similar: > we add a dominating if, and then remove the loop copies because they are > redundant. Right, I couldn't find such a scenario either and as you've pointed out the same problem would exists at other places as well. Looks good. Best regards, Tobias From claes.redestad at oracle.com Mon Apr 6 10:08:58 2020 From: claes.redestad at oracle.com (Claes Redestad) Date: Mon, 6 Apr 2020 12:08:58 +0200 Subject: [15] RFR(T): 8242090: Remove dead code from c1_LIR In-Reply-To: <3d836412-09dc-d67b-7839-22942808fe65@oracle.com> References: <84a5e6e1-7441-002d-7564-94d9eb1d4f31@oracle.com> <40e00a69-7b3e-bef2-8b84-a71659611bc3@oracle.com> <9ea01eb1-6498-7bdf-416f-675afd621110@oracle.com> <3d836412-09dc-d67b-7839-22942808fe65@oracle.com> Message-ID: On 2020-04-06 08:10, Tobias Hartmann wrote: > > > On 03.04.20 16:15, Claes Redestad wrote: >> lir_fpop_raw also looked unused, but seems to be used on x86_32 only. >> I'm not sure it's worth the trouble guarding its use with X86 && >> NOT_LP64..? > > I gave it a quick try but I don't think it's worth sprinkling additional #ifdefs into the enum and > the shared code in c1_LinearScan.cpp. I've simply removed the unused fpop_raw() method: > http://cr.openjdk.java.net/~thartmann/8242090/webrev.02/ Still looks good (and trivial). /Claes > > Best regards, > Tobias > From tobias.hartmann at oracle.com Mon Apr 6 10:10:20 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Mon, 6 Apr 2020 12:10:20 +0200 Subject: [15] RFR(T): 8242090: Remove dead code from c1_LIR In-Reply-To: References: <84a5e6e1-7441-002d-7564-94d9eb1d4f31@oracle.com> <40e00a69-7b3e-bef2-8b84-a71659611bc3@oracle.com> <9ea01eb1-6498-7bdf-416f-675afd621110@oracle.com> <3d836412-09dc-d67b-7839-22942808fe65@oracle.com> Message-ID: <2af9fa0b-9a40-c83f-7736-a33f16d76483@oracle.com> On 06.04.20 12:08, Claes Redestad wrote: > Still looks good (and trivial). Thanks again! Pushed. Best regards, Tobias From vladimir.x.ivanov at oracle.com Mon Apr 6 13:38:12 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Mon, 6 Apr 2020 16:38:12 +0300 Subject: Polymorphic Guarded Inlining in C2 In-Reply-To: <084ea561-6305-a8fd-9d7f-5ba108d41312@oracle.com> References: <6bbeea49-7335-9640-d524-32fa03968f42@oracle.com> <084ea561-6305-a8fd-9d7f-5ba108d41312@oracle.com> Message-ID: <6de5487c-c13e-c03e-9d0b-f3093b115daf@oracle.com> I see 2 directions (mostly independent) to proceed: (1) use existing profiling info only; and (2) when more profile info is available. I suggest to explore them independently. There's enough profiling data available to introduce polymorpic case with 2 major receivers ("2-poly"). And it'll complete the matrix of possible shapes. Gathering more data (-XX:TypeProfileWidth=N > 2) enables 2 more generic shapes: "N-morphic" and "N-poly". The only difference between them is what happens on fallback patch - deopt / uncommon trap or a virtual call. Regarding 2-poly, there is TypeProfileMajorReceiverPercent which should be extended to 2 cases which leads to 2 parameter: aggregated major receiver percentage and minimum indiviual percentage. Also, it makes sense to introduce UseOnlyInlinedPolymorphic which aligns 2-poly with bimorphic case. And, as I mentioned before, IMO it's promising to distinguish invokevirtual and invokeinterface cases. So, additional flag to control that would be useful. Regarding N-poly/N-morphic case, they can be generalized from 2-poly/bi-morphic cases. I believe experiments on 2-poly will provide useful insights on N-poly/N-morphic, so it makes sense to start with 2-poly first. Best regards, Vladimir Ivanov On 01.04.2020 01:29, Vladimir Kozlov wrote: > Looks like graphs were stripped from email. I put them on GitHub: > > > > > > > > > Also Vladimir Ivanov forwarded me data he collected. > > His next data shows that profiling is not "free". Vladimir I. limited to > tier3 (-XX:TieredStopAtLevel=3, C1 compilation with profiling code) to > show that profiling code with TPW=8 is slower. Note, with 4 tiers this > may not visible because execution will be switched to C2 compiled code > (without profiling code). > > > > > > > Next data collected for proposed patch. Vladimir I. collected data for > several flags configurations. > Next graphs are for one of settings:' -XX:+UsePolymorphicInlining > -XX:+UseOnlyInlinedPolymorphic -XX:TypeProfileWidth=4' > > > > > > > It has mixed data but most benchmarks are not affected. Which means we > need to spend more time on proposed changes. > > Vladimir K > > On 3/31/20 10:39 AM, Vladimir Kozlov wrote: >> I start loking on it. >> >> I think ideally TypeProfileWidth should be per call site and not per >> method - and it will require more complicated implementation (an other >> RFE). But for experiments I think setting it to 8 (or higher) for all >> methods is okay. >> >> Note, more profiling lines per each call site is cost few Mb in >> CodeCache (overestimation 20K nmethods * 10 call sites * 6 * 8 bytes) >> vs very complicated code to have dynamic number of lines. >> >> I think we should first investigate best heuristics for inlining vs >> direct call vs vcall vs uncommmont traps for polymorphic cases and >> worry about memory and time consumption during profiling later. >> >> I did some performance runs with latest JDK 15 for TypeProfileWidth=8 >> vs =2 and don't see much difference for spec benchmarks (see attached >> graph - grey dots mean no significant difference). But there are >> regressions (red dots) for Renessance which includes some modern >> benchmarks. >> >> I will work his week to get similar data with Ludovic's patch. >> >> I am for incremental approach. I think we can start/push based on what >> Ludovic is currently suggesting (do more processing for TPW > 2) while >> preserving current default behaviour (for TPW <= 2). But only if it >> gives improvements in these benchmarks. We use these benchmarks as >> criteria for JDK releases. >> >> Regards, >> Vladimir >> >> On 3/20/20 4:52 PM, Ludovic Henry wrote: >>> Hi Vladimir, >>> >>> As requested offline, please find following the latest version of the >>> patch. Contrary to what was discussed >>> initially, I haven't done the work to support per-method >>> TypeProfileWidth, as that requires to extend the >>> existing CompilerDirectives to be available to the Interpreter. For >>> me to achieve that work, I would need >>> guidance on how to approach the problem, and what your expectations are. >>> >>> Thank you, >>> >>> -- >>> Ludovic >>> >>> diff --git a/src/hotspot/cpu/x86/interp_masm_x86.cpp >>> b/src/hotspot/cpu/x86/interp_masm_x86.cpp >>> index 4ed93169c7..bad9cddf20 100644 >>> --- a/src/hotspot/cpu/x86/interp_masm_x86.cpp >>> +++ b/src/hotspot/cpu/x86/interp_masm_x86.cpp >>> @@ -1731,7 +1731,7 @@ void >>> InterpreterMacroAssembler::record_item_in_profile_helper(Register >>> item, Reg >>> ??????????? Label found_null; >>> ??????????? jccb(Assembler::zero, found_null); >>> ??????????? // Item did not match any saved item and there is no >>> empty row for it. >>> -????????? // Increment total counter to indicate polymorphic case. >>> +????????? // Increment total counter to indicate megamorphic case. >>> ??????????? increment_mdp_data_at(mdp, non_profiled_offset); >>> ??????????? jmp(done); >>> ??????????? bind(found_null); >>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp >>> b/src/hotspot/share/ci/ciCallProfile.hpp >>> index 73854806ed..c5030149bf 100644 >>> --- a/src/hotspot/share/ci/ciCallProfile.hpp >>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp >>> @@ -38,7 +38,8 @@ private: >>> ??? friend class ciMethod; >>> ??? friend class ciMethodHandle; >>> -? enum { MorphismLimit = 2 }; // Max call site's morphism we care about >>> +? enum { MorphismLimit = 8 }; // Max call site's morphism we care about >>> +? bool _is_megamorphic;????????? // whether the call site is >>> megamorphic >>> ??? int? _limit;??????????????? // number of receivers have been >>> determined >>> ??? int? _morphism;???????????? // determined call site's morphism >>> ??? int? _count;??????????????? // # times has this call been executed >>> @@ -47,6 +48,8 @@ private: >>> ??? ciKlass*? _receiver[MorphismLimit + 1];? // receivers (exact) >>> ??? ciCallProfile() { >>> +??? guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit >>> can't be smaller than TypeProfileWidth"); >>> +??? _is_megamorphic = false; >>> ????? _limit = 0; >>> ????? _morphism??? = 0; >>> ????? _count = -1; >>> @@ -58,6 +61,8 @@ private: >>> ??? void add_receiver(ciKlass* receiver, int receiver_count); >>> ? public: >>> +? bool????? is_megamorphic() const??? { return _is_megamorphic; } >>> + >>> ??? // Note:? The following predicates return false for invalid >>> profiles: >>> ??? bool????? has_receiver(int i) const { return _limit > i; } >>> ??? int?????? morphism() const????????? { return _morphism; } >>> diff --git a/src/hotspot/share/ci/ciMethod.cpp >>> b/src/hotspot/share/ci/ciMethod.cpp >>> index d771be8dac..c190919708 100644 >>> --- a/src/hotspot/share/ci/ciMethod.cpp >>> +++ b/src/hotspot/share/ci/ciMethod.cpp >>> @@ -531,25 +531,27 @@ ciCallProfile ciMethod::call_profile_at_bci(int >>> bci) { >>> ??????????? // If we extend profiling to record methods, >>> ??????????? // we will set result._method also. >>> ????????? } >>> -??????? // Determine call site's morphism. >>> +??????? // Determine call site's megamorphism. >>> ????????? // The call site count is 0 with known morphism (only 1 or >>> 2 receivers) >>> ????????? // or < 0 in the case of a type check failure for >>> checkcast, aastore, instanceof. >>> -??????? // The call site count is > 0 in the case of a polymorphic >>> virtual call. >>> +??????? // The call site count is > 0 in the case of a megamorphic >>> virtual call. >>> ????????? if (morphism > 0 && morphism == result._limit) { >>> ???????????? // The morphism <= MorphismLimit. >>> -?????????? if ((morphism >> -?????????????? (morphism == ciCallProfile::MorphismLimit && count == >>> 0)) { >>> +?????????? if ((morphism >> +?????????????? (morphism == TypeProfileWidth && count == 0)) { >>> ? #ifdef ASSERT >>> ?????????????? if (count > 0) { >>> ???????????????? this->print_short_name(tty); >>> ???????????????? tty->print_cr(" @ bci:%d", bci); >>> ???????????????? this->print_codes(); >>> -?????????????? assert(false, "this call site should not be >>> polymorphic"); >>> +?????????????? assert(false, "this call site should not be >>> megamorphic"); >>> ?????????????? } >>> ? #endif >>> -???????????? result._morphism = morphism; >>> +?????????? } else { >>> +????????????? result._is_megamorphic = true; >>> ???????????? } >>> ????????? } >>> +??????? result._morphism = morphism; >>> ????????? // Make the count consistent if this is a call profile. If >>> count is >>> ????????? // zero or less, presume that this is a typecheck profile and >>> ????????? // do nothing.? Otherwise, increase count to be the sum of all >>> @@ -578,7 +580,7 @@ void ciCallProfile::add_receiver(ciKlass* >>> receiver, int receiver_count) { >>> ??? } >>> ??? _receiver[i] = receiver; >>> ??? _receiver_count[i] = receiver_count; >>> -? if (_limit < MorphismLimit) _limit++; >>> +? if (_limit < TypeProfileWidth) _limit++; >>> ? } >>> diff --git a/src/hotspot/share/opto/c2_globals.hpp >>> b/src/hotspot/share/opto/c2_globals.hpp >>> index d605bdb7bd..e4a5e7ea8b 100644 >>> --- a/src/hotspot/share/opto/c2_globals.hpp >>> +++ b/src/hotspot/share/opto/c2_globals.hpp >>> @@ -389,9 +389,16 @@ >>> ??? product(bool, UseBimorphicInlining, >>> true,???????????????????????????????? \ >>> ??????????? "Profiling based inlining for two >>> receivers")???????????????????? \ >>> >>> \ >>> +? product(bool, UsePolymorphicInlining, >>> true,?????????????????????????????? \ >>> +????????? "Profiling based inlining for two or more >>> receivers")???????????? \ >>> + >>> \ >>> ??? product(bool, UseOnlyInlinedBimorphic, >>> true,????????????????????????????? \ >>> ??????????? "Don't use BimorphicInlining if can't inline a second >>> method")??? \ >>> >>> \ >>> +? product(bool, UseOnlyInlinedPolymorphic, >>> true,??????????????????????????? \ >>> +????????? "Don't use PolymorphicInlining if can't inline a secondary >>> "????? \ >>> + >>> "method")???????????????????????????????????????????????????????? \ >>> + >>> \ >>> ??? product(bool, InsertMemBarAfterArraycopy, >>> true,?????????????????????????? \ >>> ??????????? "Insert memory barrier after arraycopy >>> call")???????????????????? \ >>> >>> \ >>> @@ -645,6 +652,10 @@ >>> ??????????? "% of major receiver type to all profiled >>> receivers")???????????? \ >>> ??????????? range(0, >>> 100)???????????????????????????????????????????????????? \ >>> >>> \ >>> +? product(intx, TypeProfileMinimumReceiverPercent, >>> 20,????????????????????? \ >>> +????????? "minimum % of receiver type to all profiled >>> receivers")?????????? \ >>> +????????? range(0, >>> 100)???????????????????????????????????????????????????? \ >>> + >>> \ >>> ??? diagnostic(bool, PrintIntrinsics, >>> false,????????????????????????????????? \ >>> ??????????? "prints attempted and successful inlining of >>> intrinsics")???????? \ >>> >>> \ >>> diff --git a/src/hotspot/share/opto/doCall.cpp >>> b/src/hotspot/share/opto/doCall.cpp >>> index 44ab387ac8..dba2b114c6 100644 >>> --- a/src/hotspot/share/opto/doCall.cpp >>> +++ b/src/hotspot/share/opto/doCall.cpp >>> @@ -83,25 +83,27 @@ CallGenerator* Compile::call_generator(ciMethod* >>> callee, int vtable_index, bool >>> ??? // See how many times this site has been invoked. >>> ??? int site_count = profile.count(); >>> -? int receiver_count = -1; >>> -? if (call_does_dispatch && UseTypeProfile && >>> profile.has_receiver(0)) { >>> -??? // Receivers in the profile structure are ordered by call counts >>> -??? // so that the most called (major) receiver is profile.receiver(0). >>> -??? receiver_count = profile.receiver_count(0); >>> -? } >>> ??? CompileLog* log = this->log(); >>> ??? if (log != NULL) { >>> -??? int rid = (receiver_count >= 0)? >>> log->identify(profile.receiver(0)): -1; >>> -??? int r2id = (rid != -1 && profile.has_receiver(1))? >>> log->identify(profile.receiver(1)):-1; >>> +??? int* rids; >>> +??? if (call_does_dispatch) { >>> +????? rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth); >>> +????? for (int i = 0; i < TypeProfileWidth && >>> profile.has_receiver(i); i++) { >>> +??????? rids[i] = log->identify(profile.receiver(i)); >>> +????? } >>> +??? } >>> ????? log->begin_elem("call method='%d' count='%d' prof_factor='%f'", >>> ????????????????????? log->identify(callee), site_count, prof_factor); >>> -??? if (call_does_dispatch)? log->print(" virtual='1'"); >>> ????? if (allow_inline)???? log->print(" inline='1'"); >>> -??? if (receiver_count >= 0) { >>> -????? log->print(" receiver='%d' receiver_count='%d'", rid, >>> receiver_count); >>> -????? if (profile.has_receiver(1)) { >>> -??????? log->print(" receiver2='%d' receiver2_count='%d'", r2id, >>> profile.receiver_count(1)); >>> +??? if (call_does_dispatch) { >>> +????? log->print(" virtual='1'"); >>> +????? for (int i = 0; i < TypeProfileWidth && >>> profile.has_receiver(i); i++) { >>> +??????? if (i == 0) { >>> +????????? log->print(" receiver='%d' receiver_count='%d' >>> receiver_prob='%f'", rids[i], profile.receiver_count(i), >>> profile.receiver_prob(i)); >>> +??????? } else { >>> +????????? log->print(" receiver%d='%d' receiver%d_count='%d' >>> receiver%d_prob='%f'", i + 1, rids[i], i + 1, >>> profile.receiver_count(i), i + 1, profile.receiver_prob(i)); >>> +??????? } >>> ??????? } >>> ????? } >>> ????? if (callee->is_method_handle_intrinsic()) { >>> @@ -205,92 +207,112 @@ CallGenerator* >>> Compile::call_generator(ciMethod* callee, int vtable_index, bool >>> ????? if (call_does_dispatch && site_count > 0 && UseTypeProfile) { >>> ??????? // The major receiver's count >= >>> TypeProfileMajorReceiverPercent of site_count. >>> ??????? bool have_major_receiver = profile.has_receiver(0) && >>> (100.*profile.receiver_prob(0) >= >>> (float)TypeProfileMajorReceiverPercent); >>> -????? ciMethod* receiver_method = NULL; >>> ??????? int morphism = profile.morphism(); >>> + >>> +????? int width = morphism > 0 ? morphism : 1; >>> +????? ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*, >>> width); >>> +????? memset(receiver_methods, 0, sizeof(ciMethod*) * width); >>> +????? CallGenerator** hit_cgs = NEW_RESOURCE_ARRAY(CallGenerator*, >>> width); >>> +????? memset(hit_cgs, 0, sizeof(CallGenerator*) * width); >>> + >>> ??????? if (speculative_receiver_type != NULL) { >>> ????????? if (!too_many_traps_or_recompiles(caller, bci, >>> Deoptimization::Reason_speculate_class_check)) { >>> ??????????? // We have a speculative type, we should be able to resolve >>> ??????????? // the call. We do that before looking at the profiling at >>> -????????? // this invoke because it may lead to bimorphic inlining >>> which >>> +????????? // this invoke because it may lead to polymorphic inlining >>> which >>> ??????????? // a speculative type should help us avoid. >>> -????????? receiver_method = >>> callee->resolve_invoke(jvms->method()->holder(), >>> - >>> speculative_receiver_type); >>> -????????? if (receiver_method == NULL) { >>> +????????? receiver_methods[0] = >>> callee->resolve_invoke(jvms->method()->holder(), >>> + >>> speculative_receiver_type); >>> +????????? if (receiver_methods[0] == NULL) { >>> ????????????? speculative_receiver_type = NULL; >>> ??????????? } else { >>> ????????????? morphism = 1; >>> ??????????? } >>> ????????? } else { >>> ??????????? // speculation failed before. Use profiling at the call >>> -????????? // (could allow bimorphic inlining for instance). >>> +????????? // (could allow polymorphic inlining for instance). >>> ??????????? speculative_receiver_type = NULL; >>> ????????? } >>> ??????? } >>> -????? if (receiver_method == NULL && >>> -????????? (have_major_receiver || morphism == 1 || >>> -?????????? (morphism == 2 && UseBimorphicInlining))) { >>> -??????? // receiver_method = profile.method(); >>> -??????? // Profiles do not suggest methods now.? Look it up in the >>> major receiver. >>> -??????? receiver_method = >>> callee->resolve_invoke(jvms->method()->holder(), >>> - >>> profile.receiver(0)); >>> -????? } >>> -????? if (receiver_method != NULL) { >>> -??????? // The single majority receiver sufficiently outweighs the >>> minority. >>> -??????? CallGenerator* hit_cg = this->call_generator(receiver_method, >>> -????????????? vtable_index, !call_does_dispatch, jvms, allow_inline, >>> prof_factor); >>> -??????? if (hit_cg != NULL) { >>> -????????? // Look up second receiver. >>> -????????? CallGenerator* next_hit_cg = NULL; >>> -????????? ciMethod* next_receiver_method = NULL; >>> -????????? if (morphism == 2 && UseBimorphicInlining) { >>> -??????????? next_receiver_method = >>> callee->resolve_invoke(jvms->method()->holder(), >>> - >>> profile.receiver(1)); >>> -??????????? if (next_receiver_method != NULL) { >>> -????????????? next_hit_cg = this->call_generator(next_receiver_method, >>> -????????????????????????????????? vtable_index, !call_does_dispatch, >>> jvms, >>> -????????????????????????????????? allow_inline, prof_factor); >>> -????????????? if (next_hit_cg != NULL && !next_hit_cg->is_inline() && >>> -????????????????? have_major_receiver && UseOnlyInlinedBimorphic) { >>> -????????????????? // Skip if we can't inline second receiver's method >>> -????????????????? next_hit_cg = NULL; >>> -????????????? } >>> -??????????? } >>> -????????? } >>> -????????? CallGenerator* miss_cg; >>> -????????? Deoptimization::DeoptReason reason = (morphism == 2 >>> -?????????????????????????????????????????????? ? >>> Deoptimization::Reason_bimorphic >>> -?????????????????????????????????????????????? : >>> Deoptimization::reason_class_check(speculative_receiver_type != NULL)); >>> -????????? if ((morphism == 1 || (morphism == 2 && next_hit_cg != >>> NULL)) && >>> -????????????? !too_many_traps_or_recompiles(caller, bci, reason) >>> -???????????? ) { >>> -??????????? // Generate uncommon trap for class check failure path >>> -??????????? // in case of monomorphic or bimorphic virtual call site. >>> -??????????? miss_cg = CallGenerator::for_uncommon_trap(callee, reason, >>> -??????????????????????? Deoptimization::Action_maybe_recompile); >>> +????? bool removed_cgs = false; >>> +????? // Look up receivers. >>> +????? for (int i = 0; i < morphism; i++) { >>> +??????? if ((i == 1 && !UseBimorphicInlining) || (i >= 1 && >>> !UsePolymorphicInlining)) { >>> +????????? break; >>> +??????? } >>> +??????? if (receiver_methods[i] == NULL && profile.has_receiver(i)) { >>> +????????? receiver_methods[i] = >>> callee->resolve_invoke(jvms->method()->holder(), >>> + >>> profile.receiver(i)); >>> +??????? } >>> +??????? if (receiver_methods[i] != NULL) { >>> +????????? bool allow_inline; >>> +????????? if (speculative_receiver_type != NULL) { >>> +??????????? allow_inline = true; >>> ??????????? } else { >>> -??????????? // Generate virtual call for class check failure path >>> -??????????? // in case of polymorphic virtual call site. >>> -??????????? miss_cg = CallGenerator::for_virtual_call(callee, >>> vtable_index); >>> +??????????? allow_inline = 100.*profile.receiver_prob(i) >= >>> (float)TypeProfileMinimumReceiverPercent; >>> ??????????? } >>> -????????? if (miss_cg != NULL) { >>> -??????????? if (next_hit_cg != NULL) { >>> -????????????? assert(speculative_receiver_type == NULL, "shouldn't >>> end up here if we used speculation"); >>> -????????????? trace_type_profile(C, jvms->method(), jvms->depth() - >>> 1, jvms->bci(), next_receiver_method, profile.receiver(1), >>> site_count, profile.receiver_count(1)); >>> -????????????? // We don't need to record dependency on a receiver >>> here and below. >>> -????????????? // Whenever we inline, the dependency is added by >>> Parse::Parse(). >>> -????????????? miss_cg = >>> CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, >>> next_hit_cg, PROB_MAX); >>> -??????????? } >>> -??????????? if (miss_cg != NULL) { >>> -????????????? ciKlass* k = speculative_receiver_type != NULL ? >>> speculative_receiver_type : profile.receiver(0); >>> -????????????? trace_type_profile(C, jvms->method(), jvms->depth() - >>> 1, jvms->bci(), receiver_method, k, site_count, receiver_count); >>> -????????????? float hit_prob = speculative_receiver_type != NULL ? >>> 1.0 : profile.receiver_prob(0); >>> -????????????? CallGenerator* cg = >>> CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob); >>> -????????????? if (cg != NULL)? return cg; >>> +????????? hit_cgs[i] = this->call_generator(receiver_methods[i], >>> +??????????????????????????????? vtable_index, !call_does_dispatch, >>> jvms, >>> +??????????????????????????????? allow_inline, prof_factor); >>> +????????? if (hit_cgs[i] != NULL) { >>> +??????????? if (speculative_receiver_type != NULL) { >>> +????????????? // Do nothing if it's a speculative type >>> +??????????? } else if (bytecode == Bytecodes::_invokeinterface) { >>> +????????????? // Do nothing if it's an interface, multiple >>> direct-calls are faster than one indirect-call >>> +??????????? } else if (!have_major_receiver) { >>> +????????????? // Do nothing if there is no major receiver >>> +??????????? } else if ((morphism == 2 && !UseOnlyInlinedBimorphic) >>> || (morphism >= 2 && !UseOnlyInlinedPolymorphic)) { >>> +????????????? // Do nothing if the user allows non-inlined >>> polymorphic calls >>> +??????????? } else if (!hit_cgs[i]->is_inline()) { >>> +????????????? // Skip if we can't inline receiver's method >>> +????????????? hit_cgs[i] = NULL; >>> +????????????? removed_cgs = true; >>> ????????????? } >>> ??????????? } >>> ????????? } >>> ??????? } >>> + >>> +????? // Generate the fallback path >>> +????? Deoptimization::DeoptReason reason = (morphism != 1 >>> +??????????????????????????????????????????? ? >>> Deoptimization::Reason_polymorphic >>> +??????????????????????????????????????????? : >>> Deoptimization::reason_class_check(speculative_receiver_type != NULL)); >>> +????? bool disable_trap = (profile.is_megamorphic() || removed_cgs >>> || too_many_traps_or_recompiles(caller, bci, reason)); >>> +????? if (log != NULL) { >>> +??????? log->elem("call_fallback method='%d' count='%d' >>> morphism='%d' trap='%d'", >>> +????????????????????? log->identify(callee), site_count, morphism, >>> disable_trap ? 0 : 1); >>> +????? } >>> +????? CallGenerator* miss_cg; >>> +????? if (!disable_trap) { >>> +??????? // Generate uncommon trap for class check failure path >>> +??????? // in case of polymorphic virtual call site. >>> +??????? miss_cg = CallGenerator::for_uncommon_trap(callee, reason, >>> +??????????????????? Deoptimization::Action_maybe_recompile); >>> +????? } else { >>> +??????? // Generate virtual call for class check failure path >>> +??????? // in case of megamorphic virtual call site. >>> +??????? miss_cg = CallGenerator::for_virtual_call(callee, >>> vtable_index); >>> +????? } >>> + >>> +????? // Generate the guards >>> +????? CallGenerator* cg = NULL; >>> +????? if (speculative_receiver_type != NULL) { >>> +??????? if (hit_cgs[0] != NULL) { >>> +????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, >>> jvms->bci(), receiver_methods[0], speculative_receiver_type, >>> site_count, profile.receiver_count(0)); >>> +????????? // We don't need to record dependency on a receiver here >>> and below. >>> +????????? // Whenever we inline, the dependency is added by >>> Parse::Parse(). >>> +????????? cg = >>> CallGenerator::for_predicted_call(speculative_receiver_type, miss_cg, >>> hit_cgs[0], PROB_MAX); >>> +??????? } >>> +????? } else { >>> +??????? for (int i = morphism - 1; i >= 0 && miss_cg != NULL; i--) { >>> +????????? if (hit_cgs[i] != NULL) { >>> +??????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, >>> jvms->bci(), receiver_methods[i], profile.receiver(i), site_count, >>> profile.receiver_count(i)); >>> +??????????? miss_cg = >>> CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, >>> hit_cgs[i], profile.receiver_prob(i)); >>> +????????? } >>> +??????? } >>> +??????? cg = miss_cg; >>> +????? } >>> +????? if (cg != NULL)? return cg; >>> ????? } >>> ????? // If there is only one implementor of this interface then we >>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp >>> b/src/hotspot/share/runtime/deoptimization.cpp >>> index 11df15e004..2d14b52854 100644 >>> --- a/src/hotspot/share/runtime/deoptimization.cpp >>> +++ b/src/hotspot/share/runtime/deoptimization.cpp >>> @@ -2382,7 +2382,7 @@ const char* Deoptimization::_trap_reason_name[] >>> = { >>> ??? "class_check", >>> ??? "array_check", >>> ??? "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"), >>> -? "bimorphic" JVMCI_ONLY("_or_optimized_type_check"), >>> +? "polymorphic" JVMCI_ONLY("_or_optimized_type_check"), >>> ??? "profile_predicate", >>> ??? "unloaded", >>> ??? "uninitialized", >>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp >>> b/src/hotspot/share/runtime/deoptimization.hpp >>> index 1cfff5394e..c1eb998aba 100644 >>> --- a/src/hotspot/share/runtime/deoptimization.hpp >>> +++ b/src/hotspot/share/runtime/deoptimization.hpp >>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic { >>> ????? Reason_class_check,?????????? // saw unexpected object class >>> (@bci) >>> ????? Reason_array_check,?????????? // saw unexpected array class >>> (aastore @bci) >>> ????? Reason_intrinsic,???????????? // saw unexpected operand to >>> intrinsic (@bci) >>> -??? Reason_bimorphic,???????????? // saw unexpected object class in >>> bimorphic inlining (@bci) >>> +??? Reason_polymorphic,?????????? // saw unexpected object class in >>> bimorphic inlining (@bci) >>> ? #if INCLUDE_JVMCI >>> ????? Reason_unreached0???????????? = Reason_null_assert, >>> ????? Reason_type_checked_inlining? = Reason_intrinsic, >>> -??? Reason_optimized_type_check?? = Reason_bimorphic, >>> +??? Reason_optimized_type_check?? = Reason_polymorphic, >>> ? #endif >>> ????? Reason_profile_predicate,???? // compiler generated predicate >>> moved from frequent branch in a loop failed >>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp >>> b/src/hotspot/share/runtime/vmStructs.cpp >>> index 94b544824e..ee761626c4 100644 >>> --- a/src/hotspot/share/runtime/vmStructs.cpp >>> +++ b/src/hotspot/share/runtime/vmStructs.cpp >>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry>> mtClass>? KlassHashtableEntry; >>> >>> declare_constant(Deoptimization::Reason_class_check) >>> \ >>> >>> declare_constant(Deoptimization::Reason_array_check) >>> \ >>> >>> declare_constant(Deoptimization::Reason_intrinsic) >>> \ >>> - >>> declare_constant(Deoptimization::Reason_bimorphic) >>> \ >>> + >>> declare_constant(Deoptimization::Reason_polymorphic) >>> \ >>> >>> declare_constant(Deoptimization::Reason_profile_predicate) >>> \ >>> >>> declare_constant(Deoptimization::Reason_unloaded) >>> \ >>> >>> declare_constant(Deoptimization::Reason_uninitialized) >>> \ >>> >>> -----Original Message----- >>> From: hotspot-compiler-dev >>> On Behalf Of Ludovic >>> Henry >>> Sent: Tuesday, March 3, 2020 10:50 AM >>> To: Vladimir Ivanov ; John Rose >>> ; hotspot-compiler-dev at openjdk.java.net >>> Subject: RE: Polymorphic Guarded Inlining in C2 >>> >>> I just got to run the PolymorphicVirtualCallBenchmark microbenchmark >>> with >>> various TypeProfileWidth values. The results are: >>> >>> Benchmark???????????????????????????? Mode? Cnt? Score?? Error? Units >>> Configuration >>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.802 ? 0.048 >>> ops/s -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableInlining >>> -XX:+PolyGuardDisableTrap >>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.425 ? 0.019 >>> ops/s -XX:TypeProfileWidth=1 -XX:+PolyGuardDisableInlining >>> -XX:+PolyGuardDisableTrap >>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.857 ? 0.109 >>> ops/s -XX:TypeProfileWidth=2 -XX:+PolyGuardDisableInlining >>> -XX:+PolyGuardDisableTrap >>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.876 ? 0.051 >>> ops/s -XX:TypeProfileWidth=3 -XX:+PolyGuardDisableInlining >>> -XX:+PolyGuardDisableTrap >>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.867 ? 0.045 >>> ops/s -XX:TypeProfileWidth=4 -XX:+PolyGuardDisableInlining >>> -XX:+PolyGuardDisableTrap >>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.835 ? 0.104 >>> ops/s -XX:TypeProfileWidth=5 -XX:+PolyGuardDisableInlining >>> -XX:+PolyGuardDisableTrap >>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.886 ? 0.139 >>> ops/s -XX:TypeProfileWidth=6 -XX:+PolyGuardDisableInlining >>> -XX:+PolyGuardDisableTrap >>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.887 ? 0.040 >>> ops/s -XX:TypeProfileWidth=7 -XX:+PolyGuardDisableInlining >>> -XX:+PolyGuardDisableTrap >>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.684 ? 0.020 >>> ops/s -XX:TypeProfileWidth=8 -XX:+PolyGuardDisableInlining >>> -XX:+PolyGuardDisableTrap >>> >>> The main thing I observe is that there isn't a linear (or even any >>> apparent) >>> correlation between the number of guards generated (guided by >>> TypeProfileWidth), and the time taken. >>> >>> I am trying to understand why there is a dip for TypeProfileWidth equal >>> to 1 and 8. >>> >>> -- >>> Ludovic >>> >>> -----Original Message----- >>> From: Ludovic Henry >>> Sent: Tuesday, March 3, 2020 9:33 AM >>> To: Ludovic Henry ; Vladimir Ivanov >>> ; John Rose ; >>> hotspot-compiler-dev at openjdk.java.net >>> Subject: RE: Polymorphic Guarded Inlining in C2 >>> >>> Hi Vladimir, >>> >>> I did a rerun of the following benchmark with various configurations: >>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fluhenry%2Fjdk-microbenchmarks%2Fblob%2Fmaster%2Fsrc%2Fmain%2Fjava%2Forg%2Fsample%2FTypeProfileWidthOverheadBenchmark.java&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=KzGJ0Xc95rWag6rF8gyFW7%2BHcDUccltwMN5JmL%2BUhdE%3D&reserved=0 >>> >>> >>> The results are as follows: >>> >>> Benchmark???????????????????????????? Mode? Cnt? Score?? Error? Units >>> Configuration >>> PolymorphicVirtualCallBenchmark.run?? thrpt 5??? 2.910 ? 0.040? ops/s >>> indirect-call? -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableTrap >>> PolymorphicVirtualCallBenchmark.run?? thrpt 5??? 2.752 ? 0.039? ops/s >>> direct-call??? -XX:TypeProfileWidth=8 -XX:+PolyGuardDisableInlining >>> -XX:+PolyGuardDisableTrap >>> PolymorphicVirtualCallBenchmark.run?? thrpt 5??? 3.407 ? 0.085? ops/s >>> inlined-call?? -XX:TypeProfileWidth=8 -XX:-PolyGuardDisableInlining >>> -XX:+PolyGuardDisableTrap >>> Benchmark???????????????????????????? Mode? Cnt? Score?? Error? Units >>> Configuration >>> PolymorphicInterfaceCallBenchmark.run thrpt 5??? 2.043 ? 0.025? ops/s >>> indirect-call? -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableTrap >>> PolymorphicInterfaceCallBenchmark.run thrpt 5??? 2.555 ? 0.063? ops/s >>> direct-call??? -XX:TypeProfileWidth=8 -XX:+PolyGuardDisableInlining >>> -XX:+PolyGuardDisableTrap >>> PolymorphicInterfaceCallBenchmark.run thrpt 5??? 3.217 ? 0.058? ops/s >>> inlined-call?? -XX:TypeProfileWidth=8 -XX:-PolyGuardDisableInlining >>> -XX:+PolyGuardDisableTrap >>> >>> The Hotspot logs (with generated assembly) are available at: >>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fluhenry%2F4f015541cb6628517ab6698b1487c17d&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=q7MAs1V9tbuZhBIjNDQefNuhsiKVv7HHRn1kdcbrUPk%3D&reserved=0 >>> >>> >>> The main takeaway from that experiment is that direct calls w/o >>> inlining is faster >>> than indirect calls for icalls but slower for vcalls, and that >>> inlining is always faster >>> than direct calls. >>> >>> (I fully understand this applies mainly on this microbenchmark, and >>> we need to >>> validate on larger benchmarks. I'm working on that next. However, it >>> clearly show >>> gains on a pathological case.) >>> >>> Next, I want to figure out at how many guard the direct-call >>> regresses compared >>> to indirect-call in the vcall case, and I want to run larger >>> benchmarks. Any >>> particular you would like to see running? I am planning on doing >>> SPECjbb2015 first. >>> >>> Thank you, >>> >>> -- >>> Ludovic >>> >>> -----Original Message----- >>> From: hotspot-compiler-dev >>> On Behalf Of Ludovic >>> Henry >>> Sent: Monday, March 2, 2020 4:20 PM >>> To: Vladimir Ivanov ; John Rose >>> ; hotspot-compiler-dev at openjdk.java.net >>> Subject: RE: Polymorphic Guarded Inlining in C2 >>> >>> Hi Vladimir, >>> >>> Sorry for the long delay in response, I was at multiple conferences >>> over the past few >>> weeks. I'm back to the office now and fully focus on getting progress >>> on that. >>> >>>>> Possible avenues of improvements I can see are: >>>>> ??? - Gather all the types in an unbounded list so we can know >>>>> which ones >>>>> are the most frequent. It is unlikely to help with Java as, in the >>>>> general >>>>> case, there are only a few types present a call-sites. It could, >>>>> however, >>>>> be particularly helpful for languages that tend to have many types at >>>>> call-sites, like functional languages, for example. >>>> >>>> I doubt having unbounded list of receiver types is practical: it's >>>> costly to gather, but isn't too useful for compilation. But measuring >>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some >>>> numbers. >>> >>> I agree that it isn't very practical. It can be useful in the case >>> where there are >>> many types at a call-site, and the first ones end up not being >>> frequent enough to >>> mandate a guard. This is clearly an edge-case, and I don't think we >>> should optimize >>> for it. >>> >>>>> In what we have today, some of the worst-case scenarios are the >>>>> following: >>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, >>>>> the first and >>>>> second types are types A and B, and the other type(s) is(are) not >>>>> recorded, >>>>> and it increments the `count` value. Even if A and B are used in >>>>> the initialization >>>>> path (i.e. only a few times) and the other type(s) is(are) used in >>>>> the hot >>>>> path (i.e. many times), the latter are never considered for >>>>> inlining - because >>>>> it was never recorded during profiling. >>>> >>>> Can it be alleviated by (partially) clearing type profile (e.g., >>>> periodically free some space by removing elements with lower >>>> frequencies >>>> and give new types a chance to be profiled? >>> >>> Doing that reliably relies on the assumption that we know what the >>> shape of >>> the workload is going to be in future iterations. Otherwise, how >>> could you >>> guarantee that a type that's not currently frequent will not be in >>> the future, >>> and that the information that you remove now will not be important >>> later. This >>> is an assumption that, IMO, is worst than missing types which are hot >>> later in >>> the execution for two reasons: 1. it's no better, and 2. it's a lot >>> less intuitive and >>> harder to debug/understand than a straightforward "overflow". >>> >>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, >>>>> you have the >>>>> first type A with 49% probability, the second type B with 49% >>>>> probability, and >>>>> the other types with 2% probability. Even though A and B are the >>>>> two hottest >>>>> paths, it does not generate guards because none are a major receiver. >>>> >>>> Yes. On the other hand, on average it'll cause inlining twice as much >>>> code (2 methods vs 1). >>> >>> It will not necessarily cause twice as much inlining because of >>> late-inlining. Like >>> you point out later, it will generate a direct-call in case there >>> isn't room for more >>> inlinable code. >>> >>>> Also, does it make sense to increase morphism factor even if inlining >>>> doesn't happen? >>>> >>>> ?? if (recv.klass == C1) {? // >>0% >>>> ????? ... inlined ... >>>> ?? } else if (recv.klass == C2) { // >>0% >>>> ????? m2(); // direct call >>>> ?? } else { // >0% >>>> ????? m(); // virtual call >>>> ?? } >>>> >>>> vs >>>> >>>> ?? if (recv.klass == C1) {? // >>0% >>>> ????? ... inlined ... >>>> ?? } else { // >>0% >>>> ????? m(); // virtual call >>>> ?? } >>> >>> There is the advantage that modern CPUs are better at predicting >>> instruction-branches >>> than data-branches. These guards will then allow the CPU to make >>> better decisions allowing >>> for better superscalar executions, memory prefetching, etc. >>> >>> This, IMO, makes sense for warm calls, especially since the cost is a >>> guard + a call, which is >>> much lower than a inlined method, but brings benefits over an >>> indirect call. >>> >>>> In other words, how much could we get just by lowering >>>> TypeProfileMajorReceiverPercent? >>> >>> TypeProfileMajorReceiverPercent is only used today when you have a >>> megamorphic >>> call-site (aka more types than TypeProfileWidth) but still one type >>> receiving more than >>> N% of the calls. By reducing the value, you would not increase the >>> number of guards, >>> but the threshold at which you generate the 1st guard in a >>> megamorphic case. >>> >>>>>> ??????? - for N-morphic case what's the negative effect >>>>>> (quantitative) of >>>>>> the deopt? >>>>> We are triggering the uncommon trap in this case iff we observed a >>>>> limited >>>>> and stable set of types in the early stages of the Tiered Compilation >>>>> pipeline (making us generate N-morphic guards), and we suddenly >>>>> observe a >>>>> new type. AFAIU, this is precisely what deopt is for. >>>> >>>> I should have added "... compared to N-polymorhic case". My >>>> intuition is >>>> the higher morphism factor is the fewer the benefits of deopt (compared >>>> to a call) are. It would be very good to validate it with some >>>> benchmarks (both micro- and larger ones). >>> >>> I agree that what you are describing makes sense as well. To reduce >>> the cost of deopt >>> here, having a TypeProfileMinimumReceiverPercent helps. That is >>> because if any type is >>> seen less than this specific frequency, then it won't generate a >>> guard, leading to an indirect >>> call in the fallback case. >>> >>>>> I'm writing a JMH benchmark to stress that specific case. I'll >>>>> share it as soon >>>>> as I have something reliably reproducing. >>>> >>>> Thanks! A representative set of microbenchmarks will be very helpful. >>> >>> It turns out the guard is only generated once, meaning that if we >>> ever hit it then we >>> generate an indirect call. >>> >>> We also only generate the trap iff all the guards are hot (inlined) >>> or warm (direct call), >>> so any of the following case triggers the creation of an indirect >>> call over a trap: >>> ? - we hit the trap once before >>> ? - one or more guards are cold (aka not inlinable even with >>> late-inlining) >>> >>>> It was more about opportunities for future explorations. I don't think >>>> we have to act on it right away. >>>> >>>> As with "deopt vs call", my guess is callee should benefit much more >>>> from inlining than the caller it is inlined into (caller sees multiple >>>> callee candidates and has to merge the results while each callee >>>> observes the full context and can benefit from it). >>>> >>>> If we can run some sort of static analysis on callee bytecode, what >>>> kind >>>> of code patterns should we look for to guide inlining decisions? >>> >>> Any pattern that would benefit from other optimizations (escape >>> analysis, >>> dead code elimination, constant propagation, etc.) is good, but short of >>> shadowing statically what all these optimizations do, I can't see an >>> easy way >>> to do it. >>> >>> That is where late-inlining, or more advanced dynamic heuristics like >>> the one you >>> can find in Graal EE, is worthwhile. >>> >>>> Regaring experiments to try first, here are some ideas I find >>>> promising: >>>> >>>> ???? * measure the cost of additional profiling >>>> ???????? -XX:TypeProfileWidth=N without changing compilers >>> >>> I am running the following jmh microbenchmark >>> >>> ???? public final static int N = 100_000_000; >>> >>> ???? @State(Scope.Benchmark) >>> ???? public static class TypeProfileWidthOverheadBenchmarkState { >>> ???????? public A[] objs = new A[N]; >>> >>> ???????? @Setup >>> ???????? public void setup() throws Exception { >>> ???????????? for (int i = 0; i < objs.length; ++i) { >>> ???????????????? switch (i % 8) { >>> ???????????????? case 0: objs[i] = new A1(); break; >>> ???????????????? case 1: objs[i] = new A2(); break; >>> ???????????????? case 2: objs[i] = new A3(); break; >>> ???????????????? case 3: objs[i] = new A4(); break; >>> ???????????????? case 4: objs[i] = new A5(); break; >>> ???????????????? case 5: objs[i] = new A6(); break; >>> ???????????????? case 6: objs[i] = new A7(); break; >>> ???????????????? case 7: objs[i] = new A8(); break; >>> ???????????????? } >>> ???????????? } >>> ???????? } >>> ???? } >>> >>> ???? @Benchmark @OperationsPerInvocation(N) >>> ???? public void run(TypeProfileWidthOverheadBenchmarkState state, >>> Blackhole blackhole) { >>> ???????? A[] objs = state.objs; >>> ???????? for (int i = 0; i < objs.length; ++i) { >>> ???????????? objs[i].foo(i, blackhole); >>> ???????? } >>> ???? } >>> >>> And I am running with the following JVM parameters: >>> >>> -XX:TypeProfileWidth=0 -XX:CompileThreshold=200000000 >>> -XX:Tier3CompileThreshold=200000000 >>> -XX:Tier3InvocationThreshold=200000000 >>> -XX:Tier3BackEdgeThreshold=200000000 >>> -XX:TypeProfileWidth=8 -XX:CompileThreshold=200000000 >>> -XX:Tier3CompileThreshold=200000000 >>> -XX:Tier3InvocationThreshold=200000000 >>> -XX:Tier3BackEdgeThreshold=200000000 >>> >>> I observe no statistically representative difference between in s/ops >>> between TypeProfileWidth=0 and TypeProfileWidth=8. I also could observe >>> no significant difference in the resulting analysis using Intel VTune. >>> >>> I verified that the benchmark never goes beyond Tier-0 with >>> -XX:+PrintCompilation. >>> >>>> ???? * N-morphic vs N-polymorphic (N>=2): >>>> ?????? - how much deopt helps compared to a virtual call on fallback >>>> path? >>> >>> I have done the following microbenchmark, but I am not sure that it's >>> going to fully answer the question you are raising here. >>> >>> ???? public final static int N = 100_000_000; >>> >>> ???? @State(Scope.Benchmark) >>> ???? public static class PolymorphicDeoptBenchmarkState { >>> ???????? public A[] objs = new A[N]; >>> >>> ???????? @Setup >>> ???????? public void setup() throws Exception { >>> ???????????? int cutoff1 = (int)(objs.length * .90); >>> ???????????? int cutoff2 = (int)(objs.length * .95); >>> ???????????? for (int i = 0; i < cutoff1; ++i) { >>> ???????????????? switch (i % 2) { >>> ???????????????? case 0: objs[i] = new A1(); break; >>> ???????????????? case 1: objs[i] = new A2(); break; >>> ???????????????? } >>> ???????????? } >>> ???????????? for (int i = cutoff1; i < cutoff2; ++i) { >>> ???????????????? switch (i % 4) { >>> ???????????????? case 0: objs[i] = new A1(); break; >>> ???????????????? case 1: objs[i] = new A2(); break; >>> ???????????????? case 2: >>> ???????????????? case 3: objs[i] = new A3(); break; >>> ???????????????? } >>> ???????????? } >>> ???????????? for (int i = cutoff2; i < objs.length; ++i) { >>> ???????????????? switch (i % 4) { >>> ???????????????? case 0: >>> ???????????????? case 1: objs[i] = new A3(); break; >>> ???????????????? case 2: >>> ???????????????? case 3: objs[i] = new A4(); break; >>> ???????????????? } >>> ???????????? } >>> ???????? } >>> ???? } >>> >>> ???? @Benchmark @OperationsPerInvocation(N) >>> ???? public void run(PolymorphicDeoptBenchmarkState state, Blackhole >>> blackhole) { >>> ???????? A[] objs = state.objs; >>> ???????? for (int i = 0; i < objs.length; ++i) { >>> ???????????? objs[i].foo(i, blackhole); >>> ???????? } >>> ???? } >>> >>> I run this benchmark with -XX:+PolyGuardDisableTrap or >>> -XX:-PolyGuardDisableTrap which force enable/disable the trap in the >>> fallback. >>> >>> For that kind of cases, a visitor pattern is what I expect to most >>> largely >>> profit/suffer from a deopt or virtual-call in the fallback path. >>> Would you >>> know of such benchmark that heavily relies on this pattern, and that I >>> could readily reuse? >>> >>>> ???? * inlining vs devirtualization >>>> ?????? - a knob to control inlining in N-morphic/N-polymorphic cases >>>> ?????? - measure separately the effects of devirtualization and >>>> inlining >>> >>> For that one, I reused the first microbenchmark I mentioned above, and >>> added a PolyGuardDisableInlining flag that controls whether we create a >>> direct-call or inline. >>> >>> The results are 2.958 ? 0.011 ops/s for -XX:-PolyGuardDisableInlining >>> (aka inlined) >>> vs 2.540 ? 0.018 ops/s for -XX:+PolyGuardDisableInlining (aka direct >>> call). >>> >>> This benchmarks hasn't been run in the best possible conditions (on >>> my dev >>> machine, in WSL), but it gives a strong indication that even a direct >>> call has a >>> non-negligible impact, and that inlining leads to better result >>> (again, in this >>> microbenchmark). >>> >>> Otherwise, on the per-method TypeProfileWidth knob, I couldn't find >>> anything >>> that would be readily available from the Interpreter. Would you have >>> any pointer >>> of a pre-existing feature that required this specific kind of >>> plumbing? I would otherwise >>> find myself in need of making CompilerDirectives available from the >>> Interpreter, and >>> that is something outside of my current expertise (always happy to >>> learn, but I >>> will need some pointers!). >>> >>> Thank you, >>> >>> -- >>> Ludovic >>> >>> -----Original Message----- >>> From: Vladimir Ivanov >>> Sent: Thursday, February 20, 2020 9:00 AM >>> To: Ludovic Henry ; John Rose >>> ; hotspot-compiler-dev at openjdk.java.net >>> Subject: Re: Polymorphic Guarded Inlining in C2 >>> >>> Hi Ludovic, >>> >>> [...] >>> >>>> Thanks for this explanation, it makes it a lot clearer what the >>>> cases and >>>> your concerns are. To rephrase in my own words, what you are >>>> interested in >>>> is not this change in particular, but more the possibility that this >>>> change >>>> provides and how to take it the next step, correct? >>> >>> Yes, it's a good summary. >>> >>> [...] >>> >>>>> ??????? - affects profiling strategy: majority of receivers vs >>>>> complete >>>>> list of receiver types observed; >>>> Today, we only use the N first receivers when the number of types does >>>> not exceed TypeProfileWidth; otherwise, we use none of them. >>>> Possible avenues of improvements I can see are: >>>> ??? - Gather all the types in an unbounded list so we can know which >>>> ones >>>> are the most frequent. It is unlikely to help with Java as, in the >>>> general >>>> case, there are only a few types present a call-sites. It could, >>>> however, >>>> be particularly helpful for languages that tend to have many types at >>>> call-sites, like functional languages, for example. >>> >>> I doubt having unbounded list of receiver types is practical: it's >>> costly to gather, but isn't too useful for compilation. But measuring >>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some >>> numbers. >>> >>>> ?? - Use the existing types to generate guards for these types we >>>> know are >>>> common enough. Then use the types which are hot or warm, even in >>>> case of a >>>> megamorphic call-site. It would be a simple iteration of what we have >>>> nowadays. >>> >>>> In what we have today, some of the worst-case scenarios are the >>>> following: >>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, the >>>> first and >>>> second types are types A and B, and the other type(s) is(are) not >>>> recorded, >>>> and it increments the `count` value. Even if A and B are used in the >>>> initialization >>>> path (i.e. only a few times) and the other type(s) is(are) used in >>>> the hot >>>> path (i.e. many times), the latter are never considered for inlining >>>> - because >>>> it was never recorded during profiling. >>> >>> Can it be alleviated by (partially) clearing type profile (e.g., >>> periodically free some space by removing elements with lower frequencies >>> and give new types a chance to be profiled? >>> >>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, you >>>> have the >>>> first type A with 49% probability, the second type B with 49% >>>> probability, and >>>> the other types with 2% probability. Even though A and B are the two >>>> hottest >>>> paths, it does not generate guards because none are a major receiver. >>> >>> Yes. On the other hand, on average it'll cause inlining twice as much >>> code (2 methods vs 1). >>> >>> Also, does it make sense to increase morphism factor even if inlining >>> doesn't happen? >>> >>> ??? if (recv.klass == C1) {? // >>0% >>> ?????? ... inlined ... >>> ??? } else if (recv.klass == C2) { // >>0% >>> ?????? m2(); // direct call >>> ??? } else { // >0% >>> ?????? m(); // virtual call >>> ??? } >>> >>> vs >>> >>> ??? if (recv.klass == C1) {? // >>0% >>> ?????? ... inlined ... >>> ??? } else { // >>0% >>> ?????? m(); // virtual call >>> ??? } >>> >>> In other words, how much could we get just by lowering >>> TypeProfileMajorReceiverPercent? >>> >>> And it relates to "virtual/interface call" vs "type guard + direct call" >>> code shapes comparison: how much does devirtualization help? >>> >>> Otherwise, enabling 2-polymorphic shape becomes feasible only if both >>> cases are inlined. >>> >>>>> ??????? - for N-morphic case what's the negative effect >>>>> (quantitative) of >>>>> the deopt? >>>> We are triggering the uncommon trap in this case iff we observed a >>>> limited >>>> and stable set of types in the early stages of the Tiered Compilation >>>> pipeline (making us generate N-morphic guards), and we suddenly >>>> observe a >>>> new type. AFAIU, this is precisely what deopt is for. >>> >>> I should have added "... compared to N-polymorhic case". My intuition is >>> the higher morphism factor is the fewer the benefits of deopt (compared >>> to a call) are. It would be very good to validate it with some >>> benchmarks (both micro- and larger ones). >>> >>>> I'm writing a JMH benchmark to stress that specific case. I'll share >>>> it as soon >>>> as I have something reliably reproducing. >>> >>> Thanks! A representative set of microbenchmarks will be very helpful. >>> >>>>> ???? * invokevirtual vs invokeinterface call sites >>>>> ??????? - different cost models; >>>>> ??????? - interfaces are harder to optimize, but opportunities for >>>>> strength-reduction from interface to virtual calls exist; >>>> ? From the profiling information and the inlining mechanism point of >>>> view, >>>> that it is an invokevirtual or an invokeinterface doesn't change >>>> anything >>>> >>>> Are you saying that we have more to gain from generating a guard for >>>> invokeinterface over invokevirtual because the fall-back of the >>>> invokeinterface is much more expensive? >>> >>> Yes, that's the question: if we see an improvement, how much does >>> devirtualization contribute to that? >>> >>> (If we add a type-guarded direct call, but there's no inlining >>> happening, inline cache effectively strength-reduce a virtual call to a >>> direct call.) >>> >>> Considering current implementation of virtual and interface calls >>> (vtables vs itables), the cost model is very different. >>> >>> For vtable calls, it doesn't look too appealing to introduce large >>> inline caches for individual receiver types since a call through a >>> vtable involves 3 dependent loads [1] (recv => Klass* => Method* => >>> address). >>> >>> For itable calls it can be a big win in some situations: itable lookup >>> iterates over Klass::_secondary_supers array and it can become quite >>> costly. For example, some Scala workloads experience significant >>> overheads from megamorphic calls. >>> >>> If we see an improvement on some benchmark, it would be very useful to >>> be able to determine (quantitatively) how much does inlining and >>> devirtualization contribute. >>> >>> FTR ErikO has been experimenting with an alternative vtable/itable >>> implementation [4] which brings interface calls close to virtual calls. >>> So, if it turns out that devirtualization (and not inlining) of >>> interface calls is what contributes the most, then speeding up >>> megamorphic interface calls becomes a more attractive alternative. >>> >>>>> ???? * inlining heuristics >>>>> ??????? - devirtualization vs inlining >>>>> ????????? - how much benefit from expanding a call site >>>>> (devirtualize more >>>>> cases) without inlining? should differ for virtual & interface cases >>>> I'm also writing a JMH benchmark for this case, and I'll share it as >>>> soon >>>> as I have it reliably reproducing the issue you describe. >>> >>> Also, I think it's important to have a knob to control it (inline vs >>> devirtualize). It'll enable experiments with larger benchmarks. >>> >>>>> ??????? - diminishing returns with increase in number of cases >>>>> ??????? - expanding a single call site leads to more code, but >>>>> frequencies >>>>> stay the same => colder code >>>>> ??????? - based on profiling info (types + frequencies), dynamically >>>>> choose morphism factor on per-call site basis? >>>> That is where I propose to have a lower receiver probability at >>>> which we'll >>>> stop adding more guards. I am experimenting with a global flag with >>>> a default >>>> value of 10%. >>>>> ??????? - what optimization opportunities to look for? it looks >>>>> like in >>>>> general callees should benefit more than the caller (due to merges >>>>> after >>>>> the call site) >>>> Could you please expand your concern or provide an example. >>> >>> It was more about opportunities for future explorations. I don't think >>> we have to act on it right away. >>> >>> As with "deopt vs call", my guess is callee should benefit much more >>> from inlining than the caller it is inlined into (caller sees multiple >>> callee candidates and has to merge the results while each callee >>> observes the full context and can benefit from it). >>> >>> If we can run some sort of static analysis on callee bytecode, what kind >>> of code patterns should we look for to guide inlining decisions? >>> >>> >>> ? >> What's your take on it? Any other ideas? >>> ? > >>> ? > We don't know what we don't know. We need first to improve the >>> logging and >>> ? > debugging output of uncommon traps for polymorphic call-sites. >>> Then, we >>> ? > need to gather data about the different cases you talked about. >>> ? > >>> ? > We also need to have some microbenchmarks to validate some of the >>> questions >>> ? > you are raising, and verify what level of gains we can expect >>> from this >>> ? > optimization. Further validation will be needed on larger >>> benchmarks and >>> ? > real-world applications as well, and that's where, I think, we need >>> to develop >>> ? > logging and debugging for this feature. >>> >>> Yes, sounds good. >>> >>> Regaring experiments to try first, here are some ideas I find promising: >>> >>> ???? * measure the cost of additional profiling >>> ???????? -XX:TypeProfileWidth=N without changing compilers >>> >>> ???? * N-morphic vs N-polymorphic (N>=2): >>> ?????? - how much deopt helps compared to a virtual call on fallback >>> path? >>> >>> ???? * inlining vs devirtualization >>> ?????? - a knob to control inlining in N-morphic/N-polymorphic cases >>> ?????? - measure separately the effects of devirtualization and inlining >>> >>> Best regards, >>> Vladimir Ivanov >>> >>> [1] >>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l48&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=jp4emfr%2FRnQA3RLtGI%2BCJ4Q%2F%2Bx6NpIUsWPexl3KbxvM%3D&reserved=0 >>> >>> >>> [2] >>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l142&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=P1iMschSMsRCBw7e4IHWimB26v3MdqWWrGZKSvv64vA%3D&reserved=0 >>> >>> >>> [3] >>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FmacroAssembler_x86.cpp%23l4294&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=pSEuxGJYxfPHzcouSeoCYSRdfdlSByYwpyZfFA7dlYU%3D&reserved=0 >>> >>> >>> [4] >>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.openjdk.java.net%2Fbrowse%2FJDK-8221828&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=ppajmGyCb1rcDjr0v52XMPcLySA7jQTTjmYE5vIEEoo%3D&reserved=0 >>> >>> >>>> -----Original Message----- >>>> From: Vladimir Ivanov >>>> Sent: Tuesday, February 11, 2020 3:10 PM >>>> To: Ludovic Henry ; John Rose >>>> ; hotspot-compiler-dev at openjdk.java.net >>>> Subject: Re: Polymorphic Guarded Inlining in C2 >>>> >>>> Hi Ludovic, >>>> >>>> I fully agree that it's premature to discuss how default behavior >>>> should >>>> be changed since much more data is needed to be able to proceed with >>>> the >>>> decision. But considering the ultimate goal is to actually improve >>>> relevant heuristics (and effectively change the default behavior), it's >>>> the right time to discuss what kind of experiments are needed to gather >>>> enough data for further analysis. >>>> >>>> Though different shapes do look very similar at first, the shape of >>>> fallback makes a big difference. That's why monomorphic and polymorphic >>>> cases are distinct: uncommon traps are effectively exits and can >>>> significantly simplify CFG while calls can return and have to be merged >>>> back. >>>> >>>> Polymorphic shape is stable (no deopts/recompiles involved), but >>>> doesn't >>>> simplify the CFG around the call site. >>>> >>>> Monomorphic shape gives more optimization opportunities, but deopts are >>>> highly undesirable due to associated costs. >>>> >>>> For example: >>>> >>>> ???? if (recv.klass != C) { deopt(); } >>>> ???? C.m(recv); >>>> >>>> ???? // recv.klass == C - exact type >>>> ???? // return value == C.m(recv) >>>> >>>> vs >>>> >>>> ???? if (recv.klass == C) { >>>> ?????? C.m(recv); >>>> ???? } else { >>>> ?????? I.m(recv); >>>> ???? } >>>> >>>> ???? // recv.klass <: I - subtype >>>> ???? // return value is a phi merging C.m() & I.m() where I.m() is >>>> completley opaque. >>>> >>>> Monomorphic shape can degenerate into polymorphic (too many >>>> recompiles), >>>> but that's a forced move to stabilize the behavior and avoid vicious >>>> recomilation cycle (which is *very* expensive). (Another alternative is >>>> to leave deopt as is - set deopt action to "none" - but that's usually >>>> much worse decision.) >>>> >>>> And that's the reason why monomorphic shape requires a unique receiver >>>> type in profile while polymorphic shape works with major receiver type >>>> and probabilities. >>>> >>>> >>>> Considering further steps, IMO for experimental purposes a single knob >>>> won't cut it: there are multiple degrees of freedom which may play >>>> important role in building accurate performance model. I'm not yet >>>> convinced it's all about inlining and narrowing the scope of discussion >>>> specifically to type profile width doesn't help. >>>> >>>> I'd like to see more knobs introduced before we start conducting >>>> extensive experiments. So, let's discuss what other information we can >>>> benefit from. >>>> >>>> I mentioned some possible options in the previous email. I find the >>>> following aspects important for future discussion: >>>> >>>> ???? * shape of fallback path >>>> ??????? - what to generalize: 2- to N-morphic vs 1- to N-polymorphic; >>>> ??????? - affects profiling strategy: majority of receivers vs complete >>>> list of receiver types observed; >>>> ??????? - for N-morphic case what's the negative effect >>>> (quantitative) of >>>> the deopt? >>>> >>>> ???? * invokevirtual vs invokeinterface call sites >>>> ??????? - different cost models; >>>> ??????? - interfaces are harder to optimize, but opportunities for >>>> strength-reduction from interface to virtual calls exist; >>>> >>>> ???? * inlining heuristics >>>> ??????? - devirtualization vs inlining >>>> ????????? - how much benefit from expanding a call site >>>> (devirtualize more >>>> cases) without inlining? should differ for virtual & interface cases >>>> ??????? - diminishing returns with increase in number of cases >>>> ??????? - expanding a single call site leads to more code, but >>>> frequencies >>>> stay the same => colder code >>>> ??????? - based on profiling info (types + frequencies), dynamically >>>> choose morphism factor on per-call site basis? >>>> ??????? - what optimization opportunities to look for? it looks like in >>>> general callees should benefit more than the caller (due to merges >>>> after >>>> the call site) >>>> >>>> What's your take on it? Any other ideas? >>>> >>>> Best regards, >>>> Vladimir Ivanov >>>> >>>> On 11.02.2020 02:42, Ludovic Henry wrote: >>>>> Hello, >>>>> Thank you very much, John and Vladimir, for your feedback. >>>>> First, I want to stress out that this patch does not change the >>>>> default. It is still bi-morphic guarded inlining by default. This >>>>> patch, however, provides you the ability to configure the JVM to go >>>>> for N-morphic guarded inlining, with N being controlled by the >>>>> -XX:TypeProfileWidth configuration knob. I understand there are >>>>> shortcomings with the specifics of this approach so I'll work on >>>>> fixing those. However, I would want this discussion to focus on >>>>> this *configurable* feature and not on changing the default. The >>>>> latter, I think, should be discussed as part of another, more >>>>> extended running discussion, since, as you pointed out, it has far >>>>> more reaching consequences that are merely improving a >>>>> micro-benchmark. >>>>> >>>>> Now to answer some of your specific questions. >>>>> >>>>>> >>>>>> I haven't looked through the patch in details, but here are some >>>>>> thoughts. >>>>>> >>>>>> As of now, there are 4 main scenarios for devirtualization [1]. It >>>>>> seems you try to generalize (b) which becomes: >>>>>> >>>>>> ????? if (recv.klass == K1) { >>>>> m1(...); // either inline or a direct call >>>>>> ????? } else if (recv.klass == K2) { >>>>> m2(...); // either inline or a direct call >>>>>> ????? ... >>>>>> ????? } else if (recv.klass == Kn) { >>>>> mn(...); // either inline or a direct call >>>>>> ????? } else { >>>>> deopt(); // invalidate + reinterpret >>>>>> ????? } >>>>> >>>>> The general shape that exist currently in tip is: >>>>> >>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0) >>>>> if (recv.klass == K1) { >>>>> ???? m1(.); // either inline or a direct call >>>>> } >>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && >>>>> UseBimorphicInlining && !is_cold >>>>> else if (recv.klass == K2) { >>>>> ???? m2(.); // either inline or a direct call >>>>> } >>>>> else { >>>>> ???? // if (!too_many_traps_or_deopt()) >>>>> ???? deopt(); // invalidate + reinterpret >>>>> ???? // else >>>>> ???? invokeinterface A.foo(.); // virtual call with Inline Cache >>>>> } >>>>> There is no particular distinction between Bimorphic, Polymorphic, >>>>> and Megamorphic. The latter relates more to the fallback rather >>>>> than the guards. What this change brings is more guards for >>>>> N-morphic call-sites with N > 2. But it doesn't change why and how >>>>> these guards are generated (or at least, that is not the intention). >>>>> The general shape that this change proposes is: >>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0) >>>>> if (recv.klass == K1) { >>>>> ???? m1(.); // either inline or a direct call >>>>> } >>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && >>>>> (UseBimorphicInlining || UsePolymorphicInling) >>>>> && !is_cold >>>>> else if (recv.klass == K2) { >>>>> ???? m2(.); // either inline or a direct call >>>>> } >>>>> // if TypeProfileWidth >= 3 && profile.has_receiver(2) && >>>>> UsePolymorphicInling && !is_cold >>>>> else if (recv.klass == K3) { >>>>> ???? m3(.); // either inline or a direct call >>>>> } >>>>> // if TypeProfileWidth >= 4 && profile.has_receiver(3) && >>>>> UsePolymorphicInling && !is_cold >>>>> else if (recv.klass == K4) { >>>>> ???? m4(.); // either inline or a direct call >>>>> } >>>>> else { >>>>> ???? // if (!too_many_traps_or_deopt()) >>>>> ???? deopt(); // invalidate + reinterpret >>>>> ???? // else >>>>> ???? invokeinterface A.foo(.); // virtual call with Inline Cache >>>>> } >>>>> You can observe that the condition to create the guards is no >>>>> different; only the total number increases based on >>>>> TypeProfileWidth and UsePolymorphicInlining. >>>>>> Question #1: what if you generalize polymorphic shape instead and >>>>>> allow multiple major receivers? Deoptimizing (and then >>>>>> recompiling) look less beneficial the higher morphism is >>>>>> (especially considering the inlining on all paths becomes less >>>>>> likely as well). So, having a virtual call (which becomes less >>>>>> likely due to lower frequency) on the fallback path may be a >>>>>> better option. >>>>> I agree with this statement in the general sense. However, in >>>>> practice, it depends on the specifics of each application. That is >>>>> why the degree of polymorphism needs to rely on a configuration >>>>> knob, and not pre-determined on a set of benchmarks. I agree with >>>>> the proposal to have this knob as a per-method knob, instead of a >>>>> global knob. >>>>> As for the impact of a higher morphism, I expect deoptimizations to >>>>> happen less often as more guards are generated, leading to a lower >>>>> probability of reaching the fallback path, leading to less uncommon >>>>> trap/deoptimizations. Moreover, the fallback is already going to be >>>>> a virtual call in case we hit the uncommon trap too often (using >>>>> too_many_traps_or_recompiles). >>>>>> Question #2: it would be very interesting to understand what >>>>>> exactly contributes the most to performance improvements? Is it >>>>>> inlining? Or maybe devirtualization (avoid the cost of virtual >>>>>> call)? How much come from optimizing interface calls (itable vs >>>>>> vtable stubs)? >>>>> Devirtualization in itself (direct vs. indirect call) is not the >>>>> *primary* source of the gain. The gain comes from the additional >>>>> optimizations that are applied by C2 when increasing the scope/size >>>>> of the code compiled via inlining. >>>>> In the case of warm code that's not inlined as part of incremental >>>>> inlining, the call is a direct call rather than an indirect call. I >>>>> haven't measured it, but I expect performance to be positively >>>>> impacted because of the better ability of modern CPUs to correctly >>>>> predict instruction branches (a direct call) rather than data >>>>> branches (an indirect call). >>>>>> Deciding how to spend inlining budget on multiple targets with >>>>>> moderate frequency can be hard, so it makes sense to consider >>>>>> expanding 3/4/mega-morphic call sites in post-parse phase (during >>>>>> incremental inlining). >>>>> Incremental inlining is already integrated with the existing >>>>> solution. In the case of a hot or warm call, in case of failure to >>>>> inline, it generates a direct call. You still have the guards, >>>>> reducing the cost of an indirect call, but without the cost of the >>>>> inlined code. >>>>>> Question #3: how much TypeProfileWidth affects profiling speed >>>>>> (interpreter and level #3 code) and dynamic footprint? >>>>> I'll come back to you with some results. >>>>>> Getting answers to those (and similar) questions should give us >>>>>> much more insights what is actually happening in practice. >>>>>> >>>>>> Speaking of the first deliverables, it would be good to introduce >>>>>> a new experimental mode to be able to easily conduct such >>>>>> experiments with product binaries and I'd like to see the patch >>>>>> evolving in that direction. It'll enable us to gather important >>>>>> data to guide our decisions about how to enhance the heuristics in >>>>>> the product. >>>>> This patch does not change the default shape of the generated code >>>>> with bimorphic guarded inlining, because the default value of >>>>> TypeProfileWidth is 2. If your concern is that TypeProfileWidth is >>>>> used for other purposes and that I should add a dedicated knob to >>>>> control the maximum morphism of these guards, then I agree. I am >>>>> using TypeProfileWidth because it's the available and more >>>>> straightforward knob today. >>>>> Overall, this change does not propose to go from bimorphic to >>>>> N-morphic by default (with N between 0 and 8). This change focuses >>>>> on using an existing knob (TypeProfileWidth) to open the >>>>> possibility for N-morphic guarded inlining. I would want the >>>>> discussion to change the default to be part of a separate RFR, to >>>>> separate the feature change discussion from the default change >>>>> discussion. >>>>>> Such optimizations are usually not unqualified wins because of >>>>>> highly "non-linear" or "non-local" effects, where a local change >>>>>> in one direction might couple to nearby change in a different >>>>>> direction, with a net change that's "wrong", due to side effects >>>>>> rolling out from the "good" change. (I'm talking about side >>>>>> effects in our IR graph shaping heuristics, not memory side effects.) >>>>>> >>>>>> One out of many such "wrong" changes is a local optimization which >>>>>> expands code on a medium-hot path, which has the side effect of >>>>>> making a containing block of code larger than convenient.? Three >>>>>> ways of being "larger than convenient" are a. the object code of >>>>>> some containing loop doesn't fit as well in the instruction >>>>>> memory, b. the total IR size tips over some budgetary limit which >>>>>> causes further IR creation to be throttled (or the whole graph to >>>>>> be thrown away!), or c. some loop gains additional branch >>>>>> structure that impedes the optimization of the loop, where an out >>>>>> of line call would not. >>>>>> >>>>>> My overall point here is that an eager expansion of IR that is >>>>>> locally "better" (we might even say "optimal") with respect to the >>>>>> specific path under consideration hurts the optimization of nearby >>>>>> paths which are more important. >>>>> I generally agree with this statement and explanation. Again, it is >>>>> not the intention of this patch to change the default number of >>>>> guards for polymorphic call-sites, but it is to give users the >>>>> ability to optimize the code generation of their JVM to their >>>>> application. >>>>> Since I am relying on the existing inlining infrastructure, late >>>>> inlining and hot/warm/cold call generators allows to have a >>>>> "best-of-both-world" approach: it inlines code in the hot guards, >>>>> it direct calls or inline (if inlining thresholds permits) the >>>>> method in the warm guards, and it doesn't even generate the guard >>>>> in the cold guards. The question here is, then how do you define >>>>> hot, warm, and cold. As discussed above, I want to explore using a >>>>> low-threshold even to try to generate a guard (at least 10% of >>>>> calls are to this specific receiver). >>>>> On the overhead of adding more guards, I see this change as >>>>> beneficial because it removes an arbitrary limit on what code can >>>>> be inlined. For example, if you have a call-site with 3 types, each >>>>> with a hit probability of 30%, then with a maximum limit of 2 types >>>>> (with bimorphic guarded inlining), only the first 2 types are >>>>> guarded and inlined. That is despite an apparent gain in guarding >>>>> and inlining against the 3 types. >>>>> I agree we want to have guardrails to avoid worst-case >>>>> degradations. It is my understanding that the existing inlining >>>>> infrastructure (with late inlining, for example) provides many >>>>> safeguards already, and it is up to this change not to abuse these. >>>>>> (It clearly doesn't work to tell an impacted customer, well, you >>>>>> may get a 5% loss, but the micro created to test this thing shows >>>>>> a 20% gain, and all the functional tests pass.) >>>>>> >>>>>> This leads me to the following suggestion:? Your code is a very >>>>>> good POC, and deserves more work, and the next step in that work >>>>>> is probably looking for and thinking about performance >>>>>> regressions, and figuring out how to throttle this thing. >>>>> Here again, I want that feature to be behind a configuration knob, >>>>> and then discuss in a future RFR to change the default. >>>>>> A specific next step would be to make the throttling of this >>>>>> feature be controllable. MorphismLimit should be a global on its >>>>>> own.? And it should be configurable through the CompilerOracle per >>>>>> method.? (See similar code for similar throttles.)? And it should >>>>>> be more sensitive to the hotness of the overall call and of the >>>>>> various slices of the call's profile.? (I notice with suspicion >>>>>> that the comment "The single majority receiver sufficiently >>>>>> outweighs the minority" is missing in the changed code.)? And, if >>>>>> the change is as disruptive to heuristics as I suspect it *might* >>>>>> be, the call site itself *might* need some kind of dynamic >>>>>> feedback which says, after some deopt or reprofiling, "take it >>>>>> easy here, try plan B." That last point is just speculation, but I >>>>>> threw it in to show the kinds of measures we *sometimes* have to >>>>>> take in avoiding "side effects" to our locally pleasant >>>>>> optimizations. >>>>> I'll add this per-method knob on the CompilerOracle in the next >>>>> iteration of this patch. >>>>>> But, let me repeat: I'm glad to see this experiment. And very, >>>>>> very glad to see all the cool stuff that is coming out of your >>>>>> work-group.? Welcome to the adventure! >>>>> For future improvements, I will keep focusing on inlining as I see >>>>> it as the door opener to many more optimizations in C2. I am still >>>>> learning at what can be done to reduce the size of the inlined code >>>>> by, for example, applying specific optimizations that simplify the >>>>> CG (like dead-code elimination or constant propagation) before >>>>> inlining the code. As you said, we are not short of ideas on *how* >>>>> to improve it, but we have to be very wary of *what impact* it'll >>>>> have on real-world applications. We're working with internal >>>>> customers to figure that out, and we'll share them as soon as we >>>>> are ready with benchmarks for those use-case patterns. >>>>> What I am working on now is: >>>>> ??? - Add a per-method flag through CompilerOracle >>>>> ??? - Add a threshold on the probability of a receiver to generate >>>>> a guard (I am thinking of 10%, i.e., if a receiver is observed less >>>>> than 1 in every 10 calls, then don't generate a guard and use the >>>>> fallback) >>>>> ??? - Check the overhead of increasing TypeProfileWidth on >>>>> profiling speed (in the interpreter and level #3 code) >>>>> Thank you, and looking forward to the next review (I expect to post >>>>> the next iteration of the patch today or tomorrow). >>>>> -- >>>>> Ludovic >>>>> >>>>> -----Original Message----- >>>>> From: Vladimir Ivanov >>>>> Sent: Thursday, February 6, 2020 1:07 PM >>>>> To: Ludovic Henry ; >>>>> hotspot-compiler-dev at openjdk.java.net >>>>> Subject: Re: Polymorphic Guarded Inlining in C2 >>>>> >>>>> Very interesting results, Ludovic! >>>>> >>>>>> The image can be found at >>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fnam06.safelinks.protection.outlook.com%2F%3Furl%3Dhttps*3A*2F*2Fgist.github.com*2Fluhenry*2Fb7cc1ed55c51cb0fbc527cbc45018473%26amp%3Bdata%3D02*7C01*7Cluhenry*40microsoft.com*7Cc802bcf8b3154a1bb97508d7af47850b*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637170594027774879%26amp%3Bsdata%3D6dxACbgyeU0S51dVY6fSVZ*2B4USDsTOdnuxTOXkiVYqI*3D%26amp%3Breserved%3D0__%3BJSUlJSUlJSUlJSUlJSUl!!GqivPVa7Brio!MIQWA4cQu4RzL0yqjiU5mO4tpHcImj_lUv7tEh-gG6NCMtSRIAyhvfyJZQBPoAJC_Ajjaxg%24&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=aLAxfc5BDeyOK6VMwM7bUAgjVLMCAt8A6LDlsNkC6fU%3D&reserved=0 >>>>>> >>>>> >>>>> Can you elaborate on the experiment itself, please? In particular, >>>>> what >>>>> does PERCENTILES actually mean? >>>>> >>>>> I haven't looked through the patch in details, but here are some >>>>> thoughts. >>>>> >>>>> As of now, there are 4 main scenarios for devirtualization [1]. It >>>>> seems >>>>> you try to generalize (b) which becomes: >>>>> >>>>> ????? if (recv.klass == K1) { >>>>> ???????? m1(...); // either inline or a direct call >>>>> ????? } else if (recv.klass == K2) { >>>>> ???????? m2(...); // either inline or a direct call >>>>> ????? ... >>>>> ????? } else if (recv.klass == Kn) { >>>>> ???????? mn(...); // either inline or a direct call >>>>> ????? } else { >>>>> ???????? deopt(); // invalidate + reinterpret >>>>> ????? } >>>>> >>>>> Question #1: what if you generalize polymorphic shape instead and >>>>> allow >>>>> multiple major receivers? Deoptimizing (and then recompiling) look >>>>> less >>>>> beneficial the higher morphism is (especially considering the inlining >>>>> on all paths becomes less likely as well). So, having a virtual call >>>>> (which becomes less likely due to lower frequency) on the fallback >>>>> path >>>>> may be a better option. >>>>> >>>>> >>>>> Question #2: it would be very interesting to understand what exactly >>>>> contributes the most to performance improvements? Is it inlining? Or >>>>> maybe devirtualization (avoid the cost of virtual call)? How much come >>>>> from optimizing interface calls (itable vs vtable stubs)? >>>>> >>>>> Deciding how to spend inlining budget on multiple targets with >>>>> moderate >>>>> frequency can be hard, so it makes sense to consider expanding >>>>> 3/4/mega-morphic call sites in post-parse phase (during incremental >>>>> inlining). >>>>> >>>>> >>>>> Question #3: how much TypeProfileWidth affects profiling speed >>>>> (interpreter and level #3 code) and dynamic footprint? >>>>> >>>>> >>>>> Getting answers to those (and similar) questions should give us much >>>>> more insights what is actually happening in practice. >>>>> >>>>> Speaking of the first deliverables, it would be good to introduce a >>>>> new >>>>> experimental mode to be able to easily conduct such experiments with >>>>> product binaries and I'd like to see the patch evolving in that >>>>> direction. It'll enable us to gather important data to guide our >>>>> decisions about how to enhance the heuristics in the product. >>>>> >>>>> Best regards, >>>>> Vladimir Ivanov >>>>> >>>>> [1] (a) Monomorphic: >>>>> ????? if (recv.klass == K1) { >>>>> ???????? m1(...); // either inline or a direct call >>>>> ????? } else { >>>>> ???????? deopt(); // invalidate + reinterpret >>>>> ????? } >>>>> >>>>> ????? (b) Bimorphic: >>>>> ????? if (recv.klass == K1) { >>>>> ???????? m1(...); // either inline or a direct call >>>>> ????? } else if (recv.klass == K2) { >>>>> ???????? m2(...); // either inline or a direct call >>>>> ????? } else { >>>>> ???????? deopt(); // invalidate + reinterpret >>>>> ????? } >>>>> >>>>> ????? (c) Polymorphic: >>>>> ????? if (recv.klass == K1) { // major receiver (by default, >90%) >>>>> ???????? m1(...); // either inline or a direct call >>>>> ????? } else { >>>>> ???????? K.m(); // virtual call >>>>> ????? } >>>>> >>>>> ????? (d) Megamorphic: >>>>> ????? K.m(); // virtual (K is either concrete or interface class) >>>>> >>>>>> >>>>>> -- >>>>>> Ludovic >>>>>> >>>>>> -----Original Message----- >>>>>> From: hotspot-compiler-dev >>>>>> On Behalf Of >>>>>> Ludovic Henry >>>>>> Sent: Thursday, February 6, 2020 9:18 AM >>>>>> To: hotspot-compiler-dev at openjdk.java.net >>>>>> Subject: RFR: Polymorphic Guarded Inlining in C2 >>>>>> >>>>>> Hello, >>>>>> >>>>>> In our evergoing search of improving performance, I've looked at >>>>>> inlining and, more specifically, at polymorphic guarded inlining. >>>>>> Today in HotSpot, the maximum number of guards for types at any >>>>>> call site is two - with bimorphic guarded inlining. However, Graal >>>>>> and Zing have observed great results with increasing that limit. >>>>>> >>>>>> You'll find following a patch that makes the number of guards for >>>>>> types configurable with the `TypeProfileWidth` global. >>>>>> >>>>>> Testing: >>>>>> Passing tier1 on Linux and Windows, plus other large applications >>>>>> (through the Adopt testing scripts) >>>>>> >>>>>> Benchmarking: >>>>>> To get data, we run a benchmark against Apache Pinot and observe >>>>>> the following results: >>>>>> >>>>>> [cid:image001.png at 01D5D2DB.F5165550] >>>>>> >>>>>> We observe close to 20% improvements on this sample benchmark with >>>>>> a morphism (=width) of 3 or 4. We are currently validating these >>>>>> numbers on a more extensive set of benchmarks and platforms, and >>>>>> I'll share them as soon as we have them. >>>>>> >>>>>> I am happy to provide more information, just let me know if you >>>>>> have any question. >>>>>> >>>>>> Thank you, >>>>>> >>>>>> -- >>>>>> Ludovic >>>>>> >>>>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp >>>>>> b/src/hotspot/share/ci/ciCallProfile.hpp >>>>>> index 73854806ed..845070fbe1 100644 >>>>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp >>>>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp >>>>>> @@ -38,7 +38,7 @@ private: >>>>>> ?????? friend class ciMethod; >>>>>> ?????? friend class ciMethodHandle; >>>>>> >>>>>> -? enum { MorphismLimit = 2 }; // Max call site's morphism we care >>>>>> about >>>>>> +? enum { MorphismLimit = 8 }; // Max call site's morphism we care >>>>>> about >>>>>> ?????? int? _limit;??????????????? // number of receivers have >>>>>> been determined >>>>>> ?????? int? _morphism;???????????? // determined call site's morphism >>>>>> ?????? int? _count;??????????????? // # times has this call been >>>>>> executed >>>>>> @@ -47,6 +47,7 @@ private: >>>>>> ?????? ciKlass*? _receiver[MorphismLimit + 1];? // receivers (exact) >>>>>> >>>>>> ?????? ciCallProfile() { >>>>>> +??? guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit >>>>>> can't be smaller than TypeProfileWidth"); >>>>>> ???????? _limit = 0; >>>>>> ???????? _morphism??? = 0; >>>>>> ???????? _count = -1; >>>>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp >>>>>> b/src/hotspot/share/ci/ciMethod.cpp >>>>>> index d771be8dac..8e4ecc8597 100644 >>>>>> --- a/src/hotspot/share/ci/ciMethod.cpp >>>>>> +++ b/src/hotspot/share/ci/ciMethod.cpp >>>>>> @@ -496,9 +496,7 @@ ciCallProfile >>>>>> ciMethod::call_profile_at_bci(int bci) { >>>>>> ?????????? // Every profiled call site has a counter. >>>>>> ?????????? int count = >>>>>> check_overflow(data->as_CounterData()->count(), >>>>>> java_code_at_bci(bci)); >>>>>> >>>>>> -????? if (!data->is_ReceiverTypeData()) { >>>>>> -??????? result._receiver_count[0] = 0;? // that's a definite zero >>>>>> -????? } else { // ReceiverTypeData is a subclass of CounterData >>>>>> +????? if (data->is_ReceiverTypeData()) { >>>>>> ???????????? ciReceiverTypeData* call = >>>>>> (ciReceiverTypeData*)data->as_ReceiverTypeData(); >>>>>> ???????????? // In addition, virtual call sites have receiver type >>>>>> information >>>>>> ???????????? int receivers_count_total = 0; >>>>>> @@ -515,7 +513,7 @@ ciCallProfile >>>>>> ciMethod::call_profile_at_bci(int bci) { >>>>>> ?????????????? // is recorded or an associated counter is >>>>>> incremented, but not both. With >>>>>> ?????????????? // tiered compilation, however, both can happen due >>>>>> to the interpreter and >>>>>> ?????????????? // C1 profiling invocations differently. Address >>>>>> that inconsistency here. >>>>>> -????????? if (morphism == 1 && count > 0) { >>>>>> +????????? if (morphism >= 1 && count > 0) { >>>>>> ???????????????? epsilon = count; >>>>>> ???????????????? count = 0; >>>>>> ?????????????? } >>>>>> @@ -531,25 +529,26 @@ ciCallProfile >>>>>> ciMethod::call_profile_at_bci(int bci) { >>>>>> ????????????? // If we extend profiling to record methods, >>>>>> ?????????????? // we will set result._method also. >>>>>> ???????????? } >>>>>> +??????? result._morphism = morphism; >>>>>> ???????????? // Determine call site's morphism. >>>>>> ???????????? // The call site count is 0 with known morphism (only >>>>>> 1 or 2 receivers) >>>>>> ???????????? // or < 0 in the case of a type check failure for >>>>>> checkcast, aastore, instanceof. >>>>>> ???????????? // The call site count is > 0 in the case of a >>>>>> polymorphic virtual call. >>>>>> -??????? if (morphism > 0 && morphism == result._limit) { >>>>>> -?????????? // The morphism <= MorphismLimit. >>>>>> -?????????? if ((morphism >>>>> -?????????????? (morphism == ciCallProfile::MorphismLimit && count >>>>>> == 0)) { >>>>>> +??????? assert(result._morphism == result._limit, ""); >>>>>> #ifdef ASSERT >>>>>> +??????? if (result._morphism > 0) { >>>>>> +?????????? // The morphism <= TypeProfileWidth. >>>>>> +?????????? if ((result._morphism >>>>> +?????????????? (result._morphism == TypeProfileWidth && count == >>>>>> 0)) { >>>>>> ????????????????? if (count > 0) { >>>>>> ??????????????????? this->print_short_name(tty); >>>>>> ??????????????????? tty->print_cr(" @ bci:%d", bci); >>>>>> ??????????????????? this->print_codes(); >>>>>> ??????????????????? assert(false, "this call site should not be >>>>>> polymorphic"); >>>>>> ????????????????? } >>>>>> -#endif >>>>>> -???????????? result._morphism = morphism; >>>>>> ??????????????? } >>>>>> ???????????? } >>>>>> +#endif >>>>>> ???????????? // Make the count consistent if this is a call >>>>>> profile. If count is >>>>>> ???????????? // zero or less, presume that this is a typecheck >>>>>> profile and >>>>>> ???????????? // do nothing.? Otherwise, increase count to be the >>>>>> sum of all >>>>>> @@ -578,7 +577,7 @@ void ciCallProfile::add_receiver(ciKlass* >>>>>> receiver, int receiver_count) { >>>>>> ?????? } >>>>>> ?????? _receiver[i] = receiver; >>>>>> ?????? _receiver_count[i] = receiver_count; >>>>>> -? if (_limit < MorphismLimit) _limit++; >>>>>> +? if (_limit < TypeProfileWidth) _limit++; >>>>>> } >>>>>> >>>>>> >>>>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp >>>>>> b/src/hotspot/share/opto/c2_globals.hpp >>>>>> index d605bdb7bd..7a8dee43e5 100644 >>>>>> --- a/src/hotspot/share/opto/c2_globals.hpp >>>>>> +++ b/src/hotspot/share/opto/c2_globals.hpp >>>>>> @@ -389,9 +389,16 @@ >>>>>> ?????? product(bool, UseBimorphicInlining, >>>>>> true,???????????????????????????????? \ >>>>>> ?????????????? "Profiling based inlining for two >>>>>> receivers")???????????????????? \ >>>>>> >>>>>> \ >>>>>> +? product(bool, UsePolymorphicInlining, >>>>>> true,?????????????????????????????? \ >>>>>> +????????? "Profiling based inlining for two or more >>>>>> receivers")???????????? \ >>>>>> + >>>>>> \ >>>>>> ?????? product(bool, UseOnlyInlinedBimorphic, >>>>>> true,????????????????????????????? \ >>>>>> ?????????????? "Don't use BimorphicInlining if can't inline a >>>>>> second method")??? \ >>>>>> >>>>>> \ >>>>>> +? product(bool, UseOnlyInlinedPolymorphic, >>>>>> true,??????????????????????????? \ >>>>>> +????????? "Don't use PolymorphicInlining if can't inline a >>>>>> non-major "????? \ >>>>>> +????????? "receiver's >>>>>> method")????????????????????????????????????????????? \ >>>>>> + >>>>>> \ >>>>>> ?????? product(bool, InsertMemBarAfterArraycopy, >>>>>> true,?????????????????????????? \ >>>>>> ?????????????? "Insert memory barrier after arraycopy >>>>>> call")???????????????????? \ >>>>>> >>>>>> \ >>>>>> diff --git a/src/hotspot/share/opto/doCall.cpp >>>>>> b/src/hotspot/share/opto/doCall.cpp >>>>>> index 44ab387ac8..6f940209ce 100644 >>>>>> --- a/src/hotspot/share/opto/doCall.cpp >>>>>> +++ b/src/hotspot/share/opto/doCall.cpp >>>>>> @@ -83,25 +83,23 @@ CallGenerator* >>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool >>>>>> >>>>>> ?????? // See how many times this site has been invoked. >>>>>> ?????? int site_count = profile.count(); >>>>>> -? int receiver_count = -1; >>>>>> -? if (call_does_dispatch && UseTypeProfile && >>>>>> profile.has_receiver(0)) { >>>>>> -??? // Receivers in the profile structure are ordered by call counts >>>>>> -??? // so that the most called (major) receiver is >>>>>> profile.receiver(0). >>>>>> -??? receiver_count = profile.receiver_count(0); >>>>>> -? } >>>>>> >>>>>> ?????? CompileLog* log = this->log(); >>>>>> ?????? if (log != NULL) { >>>>>> -??? int rid = (receiver_count >= 0)? >>>>>> log->identify(profile.receiver(0)): -1; >>>>>> -??? int r2id = (rid != -1 && profile.has_receiver(1))? >>>>>> log->identify(profile.receiver(1)):-1; >>>>>> +??? ResourceMark rm; >>>>>> +??? int* rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth); >>>>>> +??? for (int i = 0; i < TypeProfileWidth && >>>>>> profile.has_receiver(i); i++) { >>>>>> +????? rids[i] = log->identify(profile.receiver(i)); >>>>>> +??? } >>>>>> ???????? log->begin_elem("call method='%d' count='%d' >>>>>> prof_factor='%f'", >>>>>> ???????????????????????? log->identify(callee), site_count, >>>>>> prof_factor); >>>>>> ???????? if (call_does_dispatch)? log->print(" virtual='1'"); >>>>>> ???????? if (allow_inline)???? log->print(" inline='1'"); >>>>>> -??? if (receiver_count >= 0) { >>>>>> -????? log->print(" receiver='%d' receiver_count='%d'", rid, >>>>>> receiver_count); >>>>>> -?????? if (profile.has_receiver(1)) { >>>>>> -??????? log->print(" receiver2='%d' receiver2_count='%d'", r2id, >>>>>> profile.receiver_count(1)); >>>>>> +??? for (int i = 0; i < TypeProfileWidth && >>>>>> profile.has_receiver(i); i++) { >>>>>> +????? if (i == 0) { >>>>>> +??????? log->print(" receiver='%d' receiver_count='%d'", rids[i], >>>>>> profile.receiver_count(i)); >>>>>> +????? } else { >>>>>> +??????? log->print(" receiver%d='%d' receiver%d_count='%d'", i + >>>>>> 1, rids[i], i + 1, profile.receiver_count(i)); >>>>>> ?????????? } >>>>>> ???????? } >>>>>> ???????? if (callee->is_method_handle_intrinsic()) { >>>>>> @@ -205,90 +203,96 @@ CallGenerator* >>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool >>>>>> ???????? if (call_does_dispatch && site_count > 0 && >>>>>> UseTypeProfile) { >>>>>> ?????????? // The major receiver's count >= >>>>>> TypeProfileMajorReceiverPercent of site_count. >>>>>> ?????????? bool have_major_receiver = profile.has_receiver(0) && >>>>>> (100.*profile.receiver_prob(0) >= >>>>>> (float)TypeProfileMajorReceiverPercent); >>>>>> -????? ciMethod* receiver_method = NULL; >>>>>> >>>>>> ?????????? int morphism = profile.morphism(); >>>>>> + >>>>>> +????? ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*, >>>>>> MAX(1, morphism)); >>>>>> +????? memset(receiver_methods, 0, sizeof(ciMethod*) * MAX(1, >>>>>> morphism)); >>>>>> + >>>>>> ?????????? if (speculative_receiver_type != NULL) { >>>>>> ???????????? if (!too_many_traps_or_recompiles(caller, bci, >>>>>> Deoptimization::Reason_speculate_class_check)) { >>>>>> ?????????????? // We have a speculative type, we should be able to >>>>>> resolve >>>>>> ?????????????? // the call. We do that before looking at the >>>>>> profiling at >>>>>> -????????? // this invoke because it may lead to bimorphic >>>>>> inlining which >>>>>> +????????? // this invoke because it may lead to polymorphic >>>>>> inlining which >>>>>> ?????????????? // a speculative type should help us avoid. >>>>>> -????????? receiver_method = >>>>>> callee->resolve_invoke(jvms->method()->holder(), >>>>>> - >>>>>> speculative_receiver_type); >>>>>> -????????? if (receiver_method == NULL) { >>>>>> +????????? receiver_methods[0] = >>>>>> callee->resolve_invoke(jvms->method()->holder(), >>>>>> + >>>>>> speculative_receiver_type); >>>>>> +????????? if (receiver_methods[0] == NULL) { >>>>>> ???????????????? speculative_receiver_type = NULL; >>>>>> ?????????????? } else { >>>>>> ???????????????? morphism = 1; >>>>>> ?????????????? } >>>>>> ???????????? } else { >>>>>> ?????????????? // speculation failed before. Use profiling at the >>>>>> call >>>>>> -????????? // (could allow bimorphic inlining for instance). >>>>>> +????????? // (could allow polymorphic inlining for instance). >>>>>> ?????????????? speculative_receiver_type = NULL; >>>>>> ???????????? } >>>>>> ?????????? } >>>>>> -????? if (receiver_method == NULL && >>>>>> +????? if (receiver_methods[0] == NULL && >>>>>> ?????????????? (have_major_receiver || morphism == 1 || >>>>>> -?????????? (morphism == 2 && UseBimorphicInlining))) { >>>>>> -??????? // receiver_method = profile.method(); >>>>>> +?????????? (morphism == 2 && UseBimorphicInlining) || >>>>>> +?????????? (morphism >= 2 && UsePolymorphicInlining))) { >>>>>> +??????? assert(profile.has_receiver(0), "no receiver at 0"); >>>>>> +??????? // receiver_methods[0] = profile.method(); >>>>>> ???????????? // Profiles do not suggest methods now.? Look it up >>>>>> in the major receiver. >>>>>> -??????? receiver_method = >>>>>> callee->resolve_invoke(jvms->method()->holder(), >>>>>> - >>>>>> profile.receiver(0)); >>>>>> +??????? receiver_methods[0] = >>>>>> callee->resolve_invoke(jvms->method()->holder(), >>>>>> + >>>>>> profile.receiver(0)); >>>>>> ?????????? } >>>>>> -????? if (receiver_method != NULL) { >>>>>> -??????? // The single majority receiver sufficiently outweighs >>>>>> the minority. >>>>>> -??????? CallGenerator* hit_cg = >>>>>> this->call_generator(receiver_method, >>>>>> -????????????? vtable_index, !call_does_dispatch, jvms, >>>>>> allow_inline, prof_factor); >>>>>> -??????? if (hit_cg != NULL) { >>>>>> -????????? // Look up second receiver. >>>>>> -????????? CallGenerator* next_hit_cg = NULL; >>>>>> -????????? ciMethod* next_receiver_method = NULL; >>>>>> -????????? if (morphism == 2 && UseBimorphicInlining) { >>>>>> -??????????? next_receiver_method = >>>>>> callee->resolve_invoke(jvms->method()->holder(), >>>>>> - >>>>>> profile.receiver(1)); >>>>>> -??????????? if (next_receiver_method != NULL) { >>>>>> -????????????? next_hit_cg = >>>>>> this->call_generator(next_receiver_method, >>>>>> -????????????????????????????????? vtable_index, >>>>>> !call_does_dispatch, jvms, >>>>>> -????????????????????????????????? allow_inline, prof_factor); >>>>>> -????????????? if (next_hit_cg != NULL && >>>>>> !next_hit_cg->is_inline() && >>>>>> -????????????????? have_major_receiver && UseOnlyInlinedBimorphic) { >>>>>> -????????????????? // Skip if we can't inline second receiver's >>>>>> method >>>>>> -????????????????? next_hit_cg = NULL; >>>>>> +????? if (receiver_methods[0] != NULL) { >>>>>> +??????? CallGenerator** hit_cgs = >>>>>> NEW_RESOURCE_ARRAY(CallGenerator*, MAX(1, morphism)); >>>>>> +??????? memset(hit_cgs, 0, sizeof(CallGenerator*) * MAX(1, >>>>>> morphism)); >>>>>> + >>>>>> +??????? hit_cgs[0] = this->call_generator(receiver_methods[0], >>>>>> +??????????????????????????? vtable_index, !call_does_dispatch, jvms, >>>>>> +??????????????????????????? allow_inline, prof_factor); >>>>>> +??????? if (hit_cgs[0] != NULL) { >>>>>> +????????? if ((morphism == 2 && UseBimorphicInlining) || >>>>>> (morphism >= 2 && UsePolymorphicInlining)) { >>>>>> +??????????? for (int i = 1; i < morphism; i++) { >>>>>> +????????????? assert(profile.has_receiver(i), "no receiver at >>>>>> %d", i); >>>>>> +????????????? receiver_methods[i] = >>>>>> callee->resolve_invoke(jvms->method()->holder(), >>>>>> + >>>>>> profile.receiver(i)); >>>>>> +????????????? if (receiver_methods[i] != NULL) { >>>>>> +??????????????? hit_cgs[i] = >>>>>> this->call_generator(receiver_methods[i], >>>>>> +????????????????????????????????????? vtable_index, >>>>>> !call_does_dispatch, jvms, >>>>>> +????????????????????????????????????? allow_inline, prof_factor); >>>>>> +??????????????? if (hit_cgs[i] != NULL && >>>>>> !hit_cgs[i]->is_inline() && have_major_receiver && >>>>>> +??????????????????? ((morphism == 2 && UseOnlyInlinedBimorphic) >>>>>> || (morphism >= 2 && UseOnlyInlinedPolymorphic))) { >>>>>> +????????????????? // Skip if we can't inline non-major receiver's >>>>>> method >>>>>> +????????????????? hit_cgs[i] = NULL; >>>>>> +??????????????? } >>>>>> ?????????????????? } >>>>>> ???????????????? } >>>>>> ?????????????? } >>>>>> ?????????????? CallGenerator* miss_cg; >>>>>> -????????? Deoptimization::DeoptReason reason = (morphism == 2 >>>>>> -?????????????????????????????????????????????? ? >>>>>> Deoptimization::Reason_bimorphic >>>>>> +????????? Deoptimization::DeoptReason reason = (morphism >= 2 >>>>>> +?????????????????????????????????????????????? ? >>>>>> Deoptimization::Reason_polymorphic >>>>>> ??????????????????????????????????????????????????? : >>>>>> Deoptimization::reason_class_check(speculative_receiver_type != >>>>>> NULL)); >>>>>> -????????? if ((morphism == 1 || (morphism == 2 && next_hit_cg != >>>>>> NULL)) && >>>>>> -????????????? !too_many_traps_or_recompiles(caller, bci, reason) >>>>>> -???????????? ) { >>>>>> +????????? if (!too_many_traps_or_recompiles(caller, bci, reason)) { >>>>>> ???????????????? // Generate uncommon trap for class check failure >>>>>> path >>>>>> -??????????? // in case of monomorphic or bimorphic virtual call >>>>>> site. >>>>>> +??????????? // in case of polymorphic virtual call site. >>>>>> ???????????????? miss_cg = >>>>>> CallGenerator::for_uncommon_trap(callee, reason, >>>>>> ???????????????????????????? Deoptimization::Action_maybe_recompile); >>>>>> ?????????????? } else { >>>>>> ???????????????? // Generate virtual call for class check failure >>>>>> path >>>>>> -??????????? // in case of polymorphic virtual call site. >>>>>> +??????????? // in case of megamorphic virtual call site. >>>>>> ???????????????? miss_cg = CallGenerator::for_virtual_call(callee, >>>>>> vtable_index); >>>>>> ?????????????? } >>>>>> -????????? if (miss_cg != NULL) { >>>>>> -??????????? if (next_hit_cg != NULL) { >>>>>> +????????? for (int i = morphism - 1; i >= 1 && miss_cg != NULL; >>>>>> i--) { >>>>>> +??????????? if (hit_cgs[i] != NULL) { >>>>>> ?????????????????? assert(speculative_receiver_type == NULL, >>>>>> "shouldn't end up here if we used speculation"); >>>>>> -????????????? trace_type_profile(C, jvms->method(), jvms->depth() >>>>>> - 1, jvms->bci(), next_receiver_method, profile.receiver(1), >>>>>> site_count, profile.receiver_count(1)); >>>>>> +????????????? trace_type_profile(C, jvms->method(), jvms->depth() >>>>>> - 1, jvms->bci(), receiver_methods[i], profile.receiver(i), >>>>>> site_count, profile.receiver_count(i)); >>>>>> ?????????????????? // We don't need to record dependency on a >>>>>> receiver here and below. >>>>>> ?????????????????? // Whenever we inline, the dependency is added >>>>>> by Parse::Parse(). >>>>>> -????????????? miss_cg = >>>>>> CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, >>>>>> next_hit_cg, PROB_MAX); >>>>>> -??????????? } >>>>>> -??????????? if (miss_cg != NULL) { >>>>>> -????????????? ciKlass* k = speculative_receiver_type != NULL ? >>>>>> speculative_receiver_type : profile.receiver(0); >>>>>> -????????????? trace_type_profile(C, jvms->method(), jvms->depth() >>>>>> - 1, jvms->bci(), receiver_method, k, site_count, receiver_count); >>>>>> -????????????? float hit_prob = speculative_receiver_type != NULL >>>>>> ? 1.0 : profile.receiver_prob(0); >>>>>> -????????????? CallGenerator* cg = >>>>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob); >>>>>> -????????????? if (cg != NULL)? return cg; >>>>>> +????????????? miss_cg = >>>>>> CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, >>>>>> hit_cgs[i], PROB_MAX); >>>>>> ???????????????? } >>>>>> ?????????????? } >>>>>> +????????? if (miss_cg != NULL) { >>>>>> +??????????? ciKlass* k = speculative_receiver_type != NULL ? >>>>>> speculative_receiver_type : profile.receiver(0); >>>>>> +??????????? trace_type_profile(C, jvms->method(), jvms->depth() - >>>>>> 1, jvms->bci(), receiver_methods[0], k, site_count, >>>>>> profile.receiver_count(0)); >>>>>> +??????????? float hit_prob = speculative_receiver_type != NULL ? >>>>>> 1.0 : profile.receiver_prob(0); >>>>>> +??????????? CallGenerator* cg = >>>>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cgs[0], hit_prob); >>>>>> +??????????? if (cg != NULL)? return cg; >>>>>> +????????? } >>>>>> ???????????? } >>>>>> ????????? } >>>>>> ???????? } >>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp >>>>>> b/src/hotspot/share/runtime/deoptimization.cpp >>>>>> index 11df15e004..2d14b52854 100644 >>>>>> --- a/src/hotspot/share/runtime/deoptimization.cpp >>>>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp >>>>>> @@ -2382,7 +2382,7 @@ const char* >>>>>> Deoptimization::_trap_reason_name[] = { >>>>>> ?????? "class_check", >>>>>> ?????? "array_check", >>>>>> ?????? "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"), >>>>>> -? "bimorphic" JVMCI_ONLY("_or_optimized_type_check"), >>>>>> +? "polymorphic" JVMCI_ONLY("_or_optimized_type_check"), >>>>>> ?????? "profile_predicate", >>>>>> ?????? "unloaded", >>>>>> ?????? "uninitialized", >>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp >>>>>> b/src/hotspot/share/runtime/deoptimization.hpp >>>>>> index 1cfff5394e..c1eb998aba 100644 >>>>>> --- a/src/hotspot/share/runtime/deoptimization.hpp >>>>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp >>>>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic { >>>>>> ???????? Reason_class_check,?????????? // saw unexpected object >>>>>> class (@bci) >>>>>> ???????? Reason_array_check,?????????? // saw unexpected array >>>>>> class (aastore @bci) >>>>>> ???????? Reason_intrinsic,???????????? // saw unexpected operand >>>>>> to intrinsic (@bci) >>>>>> -??? Reason_bimorphic,???????????? // saw unexpected object class >>>>>> in bimorphic inlining (@bci) >>>>>> +??? Reason_polymorphic,?????????? // saw unexpected object class >>>>>> in bimorphic inlining (@bci) >>>>>> >>>>>> #if INCLUDE_JVMCI >>>>>> ???????? Reason_unreached0???????????? = Reason_null_assert, >>>>>> ???????? Reason_type_checked_inlining? = Reason_intrinsic, >>>>>> -??? Reason_optimized_type_check?? = Reason_bimorphic, >>>>>> +??? Reason_optimized_type_check?? = Reason_polymorphic, >>>>>> #endif >>>>>> >>>>>> ???????? Reason_profile_predicate,???? // compiler generated >>>>>> predicate moved from frequent branch in a loop failed >>>>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp >>>>>> b/src/hotspot/share/runtime/vmStructs.cpp >>>>>> index 94b544824e..ee761626c4 100644 >>>>>> --- a/src/hotspot/share/runtime/vmStructs.cpp >>>>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp >>>>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry>>>>> mtClass>? KlassHashtableEntry; >>>>>> >>>>>> declare_constant(Deoptimization::Reason_class_check) >>>>>> \ >>>>>> >>>>>> declare_constant(Deoptimization::Reason_array_check) >>>>>> \ >>>>>> >>>>>> declare_constant(Deoptimization::Reason_intrinsic) >>>>>> \ >>>>>> - >>>>>> declare_constant(Deoptimization::Reason_bimorphic) >>>>>> \ >>>>>> + >>>>>> declare_constant(Deoptimization::Reason_polymorphic) >>>>>> \ >>>>>> >>>>>> declare_constant(Deoptimization::Reason_profile_predicate) >>>>>> \ >>>>>> >>>>>> declare_constant(Deoptimization::Reason_unloaded) >>>>>> \ >>>>>> >>>>>> declare_constant(Deoptimization::Reason_uninitialized) >>>>>> \ >>>>>> From viv.desh at gmail.com Mon Apr 6 18:55:05 2020 From: viv.desh at gmail.com (Vivek Deshpande) Date: Mon, 6 Apr 2020 11:55:05 -0700 Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator): x86 backend changes In-Reply-To: References: Message-ID: Hi Sandhya I looked at the patch over the weekend. It looks good to me and a lot of work is involved. I have a question. Is this patch intended to panama/dev or mainline jdk? Nit: macroAssembler_x86.cpp has extra line at 115. Regards, Vivek OpenJDK id: vdeshpande On Fri, Apr 3, 2020 at 5:18 PM Viswanathan, Sandhya < sandhya.viswanathan at intel.com> wrote: > Hi, > > > Following up on review requests of API [0], Java implementation [1] and > > General Hotspot changes[3] for Vector API, here's a request for review > > of x86 backend changes required for supporting the API: > > > > JEP: https://openjdk.java.net/jeps/338 > > JBS: https://bugs.openjdk.java.net/browse/JDK-8223347 > > Webrev: > http://cr.openjdk.java.net/~sviswanathan/VAPI_RFR/x86_webrev/webrev.00/ > > > > Complete implementation resides in vector-unstable branch of > > panama/dev repository [3]. > > Looking forward to your feedback. > > Best Regards, > Sandhya > > > [0] > https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html > > > > [1] > https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-April/065587.html > > > > [2] > https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-April/037798.html > > > > [3] https://openjdk.java.net/projects/panama/ > > $ hg clone http://hg.openjdk.java.net/panama/dev/ -b > vector-unstable > > > > > > -- Thanks and Regards, Vivek Deshpande viv.desh at gmail.com From sandhya.viswanathan at intel.com Mon Apr 6 19:01:17 2020 From: sandhya.viswanathan at intel.com (Viswanathan, Sandhya) Date: Mon, 6 Apr 2020 19:01:17 +0000 Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator): x86 backend changes In-Reply-To: References: Message-ID: Hi Vivek, Thanks for the feedback. This patch is for mainline jdk. Best Regards, Sandhya From: Vivek Deshpande Sent: Monday, April 06, 2020 11:55 AM To: Viswanathan, Sandhya Cc: hotspot-compiler-dev at openjdk.java.net; core-libs-dev at openjdk.java.net; hotspot-dev Subject: Re: RFR (XXL): 8223347: Integration of Vector API (Incubator): x86 backend changes Hi Sandhya I looked at the patch over the weekend. It looks good to me and a lot of work is involved. I have a question. Is this patch intended to panama/dev or mainline jdk? Nit: macroAssembler_x86.cpp has extra line at 115. Regards, Vivek OpenJDK id: vdeshpande On Fri, Apr 3, 2020 at 5:18 PM Viswanathan, Sandhya > wrote: Hi, Following up on review requests of API [0], Java implementation [1] and General Hotspot changes[3] for Vector API, here's a request for review of x86 backend changes required for supporting the API: JEP: https://openjdk.java.net/jeps/338 JBS: https://bugs.openjdk.java.net/browse/JDK-8223347 Webrev:http://cr.openjdk.java.net/~sviswanathan/VAPI_RFR/x86_webrev/webrev.00/ Complete implementation resides in vector-unstable branch of panama/dev repository [3]. Looking forward to your feedback. Best Regards, Sandhya [0] https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html [1] https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-April/065587.html [2] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-April/037798.html [3] https://openjdk.java.net/projects/panama/ $ hg clone http://hg.openjdk.java.net/panama/dev/ -b vector-unstable -- Thanks and Regards, Vivek Deshpande viv.desh at gmail.com From ekaterina.pavlova at oracle.com Tue Apr 7 03:12:49 2020 From: ekaterina.pavlova at oracle.com (Ekaterina Pavlova) Date: Mon, 6 Apr 2020 20:12:49 -0700 Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator): General HotSpot changes In-Reply-To: References: Message-ID: <3b6f8501-5d90-0a14-ed21-5978567d95d2@oracle.com> Hi Vladimir, what kind of testing has been done to verify these changes? Taking into account the changes are quite large and touch share code running hs compiler and perhaps runtime tiers would be very advisable. thanks, -katya On 4/3/20 4:12 PM, Vladimir Ivanov wrote: > Hi, > > Following up on review requests of API [0] and Java implementation [1] for Vector API (JEP 338 [2]), here's a request for review of general HotSpot changes (in shared code) required for supporting the API: > > > http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/all.00-03/ > > (First of all, to set proper expectations: since the JEP is still in Candidate state, the intention is to initiate preliminary round(s) of review to inform the community and gather feedback before sending out final/official RFRs once the JEP is Targeted to a release.) > > Vector API (being developed in Project Panama [3]) relies on JVM support to utilize optimal vector hardware instructions at runtime. It interacts with JVM through intrinsics (declared in jdk.internal.vm.vector.VectorSupport [4]) which expose vector operations support in C2 JIT-compiler. > > As Paul wrote earlier: "A vector intrinsic is an internal low-level vector operation. The last argument to the intrinsic is fall back behavior in Java, implementing the scalar operation over the number of elements held by the vector.? Thus, If the intrinsic is not supported in C2 for the other arguments then the Java implementation is executed (the Java implementation is always executed when running in the interpreter or for C1)." > > The rest of JVM support is about aggressively optimizing vector boxes to minimize (ideally eliminate) the overhead of boxing for vector values. > It's a stop-the-gap solution for vector box elimination problem until inline classes arrive. Vector classes are value-based and in the longer term will be migrated to inline classes once the support becomes available. > > Vector API talk from JVMLS'18 [5] contains brief overview of JVM implementation and some details. > > Complete implementation resides in vector-unstable branch of panama/dev repository [6]. > > Now to gory details (the patch is split in multiple "sub-webrevs"): > > =========================================================== > > (1) http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/00.backend.shared/ > > Ideal vector nodes for new operations introduced by Vector API. > > (Platform-specific back end support will be posted for review separately). > > =========================================================== > > (2) http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/ > > JVM Java interface (VectorSupport) and intrinsic support in C2. > > Vector instances are initially represented as VectorBox macro nodes and "unboxing" is represented by VectorUnbox node. It simplifies vector box elimination analysis and the nodes are expanded later right before EA pass. > > Vectors have 2-level on-heap representation: for the vector value primitive array is used as a backing storage and it is encapsulated in a typed wrapper (e.g., Int256Vector - vector of 8 ints - contains a int[8] instance which is used to store vector value). > > Unless VectorBox node goes away, it needs to be expanded into an allocation eventually, but it is a pure node and doesn't have any JVM state associated with it. The problem is solved by keeping JVM state separately in a VectorBoxAllocate node associated with VectorBox node and use it during expansion. > > Also, to simplify vector box elimination, inlining of vector reboxing calls (VectorSupport::maybeRebox) is delayed until the analysis is over. > > =========================================================== > > (3) http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/02.vbox_elimination/ > > Vector box elimination analysis implementation. (Brief overview: slides #36-42 [5].) > > The main part is devoted to scalarization across safepoints and rematerialization support during deoptimization. In C2-generated code vector operations work with raw vector values which live in registers or spilled on the stack and it allows to avoid boxing/unboxing when a vector value is alive across a safepoint. As with other values, there's just a location of the vector value at the safepoint and vector type information recorded in the relevant nmethod metadata and all the heavy-lifting happens only when rematerialization takes place. > > The analysis preserves object identity invariants except during aggressive reboxing (guarded by -XX:+EnableAggressiveReboxing). > > (Aggressive reboxing is crucial for cases when vectors "escape": it allocates a fresh instance at every escape point thus enabling original instance to go away.) > > =========================================================== > > (4) http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/03.module.hotspot/ > > HotSpot changes for jdk.incubator.vector module. Vector support is makred experimental and turned off by default. JEP 338 proposes the API to be released as an incubator module, so a user has to specify "--add-module jdk.incubator.vector" on the command line to be able to use it. > When user does that, JVM automatically enables Vector API support. > It improves usability (user doesn't need to separately "open" the API and enable JVM support) while minimizing risks of destabilitzation from new code when the API is not used. > > > That's it! Will be happy to answer any questions. > > And thanks in advance for any feedback! > > Best regards, > Vladimir Ivanov > > [0] https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html > > [1] https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041228.html > > [2] https://openjdk.java.net/jeps/338 > > [3] https://openjdk.java.net/projects/panama/ > > [4] http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java.html > > [5] http://cr.openjdk.java.net/~vlivanov/talks/2018_JVMLS_VectorAPI.pdf > > [6] http://hg.openjdk.java.net/panama/dev/shortlog/92bbd44386e9 > > ??? $ hg clone http://hg.openjdk.java.net/panama/dev/ -b vector-unstable From vladimir.x.ivanov at oracle.com Tue Apr 7 09:39:32 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Tue, 7 Apr 2020 12:39:32 +0300 Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator): General HotSpot changes In-Reply-To: <3b6f8501-5d90-0a14-ed21-5978567d95d2@oracle.com> References: <3b6f8501-5d90-0a14-ed21-5978567d95d2@oracle.com> Message-ID: <7dc065c6-b8c4-3d83-4b5d-788e07d8d6e5@oracle.com> Hi Katya, > what kind of testing has been done to verify these changes? > Taking into account the changes are quite large and touch share code > running hs compiler and perhaps runtime tiers would be very advisable. The changes (and previous versions) were tested in 2 modes: * ran through tier1-tier4 with the functionality turned OFF; (also, some previous version went through tier1-tier6 once) * unit tests on Vector API were run on different x86 hardware in the following modes: -XX:UseAVX=[3,2,1,0] -XX:UseSSE=[4,3,2]. Arm engineers tested the version in vector-unstable branch on AArch64 hardware. As of now, the only known test failure is compiler/graalunit/HotspotTest.java in org.graalvm.compiler.hotspot.test.CheckGraalIntrinsics which should be taught about new JVM intrinsics added. Best regards, Vladimir Ivanov > On 4/3/20 4:12 PM, Vladimir Ivanov wrote: >> Hi, >> >> Following up on review requests of API [0] and Java implementation [1] >> for Vector API (JEP 338 [2]), here's a request for review of general >> HotSpot changes (in shared code) required for supporting the API: >> >> >> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/all.00-03/ >> >> >> (First of all, to set proper expectations: since the JEP is still in >> Candidate state, the intention is to initiate preliminary round(s) of >> review to inform the community and gather feedback before sending out >> final/official RFRs once the JEP is Targeted to a release.) >> >> Vector API (being developed in Project Panama [3]) relies on JVM >> support to utilize optimal vector hardware instructions at runtime. It >> interacts with JVM through intrinsics (declared in >> jdk.internal.vm.vector.VectorSupport [4]) which expose vector >> operations support in C2 JIT-compiler. >> >> As Paul wrote earlier: "A vector intrinsic is an internal low-level >> vector operation. The last argument to the intrinsic is fall back >> behavior in Java, implementing the scalar operation over the number of >> elements held by the vector.? Thus, If the intrinsic is not supported >> in C2 for the other arguments then the Java implementation is executed >> (the Java implementation is always executed when running in the >> interpreter or for C1)." >> >> The rest of JVM support is about aggressively optimizing vector boxes >> to minimize (ideally eliminate) the overhead of boxing for vector values. >> It's a stop-the-gap solution for vector box elimination problem until >> inline classes arrive. Vector classes are value-based and in the >> longer term will be migrated to inline classes once the support >> becomes available. >> >> Vector API talk from JVMLS'18 [5] contains brief overview of JVM >> implementation and some details. >> >> Complete implementation resides in vector-unstable branch of >> panama/dev repository [6]. >> >> Now to gory details (the patch is split in multiple "sub-webrevs"): >> >> =========================================================== >> >> (1) >> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/00.backend.shared/ >> >> >> Ideal vector nodes for new operations introduced by Vector API. >> >> (Platform-specific back end support will be posted for review >> separately). >> >> =========================================================== >> >> (2) >> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/ >> >> >> JVM Java interface (VectorSupport) and intrinsic support in C2. >> >> Vector instances are initially represented as VectorBox macro nodes >> and "unboxing" is represented by VectorUnbox node. It simplifies >> vector box elimination analysis and the nodes are expanded later right >> before EA pass. >> >> Vectors have 2-level on-heap representation: for the vector value >> primitive array is used as a backing storage and it is encapsulated in >> a typed wrapper (e.g., Int256Vector - vector of 8 ints - contains a >> int[8] instance which is used to store vector value). >> >> Unless VectorBox node goes away, it needs to be expanded into an >> allocation eventually, but it is a pure node and doesn't have any JVM >> state associated with it. The problem is solved by keeping JVM state >> separately in a VectorBoxAllocate node associated with VectorBox node >> and use it during expansion. >> >> Also, to simplify vector box elimination, inlining of vector reboxing >> calls (VectorSupport::maybeRebox) is delayed until the analysis is over. >> >> =========================================================== >> >> (3) >> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/02.vbox_elimination/ >> >> >> Vector box elimination analysis implementation. (Brief overview: >> slides #36-42 [5].) >> >> The main part is devoted to scalarization across safepoints and >> rematerialization support during deoptimization. In C2-generated code >> vector operations work with raw vector values which live in registers >> or spilled on the stack and it allows to avoid boxing/unboxing when a >> vector value is alive across a safepoint. As with other values, >> there's just a location of the vector value at the safepoint and >> vector type information recorded in the relevant nmethod metadata and >> all the heavy-lifting happens only when rematerialization takes place. >> >> The analysis preserves object identity invariants except during >> aggressive reboxing (guarded by -XX:+EnableAggressiveReboxing). >> >> (Aggressive reboxing is crucial for cases when vectors "escape": it >> allocates a fresh instance at every escape point thus enabling >> original instance to go away.) >> >> =========================================================== >> >> (4) >> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/03.module.hotspot/ >> >> >> HotSpot changes for jdk.incubator.vector module. Vector support is >> makred experimental and turned off by default. JEP 338 proposes the >> API to be released as an incubator module, so a user has to specify >> "--add-module jdk.incubator.vector" on the command line to be able to >> use it. >> When user does that, JVM automatically enables Vector API support. >> It improves usability (user doesn't need to separately "open" the API >> and enable JVM support) while minimizing risks of destabilitzation >> from new code when the API is not used. >> >> >> That's it! Will be happy to answer any questions. >> >> And thanks in advance for any feedback! >> >> Best regards, >> Vladimir Ivanov >> >> [0] >> https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html >> >> >> [1] >> https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041228.html >> >> >> [2] https://openjdk.java.net/jeps/338 >> >> [3] https://openjdk.java.net/projects/panama/ >> >> [4] >> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java.html >> >> >> [5] http://cr.openjdk.java.net/~vlivanov/talks/2018_JVMLS_VectorAPI.pdf >> >> [6] http://hg.openjdk.java.net/panama/dev/shortlog/92bbd44386e9 >> >> ???? $ hg clone http://hg.openjdk.java.net/panama/dev/ -b vector-unstable > From vladimir.kozlov at oracle.com Tue Apr 7 17:15:34 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 7 Apr 2020 10:15:34 -0700 Subject: RFR(S): 8241900: Loop unswitching may cause dependence on null check to be lost In-Reply-To: <6cd91986-952c-1419-47d1-f7d25b955a70@oracle.com> References: <878sjdc5jl.fsf@redhat.com> <36d43333-81e8-1f79-1a04-06f8e34a2c30@oracle.com> <87zhbpau71.fsf@redhat.com> <6cd91986-952c-1419-47d1-f7d25b955a70@oracle.com> Message-ID: <289f3e63-9603-d90e-8b31-1d02d22d6ae7@oracle.com> I also agree with these changes. And I see that Tobias's testing did not find issues (except timeout on SPARC). Thanks, Vladimir On 4/6/20 1:51 AM, Tobias Hartmann wrote: > > On 06.04.20 10:34, Roland Westrelin wrote: >> I've been wondering about that too but couldn't find a scenario where it >> would go wrong. dominated_by() is what's used when a if is replaced by a >> dominating if with the same condition in >> PhaseIdealLoop::split_if_with_blocks_post(). Loop switching is similar: >> we add a dominating if, and then remove the loop copies because they are >> redundant. > > Right, I couldn't find such a scenario either and as you've pointed out the same problem would > exists at other places as well. Looks good. > > Best regards, > Tobias > From vladimir.x.ivanov at oracle.com Tue Apr 7 17:29:55 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Tue, 7 Apr 2020 20:29:55 +0300 Subject: [15] RFR (S): 8242289: C2: Support platform-specific node cloning in Matcher Message-ID: <2645fc7c-76f7-dd3b-bee3-72d0b923d46f@oracle.com> http://cr.openjdk.java.net/~vlivanov/8242289/webrev.00/ https://bugs.openjdk.java.net/browse/JDK-8242289 Introduce a platform-specific entry point (Matcher::pd_clone_node) and move platform-specific node cloning during matching. Matcher processes every node only once unless it is marked as shared. It is too restrictive in some cases, so the workaround is to explicitly check for particular IR patterns and clone relevant nodes during matching phase. As an example, take a look at ShiftCntV. There are the following match rules in aarch64.ad: match(Set dst (RShiftVB src (RShiftCntV shift))); By default, RShiftCntV node is matched only once, so when it has multiple users, only it will be folded only into one of them and for the rest the value it produces will be put in register. To overcome that, Matcher is taught to detect such pattern and "clone" RShiftCntV input every time it matches RShiftV node. In case of RShiftCntV, it's arm32/aarch64-specific and other platforms (x86 in particular) don't optimize for it. To avoid polluting shared code (in matcher.cpp) with platform-specific portions, I propose to add Matcher::pd_clone_node and place platform-specific checks there. Also, as a cleanup, renamed Matcher::clone_address_expressions() to pd_clone_address_expressions since it's a platform-specific method. Testing: hs-precheckin-comp, hs-tier1, hs-tier2, cross-builds on all affected platforms Thanks! Best regards, Vladimir Ivanov From vladimir.kozlov at oracle.com Tue Apr 7 17:43:25 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 7 Apr 2020 10:43:25 -0700 Subject: [15] RFR (S): 8242289: C2: Support platform-specific node cloning in Matcher In-Reply-To: <2645fc7c-76f7-dd3b-bee3-72d0b923d46f@oracle.com> References: <2645fc7c-76f7-dd3b-bee3-72d0b923d46f@oracle.com> Message-ID: Good. Thanks, Vladimir On 4/7/20 10:29 AM, Vladimir Ivanov wrote: > http://cr.openjdk.java.net/~vlivanov/8242289/webrev.00/ > https://bugs.openjdk.java.net/browse/JDK-8242289 > > Introduce a platform-specific entry point (Matcher::pd_clone_node) and move platform-specific node cloning during matching. > > Matcher processes every node only once unless it is marked as shared. > It is too restrictive in some cases, so the workaround is to explicitly check for particular IR patterns and clone > relevant nodes during matching phase. > > As an example, take a look at ShiftCntV. There are the following match rules in aarch64.ad: > > ? match(Set dst (RShiftVB src (RShiftCntV shift))); > > By default, RShiftCntV node is matched only once, so when it has multiple users, only it will be folded only into one of > them and for the rest the value it produces will be put in register. To overcome that, Matcher is taught to detect such > pattern and "clone" RShiftCntV input every time it matches RShiftV node. In case of RShiftCntV, it's > arm32/aarch64-specific and other platforms (x86 in particular) don't optimize for it. > > To avoid polluting shared code (in matcher.cpp) with platform-specific portions, I propose to add Matcher::pd_clone_node > and place platform-specific checks there. > > Also, as a cleanup, renamed Matcher::clone_address_expressions() to pd_clone_address_expressions since it's a > platform-specific method. > > Testing: hs-precheckin-comp, hs-tier1, hs-tier2, > ???????? cross-builds on all affected platforms > > Thanks! > > Best regards, > Vladimir Ivanov From vladimir.kozlov at oracle.com Tue Apr 7 17:54:07 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 7 Apr 2020 10:54:07 -0700 Subject: Polymorphic Guarded Inlining in C2 In-Reply-To: <6de5487c-c13e-c03e-9d0b-f3093b115daf@oracle.com> References: <6bbeea49-7335-9640-d524-32fa03968f42@oracle.com> <084ea561-6305-a8fd-9d7f-5ba108d41312@oracle.com> <6de5487c-c13e-c03e-9d0b-f3093b115daf@oracle.com> Message-ID: <0ee0b383-285e-bd93-3490-84ad991b53d1@oracle.com> An other thing we can do is collect statistic data about how many different receivers can be recorded with big TypeProfileWidth. My recollection from long ago was the only case for poly was HashMap usage. It would be nice to collect this data again for modern Java benchmarks. We can use them to see afftets of changes - benchmarks which do not have poly cases are usless in these experiments. On 4/6/20 6:38 AM, Vladimir Ivanov wrote: > I see 2 directions (mostly independent) to proceed: (1) use existing profiling info only; and (2) when more profile info > is available. > > I suggest to explore them independently. > > There's enough profiling data available to introduce polymorpic case with 2 major receivers ("2-poly"). And it'll > complete the matrix of possible shapes. Please explain how it is different from current bimprphic case? > > Gathering more data (-XX:TypeProfileWidth=N > 2) enables 2 more generic shapes: "N-morphic" and "N-poly". The only > difference between them is what happens on fallback patch - deopt / uncommon trap or a virtual call. > > Regarding 2-poly, there is TypeProfileMajorReceiverPercent which should be extended to 2 cases which leads to 2 > parameter: aggregated major receiver percentage and minimum indiviual percentage. okay > > Also, it makes sense to introduce UseOnlyInlinedPolymorphic which aligns 2-poly with bimorphic case. > > And, as I mentioned before, IMO it's promising to distinguish invokevirtual and invokeinterface cases. So, additional > flag to control that would be useful. yes > > Regarding N-poly/N-morphic case, they can be generalized from 2-poly/bi-morphic cases. > > I believe experiments on 2-poly will provide useful insights on N-poly/N-morphic, so it makes sense to start with 2-poly > first. Yes Thanks, Vladimir K > > Best regards, > Vladimir Ivanov > > On 01.04.2020 01:29, Vladimir Kozlov wrote: >> Looks like graphs were stripped from email. I put them on GitHub: >> >> >> >> >> >> Also Vladimir Ivanov forwarded me data he collected. >> >> His next data shows that profiling is not "free". Vladimir I. limited to tier3 (-XX:TieredStopAtLevel=3, C1 >> compilation with profiling code) to show that profiling code with TPW=8 is slower. Note, with 4 tiers this may not >> visible because execution will be switched to C2 compiled code (without profiling code). >> >> >> >> >> Next data collected for proposed patch. Vladimir I. collected data for several flags configurations. >> Next graphs are for one of settings:' -XX:+UsePolymorphicInlining -XX:+UseOnlyInlinedPolymorphic -XX:TypeProfileWidth=4' >> >> >> >> >> It has mixed data but most benchmarks are not affected. Which means we need to spend more time on proposed changes. >> >> Vladimir K >> >> On 3/31/20 10:39 AM, Vladimir Kozlov wrote: >>> I start loking on it. >>> >>> I think ideally TypeProfileWidth should be per call site and not per method - and it will require more complicated >>> implementation (an other RFE). But for experiments I think setting it to 8 (or higher) for all methods is okay. >>> >>> Note, more profiling lines per each call site is cost few Mb in CodeCache (overestimation 20K nmethods * 10 call >>> sites * 6 * 8 bytes) vs very complicated code to have dynamic number of lines. >>> >>> I think we should first investigate best heuristics for inlining vs direct call vs vcall vs uncommmont traps for >>> polymorphic cases and worry about memory and time consumption during profiling later. >>> >>> I did some performance runs with latest JDK 15 for TypeProfileWidth=8 vs =2 and don't see much difference for spec >>> benchmarks (see attached graph - grey dots mean no significant difference). But there are regressions (red dots) for >>> Renessance which includes some modern benchmarks. >>> >>> I will work his week to get similar data with Ludovic's patch. >>> >>> I am for incremental approach. I think we can start/push based on what Ludovic is currently suggesting (do more >>> processing for TPW > 2) while preserving current default behaviour (for TPW <= 2). But only if it gives improvements >>> in these benchmarks. We use these benchmarks as criteria for JDK releases. >>> >>> Regards, >>> Vladimir >>> >>> On 3/20/20 4:52 PM, Ludovic Henry wrote: >>>> Hi Vladimir, >>>> >>>> As requested offline, please find following the latest version of the patch. Contrary to what was discussed >>>> initially, I haven't done the work to support per-method TypeProfileWidth, as that requires to extend the >>>> existing CompilerDirectives to be available to the Interpreter. For me to achieve that work, I would need >>>> guidance on how to approach the problem, and what your expectations are. >>>> >>>> Thank you, >>>> >>>> -- >>>> Ludovic >>>> >>>> diff --git a/src/hotspot/cpu/x86/interp_masm_x86.cpp b/src/hotspot/cpu/x86/interp_masm_x86.cpp >>>> index 4ed93169c7..bad9cddf20 100644 >>>> --- a/src/hotspot/cpu/x86/interp_masm_x86.cpp >>>> +++ b/src/hotspot/cpu/x86/interp_masm_x86.cpp >>>> @@ -1731,7 +1731,7 @@ void InterpreterMacroAssembler::record_item_in_profile_helper(Register item, Reg >>>> ??????????? Label found_null; >>>> ??????????? jccb(Assembler::zero, found_null); >>>> ??????????? // Item did not match any saved item and there is no empty row for it. >>>> -????????? // Increment total counter to indicate polymorphic case. >>>> +????????? // Increment total counter to indicate megamorphic case. >>>> ??????????? increment_mdp_data_at(mdp, non_profiled_offset); >>>> ??????????? jmp(done); >>>> ??????????? bind(found_null); >>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp b/src/hotspot/share/ci/ciCallProfile.hpp >>>> index 73854806ed..c5030149bf 100644 >>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp >>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp >>>> @@ -38,7 +38,8 @@ private: >>>> ??? friend class ciMethod; >>>> ??? friend class ciMethodHandle; >>>> -? enum { MorphismLimit = 2 }; // Max call site's morphism we care about >>>> +? enum { MorphismLimit = 8 }; // Max call site's morphism we care about >>>> +? bool _is_megamorphic;????????? // whether the call site is megamorphic >>>> ??? int? _limit;??????????????? // number of receivers have been determined >>>> ??? int? _morphism;???????????? // determined call site's morphism >>>> ??? int? _count;??????????????? // # times has this call been executed >>>> @@ -47,6 +48,8 @@ private: >>>> ??? ciKlass*? _receiver[MorphismLimit + 1];? // receivers (exact) >>>> ??? ciCallProfile() { >>>> +??? guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit can't be smaller than TypeProfileWidth"); >>>> +??? _is_megamorphic = false; >>>> ????? _limit = 0; >>>> ????? _morphism??? = 0; >>>> ????? _count = -1; >>>> @@ -58,6 +61,8 @@ private: >>>> ??? void add_receiver(ciKlass* receiver, int receiver_count); >>>> ? public: >>>> +? bool????? is_megamorphic() const??? { return _is_megamorphic; } >>>> + >>>> ??? // Note:? The following predicates return false for invalid profiles: >>>> ??? bool????? has_receiver(int i) const { return _limit > i; } >>>> ??? int?????? morphism() const????????? { return _morphism; } >>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp b/src/hotspot/share/ci/ciMethod.cpp >>>> index d771be8dac..c190919708 100644 >>>> --- a/src/hotspot/share/ci/ciMethod.cpp >>>> +++ b/src/hotspot/share/ci/ciMethod.cpp >>>> @@ -531,25 +531,27 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) { >>>> ??????????? // If we extend profiling to record methods, >>>> ??????????? // we will set result._method also. >>>> ????????? } >>>> -??????? // Determine call site's morphism. >>>> +??????? // Determine call site's megamorphism. >>>> ????????? // The call site count is 0 with known morphism (only 1 or 2 receivers) >>>> ????????? // or < 0 in the case of a type check failure for checkcast, aastore, instanceof. >>>> -??????? // The call site count is > 0 in the case of a polymorphic virtual call. >>>> +??????? // The call site count is > 0 in the case of a megamorphic virtual call. >>>> ????????? if (morphism > 0 && morphism == result._limit) { >>>> ???????????? // The morphism <= MorphismLimit. >>>> -?????????? if ((morphism >>> -?????????????? (morphism == ciCallProfile::MorphismLimit && count == 0)) { >>>> +?????????? if ((morphism >>> +?????????????? (morphism == TypeProfileWidth && count == 0)) { >>>> ? #ifdef ASSERT >>>> ?????????????? if (count > 0) { >>>> ???????????????? this->print_short_name(tty); >>>> ???????????????? tty->print_cr(" @ bci:%d", bci); >>>> ???????????????? this->print_codes(); >>>> -?????????????? assert(false, "this call site should not be polymorphic"); >>>> +?????????????? assert(false, "this call site should not be megamorphic"); >>>> ?????????????? } >>>> ? #endif >>>> -???????????? result._morphism = morphism; >>>> +?????????? } else { >>>> +????????????? result._is_megamorphic = true; >>>> ???????????? } >>>> ????????? } >>>> +??????? result._morphism = morphism; >>>> ????????? // Make the count consistent if this is a call profile. If count is >>>> ????????? // zero or less, presume that this is a typecheck profile and >>>> ????????? // do nothing.? Otherwise, increase count to be the sum of all >>>> @@ -578,7 +580,7 @@ void ciCallProfile::add_receiver(ciKlass* receiver, int receiver_count) { >>>> ??? } >>>> ??? _receiver[i] = receiver; >>>> ??? _receiver_count[i] = receiver_count; >>>> -? if (_limit < MorphismLimit) _limit++; >>>> +? if (_limit < TypeProfileWidth) _limit++; >>>> ? } >>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp b/src/hotspot/share/opto/c2_globals.hpp >>>> index d605bdb7bd..e4a5e7ea8b 100644 >>>> --- a/src/hotspot/share/opto/c2_globals.hpp >>>> +++ b/src/hotspot/share/opto/c2_globals.hpp >>>> @@ -389,9 +389,16 @@ >>>> ??? product(bool, UseBimorphicInlining, true,???????????????????????????????? \ >>>> ??????????? "Profiling based inlining for two receivers")???????????????????? \ >>>> \ >>>> +? product(bool, UsePolymorphicInlining, true,?????????????????????????????? \ >>>> +????????? "Profiling based inlining for two or more receivers")???????????? \ >>>> + \ >>>> ??? product(bool, UseOnlyInlinedBimorphic, true,????????????????????????????? \ >>>> ??????????? "Don't use BimorphicInlining if can't inline a second method")??? \ >>>> \ >>>> +? product(bool, UseOnlyInlinedPolymorphic, true,??????????????????????????? \ >>>> +????????? "Don't use PolymorphicInlining if can't inline a secondary "????? \ >>>> + "method")???????????????????????????????????????????????????????? \ >>>> + \ >>>> ??? product(bool, InsertMemBarAfterArraycopy, true,?????????????????????????? \ >>>> ??????????? "Insert memory barrier after arraycopy call")???????????????????? \ >>>> \ >>>> @@ -645,6 +652,10 @@ >>>> ??????????? "% of major receiver type to all profiled receivers")???????????? \ >>>> ??????????? range(0, 100)???????????????????????????????????????????????????? \ >>>> \ >>>> +? product(intx, TypeProfileMinimumReceiverPercent, 20,????????????????????? \ >>>> +????????? "minimum % of receiver type to all profiled receivers")?????????? \ >>>> +????????? range(0, 100)???????????????????????????????????????????????????? \ >>>> + \ >>>> ??? diagnostic(bool, PrintIntrinsics, false,????????????????????????????????? \ >>>> ??????????? "prints attempted and successful inlining of intrinsics")???????? \ >>>> \ >>>> diff --git a/src/hotspot/share/opto/doCall.cpp b/src/hotspot/share/opto/doCall.cpp >>>> index 44ab387ac8..dba2b114c6 100644 >>>> --- a/src/hotspot/share/opto/doCall.cpp >>>> +++ b/src/hotspot/share/opto/doCall.cpp >>>> @@ -83,25 +83,27 @@ CallGenerator* Compile::call_generator(ciMethod* callee, int vtable_index, bool >>>> ??? // See how many times this site has been invoked. >>>> ??? int site_count = profile.count(); >>>> -? int receiver_count = -1; >>>> -? if (call_does_dispatch && UseTypeProfile && profile.has_receiver(0)) { >>>> -??? // Receivers in the profile structure are ordered by call counts >>>> -??? // so that the most called (major) receiver is profile.receiver(0). >>>> -??? receiver_count = profile.receiver_count(0); >>>> -? } >>>> ??? CompileLog* log = this->log(); >>>> ??? if (log != NULL) { >>>> -??? int rid = (receiver_count >= 0)? log->identify(profile.receiver(0)): -1; >>>> -??? int r2id = (rid != -1 && profile.has_receiver(1))? log->identify(profile.receiver(1)):-1; >>>> +??? int* rids; >>>> +??? if (call_does_dispatch) { >>>> +????? rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth); >>>> +????? for (int i = 0; i < TypeProfileWidth && profile.has_receiver(i); i++) { >>>> +??????? rids[i] = log->identify(profile.receiver(i)); >>>> +????? } >>>> +??? } >>>> ????? log->begin_elem("call method='%d' count='%d' prof_factor='%f'", >>>> ????????????????????? log->identify(callee), site_count, prof_factor); >>>> -??? if (call_does_dispatch)? log->print(" virtual='1'"); >>>> ????? if (allow_inline)???? log->print(" inline='1'"); >>>> -??? if (receiver_count >= 0) { >>>> -????? log->print(" receiver='%d' receiver_count='%d'", rid, receiver_count); >>>> -????? if (profile.has_receiver(1)) { >>>> -??????? log->print(" receiver2='%d' receiver2_count='%d'", r2id, profile.receiver_count(1)); >>>> +??? if (call_does_dispatch) { >>>> +????? log->print(" virtual='1'"); >>>> +????? for (int i = 0; i < TypeProfileWidth && profile.has_receiver(i); i++) { >>>> +??????? if (i == 0) { >>>> +????????? log->print(" receiver='%d' receiver_count='%d' receiver_prob='%f'", rids[i], profile.receiver_count(i), >>>> profile.receiver_prob(i)); >>>> +??????? } else { >>>> +????????? log->print(" receiver%d='%d' receiver%d_count='%d' receiver%d_prob='%f'", i + 1, rids[i], i + 1, >>>> profile.receiver_count(i), i + 1, profile.receiver_prob(i)); >>>> +??????? } >>>> ??????? } >>>> ????? } >>>> ????? if (callee->is_method_handle_intrinsic()) { >>>> @@ -205,92 +207,112 @@ CallGenerator* Compile::call_generator(ciMethod* callee, int vtable_index, bool >>>> ????? if (call_does_dispatch && site_count > 0 && UseTypeProfile) { >>>> ??????? // The major receiver's count >= TypeProfileMajorReceiverPercent of site_count. >>>> ??????? bool have_major_receiver = profile.has_receiver(0) && (100.*profile.receiver_prob(0) >= >>>> (float)TypeProfileMajorReceiverPercent); >>>> -????? ciMethod* receiver_method = NULL; >>>> ??????? int morphism = profile.morphism(); >>>> + >>>> +????? int width = morphism > 0 ? morphism : 1; >>>> +????? ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*, width); >>>> +????? memset(receiver_methods, 0, sizeof(ciMethod*) * width); >>>> +????? CallGenerator** hit_cgs = NEW_RESOURCE_ARRAY(CallGenerator*, width); >>>> +????? memset(hit_cgs, 0, sizeof(CallGenerator*) * width); >>>> + >>>> ??????? if (speculative_receiver_type != NULL) { >>>> ????????? if (!too_many_traps_or_recompiles(caller, bci, Deoptimization::Reason_speculate_class_check)) { >>>> ??????????? // We have a speculative type, we should be able to resolve >>>> ??????????? // the call. We do that before looking at the profiling at >>>> -????????? // this invoke because it may lead to bimorphic inlining which >>>> +????????? // this invoke because it may lead to polymorphic inlining which >>>> ??????????? // a speculative type should help us avoid. >>>> -????????? receiver_method = callee->resolve_invoke(jvms->method()->holder(), >>>> - speculative_receiver_type); >>>> -????????? if (receiver_method == NULL) { >>>> +????????? receiver_methods[0] = callee->resolve_invoke(jvms->method()->holder(), >>>> + speculative_receiver_type); >>>> +????????? if (receiver_methods[0] == NULL) { >>>> ????????????? speculative_receiver_type = NULL; >>>> ??????????? } else { >>>> ????????????? morphism = 1; >>>> ??????????? } >>>> ????????? } else { >>>> ??????????? // speculation failed before. Use profiling at the call >>>> -????????? // (could allow bimorphic inlining for instance). >>>> +????????? // (could allow polymorphic inlining for instance). >>>> ??????????? speculative_receiver_type = NULL; >>>> ????????? } >>>> ??????? } >>>> -????? if (receiver_method == NULL && >>>> -????????? (have_major_receiver || morphism == 1 || >>>> -?????????? (morphism == 2 && UseBimorphicInlining))) { >>>> -??????? // receiver_method = profile.method(); >>>> -??????? // Profiles do not suggest methods now.? Look it up in the major receiver. >>>> -??????? receiver_method = callee->resolve_invoke(jvms->method()->holder(), >>>> - profile.receiver(0)); >>>> -????? } >>>> -????? if (receiver_method != NULL) { >>>> -??????? // The single majority receiver sufficiently outweighs the minority. >>>> -??????? CallGenerator* hit_cg = this->call_generator(receiver_method, >>>> -????????????? vtable_index, !call_does_dispatch, jvms, allow_inline, prof_factor); >>>> -??????? if (hit_cg != NULL) { >>>> -????????? // Look up second receiver. >>>> -????????? CallGenerator* next_hit_cg = NULL; >>>> -????????? ciMethod* next_receiver_method = NULL; >>>> -????????? if (morphism == 2 && UseBimorphicInlining) { >>>> -??????????? next_receiver_method = callee->resolve_invoke(jvms->method()->holder(), >>>> - profile.receiver(1)); >>>> -??????????? if (next_receiver_method != NULL) { >>>> -????????????? next_hit_cg = this->call_generator(next_receiver_method, >>>> -????????????????????????????????? vtable_index, !call_does_dispatch, jvms, >>>> -????????????????????????????????? allow_inline, prof_factor); >>>> -????????????? if (next_hit_cg != NULL && !next_hit_cg->is_inline() && >>>> -????????????????? have_major_receiver && UseOnlyInlinedBimorphic) { >>>> -????????????????? // Skip if we can't inline second receiver's method >>>> -????????????????? next_hit_cg = NULL; >>>> -????????????? } >>>> -??????????? } >>>> -????????? } >>>> -????????? CallGenerator* miss_cg; >>>> -????????? Deoptimization::DeoptReason reason = (morphism == 2 >>>> -?????????????????????????????????????????????? ? Deoptimization::Reason_bimorphic >>>> -?????????????????????????????????????????????? : Deoptimization::reason_class_check(speculative_receiver_type != >>>> NULL)); >>>> -????????? if ((morphism == 1 || (morphism == 2 && next_hit_cg != NULL)) && >>>> -????????????? !too_many_traps_or_recompiles(caller, bci, reason) >>>> -???????????? ) { >>>> -??????????? // Generate uncommon trap for class check failure path >>>> -??????????? // in case of monomorphic or bimorphic virtual call site. >>>> -??????????? miss_cg = CallGenerator::for_uncommon_trap(callee, reason, >>>> -??????????????????????? Deoptimization::Action_maybe_recompile); >>>> +????? bool removed_cgs = false; >>>> +????? // Look up receivers. >>>> +????? for (int i = 0; i < morphism; i++) { >>>> +??????? if ((i == 1 && !UseBimorphicInlining) || (i >= 1 && !UsePolymorphicInlining)) { >>>> +????????? break; >>>> +??????? } >>>> +??????? if (receiver_methods[i] == NULL && profile.has_receiver(i)) { >>>> +????????? receiver_methods[i] = callee->resolve_invoke(jvms->method()->holder(), >>>> + profile.receiver(i)); >>>> +??????? } >>>> +??????? if (receiver_methods[i] != NULL) { >>>> +????????? bool allow_inline; >>>> +????????? if (speculative_receiver_type != NULL) { >>>> +??????????? allow_inline = true; >>>> ??????????? } else { >>>> -??????????? // Generate virtual call for class check failure path >>>> -??????????? // in case of polymorphic virtual call site. >>>> -??????????? miss_cg = CallGenerator::for_virtual_call(callee, vtable_index); >>>> +??????????? allow_inline = 100.*profile.receiver_prob(i) >= (float)TypeProfileMinimumReceiverPercent; >>>> ??????????? } >>>> -????????? if (miss_cg != NULL) { >>>> -??????????? if (next_hit_cg != NULL) { >>>> -????????????? assert(speculative_receiver_type == NULL, "shouldn't end up here if we used speculation"); >>>> -????????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), next_receiver_method, >>>> profile.receiver(1), site_count, profile.receiver_count(1)); >>>> -????????????? // We don't need to record dependency on a receiver here and below. >>>> -????????????? // Whenever we inline, the dependency is added by Parse::Parse(). >>>> -????????????? miss_cg = CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, next_hit_cg, PROB_MAX); >>>> -??????????? } >>>> -??????????? if (miss_cg != NULL) { >>>> -????????????? ciKlass* k = speculative_receiver_type != NULL ? speculative_receiver_type : profile.receiver(0); >>>> -????????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_method, k, site_count, >>>> receiver_count); >>>> -????????????? float hit_prob = speculative_receiver_type != NULL ? 1.0 : profile.receiver_prob(0); >>>> -????????????? CallGenerator* cg = CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob); >>>> -????????????? if (cg != NULL)? return cg; >>>> +????????? hit_cgs[i] = this->call_generator(receiver_methods[i], >>>> +??????????????????????????????? vtable_index, !call_does_dispatch, jvms, >>>> +??????????????????????????????? allow_inline, prof_factor); >>>> +????????? if (hit_cgs[i] != NULL) { >>>> +??????????? if (speculative_receiver_type != NULL) { >>>> +????????????? // Do nothing if it's a speculative type >>>> +??????????? } else if (bytecode == Bytecodes::_invokeinterface) { >>>> +????????????? // Do nothing if it's an interface, multiple direct-calls are faster than one indirect-call >>>> +??????????? } else if (!have_major_receiver) { >>>> +????????????? // Do nothing if there is no major receiver >>>> +??????????? } else if ((morphism == 2 && !UseOnlyInlinedBimorphic) || (morphism >= 2 && !UseOnlyInlinedPolymorphic)) { >>>> +????????????? // Do nothing if the user allows non-inlined polymorphic calls >>>> +??????????? } else if (!hit_cgs[i]->is_inline()) { >>>> +????????????? // Skip if we can't inline receiver's method >>>> +????????????? hit_cgs[i] = NULL; >>>> +????????????? removed_cgs = true; >>>> ????????????? } >>>> ??????????? } >>>> ????????? } >>>> ??????? } >>>> + >>>> +????? // Generate the fallback path >>>> +????? Deoptimization::DeoptReason reason = (morphism != 1 >>>> +??????????????????????????????????????????? ? Deoptimization::Reason_polymorphic >>>> +??????????????????????????????????????????? : Deoptimization::reason_class_check(speculative_receiver_type != NULL)); >>>> +????? bool disable_trap = (profile.is_megamorphic() || removed_cgs || too_many_traps_or_recompiles(caller, bci, >>>> reason)); >>>> +????? if (log != NULL) { >>>> +??????? log->elem("call_fallback method='%d' count='%d' morphism='%d' trap='%d'", >>>> +????????????????????? log->identify(callee), site_count, morphism, disable_trap ? 0 : 1); >>>> +????? } >>>> +????? CallGenerator* miss_cg; >>>> +????? if (!disable_trap) { >>>> +??????? // Generate uncommon trap for class check failure path >>>> +??????? // in case of polymorphic virtual call site. >>>> +??????? miss_cg = CallGenerator::for_uncommon_trap(callee, reason, >>>> +??????????????????? Deoptimization::Action_maybe_recompile); >>>> +????? } else { >>>> +??????? // Generate virtual call for class check failure path >>>> +??????? // in case of megamorphic virtual call site. >>>> +??????? miss_cg = CallGenerator::for_virtual_call(callee, vtable_index); >>>> +????? } >>>> + >>>> +????? // Generate the guards >>>> +????? CallGenerator* cg = NULL; >>>> +????? if (speculative_receiver_type != NULL) { >>>> +??????? if (hit_cgs[0] != NULL) { >>>> +????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_methods[0], >>>> speculative_receiver_type, site_count, profile.receiver_count(0)); >>>> +????????? // We don't need to record dependency on a receiver here and below. >>>> +????????? // Whenever we inline, the dependency is added by Parse::Parse(). >>>> +????????? cg = CallGenerator::for_predicted_call(speculative_receiver_type, miss_cg, hit_cgs[0], PROB_MAX); >>>> +??????? } >>>> +????? } else { >>>> +??????? for (int i = morphism - 1; i >= 0 && miss_cg != NULL; i--) { >>>> +????????? if (hit_cgs[i] != NULL) { >>>> +??????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_methods[i], >>>> profile.receiver(i), site_count, profile.receiver_count(i)); >>>> +??????????? miss_cg = CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, hit_cgs[i], >>>> profile.receiver_prob(i)); >>>> +????????? } >>>> +??????? } >>>> +??????? cg = miss_cg; >>>> +????? } >>>> +????? if (cg != NULL)? return cg; >>>> ????? } >>>> ????? // If there is only one implementor of this interface then we >>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp b/src/hotspot/share/runtime/deoptimization.cpp >>>> index 11df15e004..2d14b52854 100644 >>>> --- a/src/hotspot/share/runtime/deoptimization.cpp >>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp >>>> @@ -2382,7 +2382,7 @@ const char* Deoptimization::_trap_reason_name[] = { >>>> ??? "class_check", >>>> ??? "array_check", >>>> ??? "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"), >>>> -? "bimorphic" JVMCI_ONLY("_or_optimized_type_check"), >>>> +? "polymorphic" JVMCI_ONLY("_or_optimized_type_check"), >>>> ??? "profile_predicate", >>>> ??? "unloaded", >>>> ??? "uninitialized", >>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp b/src/hotspot/share/runtime/deoptimization.hpp >>>> index 1cfff5394e..c1eb998aba 100644 >>>> --- a/src/hotspot/share/runtime/deoptimization.hpp >>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp >>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic { >>>> ????? Reason_class_check,?????????? // saw unexpected object class (@bci) >>>> ????? Reason_array_check,?????????? // saw unexpected array class (aastore @bci) >>>> ????? Reason_intrinsic,???????????? // saw unexpected operand to intrinsic (@bci) >>>> -??? Reason_bimorphic,???????????? // saw unexpected object class in bimorphic inlining (@bci) >>>> +??? Reason_polymorphic,?????????? // saw unexpected object class in bimorphic inlining (@bci) >>>> ? #if INCLUDE_JVMCI >>>> ????? Reason_unreached0???????????? = Reason_null_assert, >>>> ????? Reason_type_checked_inlining? = Reason_intrinsic, >>>> -??? Reason_optimized_type_check?? = Reason_bimorphic, >>>> +??? Reason_optimized_type_check?? = Reason_polymorphic, >>>> ? #endif >>>> ????? Reason_profile_predicate,???? // compiler generated predicate moved from frequent branch in a loop failed >>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp b/src/hotspot/share/runtime/vmStructs.cpp >>>> index 94b544824e..ee761626c4 100644 >>>> --- a/src/hotspot/share/runtime/vmStructs.cpp >>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp >>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry? KlassHashtableEntry; >>>> declare_constant(Deoptimization::Reason_class_check) \ >>>> declare_constant(Deoptimization::Reason_array_check) \ >>>> declare_constant(Deoptimization::Reason_intrinsic) \ >>>> - declare_constant(Deoptimization::Reason_bimorphic) \ >>>> + declare_constant(Deoptimization::Reason_polymorphic) \ >>>> declare_constant(Deoptimization::Reason_profile_predicate) \ >>>> declare_constant(Deoptimization::Reason_unloaded) \ >>>> declare_constant(Deoptimization::Reason_uninitialized) \ >>>> >>>> -----Original Message----- >>>> From: hotspot-compiler-dev On Behalf Of Ludovic Henry >>>> Sent: Tuesday, March 3, 2020 10:50 AM >>>> To: Vladimir Ivanov ; John Rose ; >>>> hotspot-compiler-dev at openjdk.java.net >>>> Subject: RE: Polymorphic Guarded Inlining in C2 >>>> >>>> I just got to run the PolymorphicVirtualCallBenchmark microbenchmark with >>>> various TypeProfileWidth values. The results are: >>>> >>>> Benchmark???????????????????????????? Mode? Cnt? Score?? Error? Units Configuration >>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.802 ? 0.048 ops/s -XX:TypeProfileWidth=0 >>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap >>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.425 ? 0.019 ops/s -XX:TypeProfileWidth=1 >>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap >>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.857 ? 0.109 ops/s -XX:TypeProfileWidth=2 >>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap >>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.876 ? 0.051 ops/s -XX:TypeProfileWidth=3 >>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap >>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.867 ? 0.045 ops/s -XX:TypeProfileWidth=4 >>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap >>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.835 ? 0.104 ops/s -XX:TypeProfileWidth=5 >>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap >>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.886 ? 0.139 ops/s -XX:TypeProfileWidth=6 >>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap >>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.887 ? 0.040 ops/s -XX:TypeProfileWidth=7 >>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap >>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.684 ? 0.020 ops/s -XX:TypeProfileWidth=8 >>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap >>>> >>>> The main thing I observe is that there isn't a linear (or even any apparent) >>>> correlation between the number of guards generated (guided by >>>> TypeProfileWidth), and the time taken. >>>> >>>> I am trying to understand why there is a dip for TypeProfileWidth equal >>>> to 1 and 8. >>>> >>>> -- >>>> Ludovic >>>> >>>> -----Original Message----- >>>> From: Ludovic Henry >>>> Sent: Tuesday, March 3, 2020 9:33 AM >>>> To: Ludovic Henry ; Vladimir Ivanov ; John Rose >>>> ; hotspot-compiler-dev at openjdk.java.net >>>> Subject: RE: Polymorphic Guarded Inlining in C2 >>>> >>>> Hi Vladimir, >>>> >>>> I did a rerun of the following benchmark with various configurations: >>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fluhenry%2Fjdk-microbenchmarks%2Fblob%2Fmaster%2Fsrc%2Fmain%2Fjava%2Forg%2Fsample%2FTypeProfileWidthOverheadBenchmark.java&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=KzGJ0Xc95rWag6rF8gyFW7%2BHcDUccltwMN5JmL%2BUhdE%3D&reserved=0 >>>> >>>> >>>> The results are as follows: >>>> >>>> Benchmark???????????????????????????? Mode? Cnt? Score?? Error? Units Configuration >>>> PolymorphicVirtualCallBenchmark.run?? thrpt 5??? 2.910 ? 0.040? ops/s indirect-call? -XX:TypeProfileWidth=0 >>>> -XX:+PolyGuardDisableTrap >>>> PolymorphicVirtualCallBenchmark.run?? thrpt 5??? 2.752 ? 0.039? ops/s direct-call??? -XX:TypeProfileWidth=8 >>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap >>>> PolymorphicVirtualCallBenchmark.run?? thrpt 5??? 3.407 ? 0.085? ops/s inlined-call?? -XX:TypeProfileWidth=8 >>>> -XX:-PolyGuardDisableInlining -XX:+PolyGuardDisableTrap >>>> Benchmark???????????????????????????? Mode? Cnt? Score?? Error? Units Configuration >>>> PolymorphicInterfaceCallBenchmark.run thrpt 5??? 2.043 ? 0.025? ops/s indirect-call? -XX:TypeProfileWidth=0 >>>> -XX:+PolyGuardDisableTrap >>>> PolymorphicInterfaceCallBenchmark.run thrpt 5??? 2.555 ? 0.063? ops/s direct-call??? -XX:TypeProfileWidth=8 >>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap >>>> PolymorphicInterfaceCallBenchmark.run thrpt 5??? 3.217 ? 0.058? ops/s inlined-call?? -XX:TypeProfileWidth=8 >>>> -XX:-PolyGuardDisableInlining -XX:+PolyGuardDisableTrap >>>> >>>> The Hotspot logs (with generated assembly) are available at: >>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fluhenry%2F4f015541cb6628517ab6698b1487c17d&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=q7MAs1V9tbuZhBIjNDQefNuhsiKVv7HHRn1kdcbrUPk%3D&reserved=0 >>>> >>>> >>>> The main takeaway from that experiment is that direct calls w/o inlining is faster >>>> than indirect calls for icalls but slower for vcalls, and that inlining is always faster >>>> than direct calls. >>>> >>>> (I fully understand this applies mainly on this microbenchmark, and we need to >>>> validate on larger benchmarks. I'm working on that next. However, it clearly show >>>> gains on a pathological case.) >>>> >>>> Next, I want to figure out at how many guard the direct-call regresses compared >>>> to indirect-call in the vcall case, and I want to run larger benchmarks. Any >>>> particular you would like to see running? I am planning on doing SPECjbb2015 first. >>>> >>>> Thank you, >>>> >>>> -- >>>> Ludovic >>>> >>>> -----Original Message----- >>>> From: hotspot-compiler-dev On Behalf Of Ludovic Henry >>>> Sent: Monday, March 2, 2020 4:20 PM >>>> To: Vladimir Ivanov ; John Rose ; >>>> hotspot-compiler-dev at openjdk.java.net >>>> Subject: RE: Polymorphic Guarded Inlining in C2 >>>> >>>> Hi Vladimir, >>>> >>>> Sorry for the long delay in response, I was at multiple conferences over the past few >>>> weeks. I'm back to the office now and fully focus on getting progress on that. >>>> >>>>>> Possible avenues of improvements I can see are: >>>>>> ??? - Gather all the types in an unbounded list so we can know which ones >>>>>> are the most frequent. It is unlikely to help with Java as, in the general >>>>>> case, there are only a few types present a call-sites. It could, however, >>>>>> be particularly helpful for languages that tend to have many types at >>>>>> call-sites, like functional languages, for example. >>>>> >>>>> I doubt having unbounded list of receiver types is practical: it's >>>>> costly to gather, but isn't too useful for compilation. But measuring >>>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some numbers. >>>> >>>> I agree that it isn't very practical. It can be useful in the case where there are >>>> many types at a call-site, and the first ones end up not being frequent enough to >>>> mandate a guard. This is clearly an edge-case, and I don't think we should optimize >>>> for it. >>>> >>>>>> In what we have today, some of the worst-case scenarios are the following: >>>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, the first and >>>>>> second types are types A and B, and the other type(s) is(are) not recorded, >>>>>> and it increments the `count` value. Even if A and B are used in the initialization >>>>>> path (i.e. only a few times) and the other type(s) is(are) used in the hot >>>>>> path (i.e. many times), the latter are never considered for inlining - because >>>>>> it was never recorded during profiling. >>>>> >>>>> Can it be alleviated by (partially) clearing type profile (e.g., >>>>> periodically free some space by removing elements with lower frequencies >>>>> and give new types a chance to be profiled? >>>> >>>> Doing that reliably relies on the assumption that we know what the shape of >>>> the workload is going to be in future iterations. Otherwise, how could you >>>> guarantee that a type that's not currently frequent will not be in the future, >>>> and that the information that you remove now will not be important later. This >>>> is an assumption that, IMO, is worst than missing types which are hot later in >>>> the execution for two reasons: 1. it's no better, and 2. it's a lot less intuitive and >>>> harder to debug/understand than a straightforward "overflow". >>>> >>>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, you have the >>>>>> first type A with 49% probability, the second type B with 49% probability, and >>>>>> the other types with 2% probability. Even though A and B are the two hottest >>>>>> paths, it does not generate guards because none are a major receiver. >>>>> >>>>> Yes. On the other hand, on average it'll cause inlining twice as much >>>>> code (2 methods vs 1). >>>> >>>> It will not necessarily cause twice as much inlining because of late-inlining. Like >>>> you point out later, it will generate a direct-call in case there isn't room for more >>>> inlinable code. >>>> >>>>> Also, does it make sense to increase morphism factor even if inlining >>>>> doesn't happen? >>>>> >>>>> ?? if (recv.klass == C1) {? // >>0% >>>>> ????? ... inlined ... >>>>> ?? } else if (recv.klass == C2) { // >>0% >>>>> ????? m2(); // direct call >>>>> ?? } else { // >0% >>>>> ????? m(); // virtual call >>>>> ?? } >>>>> >>>>> vs >>>>> >>>>> ?? if (recv.klass == C1) {? // >>0% >>>>> ????? ... inlined ... >>>>> ?? } else { // >>0% >>>>> ????? m(); // virtual call >>>>> ?? } >>>> >>>> There is the advantage that modern CPUs are better at predicting instruction-branches >>>> than data-branches. These guards will then allow the CPU to make better decisions allowing >>>> for better superscalar executions, memory prefetching, etc. >>>> >>>> This, IMO, makes sense for warm calls, especially since the cost is a guard + a call, which is >>>> much lower than a inlined method, but brings benefits over an indirect call. >>>> >>>>> In other words, how much could we get just by lowering >>>>> TypeProfileMajorReceiverPercent? >>>> >>>> TypeProfileMajorReceiverPercent is only used today when you have a megamorphic >>>> call-site (aka more types than TypeProfileWidth) but still one type receiving more than >>>> N% of the calls. By reducing the value, you would not increase the number of guards, >>>> but the threshold at which you generate the 1st guard in a megamorphic case. >>>> >>>>>>> ??????? - for N-morphic case what's the negative effect (quantitative) of >>>>>>> the deopt? >>>>>> We are triggering the uncommon trap in this case iff we observed a limited >>>>>> and stable set of types in the early stages of the Tiered Compilation >>>>>> pipeline (making us generate N-morphic guards), and we suddenly observe a >>>>>> new type. AFAIU, this is precisely what deopt is for. >>>>> >>>>> I should have added "... compared to N-polymorhic case". My intuition is >>>>> the higher morphism factor is the fewer the benefits of deopt (compared >>>>> to a call) are. It would be very good to validate it with some >>>>> benchmarks (both micro- and larger ones). >>>> >>>> I agree that what you are describing makes sense as well. To reduce the cost of deopt >>>> here, having a TypeProfileMinimumReceiverPercent helps. That is because if any type is >>>> seen less than this specific frequency, then it won't generate a guard, leading to an indirect >>>> call in the fallback case. >>>> >>>>>> I'm writing a JMH benchmark to stress that specific case. I'll share it as soon >>>>>> as I have something reliably reproducing. >>>>> >>>>> Thanks! A representative set of microbenchmarks will be very helpful. >>>> >>>> It turns out the guard is only generated once, meaning that if we ever hit it then we >>>> generate an indirect call. >>>> >>>> We also only generate the trap iff all the guards are hot (inlined) or warm (direct call), >>>> so any of the following case triggers the creation of an indirect call over a trap: >>>> ? - we hit the trap once before >>>> ? - one or more guards are cold (aka not inlinable even with late-inlining) >>>> >>>>> It was more about opportunities for future explorations. I don't think >>>>> we have to act on it right away. >>>>> >>>>> As with "deopt vs call", my guess is callee should benefit much more >>>>> from inlining than the caller it is inlined into (caller sees multiple >>>>> callee candidates and has to merge the results while each callee >>>>> observes the full context and can benefit from it). >>>>> >>>>> If we can run some sort of static analysis on callee bytecode, what kind >>>>> of code patterns should we look for to guide inlining decisions? >>>> >>>> Any pattern that would benefit from other optimizations (escape analysis, >>>> dead code elimination, constant propagation, etc.) is good, but short of >>>> shadowing statically what all these optimizations do, I can't see an easy way >>>> to do it. >>>> >>>> That is where late-inlining, or more advanced dynamic heuristics like the one you >>>> can find in Graal EE, is worthwhile. >>>> >>>>> Regaring experiments to try first, here are some ideas I find promising: >>>>> >>>>> ???? * measure the cost of additional profiling >>>>> ???????? -XX:TypeProfileWidth=N without changing compilers >>>> >>>> I am running the following jmh microbenchmark >>>> >>>> ???? public final static int N = 100_000_000; >>>> >>>> ???? @State(Scope.Benchmark) >>>> ???? public static class TypeProfileWidthOverheadBenchmarkState { >>>> ???????? public A[] objs = new A[N]; >>>> >>>> ???????? @Setup >>>> ???????? public void setup() throws Exception { >>>> ???????????? for (int i = 0; i < objs.length; ++i) { >>>> ???????????????? switch (i % 8) { >>>> ???????????????? case 0: objs[i] = new A1(); break; >>>> ???????????????? case 1: objs[i] = new A2(); break; >>>> ???????????????? case 2: objs[i] = new A3(); break; >>>> ???????????????? case 3: objs[i] = new A4(); break; >>>> ???????????????? case 4: objs[i] = new A5(); break; >>>> ???????????????? case 5: objs[i] = new A6(); break; >>>> ???????????????? case 6: objs[i] = new A7(); break; >>>> ???????????????? case 7: objs[i] = new A8(); break; >>>> ???????????????? } >>>> ???????????? } >>>> ???????? } >>>> ???? } >>>> >>>> ???? @Benchmark @OperationsPerInvocation(N) >>>> ???? public void run(TypeProfileWidthOverheadBenchmarkState state, Blackhole blackhole) { >>>> ???????? A[] objs = state.objs; >>>> ???????? for (int i = 0; i < objs.length; ++i) { >>>> ???????????? objs[i].foo(i, blackhole); >>>> ???????? } >>>> ???? } >>>> >>>> And I am running with the following JVM parameters: >>>> >>>> -XX:TypeProfileWidth=0 -XX:CompileThreshold=200000000 -XX:Tier3CompileThreshold=200000000 >>>> -XX:Tier3InvocationThreshold=200000000 -XX:Tier3BackEdgeThreshold=200000000 >>>> -XX:TypeProfileWidth=8 -XX:CompileThreshold=200000000 -XX:Tier3CompileThreshold=200000000 >>>> -XX:Tier3InvocationThreshold=200000000 -XX:Tier3BackEdgeThreshold=200000000 >>>> >>>> I observe no statistically representative difference between in s/ops >>>> between TypeProfileWidth=0 and TypeProfileWidth=8. I also could observe >>>> no significant difference in the resulting analysis using Intel VTune. >>>> >>>> I verified that the benchmark never goes beyond Tier-0 with -XX:+PrintCompilation. >>>> >>>>> ???? * N-morphic vs N-polymorphic (N>=2): >>>>> ?????? - how much deopt helps compared to a virtual call on fallback path? >>>> >>>> I have done the following microbenchmark, but I am not sure that it's >>>> going to fully answer the question you are raising here. >>>> >>>> ???? public final static int N = 100_000_000; >>>> >>>> ???? @State(Scope.Benchmark) >>>> ???? public static class PolymorphicDeoptBenchmarkState { >>>> ???????? public A[] objs = new A[N]; >>>> >>>> ???????? @Setup >>>> ???????? public void setup() throws Exception { >>>> ???????????? int cutoff1 = (int)(objs.length * .90); >>>> ???????????? int cutoff2 = (int)(objs.length * .95); >>>> ???????????? for (int i = 0; i < cutoff1; ++i) { >>>> ???????????????? switch (i % 2) { >>>> ???????????????? case 0: objs[i] = new A1(); break; >>>> ???????????????? case 1: objs[i] = new A2(); break; >>>> ???????????????? } >>>> ???????????? } >>>> ???????????? for (int i = cutoff1; i < cutoff2; ++i) { >>>> ???????????????? switch (i % 4) { >>>> ???????????????? case 0: objs[i] = new A1(); break; >>>> ???????????????? case 1: objs[i] = new A2(); break; >>>> ???????????????? case 2: >>>> ???????????????? case 3: objs[i] = new A3(); break; >>>> ???????????????? } >>>> ???????????? } >>>> ???????????? for (int i = cutoff2; i < objs.length; ++i) { >>>> ???????????????? switch (i % 4) { >>>> ???????????????? case 0: >>>> ???????????????? case 1: objs[i] = new A3(); break; >>>> ???????????????? case 2: >>>> ???????????????? case 3: objs[i] = new A4(); break; >>>> ???????????????? } >>>> ???????????? } >>>> ???????? } >>>> ???? } >>>> >>>> ???? @Benchmark @OperationsPerInvocation(N) >>>> ???? public void run(PolymorphicDeoptBenchmarkState state, Blackhole blackhole) { >>>> ???????? A[] objs = state.objs; >>>> ???????? for (int i = 0; i < objs.length; ++i) { >>>> ???????????? objs[i].foo(i, blackhole); >>>> ???????? } >>>> ???? } >>>> >>>> I run this benchmark with -XX:+PolyGuardDisableTrap or >>>> -XX:-PolyGuardDisableTrap which force enable/disable the trap in the >>>> fallback. >>>> >>>> For that kind of cases, a visitor pattern is what I expect to most largely >>>> profit/suffer from a deopt or virtual-call in the fallback path. Would you >>>> know of such benchmark that heavily relies on this pattern, and that I >>>> could readily reuse? >>>> >>>>> ???? * inlining vs devirtualization >>>>> ?????? - a knob to control inlining in N-morphic/N-polymorphic cases >>>>> ?????? - measure separately the effects of devirtualization and inlining >>>> >>>> For that one, I reused the first microbenchmark I mentioned above, and >>>> added a PolyGuardDisableInlining flag that controls whether we create a >>>> direct-call or inline. >>>> >>>> The results are 2.958 ? 0.011 ops/s for -XX:-PolyGuardDisableInlining (aka inlined) >>>> vs 2.540 ? 0.018 ops/s for -XX:+PolyGuardDisableInlining (aka direct call). >>>> >>>> This benchmarks hasn't been run in the best possible conditions (on my dev >>>> machine, in WSL), but it gives a strong indication that even a direct call has a >>>> non-negligible impact, and that inlining leads to better result (again, in this >>>> microbenchmark). >>>> >>>> Otherwise, on the per-method TypeProfileWidth knob, I couldn't find anything >>>> that would be readily available from the Interpreter. Would you have any pointer >>>> of a pre-existing feature that required this specific kind of plumbing? I would otherwise >>>> find myself in need of making CompilerDirectives available from the Interpreter, and >>>> that is something outside of my current expertise (always happy to learn, but I >>>> will need some pointers!). >>>> >>>> Thank you, >>>> >>>> -- >>>> Ludovic >>>> >>>> -----Original Message----- >>>> From: Vladimir Ivanov >>>> Sent: Thursday, February 20, 2020 9:00 AM >>>> To: Ludovic Henry ; John Rose ; hotspot-compiler-dev at openjdk.java.net >>>> Subject: Re: Polymorphic Guarded Inlining in C2 >>>> >>>> Hi Ludovic, >>>> >>>> [...] >>>> >>>>> Thanks for this explanation, it makes it a lot clearer what the cases and >>>>> your concerns are. To rephrase in my own words, what you are interested in >>>>> is not this change in particular, but more the possibility that this change >>>>> provides and how to take it the next step, correct? >>>> >>>> Yes, it's a good summary. >>>> >>>> [...] >>>> >>>>>> ??????? - affects profiling strategy: majority of receivers vs complete >>>>>> list of receiver types observed; >>>>> Today, we only use the N first receivers when the number of types does >>>>> not exceed TypeProfileWidth; otherwise, we use none of them. >>>>> Possible avenues of improvements I can see are: >>>>> ??? - Gather all the types in an unbounded list so we can know which ones >>>>> are the most frequent. It is unlikely to help with Java as, in the general >>>>> case, there are only a few types present a call-sites. It could, however, >>>>> be particularly helpful for languages that tend to have many types at >>>>> call-sites, like functional languages, for example. >>>> >>>> I doubt having unbounded list of receiver types is practical: it's >>>> costly to gather, but isn't too useful for compilation. But measuring >>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some numbers. >>>> >>>>> ?? - Use the existing types to generate guards for these types we know are >>>>> common enough. Then use the types which are hot or warm, even in case of a >>>>> megamorphic call-site. It would be a simple iteration of what we have >>>>> nowadays. >>>> >>>>> In what we have today, some of the worst-case scenarios are the following: >>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, the first and >>>>> second types are types A and B, and the other type(s) is(are) not recorded, >>>>> and it increments the `count` value. Even if A and B are used in the initialization >>>>> path (i.e. only a few times) and the other type(s) is(are) used in the hot >>>>> path (i.e. many times), the latter are never considered for inlining - because >>>>> it was never recorded during profiling. >>>> >>>> Can it be alleviated by (partially) clearing type profile (e.g., >>>> periodically free some space by removing elements with lower frequencies >>>> and give new types a chance to be profiled? >>>> >>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, you have the >>>>> first type A with 49% probability, the second type B with 49% probability, and >>>>> the other types with 2% probability. Even though A and B are the two hottest >>>>> paths, it does not generate guards because none are a major receiver. >>>> >>>> Yes. On the other hand, on average it'll cause inlining twice as much >>>> code (2 methods vs 1). >>>> >>>> Also, does it make sense to increase morphism factor even if inlining >>>> doesn't happen? >>>> >>>> ??? if (recv.klass == C1) {? // >>0% >>>> ?????? ... inlined ... >>>> ??? } else if (recv.klass == C2) { // >>0% >>>> ?????? m2(); // direct call >>>> ??? } else { // >0% >>>> ?????? m(); // virtual call >>>> ??? } >>>> >>>> vs >>>> >>>> ??? if (recv.klass == C1) {? // >>0% >>>> ?????? ... inlined ... >>>> ??? } else { // >>0% >>>> ?????? m(); // virtual call >>>> ??? } >>>> >>>> In other words, how much could we get just by lowering >>>> TypeProfileMajorReceiverPercent? >>>> >>>> And it relates to "virtual/interface call" vs "type guard + direct call" >>>> code shapes comparison: how much does devirtualization help? >>>> >>>> Otherwise, enabling 2-polymorphic shape becomes feasible only if both >>>> cases are inlined. >>>> >>>>>> ??????? - for N-morphic case what's the negative effect (quantitative) of >>>>>> the deopt? >>>>> We are triggering the uncommon trap in this case iff we observed a limited >>>>> and stable set of types in the early stages of the Tiered Compilation >>>>> pipeline (making us generate N-morphic guards), and we suddenly observe a >>>>> new type. AFAIU, this is precisely what deopt is for. >>>> >>>> I should have added "... compared to N-polymorhic case". My intuition is >>>> the higher morphism factor is the fewer the benefits of deopt (compared >>>> to a call) are. It would be very good to validate it with some >>>> benchmarks (both micro- and larger ones). >>>> >>>>> I'm writing a JMH benchmark to stress that specific case. I'll share it as soon >>>>> as I have something reliably reproducing. >>>> >>>> Thanks! A representative set of microbenchmarks will be very helpful. >>>> >>>>>> ???? * invokevirtual vs invokeinterface call sites >>>>>> ??????? - different cost models; >>>>>> ??????? - interfaces are harder to optimize, but opportunities for >>>>>> strength-reduction from interface to virtual calls exist; >>>>> ? From the profiling information and the inlining mechanism point of view, >>>>> that it is an invokevirtual or an invokeinterface doesn't change anything >>>>> >>>>> Are you saying that we have more to gain from generating a guard for >>>>> invokeinterface over invokevirtual because the fall-back of the >>>>> invokeinterface is much more expensive? >>>> >>>> Yes, that's the question: if we see an improvement, how much does >>>> devirtualization contribute to that? >>>> >>>> (If we add a type-guarded direct call, but there's no inlining >>>> happening, inline cache effectively strength-reduce a virtual call to a >>>> direct call.) >>>> >>>> Considering current implementation of virtual and interface calls >>>> (vtables vs itables), the cost model is very different. >>>> >>>> For vtable calls, it doesn't look too appealing to introduce large >>>> inline caches for individual receiver types since a call through a >>>> vtable involves 3 dependent loads [1] (recv => Klass* => Method* => >>>> address). >>>> >>>> For itable calls it can be a big win in some situations: itable lookup >>>> iterates over Klass::_secondary_supers array and it can become quite >>>> costly. For example, some Scala workloads experience significant >>>> overheads from megamorphic calls. >>>> >>>> If we see an improvement on some benchmark, it would be very useful to >>>> be able to determine (quantitatively) how much does inlining and >>>> devirtualization contribute. >>>> >>>> FTR ErikO has been experimenting with an alternative vtable/itable >>>> implementation [4] which brings interface calls close to virtual calls. >>>> So, if it turns out that devirtualization (and not inlining) of >>>> interface calls is what contributes the most, then speeding up >>>> megamorphic interface calls becomes a more attractive alternative. >>>> >>>>>> ???? * inlining heuristics >>>>>> ??????? - devirtualization vs inlining >>>>>> ????????? - how much benefit from expanding a call site (devirtualize more >>>>>> cases) without inlining? should differ for virtual & interface cases >>>>> I'm also writing a JMH benchmark for this case, and I'll share it as soon >>>>> as I have it reliably reproducing the issue you describe. >>>> >>>> Also, I think it's important to have a knob to control it (inline vs >>>> devirtualize). It'll enable experiments with larger benchmarks. >>>> >>>>>> ??????? - diminishing returns with increase in number of cases >>>>>> ??????? - expanding a single call site leads to more code, but frequencies >>>>>> stay the same => colder code >>>>>> ??????? - based on profiling info (types + frequencies), dynamically >>>>>> choose morphism factor on per-call site basis? >>>>> That is where I propose to have a lower receiver probability at which we'll >>>>> stop adding more guards. I am experimenting with a global flag with a default >>>>> value of 10%. >>>>>> ??????? - what optimization opportunities to look for? it looks like in >>>>>> general callees should benefit more than the caller (due to merges after >>>>>> the call site) >>>>> Could you please expand your concern or provide an example. >>>> >>>> It was more about opportunities for future explorations. I don't think >>>> we have to act on it right away. >>>> >>>> As with "deopt vs call", my guess is callee should benefit much more >>>> from inlining than the caller it is inlined into (caller sees multiple >>>> callee candidates and has to merge the results while each callee >>>> observes the full context and can benefit from it). >>>> >>>> If we can run some sort of static analysis on callee bytecode, what kind >>>> of code patterns should we look for to guide inlining decisions? >>>> >>>> >>>> ? >> What's your take on it? Any other ideas? >>>> ? > >>>> ? > We don't know what we don't know. We need first to improve the >>>> logging and >>>> ? > debugging output of uncommon traps for polymorphic call-sites. Then, we >>>> ? > need to gather data about the different cases you talked about. >>>> ? > >>>> ? > We also need to have some microbenchmarks to validate some of the >>>> questions >>>> ? > you are raising, and verify what level of gains we can expect from this >>>> ? > optimization. Further validation will be needed on larger benchmarks and >>>> ? > real-world applications as well, and that's where, I think, we need >>>> to develop >>>> ? > logging and debugging for this feature. >>>> >>>> Yes, sounds good. >>>> >>>> Regaring experiments to try first, here are some ideas I find promising: >>>> >>>> ???? * measure the cost of additional profiling >>>> ???????? -XX:TypeProfileWidth=N without changing compilers >>>> >>>> ???? * N-morphic vs N-polymorphic (N>=2): >>>> ?????? - how much deopt helps compared to a virtual call on fallback path? >>>> >>>> ???? * inlining vs devirtualization >>>> ?????? - a knob to control inlining in N-morphic/N-polymorphic cases >>>> ?????? - measure separately the effects of devirtualization and inlining >>>> >>>> Best regards, >>>> Vladimir Ivanov >>>> >>>> [1] >>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l48&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=jp4emfr%2FRnQA3RLtGI%2BCJ4Q%2F%2Bx6NpIUsWPexl3KbxvM%3D&reserved=0 >>>> >>>> >>>> [2] >>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l142&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=P1iMschSMsRCBw7e4IHWimB26v3MdqWWrGZKSvv64vA%3D&reserved=0 >>>> >>>> >>>> [3] >>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FmacroAssembler_x86.cpp%23l4294&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=pSEuxGJYxfPHzcouSeoCYSRdfdlSByYwpyZfFA7dlYU%3D&reserved=0 >>>> >>>> >>>> [4] >>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.openjdk.java.net%2Fbrowse%2FJDK-8221828&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=ppajmGyCb1rcDjr0v52XMPcLySA7jQTTjmYE5vIEEoo%3D&reserved=0 >>>> >>>> >>>>> -----Original Message----- >>>>> From: Vladimir Ivanov >>>>> Sent: Tuesday, February 11, 2020 3:10 PM >>>>> To: Ludovic Henry ; John Rose ; hotspot-compiler-dev at openjdk.java.net >>>>> Subject: Re: Polymorphic Guarded Inlining in C2 >>>>> >>>>> Hi Ludovic, >>>>> >>>>> I fully agree that it's premature to discuss how default behavior should >>>>> be changed since much more data is needed to be able to proceed with the >>>>> decision. But considering the ultimate goal is to actually improve >>>>> relevant heuristics (and effectively change the default behavior), it's >>>>> the right time to discuss what kind of experiments are needed to gather >>>>> enough data for further analysis. >>>>> >>>>> Though different shapes do look very similar at first, the shape of >>>>> fallback makes a big difference. That's why monomorphic and polymorphic >>>>> cases are distinct: uncommon traps are effectively exits and can >>>>> significantly simplify CFG while calls can return and have to be merged >>>>> back. >>>>> >>>>> Polymorphic shape is stable (no deopts/recompiles involved), but doesn't >>>>> simplify the CFG around the call site. >>>>> >>>>> Monomorphic shape gives more optimization opportunities, but deopts are >>>>> highly undesirable due to associated costs. >>>>> >>>>> For example: >>>>> >>>>> ???? if (recv.klass != C) { deopt(); } >>>>> ???? C.m(recv); >>>>> >>>>> ???? // recv.klass == C - exact type >>>>> ???? // return value == C.m(recv) >>>>> >>>>> vs >>>>> >>>>> ???? if (recv.klass == C) { >>>>> ?????? C.m(recv); >>>>> ???? } else { >>>>> ?????? I.m(recv); >>>>> ???? } >>>>> >>>>> ???? // recv.klass <: I - subtype >>>>> ???? // return value is a phi merging C.m() & I.m() where I.m() is >>>>> completley opaque. >>>>> >>>>> Monomorphic shape can degenerate into polymorphic (too many recompiles), >>>>> but that's a forced move to stabilize the behavior and avoid vicious >>>>> recomilation cycle (which is *very* expensive). (Another alternative is >>>>> to leave deopt as is - set deopt action to "none" - but that's usually >>>>> much worse decision.) >>>>> >>>>> And that's the reason why monomorphic shape requires a unique receiver >>>>> type in profile while polymorphic shape works with major receiver type >>>>> and probabilities. >>>>> >>>>> >>>>> Considering further steps, IMO for experimental purposes a single knob >>>>> won't cut it: there are multiple degrees of freedom which may play >>>>> important role in building accurate performance model. I'm not yet >>>>> convinced it's all about inlining and narrowing the scope of discussion >>>>> specifically to type profile width doesn't help. >>>>> >>>>> I'd like to see more knobs introduced before we start conducting >>>>> extensive experiments. So, let's discuss what other information we can >>>>> benefit from. >>>>> >>>>> I mentioned some possible options in the previous email. I find the >>>>> following aspects important for future discussion: >>>>> >>>>> ???? * shape of fallback path >>>>> ??????? - what to generalize: 2- to N-morphic vs 1- to N-polymorphic; >>>>> ??????? - affects profiling strategy: majority of receivers vs complete >>>>> list of receiver types observed; >>>>> ??????? - for N-morphic case what's the negative effect (quantitative) of >>>>> the deopt? >>>>> >>>>> ???? * invokevirtual vs invokeinterface call sites >>>>> ??????? - different cost models; >>>>> ??????? - interfaces are harder to optimize, but opportunities for >>>>> strength-reduction from interface to virtual calls exist; >>>>> >>>>> ???? * inlining heuristics >>>>> ??????? - devirtualization vs inlining >>>>> ????????? - how much benefit from expanding a call site (devirtualize more >>>>> cases) without inlining? should differ for virtual & interface cases >>>>> ??????? - diminishing returns with increase in number of cases >>>>> ??????? - expanding a single call site leads to more code, but frequencies >>>>> stay the same => colder code >>>>> ??????? - based on profiling info (types + frequencies), dynamically >>>>> choose morphism factor on per-call site basis? >>>>> ??????? - what optimization opportunities to look for? it looks like in >>>>> general callees should benefit more than the caller (due to merges after >>>>> the call site) >>>>> >>>>> What's your take on it? Any other ideas? >>>>> >>>>> Best regards, >>>>> Vladimir Ivanov >>>>> >>>>> On 11.02.2020 02:42, Ludovic Henry wrote: >>>>>> Hello, >>>>>> Thank you very much, John and Vladimir, for your feedback. >>>>>> First, I want to stress out that this patch does not change the default. It is still bi-morphic guarded inlining >>>>>> by default. This patch, however, provides you the ability to configure the JVM to go for N-morphic guarded >>>>>> inlining, with N being controlled by the -XX:TypeProfileWidth configuration knob. I understand there are >>>>>> shortcomings with the specifics of this approach so I'll work on fixing those. However, I would want this >>>>>> discussion to focus on this *configurable* feature and not on changing the default. The latter, I think, should be >>>>>> discussed as part of another, more extended running discussion, since, as you pointed out, it has far more >>>>>> reaching consequences that are merely improving a micro-benchmark. >>>>>> >>>>>> Now to answer some of your specific questions. >>>>>> >>>>>>> >>>>>>> I haven't looked through the patch in details, but here are some thoughts. >>>>>>> >>>>>>> As of now, there are 4 main scenarios for devirtualization [1]. It seems you try to generalize (b) which becomes: >>>>>>> >>>>>>> ????? if (recv.klass == K1) { >>>>>> m1(...); // either inline or a direct call >>>>>>> ????? } else if (recv.klass == K2) { >>>>>> m2(...); // either inline or a direct call >>>>>>> ????? ... >>>>>>> ????? } else if (recv.klass == Kn) { >>>>>> mn(...); // either inline or a direct call >>>>>>> ????? } else { >>>>>> deopt(); // invalidate + reinterpret >>>>>>> ????? } >>>>>> >>>>>> The general shape that exist currently in tip is: >>>>>> >>>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0) >>>>>> if (recv.klass == K1) { >>>>>> ???? m1(.); // either inline or a direct call >>>>>> } >>>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && UseBimorphicInlining && !is_cold >>>>>> else if (recv.klass == K2) { >>>>>> ???? m2(.); // either inline or a direct call >>>>>> } >>>>>> else { >>>>>> ???? // if (!too_many_traps_or_deopt()) >>>>>> ???? deopt(); // invalidate + reinterpret >>>>>> ???? // else >>>>>> ???? invokeinterface A.foo(.); // virtual call with Inline Cache >>>>>> } >>>>>> There is no particular distinction between Bimorphic, Polymorphic, and Megamorphic. The latter relates more to the >>>>>> fallback rather than the guards. What this change brings is more guards for N-morphic call-sites with N > 2. But >>>>>> it doesn't change why and how these guards are generated (or at least, that is not the intention). >>>>>> The general shape that this change proposes is: >>>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0) >>>>>> if (recv.klass == K1) { >>>>>> ???? m1(.); // either inline or a direct call >>>>>> } >>>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && (UseBimorphicInlining || UsePolymorphicInling) >>>>>> && !is_cold >>>>>> else if (recv.klass == K2) { >>>>>> ???? m2(.); // either inline or a direct call >>>>>> } >>>>>> // if TypeProfileWidth >= 3 && profile.has_receiver(2) && UsePolymorphicInling && !is_cold >>>>>> else if (recv.klass == K3) { >>>>>> ???? m3(.); // either inline or a direct call >>>>>> } >>>>>> // if TypeProfileWidth >= 4 && profile.has_receiver(3) && UsePolymorphicInling && !is_cold >>>>>> else if (recv.klass == K4) { >>>>>> ???? m4(.); // either inline or a direct call >>>>>> } >>>>>> else { >>>>>> ???? // if (!too_many_traps_or_deopt()) >>>>>> ???? deopt(); // invalidate + reinterpret >>>>>> ???? // else >>>>>> ???? invokeinterface A.foo(.); // virtual call with Inline Cache >>>>>> } >>>>>> You can observe that the condition to create the guards is no different; only the total number increases based on >>>>>> TypeProfileWidth and UsePolymorphicInlining. >>>>>>> Question #1: what if you generalize polymorphic shape instead and allow multiple major receivers? Deoptimizing >>>>>>> (and then recompiling) look less beneficial the higher morphism is (especially considering the inlining on all >>>>>>> paths becomes less likely as well). So, having a virtual call (which becomes less likely due to lower frequency) >>>>>>> on the fallback path may be a better option. >>>>>> I agree with this statement in the general sense. However, in practice, it depends on the specifics of each >>>>>> application. That is why the degree of polymorphism needs to rely on a configuration knob, and not pre-determined >>>>>> on a set of benchmarks. I agree with the proposal to have this knob as a per-method knob, instead of a global knob. >>>>>> As for the impact of a higher morphism, I expect deoptimizations to happen less often as more guards are >>>>>> generated, leading to a lower probability of reaching the fallback path, leading to less uncommon >>>>>> trap/deoptimizations. Moreover, the fallback is already going to be a virtual call in case we hit the uncommon >>>>>> trap too often (using too_many_traps_or_recompiles). >>>>>>> Question #2: it would be very interesting to understand what exactly contributes the most to performance >>>>>>> improvements? Is it inlining? Or maybe devirtualization (avoid the cost of virtual call)? How much come from >>>>>>> optimizing interface calls (itable vs vtable stubs)? >>>>>> Devirtualization in itself (direct vs. indirect call) is not the *primary* source of the gain. The gain comes from >>>>>> the additional optimizations that are applied by C2 when increasing the scope/size of the code compiled via inlining. >>>>>> In the case of warm code that's not inlined as part of incremental inlining, the call is a direct call rather than >>>>>> an indirect call. I haven't measured it, but I expect performance to be positively impacted because of the better >>>>>> ability of modern CPUs to correctly predict instruction branches (a direct call) rather than data branches (an >>>>>> indirect call). >>>>>>> Deciding how to spend inlining budget on multiple targets with moderate frequency can be hard, so it makes sense >>>>>>> to consider expanding 3/4/mega-morphic call sites in post-parse phase (during incremental inlining). >>>>>> Incremental inlining is already integrated with the existing solution. In the case of a hot or warm call, in case >>>>>> of failure to inline, it generates a direct call. You still have the guards, reducing the cost of an indirect >>>>>> call, but without the cost of the inlined code. >>>>>>> Question #3: how much TypeProfileWidth affects profiling speed (interpreter and level #3 code) and dynamic >>>>>>> footprint? >>>>>> I'll come back to you with some results. >>>>>>> Getting answers to those (and similar) questions should give us much more insights what is actually happening in >>>>>>> practice. >>>>>>> >>>>>>> Speaking of the first deliverables, it would be good to introduce a new experimental mode to be able to easily >>>>>>> conduct such experiments with product binaries and I'd like to see the patch evolving in that direction. It'll >>>>>>> enable us to gather important data to guide our decisions about how to enhance the heuristics in the product. >>>>>> This patch does not change the default shape of the generated code with bimorphic guarded inlining, because the >>>>>> default value of TypeProfileWidth is 2. If your concern is that TypeProfileWidth is used for other purposes and >>>>>> that I should add a dedicated knob to control the maximum morphism of these guards, then I agree. I am using >>>>>> TypeProfileWidth because it's the available and more straightforward knob today. >>>>>> Overall, this change does not propose to go from bimorphic to N-morphic by default (with N between 0 and 8). This >>>>>> change focuses on using an existing knob (TypeProfileWidth) to open the possibility for N-morphic guarded >>>>>> inlining. I would want the discussion to change the default to be part of a separate RFR, to separate the feature >>>>>> change discussion from the default change discussion. >>>>>>> Such optimizations are usually not unqualified wins because of highly "non-linear" or "non-local" effects, where >>>>>>> a local change in one direction might couple to nearby change in a different direction, with a net change that's >>>>>>> "wrong", due to side effects rolling out from the "good" change. (I'm talking about side effects in our IR graph >>>>>>> shaping heuristics, not memory side effects.) >>>>>>> >>>>>>> One out of many such "wrong" changes is a local optimization which expands code on a medium-hot path, which has >>>>>>> the side effect of making a containing block of code larger than convenient.? Three ways of being "larger than >>>>>>> convenient" are a. the object code of some containing loop doesn't fit as well in the instruction memory, b. the >>>>>>> total IR size tips over some budgetary limit which causes further IR creation to be throttled (or the whole graph >>>>>>> to be thrown away!), or c. some loop gains additional branch structure that impedes the optimization of the loop, >>>>>>> where an out of line call would not. >>>>>>> >>>>>>> My overall point here is that an eager expansion of IR that is locally "better" (we might even say "optimal") >>>>>>> with respect to the specific path under consideration hurts the optimization of nearby paths which are more >>>>>>> important. >>>>>> I generally agree with this statement and explanation. Again, it is not the intention of this patch to change the >>>>>> default number of guards for polymorphic call-sites, but it is to give users the ability to optimize the code >>>>>> generation of their JVM to their application. >>>>>> Since I am relying on the existing inlining infrastructure, late inlining and hot/warm/cold call generators allows >>>>>> to have a "best-of-both-world" approach: it inlines code in the hot guards, it direct calls or inline (if inlining >>>>>> thresholds permits) the method in the warm guards, and it doesn't even generate the guard in the cold guards. The >>>>>> question here is, then how do you define hot, warm, and cold. As discussed above, I want to explore using a >>>>>> low-threshold even to try to generate a guard (at least 10% of calls are to this specific receiver). >>>>>> On the overhead of adding more guards, I see this change as beneficial because it removes an arbitrary limit on >>>>>> what code can be inlined. For example, if you have a call-site with 3 types, each with a hit probability of 30%, >>>>>> then with a maximum limit of 2 types (with bimorphic guarded inlining), only the first 2 types are guarded and >>>>>> inlined. That is despite an apparent gain in guarding and inlining against the 3 types. >>>>>> I agree we want to have guardrails to avoid worst-case degradations. It is my understanding that the existing >>>>>> inlining infrastructure (with late inlining, for example) provides many safeguards already, and it is up to this >>>>>> change not to abuse these. >>>>>>> (It clearly doesn't work to tell an impacted customer, well, you may get a 5% loss, but the micro created to test >>>>>>> this thing shows a 20% gain, and all the functional tests pass.) >>>>>>> >>>>>>> This leads me to the following suggestion:? Your code is a very good POC, and deserves more work, and the next >>>>>>> step in that work is probably looking for and thinking about performance regressions, and figuring out how to >>>>>>> throttle this thing. >>>>>> Here again, I want that feature to be behind a configuration knob, and then discuss in a future RFR to change the >>>>>> default. >>>>>>> A specific next step would be to make the throttling of this feature be controllable. MorphismLimit should be a >>>>>>> global on its own.? And it should be configurable through the CompilerOracle per method.? (See similar code for >>>>>>> similar throttles.)? And it should be more sensitive to the hotness of the overall call and of the various slices >>>>>>> of the call's profile.? (I notice with suspicion that the comment "The single majority receiver sufficiently >>>>>>> outweighs the minority" is missing in the changed code.)? And, if the change is as disruptive to heuristics as I >>>>>>> suspect it *might* be, the call site itself *might* need some kind of dynamic feedback which says, after some >>>>>>> deopt or reprofiling, "take it easy here, try plan B." That last point is just speculation, but I threw it in to >>>>>>> show the kinds of measures we *sometimes* have to take in avoiding "side effects" to our locally pleasant >>>>>>> optimizations. >>>>>> I'll add this per-method knob on the CompilerOracle in the next iteration of this patch. >>>>>>> But, let me repeat: I'm glad to see this experiment. And very, very glad to see all the cool stuff that is coming >>>>>>> out of your work-group.? Welcome to the adventure! >>>>>> For future improvements, I will keep focusing on inlining as I see it as the door opener to many more >>>>>> optimizations in C2. I am still learning at what can be done to reduce the size of the inlined code by, for >>>>>> example, applying specific optimizations that simplify the CG (like dead-code elimination or constant propagation) >>>>>> before inlining the code. As you said, we are not short of ideas on *how* to improve it, but we have to be very >>>>>> wary of *what impact* it'll have on real-world applications. We're working with internal customers to figure that >>>>>> out, and we'll share them as soon as we are ready with benchmarks for those use-case patterns. >>>>>> What I am working on now is: >>>>>> ??? - Add a per-method flag through CompilerOracle >>>>>> ??? - Add a threshold on the probability of a receiver to generate a guard (I am thinking of 10%, i.e., if a >>>>>> receiver is observed less than 1 in every 10 calls, then don't generate a guard and use the fallback) >>>>>> ??? - Check the overhead of increasing TypeProfileWidth on profiling speed (in the interpreter and level #3 code) >>>>>> Thank you, and looking forward to the next review (I expect to post the next iteration of the patch today or >>>>>> tomorrow). >>>>>> -- >>>>>> Ludovic >>>>>> >>>>>> -----Original Message----- >>>>>> From: Vladimir Ivanov >>>>>> Sent: Thursday, February 6, 2020 1:07 PM >>>>>> To: Ludovic Henry ; hotspot-compiler-dev at openjdk.java.net >>>>>> Subject: Re: Polymorphic Guarded Inlining in C2 >>>>>> >>>>>> Very interesting results, Ludovic! >>>>>> >>>>>>> The image can be found at >>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fnam06.safelinks.protection.outlook.com%2F%3Furl%3Dhttps*3A*2F*2Fgist.github.com*2Fluhenry*2Fb7cc1ed55c51cb0fbc527cbc45018473%26amp%3Bdata%3D02*7C01*7Cluhenry*40microsoft.com*7Cc802bcf8b3154a1bb97508d7af47850b*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637170594027774879%26amp%3Bsdata%3D6dxACbgyeU0S51dVY6fSVZ*2B4USDsTOdnuxTOXkiVYqI*3D%26amp%3Breserved%3D0__%3BJSUlJSUlJSUlJSUlJSUl!!GqivPVa7Brio!MIQWA4cQu4RzL0yqjiU5mO4tpHcImj_lUv7tEh-gG6NCMtSRIAyhvfyJZQBPoAJC_Ajjaxg%24&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=aLAxfc5BDeyOK6VMwM7bUAgjVLMCAt8A6LDlsNkC6fU%3D&reserved=0 >>>>>>> >>>>>> >>>>>> Can you elaborate on the experiment itself, please? In particular, what >>>>>> does PERCENTILES actually mean? >>>>>> >>>>>> I haven't looked through the patch in details, but here are some thoughts. >>>>>> >>>>>> As of now, there are 4 main scenarios for devirtualization [1]. It seems >>>>>> you try to generalize (b) which becomes: >>>>>> >>>>>> ????? if (recv.klass == K1) { >>>>>> ???????? m1(...); // either inline or a direct call >>>>>> ????? } else if (recv.klass == K2) { >>>>>> ???????? m2(...); // either inline or a direct call >>>>>> ????? ... >>>>>> ????? } else if (recv.klass == Kn) { >>>>>> ???????? mn(...); // either inline or a direct call >>>>>> ????? } else { >>>>>> ???????? deopt(); // invalidate + reinterpret >>>>>> ????? } >>>>>> >>>>>> Question #1: what if you generalize polymorphic shape instead and allow >>>>>> multiple major receivers? Deoptimizing (and then recompiling) look less >>>>>> beneficial the higher morphism is (especially considering the inlining >>>>>> on all paths becomes less likely as well). So, having a virtual call >>>>>> (which becomes less likely due to lower frequency) on the fallback path >>>>>> may be a better option. >>>>>> >>>>>> >>>>>> Question #2: it would be very interesting to understand what exactly >>>>>> contributes the most to performance improvements? Is it inlining? Or >>>>>> maybe devirtualization (avoid the cost of virtual call)? How much come >>>>>> from optimizing interface calls (itable vs vtable stubs)? >>>>>> >>>>>> Deciding how to spend inlining budget on multiple targets with moderate >>>>>> frequency can be hard, so it makes sense to consider expanding >>>>>> 3/4/mega-morphic call sites in post-parse phase (during incremental >>>>>> inlining). >>>>>> >>>>>> >>>>>> Question #3: how much TypeProfileWidth affects profiling speed >>>>>> (interpreter and level #3 code) and dynamic footprint? >>>>>> >>>>>> >>>>>> Getting answers to those (and similar) questions should give us much >>>>>> more insights what is actually happening in practice. >>>>>> >>>>>> Speaking of the first deliverables, it would be good to introduce a new >>>>>> experimental mode to be able to easily conduct such experiments with >>>>>> product binaries and I'd like to see the patch evolving in that >>>>>> direction. It'll enable us to gather important data to guide our >>>>>> decisions about how to enhance the heuristics in the product. >>>>>> >>>>>> Best regards, >>>>>> Vladimir Ivanov >>>>>> >>>>>> [1] (a) Monomorphic: >>>>>> ????? if (recv.klass == K1) { >>>>>> ???????? m1(...); // either inline or a direct call >>>>>> ????? } else { >>>>>> ???????? deopt(); // invalidate + reinterpret >>>>>> ????? } >>>>>> >>>>>> ????? (b) Bimorphic: >>>>>> ????? if (recv.klass == K1) { >>>>>> ???????? m1(...); // either inline or a direct call >>>>>> ????? } else if (recv.klass == K2) { >>>>>> ???????? m2(...); // either inline or a direct call >>>>>> ????? } else { >>>>>> ???????? deopt(); // invalidate + reinterpret >>>>>> ????? } >>>>>> >>>>>> ????? (c) Polymorphic: >>>>>> ????? if (recv.klass == K1) { // major receiver (by default, >90%) >>>>>> ???????? m1(...); // either inline or a direct call >>>>>> ????? } else { >>>>>> ???????? K.m(); // virtual call >>>>>> ????? } >>>>>> >>>>>> ????? (d) Megamorphic: >>>>>> ????? K.m(); // virtual (K is either concrete or interface class) >>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ludovic >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: hotspot-compiler-dev On Behalf Of Ludovic Henry >>>>>>> Sent: Thursday, February 6, 2020 9:18 AM >>>>>>> To: hotspot-compiler-dev at openjdk.java.net >>>>>>> Subject: RFR: Polymorphic Guarded Inlining in C2 >>>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> In our evergoing search of improving performance, I've looked at inlining and, more specifically, at polymorphic >>>>>>> guarded inlining. Today in HotSpot, the maximum number of guards for types at any call site is two - with >>>>>>> bimorphic guarded inlining. However, Graal and Zing have observed great results with increasing that limit. >>>>>>> >>>>>>> You'll find following a patch that makes the number of guards for types configurable with the `TypeProfileWidth` >>>>>>> global. >>>>>>> >>>>>>> Testing: >>>>>>> Passing tier1 on Linux and Windows, plus other large applications (through the Adopt testing scripts) >>>>>>> >>>>>>> Benchmarking: >>>>>>> To get data, we run a benchmark against Apache Pinot and observe the following results: >>>>>>> >>>>>>> [cid:image001.png at 01D5D2DB.F5165550] >>>>>>> >>>>>>> We observe close to 20% improvements on this sample benchmark with a morphism (=width) of 3 or 4. We are >>>>>>> currently validating these numbers on a more extensive set of benchmarks and platforms, and I'll share them as >>>>>>> soon as we have them. >>>>>>> >>>>>>> I am happy to provide more information, just let me know if you have any question. >>>>>>> >>>>>>> Thank you, >>>>>>> >>>>>>> -- >>>>>>> Ludovic >>>>>>> >>>>>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp b/src/hotspot/share/ci/ciCallProfile.hpp >>>>>>> index 73854806ed..845070fbe1 100644 >>>>>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp >>>>>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp >>>>>>> @@ -38,7 +38,7 @@ private: >>>>>>> ?????? friend class ciMethod; >>>>>>> ?????? friend class ciMethodHandle; >>>>>>> >>>>>>> -? enum { MorphismLimit = 2 }; // Max call site's morphism we care about >>>>>>> +? enum { MorphismLimit = 8 }; // Max call site's morphism we care about >>>>>>> ?????? int? _limit;??????????????? // number of receivers have been determined >>>>>>> ?????? int? _morphism;???????????? // determined call site's morphism >>>>>>> ?????? int? _count;??????????????? // # times has this call been executed >>>>>>> @@ -47,6 +47,7 @@ private: >>>>>>> ?????? ciKlass*? _receiver[MorphismLimit + 1];? // receivers (exact) >>>>>>> >>>>>>> ?????? ciCallProfile() { >>>>>>> +??? guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit can't be smaller than TypeProfileWidth"); >>>>>>> ???????? _limit = 0; >>>>>>> ???????? _morphism??? = 0; >>>>>>> ???????? _count = -1; >>>>>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp b/src/hotspot/share/ci/ciMethod.cpp >>>>>>> index d771be8dac..8e4ecc8597 100644 >>>>>>> --- a/src/hotspot/share/ci/ciMethod.cpp >>>>>>> +++ b/src/hotspot/share/ci/ciMethod.cpp >>>>>>> @@ -496,9 +496,7 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) { >>>>>>> ?????????? // Every profiled call site has a counter. >>>>>>> ?????????? int count = check_overflow(data->as_CounterData()->count(), java_code_at_bci(bci)); >>>>>>> >>>>>>> -????? if (!data->is_ReceiverTypeData()) { >>>>>>> -??????? result._receiver_count[0] = 0;? // that's a definite zero >>>>>>> -????? } else { // ReceiverTypeData is a subclass of CounterData >>>>>>> +????? if (data->is_ReceiverTypeData()) { >>>>>>> ???????????? ciReceiverTypeData* call = (ciReceiverTypeData*)data->as_ReceiverTypeData(); >>>>>>> ???????????? // In addition, virtual call sites have receiver type information >>>>>>> ???????????? int receivers_count_total = 0; >>>>>>> @@ -515,7 +513,7 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) { >>>>>>> ?????????????? // is recorded or an associated counter is incremented, but not both. With >>>>>>> ?????????????? // tiered compilation, however, both can happen due to the interpreter and >>>>>>> ?????????????? // C1 profiling invocations differently. Address that inconsistency here. >>>>>>> -????????? if (morphism == 1 && count > 0) { >>>>>>> +????????? if (morphism >= 1 && count > 0) { >>>>>>> ???????????????? epsilon = count; >>>>>>> ???????????????? count = 0; >>>>>>> ?????????????? } >>>>>>> @@ -531,25 +529,26 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) { >>>>>>> ????????????? // If we extend profiling to record methods, >>>>>>> ?????????????? // we will set result._method also. >>>>>>> ???????????? } >>>>>>> +??????? result._morphism = morphism; >>>>>>> ???????????? // Determine call site's morphism. >>>>>>> ???????????? // The call site count is 0 with known morphism (only 1 or 2 receivers) >>>>>>> ???????????? // or < 0 in the case of a type check failure for checkcast, aastore, instanceof. >>>>>>> ???????????? // The call site count is > 0 in the case of a polymorphic virtual call. >>>>>>> -??????? if (morphism > 0 && morphism == result._limit) { >>>>>>> -?????????? // The morphism <= MorphismLimit. >>>>>>> -?????????? if ((morphism >>>>>> -?????????????? (morphism == ciCallProfile::MorphismLimit && count == 0)) { >>>>>>> +??????? assert(result._morphism == result._limit, ""); >>>>>>> #ifdef ASSERT >>>>>>> +??????? if (result._morphism > 0) { >>>>>>> +?????????? // The morphism <= TypeProfileWidth. >>>>>>> +?????????? if ((result._morphism >>>>>> +?????????????? (result._morphism == TypeProfileWidth && count == 0)) { >>>>>>> ????????????????? if (count > 0) { >>>>>>> ??????????????????? this->print_short_name(tty); >>>>>>> ??????????????????? tty->print_cr(" @ bci:%d", bci); >>>>>>> ??????????????????? this->print_codes(); >>>>>>> ??????????????????? assert(false, "this call site should not be polymorphic"); >>>>>>> ????????????????? } >>>>>>> -#endif >>>>>>> -???????????? result._morphism = morphism; >>>>>>> ??????????????? } >>>>>>> ???????????? } >>>>>>> +#endif >>>>>>> ???????????? // Make the count consistent if this is a call profile. If count is >>>>>>> ???????????? // zero or less, presume that this is a typecheck profile and >>>>>>> ???????????? // do nothing.? Otherwise, increase count to be the sum of all >>>>>>> @@ -578,7 +577,7 @@ void ciCallProfile::add_receiver(ciKlass* receiver, int receiver_count) { >>>>>>> ?????? } >>>>>>> ?????? _receiver[i] = receiver; >>>>>>> ?????? _receiver_count[i] = receiver_count; >>>>>>> -? if (_limit < MorphismLimit) _limit++; >>>>>>> +? if (_limit < TypeProfileWidth) _limit++; >>>>>>> } >>>>>>> >>>>>>> >>>>>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp b/src/hotspot/share/opto/c2_globals.hpp >>>>>>> index d605bdb7bd..7a8dee43e5 100644 >>>>>>> --- a/src/hotspot/share/opto/c2_globals.hpp >>>>>>> +++ b/src/hotspot/share/opto/c2_globals.hpp >>>>>>> @@ -389,9 +389,16 @@ >>>>>>> ?????? product(bool, UseBimorphicInlining, true,???????????????????????????????? \ >>>>>>> ?????????????? "Profiling based inlining for two receivers")???????????????????? \ >>>>>>> \ >>>>>>> +? product(bool, UsePolymorphicInlining, true,?????????????????????????????? \ >>>>>>> +????????? "Profiling based inlining for two or more receivers")???????????? \ >>>>>>> + \ >>>>>>> ?????? product(bool, UseOnlyInlinedBimorphic, true,????????????????????????????? \ >>>>>>> ?????????????? "Don't use BimorphicInlining if can't inline a second method")??? \ >>>>>>> \ >>>>>>> +? product(bool, UseOnlyInlinedPolymorphic, true,??????????????????????????? \ >>>>>>> +????????? "Don't use PolymorphicInlining if can't inline a non-major "????? \ >>>>>>> +????????? "receiver's method")????????????????????????????????????????????? \ >>>>>>> + \ >>>>>>> ?????? product(bool, InsertMemBarAfterArraycopy, true,?????????????????????????? \ >>>>>>> ?????????????? "Insert memory barrier after arraycopy call")???????????????????? \ >>>>>>> \ >>>>>>> diff --git a/src/hotspot/share/opto/doCall.cpp b/src/hotspot/share/opto/doCall.cpp >>>>>>> index 44ab387ac8..6f940209ce 100644 >>>>>>> --- a/src/hotspot/share/opto/doCall.cpp >>>>>>> +++ b/src/hotspot/share/opto/doCall.cpp >>>>>>> @@ -83,25 +83,23 @@ CallGenerator* Compile::call_generator(ciMethod* callee, int vtable_index, bool >>>>>>> >>>>>>> ?????? // See how many times this site has been invoked. >>>>>>> ?????? int site_count = profile.count(); >>>>>>> -? int receiver_count = -1; >>>>>>> -? if (call_does_dispatch && UseTypeProfile && profile.has_receiver(0)) { >>>>>>> -??? // Receivers in the profile structure are ordered by call counts >>>>>>> -??? // so that the most called (major) receiver is profile.receiver(0). >>>>>>> -??? receiver_count = profile.receiver_count(0); >>>>>>> -? } >>>>>>> >>>>>>> ?????? CompileLog* log = this->log(); >>>>>>> ?????? if (log != NULL) { >>>>>>> -??? int rid = (receiver_count >= 0)? log->identify(profile.receiver(0)): -1; >>>>>>> -??? int r2id = (rid != -1 && profile.has_receiver(1))? log->identify(profile.receiver(1)):-1; >>>>>>> +??? ResourceMark rm; >>>>>>> +??? int* rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth); >>>>>>> +??? for (int i = 0; i < TypeProfileWidth && profile.has_receiver(i); i++) { >>>>>>> +????? rids[i] = log->identify(profile.receiver(i)); >>>>>>> +??? } >>>>>>> ???????? log->begin_elem("call method='%d' count='%d' prof_factor='%f'", >>>>>>> ???????????????????????? log->identify(callee), site_count, prof_factor); >>>>>>> ???????? if (call_does_dispatch)? log->print(" virtual='1'"); >>>>>>> ???????? if (allow_inline)???? log->print(" inline='1'"); >>>>>>> -??? if (receiver_count >= 0) { >>>>>>> -????? log->print(" receiver='%d' receiver_count='%d'", rid, receiver_count); >>>>>>> -?????? if (profile.has_receiver(1)) { >>>>>>> -??????? log->print(" receiver2='%d' receiver2_count='%d'", r2id, profile.receiver_count(1)); >>>>>>> +??? for (int i = 0; i < TypeProfileWidth && profile.has_receiver(i); i++) { >>>>>>> +????? if (i == 0) { >>>>>>> +??????? log->print(" receiver='%d' receiver_count='%d'", rids[i], profile.receiver_count(i)); >>>>>>> +????? } else { >>>>>>> +??????? log->print(" receiver%d='%d' receiver%d_count='%d'", i + 1, rids[i], i + 1, profile.receiver_count(i)); >>>>>>> ?????????? } >>>>>>> ???????? } >>>>>>> ???????? if (callee->is_method_handle_intrinsic()) { >>>>>>> @@ -205,90 +203,96 @@ CallGenerator* Compile::call_generator(ciMethod* callee, int vtable_index, bool >>>>>>> ???????? if (call_does_dispatch && site_count > 0 && UseTypeProfile) { >>>>>>> ?????????? // The major receiver's count >= TypeProfileMajorReceiverPercent of site_count. >>>>>>> ?????????? bool have_major_receiver = profile.has_receiver(0) && (100.*profile.receiver_prob(0) >= >>>>>>> (float)TypeProfileMajorReceiverPercent); >>>>>>> -????? ciMethod* receiver_method = NULL; >>>>>>> >>>>>>> ?????????? int morphism = profile.morphism(); >>>>>>> + >>>>>>> +????? ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*, MAX(1, morphism)); >>>>>>> +????? memset(receiver_methods, 0, sizeof(ciMethod*) * MAX(1, morphism)); >>>>>>> + >>>>>>> ?????????? if (speculative_receiver_type != NULL) { >>>>>>> ???????????? if (!too_many_traps_or_recompiles(caller, bci, Deoptimization::Reason_speculate_class_check)) { >>>>>>> ?????????????? // We have a speculative type, we should be able to resolve >>>>>>> ?????????????? // the call. We do that before looking at the profiling at >>>>>>> -????????? // this invoke because it may lead to bimorphic inlining which >>>>>>> +????????? // this invoke because it may lead to polymorphic inlining which >>>>>>> ?????????????? // a speculative type should help us avoid. >>>>>>> -????????? receiver_method = callee->resolve_invoke(jvms->method()->holder(), >>>>>>> - speculative_receiver_type); >>>>>>> -????????? if (receiver_method == NULL) { >>>>>>> +????????? receiver_methods[0] = callee->resolve_invoke(jvms->method()->holder(), >>>>>>> + speculative_receiver_type); >>>>>>> +????????? if (receiver_methods[0] == NULL) { >>>>>>> ???????????????? speculative_receiver_type = NULL; >>>>>>> ?????????????? } else { >>>>>>> ???????????????? morphism = 1; >>>>>>> ?????????????? } >>>>>>> ???????????? } else { >>>>>>> ?????????????? // speculation failed before. Use profiling at the call >>>>>>> -????????? // (could allow bimorphic inlining for instance). >>>>>>> +????????? // (could allow polymorphic inlining for instance). >>>>>>> ?????????????? speculative_receiver_type = NULL; >>>>>>> ???????????? } >>>>>>> ?????????? } >>>>>>> -????? if (receiver_method == NULL && >>>>>>> +????? if (receiver_methods[0] == NULL && >>>>>>> ?????????????? (have_major_receiver || morphism == 1 || >>>>>>> -?????????? (morphism == 2 && UseBimorphicInlining))) { >>>>>>> -??????? // receiver_method = profile.method(); >>>>>>> +?????????? (morphism == 2 && UseBimorphicInlining) || >>>>>>> +?????????? (morphism >= 2 && UsePolymorphicInlining))) { >>>>>>> +??????? assert(profile.has_receiver(0), "no receiver at 0"); >>>>>>> +??????? // receiver_methods[0] = profile.method(); >>>>>>> ???????????? // Profiles do not suggest methods now.? Look it up in the major receiver. >>>>>>> -??????? receiver_method = callee->resolve_invoke(jvms->method()->holder(), >>>>>>> - profile.receiver(0)); >>>>>>> +??????? receiver_methods[0] = callee->resolve_invoke(jvms->method()->holder(), >>>>>>> + profile.receiver(0)); >>>>>>> ?????????? } >>>>>>> -????? if (receiver_method != NULL) { >>>>>>> -??????? // The single majority receiver sufficiently outweighs the minority. >>>>>>> -??????? CallGenerator* hit_cg = this->call_generator(receiver_method, >>>>>>> -????????????? vtable_index, !call_does_dispatch, jvms, allow_inline, prof_factor); >>>>>>> -??????? if (hit_cg != NULL) { >>>>>>> -????????? // Look up second receiver. >>>>>>> -????????? CallGenerator* next_hit_cg = NULL; >>>>>>> -????????? ciMethod* next_receiver_method = NULL; >>>>>>> -????????? if (morphism == 2 && UseBimorphicInlining) { >>>>>>> -??????????? next_receiver_method = callee->resolve_invoke(jvms->method()->holder(), >>>>>>> - profile.receiver(1)); >>>>>>> -??????????? if (next_receiver_method != NULL) { >>>>>>> -????????????? next_hit_cg = this->call_generator(next_receiver_method, >>>>>>> -????????????????????????????????? vtable_index, !call_does_dispatch, jvms, >>>>>>> -????????????????????????????????? allow_inline, prof_factor); >>>>>>> -????????????? if (next_hit_cg != NULL && !next_hit_cg->is_inline() && >>>>>>> -????????????????? have_major_receiver && UseOnlyInlinedBimorphic) { >>>>>>> -????????????????? // Skip if we can't inline second receiver's method >>>>>>> -????????????????? next_hit_cg = NULL; >>>>>>> +????? if (receiver_methods[0] != NULL) { >>>>>>> +??????? CallGenerator** hit_cgs = NEW_RESOURCE_ARRAY(CallGenerator*, MAX(1, morphism)); >>>>>>> +??????? memset(hit_cgs, 0, sizeof(CallGenerator*) * MAX(1, morphism)); >>>>>>> + >>>>>>> +??????? hit_cgs[0] = this->call_generator(receiver_methods[0], >>>>>>> +??????????????????????????? vtable_index, !call_does_dispatch, jvms, >>>>>>> +??????????????????????????? allow_inline, prof_factor); >>>>>>> +??????? if (hit_cgs[0] != NULL) { >>>>>>> +????????? if ((morphism == 2 && UseBimorphicInlining) || (morphism >= 2 && UsePolymorphicInlining)) { >>>>>>> +??????????? for (int i = 1; i < morphism; i++) { >>>>>>> +????????????? assert(profile.has_receiver(i), "no receiver at %d", i); >>>>>>> +????????????? receiver_methods[i] = callee->resolve_invoke(jvms->method()->holder(), >>>>>>> + profile.receiver(i)); >>>>>>> +????????????? if (receiver_methods[i] != NULL) { >>>>>>> +??????????????? hit_cgs[i] = this->call_generator(receiver_methods[i], >>>>>>> +????????????????????????????????????? vtable_index, !call_does_dispatch, jvms, >>>>>>> +????????????????????????????????????? allow_inline, prof_factor); >>>>>>> +??????????????? if (hit_cgs[i] != NULL && !hit_cgs[i]->is_inline() && have_major_receiver && >>>>>>> +??????????????????? ((morphism == 2 && UseOnlyInlinedBimorphic) || (morphism >= 2 && UseOnlyInlinedPolymorphic))) { >>>>>>> +????????????????? // Skip if we can't inline non-major receiver's method >>>>>>> +????????????????? hit_cgs[i] = NULL; >>>>>>> +??????????????? } >>>>>>> ?????????????????? } >>>>>>> ???????????????? } >>>>>>> ?????????????? } >>>>>>> ?????????????? CallGenerator* miss_cg; >>>>>>> -????????? Deoptimization::DeoptReason reason = (morphism == 2 >>>>>>> -?????????????????????????????????????????????? ? Deoptimization::Reason_bimorphic >>>>>>> +????????? Deoptimization::DeoptReason reason = (morphism >= 2 >>>>>>> +?????????????????????????????????????????????? ? Deoptimization::Reason_polymorphic >>>>>>> ??????????????????????????????????????????????????? : >>>>>>> Deoptimization::reason_class_check(speculative_receiver_type != NULL)); >>>>>>> -????????? if ((morphism == 1 || (morphism == 2 && next_hit_cg != NULL)) && >>>>>>> -????????????? !too_many_traps_or_recompiles(caller, bci, reason) >>>>>>> -???????????? ) { >>>>>>> +????????? if (!too_many_traps_or_recompiles(caller, bci, reason)) { >>>>>>> ???????????????? // Generate uncommon trap for class check failure path >>>>>>> -??????????? // in case of monomorphic or bimorphic virtual call site. >>>>>>> +??????????? // in case of polymorphic virtual call site. >>>>>>> ???????????????? miss_cg = CallGenerator::for_uncommon_trap(callee, reason, >>>>>>> ???????????????????????????? Deoptimization::Action_maybe_recompile); >>>>>>> ?????????????? } else { >>>>>>> ???????????????? // Generate virtual call for class check failure path >>>>>>> -??????????? // in case of polymorphic virtual call site. >>>>>>> +??????????? // in case of megamorphic virtual call site. >>>>>>> ???????????????? miss_cg = CallGenerator::for_virtual_call(callee, vtable_index); >>>>>>> ?????????????? } >>>>>>> -????????? if (miss_cg != NULL) { >>>>>>> -??????????? if (next_hit_cg != NULL) { >>>>>>> +????????? for (int i = morphism - 1; i >= 1 && miss_cg != NULL; i--) { >>>>>>> +??????????? if (hit_cgs[i] != NULL) { >>>>>>> ?????????????????? assert(speculative_receiver_type == NULL, "shouldn't end up here if we used speculation"); >>>>>>> -????????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), next_receiver_method, >>>>>>> profile.receiver(1), site_count, profile.receiver_count(1)); >>>>>>> +????????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_methods[i], >>>>>>> profile.receiver(i), site_count, profile.receiver_count(i)); >>>>>>> ?????????????????? // We don't need to record dependency on a receiver here and below. >>>>>>> ?????????????????? // Whenever we inline, the dependency is added by Parse::Parse(). >>>>>>> -????????????? miss_cg = CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, next_hit_cg, PROB_MAX); >>>>>>> -??????????? } >>>>>>> -??????????? if (miss_cg != NULL) { >>>>>>> -????????????? ciKlass* k = speculative_receiver_type != NULL ? speculative_receiver_type : profile.receiver(0); >>>>>>> -????????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_method, k, >>>>>>> site_count, receiver_count); >>>>>>> -????????????? float hit_prob = speculative_receiver_type != NULL ? 1.0 : profile.receiver_prob(0); >>>>>>> -????????????? CallGenerator* cg = CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob); >>>>>>> -????????????? if (cg != NULL)? return cg; >>>>>>> +????????????? miss_cg = CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, hit_cgs[i], PROB_MAX); >>>>>>> ???????????????? } >>>>>>> ?????????????? } >>>>>>> +????????? if (miss_cg != NULL) { >>>>>>> +??????????? ciKlass* k = speculative_receiver_type != NULL ? speculative_receiver_type : profile.receiver(0); >>>>>>> +??????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_methods[0], k, >>>>>>> site_count, profile.receiver_count(0)); >>>>>>> +??????????? float hit_prob = speculative_receiver_type != NULL ? 1.0 : profile.receiver_prob(0); >>>>>>> +??????????? CallGenerator* cg = CallGenerator::for_predicted_call(k, miss_cg, hit_cgs[0], hit_prob); >>>>>>> +??????????? if (cg != NULL)? return cg; >>>>>>> +????????? } >>>>>>> ???????????? } >>>>>>> ????????? } >>>>>>> ???????? } >>>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp b/src/hotspot/share/runtime/deoptimization.cpp >>>>>>> index 11df15e004..2d14b52854 100644 >>>>>>> --- a/src/hotspot/share/runtime/deoptimization.cpp >>>>>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp >>>>>>> @@ -2382,7 +2382,7 @@ const char* Deoptimization::_trap_reason_name[] = { >>>>>>> ?????? "class_check", >>>>>>> ?????? "array_check", >>>>>>> ?????? "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"), >>>>>>> -? "bimorphic" JVMCI_ONLY("_or_optimized_type_check"), >>>>>>> +? "polymorphic" JVMCI_ONLY("_or_optimized_type_check"), >>>>>>> ?????? "profile_predicate", >>>>>>> ?????? "unloaded", >>>>>>> ?????? "uninitialized", >>>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp b/src/hotspot/share/runtime/deoptimization.hpp >>>>>>> index 1cfff5394e..c1eb998aba 100644 >>>>>>> --- a/src/hotspot/share/runtime/deoptimization.hpp >>>>>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp >>>>>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic { >>>>>>> ???????? Reason_class_check,?????????? // saw unexpected object class (@bci) >>>>>>> ???????? Reason_array_check,?????????? // saw unexpected array class (aastore @bci) >>>>>>> ???????? Reason_intrinsic,???????????? // saw unexpected operand to intrinsic (@bci) >>>>>>> -??? Reason_bimorphic,???????????? // saw unexpected object class in bimorphic inlining (@bci) >>>>>>> +??? Reason_polymorphic,?????????? // saw unexpected object class in bimorphic inlining (@bci) >>>>>>> >>>>>>> #if INCLUDE_JVMCI >>>>>>> ???????? Reason_unreached0???????????? = Reason_null_assert, >>>>>>> ???????? Reason_type_checked_inlining? = Reason_intrinsic, >>>>>>> -??? Reason_optimized_type_check?? = Reason_bimorphic, >>>>>>> +??? Reason_optimized_type_check?? = Reason_polymorphic, >>>>>>> #endif >>>>>>> >>>>>>> ???????? Reason_profile_predicate,???? // compiler generated predicate moved from frequent branch in a loop failed >>>>>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp b/src/hotspot/share/runtime/vmStructs.cpp >>>>>>> index 94b544824e..ee761626c4 100644 >>>>>>> --- a/src/hotspot/share/runtime/vmStructs.cpp >>>>>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp >>>>>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry? KlassHashtableEntry; >>>>>>> declare_constant(Deoptimization::Reason_class_check) \ >>>>>>> declare_constant(Deoptimization::Reason_array_check) \ >>>>>>> declare_constant(Deoptimization::Reason_intrinsic) \ >>>>>>> - declare_constant(Deoptimization::Reason_bimorphic) \ >>>>>>> + declare_constant(Deoptimization::Reason_polymorphic) \ >>>>>>> declare_constant(Deoptimization::Reason_profile_predicate) \ >>>>>>> declare_constant(Deoptimization::Reason_unloaded) \ >>>>>>> declare_constant(Deoptimization::Reason_uninitialized) \ >>>>>>> From vladimir.x.ivanov at oracle.com Tue Apr 7 19:31:09 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Tue, 7 Apr 2020 22:31:09 +0300 Subject: Polymorphic Guarded Inlining in C2 In-Reply-To: <0ee0b383-285e-bd93-3490-84ad991b53d1@oracle.com> References: <6bbeea49-7335-9640-d524-32fa03968f42@oracle.com> <084ea561-6305-a8fd-9d7f-5ba108d41312@oracle.com> <6de5487c-c13e-c03e-9d0b-f3093b115daf@oracle.com> <0ee0b383-285e-bd93-3490-84ad991b53d1@oracle.com> Message-ID: <0307f0de-4743-5870-6f83-ce2e88d438b0@oracle.com> > An other thing we can do is collect statistic data about how many > different receivers can be recorded with big TypeProfileWidth. My > recollection from long ago was the only case for poly was HashMap usage. > It would be nice to collect this data again for modern Java benchmarks. > We can use them to see afftets of changes - benchmarks which do not have > poly cases are usless in these experiments. Yes, such data would be very valuable. The last time I looked at megamorphic call sites, only a few of standard benchmarks (SPEC*) had any in hot code. Additionally, separating data for virtual and interface calls looks very useful. > On 4/6/20 6:38 AM, Vladimir Ivanov wrote: >> I see 2 directions (mostly independent) to proceed: (1) use existing >> profiling info only; and (2) when more profile info is available. >> >> I suggest to explore them independently. >> >> There's enough profiling data available to introduce polymorpic case >> with 2 major receivers ("2-poly"). And it'll complete the matrix of >> possible shapes. > > Please explain how it is different from current bimprphic case? Bimorphic case is when there are exactly 2 receivers recorded in type profile and on fallback path an uncommon trap is put. Polymorphic (1-poly) doesn't care about total number of receivers, just that one of them is encountered more frequently than the others (>TypeProfileMajorReceiverPercent). On fallback path it has a virtual call. That's the difference from monomorphic (1-morphic) case. What I call 2-poly is when the number of major receivers is increased to 2, but still keeping a virtual call on fallback path. So, the only difference between 2-poly and bimorphic is the shape of fallback path. Best regards, Vladimir Ivanov >> Gathering more data (-XX:TypeProfileWidth=N > 2) enables 2 more >> generic shapes: "N-morphic" and "N-poly". The only difference between >> them is what happens on fallback patch - deopt / uncommon trap or a >> virtual call. >> >> Regarding 2-poly, there is TypeProfileMajorReceiverPercent which >> should be extended to 2 cases which leads to 2 parameter: aggregated >> major receiver percentage and minimum indiviual percentage. > > okay > >> >> Also, it makes sense to introduce UseOnlyInlinedPolymorphic which >> aligns 2-poly with bimorphic case. >> >> And, as I mentioned before, IMO it's promising to distinguish >> invokevirtual and invokeinterface cases. So, additional flag to >> control that would be useful. > > yes > >> >> Regarding N-poly/N-morphic case, they can be generalized from >> 2-poly/bi-morphic cases. >> >> I believe experiments on 2-poly will provide useful insights on >> N-poly/N-morphic, so it makes sense to start with 2-poly first. > > Yes > > Thanks, > Vladimir K > >> >> Best regards, >> Vladimir Ivanov >> >> On 01.04.2020 01:29, Vladimir Kozlov wrote: >>> Looks like graphs were stripped from email. I put them on GitHub: >>> >>> >>> >>> >>> >>> >>> >>> >>> Also Vladimir Ivanov forwarded me data he collected. >>> >>> His next data shows that profiling is not "free". Vladimir I. limited >>> to tier3 (-XX:TieredStopAtLevel=3, C1 compilation with profiling >>> code) to show that profiling code with TPW=8 is slower. Note, with 4 >>> tiers this may not visible because execution will be switched to C2 >>> compiled code (without profiling code). >>> >>> >>> >>> >>> >>> >>> Next data collected for proposed patch. Vladimir I. collected data >>> for several flags configurations. >>> Next graphs are for one of settings:' -XX:+UsePolymorphicInlining >>> -XX:+UseOnlyInlinedPolymorphic -XX:TypeProfileWidth=4' >>> >>> >>> >>> >>> >>> >>> It has mixed data but most benchmarks are not affected. Which means >>> we need to spend more time on proposed changes. >>> >>> Vladimir K >>> >>> On 3/31/20 10:39 AM, Vladimir Kozlov wrote: >>>> I start loking on it. >>>> >>>> I think ideally TypeProfileWidth should be per call site and not per >>>> method - and it will require more complicated implementation (an >>>> other RFE). But for experiments I think setting it to 8 (or higher) >>>> for all methods is okay. >>>> >>>> Note, more profiling lines per each call site is cost few Mb in >>>> CodeCache (overestimation 20K nmethods * 10 call sites * 6 * 8 >>>> bytes) vs very complicated code to have dynamic number of lines. >>>> >>>> I think we should first investigate best heuristics for inlining vs >>>> direct call vs vcall vs uncommmont traps for polymorphic cases and >>>> worry about memory and time consumption during profiling later. >>>> >>>> I did some performance runs with latest JDK 15 for >>>> TypeProfileWidth=8 vs =2 and don't see much difference for spec >>>> benchmarks (see attached graph - grey dots mean no significant >>>> difference). But there are regressions (red dots) for Renessance >>>> which includes some modern benchmarks. >>>> >>>> I will work his week to get similar data with Ludovic's patch. >>>> >>>> I am for incremental approach. I think we can start/push based on >>>> what Ludovic is currently suggesting (do more processing for TPW > >>>> 2) while preserving current default behaviour (for TPW <= 2). But >>>> only if it gives improvements in these benchmarks. We use these >>>> benchmarks as criteria for JDK releases. >>>> >>>> Regards, >>>> Vladimir >>>> >>>> On 3/20/20 4:52 PM, Ludovic Henry wrote: >>>>> Hi Vladimir, >>>>> >>>>> As requested offline, please find following the latest version of >>>>> the patch. Contrary to what was discussed >>>>> initially, I haven't done the work to support per-method >>>>> TypeProfileWidth, as that requires to extend the >>>>> existing CompilerDirectives to be available to the Interpreter. For >>>>> me to achieve that work, I would need >>>>> guidance on how to approach the problem, and what your expectations >>>>> are. >>>>> >>>>> Thank you, >>>>> >>>>> -- >>>>> Ludovic >>>>> >>>>> diff --git a/src/hotspot/cpu/x86/interp_masm_x86.cpp >>>>> b/src/hotspot/cpu/x86/interp_masm_x86.cpp >>>>> index 4ed93169c7..bad9cddf20 100644 >>>>> --- a/src/hotspot/cpu/x86/interp_masm_x86.cpp >>>>> +++ b/src/hotspot/cpu/x86/interp_masm_x86.cpp >>>>> @@ -1731,7 +1731,7 @@ void >>>>> InterpreterMacroAssembler::record_item_in_profile_helper(Register >>>>> item, Reg >>>>> ??????????? Label found_null; >>>>> ??????????? jccb(Assembler::zero, found_null); >>>>> ??????????? // Item did not match any saved item and there is no >>>>> empty row for it. >>>>> -????????? // Increment total counter to indicate polymorphic case. >>>>> +????????? // Increment total counter to indicate megamorphic case. >>>>> ??????????? increment_mdp_data_at(mdp, non_profiled_offset); >>>>> ??????????? jmp(done); >>>>> ??????????? bind(found_null); >>>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp >>>>> b/src/hotspot/share/ci/ciCallProfile.hpp >>>>> index 73854806ed..c5030149bf 100644 >>>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp >>>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp >>>>> @@ -38,7 +38,8 @@ private: >>>>> ??? friend class ciMethod; >>>>> ??? friend class ciMethodHandle; >>>>> -? enum { MorphismLimit = 2 }; // Max call site's morphism we care >>>>> about >>>>> +? enum { MorphismLimit = 8 }; // Max call site's morphism we care >>>>> about >>>>> +? bool _is_megamorphic;????????? // whether the call site is >>>>> megamorphic >>>>> ??? int? _limit;??????????????? // number of receivers have been >>>>> determined >>>>> ??? int? _morphism;???????????? // determined call site's morphism >>>>> ??? int? _count;??????????????? // # times has this call been executed >>>>> @@ -47,6 +48,8 @@ private: >>>>> ??? ciKlass*? _receiver[MorphismLimit + 1];? // receivers (exact) >>>>> ??? ciCallProfile() { >>>>> +??? guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit >>>>> can't be smaller than TypeProfileWidth"); >>>>> +??? _is_megamorphic = false; >>>>> ????? _limit = 0; >>>>> ????? _morphism??? = 0; >>>>> ????? _count = -1; >>>>> @@ -58,6 +61,8 @@ private: >>>>> ??? void add_receiver(ciKlass* receiver, int receiver_count); >>>>> ? public: >>>>> +? bool????? is_megamorphic() const??? { return _is_megamorphic; } >>>>> + >>>>> ??? // Note:? The following predicates return false for invalid >>>>> profiles: >>>>> ??? bool????? has_receiver(int i) const { return _limit > i; } >>>>> ??? int?????? morphism() const????????? { return _morphism; } >>>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp >>>>> b/src/hotspot/share/ci/ciMethod.cpp >>>>> index d771be8dac..c190919708 100644 >>>>> --- a/src/hotspot/share/ci/ciMethod.cpp >>>>> +++ b/src/hotspot/share/ci/ciMethod.cpp >>>>> @@ -531,25 +531,27 @@ ciCallProfile >>>>> ciMethod::call_profile_at_bci(int bci) { >>>>> ??????????? // If we extend profiling to record methods, >>>>> ??????????? // we will set result._method also. >>>>> ????????? } >>>>> -??????? // Determine call site's morphism. >>>>> +??????? // Determine call site's megamorphism. >>>>> ????????? // The call site count is 0 with known morphism (only 1 >>>>> or 2 receivers) >>>>> ????????? // or < 0 in the case of a type check failure for >>>>> checkcast, aastore, instanceof. >>>>> -??????? // The call site count is > 0 in the case of a polymorphic >>>>> virtual call. >>>>> +??????? // The call site count is > 0 in the case of a megamorphic >>>>> virtual call. >>>>> ????????? if (morphism > 0 && morphism == result._limit) { >>>>> ???????????? // The morphism <= MorphismLimit. >>>>> -?????????? if ((morphism >>>> -?????????????? (morphism == ciCallProfile::MorphismLimit && count >>>>> == 0)) { >>>>> +?????????? if ((morphism >>>> +?????????????? (morphism == TypeProfileWidth && count == 0)) { >>>>> ? #ifdef ASSERT >>>>> ?????????????? if (count > 0) { >>>>> ???????????????? this->print_short_name(tty); >>>>> ???????????????? tty->print_cr(" @ bci:%d", bci); >>>>> ???????????????? this->print_codes(); >>>>> -?????????????? assert(false, "this call site should not be >>>>> polymorphic"); >>>>> +?????????????? assert(false, "this call site should not be >>>>> megamorphic"); >>>>> ?????????????? } >>>>> ? #endif >>>>> -???????????? result._morphism = morphism; >>>>> +?????????? } else { >>>>> +????????????? result._is_megamorphic = true; >>>>> ???????????? } >>>>> ????????? } >>>>> +??????? result._morphism = morphism; >>>>> ????????? // Make the count consistent if this is a call profile. >>>>> If count is >>>>> ????????? // zero or less, presume that this is a typecheck profile >>>>> and >>>>> ????????? // do nothing.? Otherwise, increase count to be the sum >>>>> of all >>>>> @@ -578,7 +580,7 @@ void ciCallProfile::add_receiver(ciKlass* >>>>> receiver, int receiver_count) { >>>>> ??? } >>>>> ??? _receiver[i] = receiver; >>>>> ??? _receiver_count[i] = receiver_count; >>>>> -? if (_limit < MorphismLimit) _limit++; >>>>> +? if (_limit < TypeProfileWidth) _limit++; >>>>> ? } >>>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp >>>>> b/src/hotspot/share/opto/c2_globals.hpp >>>>> index d605bdb7bd..e4a5e7ea8b 100644 >>>>> --- a/src/hotspot/share/opto/c2_globals.hpp >>>>> +++ b/src/hotspot/share/opto/c2_globals.hpp >>>>> @@ -389,9 +389,16 @@ >>>>> ??? product(bool, UseBimorphicInlining, >>>>> true,???????????????????????????????? \ >>>>> ??????????? "Profiling based inlining for two >>>>> receivers")???????????????????? \ >>>>> \ >>>>> +? product(bool, UsePolymorphicInlining, >>>>> true,?????????????????????????????? \ >>>>> +????????? "Profiling based inlining for two or more >>>>> receivers")???????????? \ >>>>> + \ >>>>> ??? product(bool, UseOnlyInlinedBimorphic, >>>>> true,????????????????????????????? \ >>>>> ??????????? "Don't use BimorphicInlining if can't inline a second >>>>> method")??? \ >>>>> \ >>>>> +? product(bool, UseOnlyInlinedPolymorphic, >>>>> true,??????????????????????????? \ >>>>> +????????? "Don't use PolymorphicInlining if can't inline a >>>>> secondary "????? \ >>>>> + "method")???????????????????????????????????????????????????????? \ >>>>> + \ >>>>> ??? product(bool, InsertMemBarAfterArraycopy, >>>>> true,?????????????????????????? \ >>>>> ??????????? "Insert memory barrier after arraycopy >>>>> call")???????????????????? \ >>>>> \ >>>>> @@ -645,6 +652,10 @@ >>>>> ??????????? "% of major receiver type to all profiled >>>>> receivers")???????????? \ >>>>> ??????????? range(0, >>>>> 100)???????????????????????????????????????????????????? \ >>>>> \ >>>>> +? product(intx, TypeProfileMinimumReceiverPercent, >>>>> 20,????????????????????? \ >>>>> +????????? "minimum % of receiver type to all profiled >>>>> receivers")?????????? \ >>>>> +????????? range(0, >>>>> 100)???????????????????????????????????????????????????? \ >>>>> + \ >>>>> ??? diagnostic(bool, PrintIntrinsics, >>>>> false,????????????????????????????????? \ >>>>> ??????????? "prints attempted and successful inlining of >>>>> intrinsics")???????? \ >>>>> \ >>>>> diff --git a/src/hotspot/share/opto/doCall.cpp >>>>> b/src/hotspot/share/opto/doCall.cpp >>>>> index 44ab387ac8..dba2b114c6 100644 >>>>> --- a/src/hotspot/share/opto/doCall.cpp >>>>> +++ b/src/hotspot/share/opto/doCall.cpp >>>>> @@ -83,25 +83,27 @@ CallGenerator* >>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool >>>>> ??? // See how many times this site has been invoked. >>>>> ??? int site_count = profile.count(); >>>>> -? int receiver_count = -1; >>>>> -? if (call_does_dispatch && UseTypeProfile && >>>>> profile.has_receiver(0)) { >>>>> -??? // Receivers in the profile structure are ordered by call counts >>>>> -??? // so that the most called (major) receiver is >>>>> profile.receiver(0). >>>>> -??? receiver_count = profile.receiver_count(0); >>>>> -? } >>>>> ??? CompileLog* log = this->log(); >>>>> ??? if (log != NULL) { >>>>> -??? int rid = (receiver_count >= 0)? >>>>> log->identify(profile.receiver(0)): -1; >>>>> -??? int r2id = (rid != -1 && profile.has_receiver(1))? >>>>> log->identify(profile.receiver(1)):-1; >>>>> +??? int* rids; >>>>> +??? if (call_does_dispatch) { >>>>> +????? rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth); >>>>> +????? for (int i = 0; i < TypeProfileWidth && >>>>> profile.has_receiver(i); i++) { >>>>> +??????? rids[i] = log->identify(profile.receiver(i)); >>>>> +????? } >>>>> +??? } >>>>> ????? log->begin_elem("call method='%d' count='%d' prof_factor='%f'", >>>>> ????????????????????? log->identify(callee), site_count, prof_factor); >>>>> -??? if (call_does_dispatch)? log->print(" virtual='1'"); >>>>> ????? if (allow_inline)???? log->print(" inline='1'"); >>>>> -??? if (receiver_count >= 0) { >>>>> -????? log->print(" receiver='%d' receiver_count='%d'", rid, >>>>> receiver_count); >>>>> -????? if (profile.has_receiver(1)) { >>>>> -??????? log->print(" receiver2='%d' receiver2_count='%d'", r2id, >>>>> profile.receiver_count(1)); >>>>> +??? if (call_does_dispatch) { >>>>> +????? log->print(" virtual='1'"); >>>>> +????? for (int i = 0; i < TypeProfileWidth && >>>>> profile.has_receiver(i); i++) { >>>>> +??????? if (i == 0) { >>>>> +????????? log->print(" receiver='%d' receiver_count='%d' >>>>> receiver_prob='%f'", rids[i], profile.receiver_count(i), >>>>> profile.receiver_prob(i)); >>>>> +??????? } else { >>>>> +????????? log->print(" receiver%d='%d' receiver%d_count='%d' >>>>> receiver%d_prob='%f'", i + 1, rids[i], i + 1, >>>>> profile.receiver_count(i), i + 1, profile.receiver_prob(i)); >>>>> +??????? } >>>>> ??????? } >>>>> ????? } >>>>> ????? if (callee->is_method_handle_intrinsic()) { >>>>> @@ -205,92 +207,112 @@ CallGenerator* >>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool >>>>> ????? if (call_does_dispatch && site_count > 0 && UseTypeProfile) { >>>>> ??????? // The major receiver's count >= >>>>> TypeProfileMajorReceiverPercent of site_count. >>>>> ??????? bool have_major_receiver = profile.has_receiver(0) && >>>>> (100.*profile.receiver_prob(0) >= >>>>> (float)TypeProfileMajorReceiverPercent); >>>>> -????? ciMethod* receiver_method = NULL; >>>>> ??????? int morphism = profile.morphism(); >>>>> + >>>>> +????? int width = morphism > 0 ? morphism : 1; >>>>> +????? ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*, >>>>> width); >>>>> +????? memset(receiver_methods, 0, sizeof(ciMethod*) * width); >>>>> +????? CallGenerator** hit_cgs = NEW_RESOURCE_ARRAY(CallGenerator*, >>>>> width); >>>>> +????? memset(hit_cgs, 0, sizeof(CallGenerator*) * width); >>>>> + >>>>> ??????? if (speculative_receiver_type != NULL) { >>>>> ????????? if (!too_many_traps_or_recompiles(caller, bci, >>>>> Deoptimization::Reason_speculate_class_check)) { >>>>> ??????????? // We have a speculative type, we should be able to >>>>> resolve >>>>> ??????????? // the call. We do that before looking at the profiling at >>>>> -????????? // this invoke because it may lead to bimorphic inlining >>>>> which >>>>> +????????? // this invoke because it may lead to polymorphic >>>>> inlining which >>>>> ??????????? // a speculative type should help us avoid. >>>>> -????????? receiver_method = >>>>> callee->resolve_invoke(jvms->method()->holder(), >>>>> - speculative_receiver_type); >>>>> -????????? if (receiver_method == NULL) { >>>>> +????????? receiver_methods[0] = >>>>> callee->resolve_invoke(jvms->method()->holder(), >>>>> + speculative_receiver_type); >>>>> +????????? if (receiver_methods[0] == NULL) { >>>>> ????????????? speculative_receiver_type = NULL; >>>>> ??????????? } else { >>>>> ????????????? morphism = 1; >>>>> ??????????? } >>>>> ????????? } else { >>>>> ??????????? // speculation failed before. Use profiling at the call >>>>> -????????? // (could allow bimorphic inlining for instance). >>>>> +????????? // (could allow polymorphic inlining for instance). >>>>> ??????????? speculative_receiver_type = NULL; >>>>> ????????? } >>>>> ??????? } >>>>> -????? if (receiver_method == NULL && >>>>> -????????? (have_major_receiver || morphism == 1 || >>>>> -?????????? (morphism == 2 && UseBimorphicInlining))) { >>>>> -??????? // receiver_method = profile.method(); >>>>> -??????? // Profiles do not suggest methods now.? Look it up in the >>>>> major receiver. >>>>> -??????? receiver_method = >>>>> callee->resolve_invoke(jvms->method()->holder(), >>>>> - profile.receiver(0)); >>>>> -????? } >>>>> -????? if (receiver_method != NULL) { >>>>> -??????? // The single majority receiver sufficiently outweighs the >>>>> minority. >>>>> -??????? CallGenerator* hit_cg = this->call_generator(receiver_method, >>>>> -????????????? vtable_index, !call_does_dispatch, jvms, >>>>> allow_inline, prof_factor); >>>>> -??????? if (hit_cg != NULL) { >>>>> -????????? // Look up second receiver. >>>>> -????????? CallGenerator* next_hit_cg = NULL; >>>>> -????????? ciMethod* next_receiver_method = NULL; >>>>> -????????? if (morphism == 2 && UseBimorphicInlining) { >>>>> -??????????? next_receiver_method = >>>>> callee->resolve_invoke(jvms->method()->holder(), >>>>> - profile.receiver(1)); >>>>> -??????????? if (next_receiver_method != NULL) { >>>>> -????????????? next_hit_cg = >>>>> this->call_generator(next_receiver_method, >>>>> -????????????????????????????????? vtable_index, >>>>> !call_does_dispatch, jvms, >>>>> -????????????????????????????????? allow_inline, prof_factor); >>>>> -????????????? if (next_hit_cg != NULL && !next_hit_cg->is_inline() && >>>>> -????????????????? have_major_receiver && UseOnlyInlinedBimorphic) { >>>>> -????????????????? // Skip if we can't inline second receiver's method >>>>> -????????????????? next_hit_cg = NULL; >>>>> -????????????? } >>>>> -??????????? } >>>>> -????????? } >>>>> -????????? CallGenerator* miss_cg; >>>>> -????????? Deoptimization::DeoptReason reason = (morphism == 2 >>>>> -?????????????????????????????????????????????? ? >>>>> Deoptimization::Reason_bimorphic >>>>> -?????????????????????????????????????????????? : >>>>> Deoptimization::reason_class_check(speculative_receiver_type != >>>>> NULL)); >>>>> -????????? if ((morphism == 1 || (morphism == 2 && next_hit_cg != >>>>> NULL)) && >>>>> -????????????? !too_many_traps_or_recompiles(caller, bci, reason) >>>>> -???????????? ) { >>>>> -??????????? // Generate uncommon trap for class check failure path >>>>> -??????????? // in case of monomorphic or bimorphic virtual call site. >>>>> -??????????? miss_cg = CallGenerator::for_uncommon_trap(callee, >>>>> reason, >>>>> -??????????????????????? Deoptimization::Action_maybe_recompile); >>>>> +????? bool removed_cgs = false; >>>>> +????? // Look up receivers. >>>>> +????? for (int i = 0; i < morphism; i++) { >>>>> +??????? if ((i == 1 && !UseBimorphicInlining) || (i >= 1 && >>>>> !UsePolymorphicInlining)) { >>>>> +????????? break; >>>>> +??????? } >>>>> +??????? if (receiver_methods[i] == NULL && profile.has_receiver(i)) { >>>>> +????????? receiver_methods[i] = >>>>> callee->resolve_invoke(jvms->method()->holder(), >>>>> + profile.receiver(i)); >>>>> +??????? } >>>>> +??????? if (receiver_methods[i] != NULL) { >>>>> +????????? bool allow_inline; >>>>> +????????? if (speculative_receiver_type != NULL) { >>>>> +??????????? allow_inline = true; >>>>> ??????????? } else { >>>>> -??????????? // Generate virtual call for class check failure path >>>>> -??????????? // in case of polymorphic virtual call site. >>>>> -??????????? miss_cg = CallGenerator::for_virtual_call(callee, >>>>> vtable_index); >>>>> +??????????? allow_inline = 100.*profile.receiver_prob(i) >= >>>>> (float)TypeProfileMinimumReceiverPercent; >>>>> ??????????? } >>>>> -????????? if (miss_cg != NULL) { >>>>> -??????????? if (next_hit_cg != NULL) { >>>>> -????????????? assert(speculative_receiver_type == NULL, "shouldn't >>>>> end up here if we used speculation"); >>>>> -????????????? trace_type_profile(C, jvms->method(), jvms->depth() >>>>> - 1, jvms->bci(), next_receiver_method, profile.receiver(1), >>>>> site_count, profile.receiver_count(1)); >>>>> -????????????? // We don't need to record dependency on a receiver >>>>> here and below. >>>>> -????????????? // Whenever we inline, the dependency is added by >>>>> Parse::Parse(). >>>>> -????????????? miss_cg = >>>>> CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, >>>>> next_hit_cg, PROB_MAX); >>>>> -??????????? } >>>>> -??????????? if (miss_cg != NULL) { >>>>> -????????????? ciKlass* k = speculative_receiver_type != NULL ? >>>>> speculative_receiver_type : profile.receiver(0); >>>>> -????????????? trace_type_profile(C, jvms->method(), jvms->depth() >>>>> - 1, jvms->bci(), receiver_method, k, site_count, receiver_count); >>>>> -????????????? float hit_prob = speculative_receiver_type != NULL ? >>>>> 1.0 : profile.receiver_prob(0); >>>>> -????????????? CallGenerator* cg = >>>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob); >>>>> -????????????? if (cg != NULL)? return cg; >>>>> +????????? hit_cgs[i] = this->call_generator(receiver_methods[i], >>>>> +??????????????????????????????? vtable_index, !call_does_dispatch, >>>>> jvms, >>>>> +??????????????????????????????? allow_inline, prof_factor); >>>>> +????????? if (hit_cgs[i] != NULL) { >>>>> +??????????? if (speculative_receiver_type != NULL) { >>>>> +????????????? // Do nothing if it's a speculative type >>>>> +??????????? } else if (bytecode == Bytecodes::_invokeinterface) { >>>>> +????????????? // Do nothing if it's an interface, multiple >>>>> direct-calls are faster than one indirect-call >>>>> +??????????? } else if (!have_major_receiver) { >>>>> +????????????? // Do nothing if there is no major receiver >>>>> +??????????? } else if ((morphism == 2 && !UseOnlyInlinedBimorphic) >>>>> || (morphism >= 2 && !UseOnlyInlinedPolymorphic)) { >>>>> +????????????? // Do nothing if the user allows non-inlined >>>>> polymorphic calls >>>>> +??????????? } else if (!hit_cgs[i]->is_inline()) { >>>>> +????????????? // Skip if we can't inline receiver's method >>>>> +????????????? hit_cgs[i] = NULL; >>>>> +????????????? removed_cgs = true; >>>>> ????????????? } >>>>> ??????????? } >>>>> ????????? } >>>>> ??????? } >>>>> + >>>>> +????? // Generate the fallback path >>>>> +????? Deoptimization::DeoptReason reason = (morphism != 1 >>>>> +??????????????????????????????????????????? ? >>>>> Deoptimization::Reason_polymorphic >>>>> +??????????????????????????????????????????? : >>>>> Deoptimization::reason_class_check(speculative_receiver_type != >>>>> NULL)); >>>>> +????? bool disable_trap = (profile.is_megamorphic() || removed_cgs >>>>> || too_many_traps_or_recompiles(caller, bci, reason)); >>>>> +????? if (log != NULL) { >>>>> +??????? log->elem("call_fallback method='%d' count='%d' >>>>> morphism='%d' trap='%d'", >>>>> +????????????????????? log->identify(callee), site_count, morphism, >>>>> disable_trap ? 0 : 1); >>>>> +????? } >>>>> +????? CallGenerator* miss_cg; >>>>> +????? if (!disable_trap) { >>>>> +??????? // Generate uncommon trap for class check failure path >>>>> +??????? // in case of polymorphic virtual call site. >>>>> +??????? miss_cg = CallGenerator::for_uncommon_trap(callee, reason, >>>>> +??????????????????? Deoptimization::Action_maybe_recompile); >>>>> +????? } else { >>>>> +??????? // Generate virtual call for class check failure path >>>>> +??????? // in case of megamorphic virtual call site. >>>>> +??????? miss_cg = CallGenerator::for_virtual_call(callee, >>>>> vtable_index); >>>>> +????? } >>>>> + >>>>> +????? // Generate the guards >>>>> +????? CallGenerator* cg = NULL; >>>>> +????? if (speculative_receiver_type != NULL) { >>>>> +??????? if (hit_cgs[0] != NULL) { >>>>> +????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, >>>>> jvms->bci(), receiver_methods[0], speculative_receiver_type, >>>>> site_count, profile.receiver_count(0)); >>>>> +????????? // We don't need to record dependency on a receiver here >>>>> and below. >>>>> +????????? // Whenever we inline, the dependency is added by >>>>> Parse::Parse(). >>>>> +????????? cg = >>>>> CallGenerator::for_predicted_call(speculative_receiver_type, >>>>> miss_cg, hit_cgs[0], PROB_MAX); >>>>> +??????? } >>>>> +????? } else { >>>>> +??????? for (int i = morphism - 1; i >= 0 && miss_cg != NULL; i--) { >>>>> +????????? if (hit_cgs[i] != NULL) { >>>>> +??????????? trace_type_profile(C, jvms->method(), jvms->depth() - >>>>> 1, jvms->bci(), receiver_methods[i], profile.receiver(i), >>>>> site_count, profile.receiver_count(i)); >>>>> +??????????? miss_cg = >>>>> CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, >>>>> hit_cgs[i], profile.receiver_prob(i)); >>>>> +????????? } >>>>> +??????? } >>>>> +??????? cg = miss_cg; >>>>> +????? } >>>>> +????? if (cg != NULL)? return cg; >>>>> ????? } >>>>> ????? // If there is only one implementor of this interface then we >>>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp >>>>> b/src/hotspot/share/runtime/deoptimization.cpp >>>>> index 11df15e004..2d14b52854 100644 >>>>> --- a/src/hotspot/share/runtime/deoptimization.cpp >>>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp >>>>> @@ -2382,7 +2382,7 @@ const char* >>>>> Deoptimization::_trap_reason_name[] = { >>>>> ??? "class_check", >>>>> ??? "array_check", >>>>> ??? "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"), >>>>> -? "bimorphic" JVMCI_ONLY("_or_optimized_type_check"), >>>>> +? "polymorphic" JVMCI_ONLY("_or_optimized_type_check"), >>>>> ??? "profile_predicate", >>>>> ??? "unloaded", >>>>> ??? "uninitialized", >>>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp >>>>> b/src/hotspot/share/runtime/deoptimization.hpp >>>>> index 1cfff5394e..c1eb998aba 100644 >>>>> --- a/src/hotspot/share/runtime/deoptimization.hpp >>>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp >>>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic { >>>>> ????? Reason_class_check,?????????? // saw unexpected object class >>>>> (@bci) >>>>> ????? Reason_array_check,?????????? // saw unexpected array class >>>>> (aastore @bci) >>>>> ????? Reason_intrinsic,???????????? // saw unexpected operand to >>>>> intrinsic (@bci) >>>>> -??? Reason_bimorphic,???????????? // saw unexpected object class >>>>> in bimorphic inlining (@bci) >>>>> +??? Reason_polymorphic,?????????? // saw unexpected object class >>>>> in bimorphic inlining (@bci) >>>>> ? #if INCLUDE_JVMCI >>>>> ????? Reason_unreached0???????????? = Reason_null_assert, >>>>> ????? Reason_type_checked_inlining? = Reason_intrinsic, >>>>> -??? Reason_optimized_type_check?? = Reason_bimorphic, >>>>> +??? Reason_optimized_type_check?? = Reason_polymorphic, >>>>> ? #endif >>>>> ????? Reason_profile_predicate,???? // compiler generated predicate >>>>> moved from frequent branch in a loop failed >>>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp >>>>> b/src/hotspot/share/runtime/vmStructs.cpp >>>>> index 94b544824e..ee761626c4 100644 >>>>> --- a/src/hotspot/share/runtime/vmStructs.cpp >>>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp >>>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry>>>> mtClass>? KlassHashtableEntry; >>>>> declare_constant(Deoptimization::Reason_class_check) \ >>>>> declare_constant(Deoptimization::Reason_array_check) \ >>>>> declare_constant(Deoptimization::Reason_intrinsic) \ >>>>> - declare_constant(Deoptimization::Reason_bimorphic) \ >>>>> + declare_constant(Deoptimization::Reason_polymorphic) \ >>>>> declare_constant(Deoptimization::Reason_profile_predicate) \ >>>>> declare_constant(Deoptimization::Reason_unloaded) \ >>>>> declare_constant(Deoptimization::Reason_uninitialized) \ >>>>> >>>>> -----Original Message----- >>>>> From: hotspot-compiler-dev >>>>> On Behalf Of >>>>> Ludovic Henry >>>>> Sent: Tuesday, March 3, 2020 10:50 AM >>>>> To: Vladimir Ivanov ; John Rose >>>>> ; hotspot-compiler-dev at openjdk.java.net >>>>> Subject: RE: Polymorphic Guarded Inlining in C2 >>>>> >>>>> I just got to run the PolymorphicVirtualCallBenchmark >>>>> microbenchmark with >>>>> various TypeProfileWidth values. The results are: >>>>> >>>>> Benchmark???????????????????????????? Mode? Cnt? Score?? Error >>>>> Units Configuration >>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.802 ? 0.048 >>>>> ops/s -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableInlining >>>>> -XX:+PolyGuardDisableTrap >>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.425 ? 0.019 >>>>> ops/s -XX:TypeProfileWidth=1 -XX:+PolyGuardDisableInlining >>>>> -XX:+PolyGuardDisableTrap >>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.857 ? 0.109 >>>>> ops/s -XX:TypeProfileWidth=2 -XX:+PolyGuardDisableInlining >>>>> -XX:+PolyGuardDisableTrap >>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.876 ? 0.051 >>>>> ops/s -XX:TypeProfileWidth=3 -XX:+PolyGuardDisableInlining >>>>> -XX:+PolyGuardDisableTrap >>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.867 ? 0.045 >>>>> ops/s -XX:TypeProfileWidth=4 -XX:+PolyGuardDisableInlining >>>>> -XX:+PolyGuardDisableTrap >>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.835 ? 0.104 >>>>> ops/s -XX:TypeProfileWidth=5 -XX:+PolyGuardDisableInlining >>>>> -XX:+PolyGuardDisableTrap >>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.886 ? 0.139 >>>>> ops/s -XX:TypeProfileWidth=6 -XX:+PolyGuardDisableInlining >>>>> -XX:+PolyGuardDisableTrap >>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.887 ? 0.040 >>>>> ops/s -XX:TypeProfileWidth=7 -XX:+PolyGuardDisableInlining >>>>> -XX:+PolyGuardDisableTrap >>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.684 ? 0.020 >>>>> ops/s -XX:TypeProfileWidth=8 -XX:+PolyGuardDisableInlining >>>>> -XX:+PolyGuardDisableTrap >>>>> >>>>> The main thing I observe is that there isn't a linear (or even any >>>>> apparent) >>>>> correlation between the number of guards generated (guided by >>>>> TypeProfileWidth), and the time taken. >>>>> >>>>> I am trying to understand why there is a dip for TypeProfileWidth >>>>> equal >>>>> to 1 and 8. >>>>> >>>>> -- >>>>> Ludovic >>>>> >>>>> -----Original Message----- >>>>> From: Ludovic Henry >>>>> Sent: Tuesday, March 3, 2020 9:33 AM >>>>> To: Ludovic Henry ; Vladimir Ivanov >>>>> ; John Rose ; >>>>> hotspot-compiler-dev at openjdk.java.net >>>>> Subject: RE: Polymorphic Guarded Inlining in C2 >>>>> >>>>> Hi Vladimir, >>>>> >>>>> I did a rerun of the following benchmark with various configurations: >>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fluhenry%2Fjdk-microbenchmarks%2Fblob%2Fmaster%2Fsrc%2Fmain%2Fjava%2Forg%2Fsample%2FTypeProfileWidthOverheadBenchmark.java&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=KzGJ0Xc95rWag6rF8gyFW7%2BHcDUccltwMN5JmL%2BUhdE%3D&reserved=0 >>>>> >>>>> >>>>> The results are as follows: >>>>> >>>>> Benchmark???????????????????????????? Mode? Cnt? Score?? Error >>>>> Units Configuration >>>>> PolymorphicVirtualCallBenchmark.run?? thrpt 5??? 2.910 ? 0.040 >>>>> ops/s indirect-call? -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableTrap >>>>> PolymorphicVirtualCallBenchmark.run?? thrpt 5??? 2.752 ? 0.039 >>>>> ops/s direct-call??? -XX:TypeProfileWidth=8 >>>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap >>>>> PolymorphicVirtualCallBenchmark.run?? thrpt 5??? 3.407 ? 0.085 >>>>> ops/s inlined-call?? -XX:TypeProfileWidth=8 >>>>> -XX:-PolyGuardDisableInlining -XX:+PolyGuardDisableTrap >>>>> Benchmark???????????????????????????? Mode? Cnt? Score?? Error >>>>> Units Configuration >>>>> PolymorphicInterfaceCallBenchmark.run thrpt 5??? 2.043 ? 0.025 >>>>> ops/s indirect-call? -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableTrap >>>>> PolymorphicInterfaceCallBenchmark.run thrpt 5??? 2.555 ? 0.063 >>>>> ops/s direct-call??? -XX:TypeProfileWidth=8 >>>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap >>>>> PolymorphicInterfaceCallBenchmark.run thrpt 5??? 3.217 ? 0.058 >>>>> ops/s inlined-call?? -XX:TypeProfileWidth=8 >>>>> -XX:-PolyGuardDisableInlining -XX:+PolyGuardDisableTrap >>>>> >>>>> The Hotspot logs (with generated assembly) are available at: >>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fluhenry%2F4f015541cb6628517ab6698b1487c17d&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=q7MAs1V9tbuZhBIjNDQefNuhsiKVv7HHRn1kdcbrUPk%3D&reserved=0 >>>>> >>>>> >>>>> The main takeaway from that experiment is that direct calls w/o >>>>> inlining is faster >>>>> than indirect calls for icalls but slower for vcalls, and that >>>>> inlining is always faster >>>>> than direct calls. >>>>> >>>>> (I fully understand this applies mainly on this microbenchmark, and >>>>> we need to >>>>> validate on larger benchmarks. I'm working on that next. However, >>>>> it clearly show >>>>> gains on a pathological case.) >>>>> >>>>> Next, I want to figure out at how many guard the direct-call >>>>> regresses compared >>>>> to indirect-call in the vcall case, and I want to run larger >>>>> benchmarks. Any >>>>> particular you would like to see running? I am planning on doing >>>>> SPECjbb2015 first. >>>>> >>>>> Thank you, >>>>> >>>>> -- >>>>> Ludovic >>>>> >>>>> -----Original Message----- >>>>> From: hotspot-compiler-dev >>>>> On Behalf Of >>>>> Ludovic Henry >>>>> Sent: Monday, March 2, 2020 4:20 PM >>>>> To: Vladimir Ivanov ; John Rose >>>>> ; hotspot-compiler-dev at openjdk.java.net >>>>> Subject: RE: Polymorphic Guarded Inlining in C2 >>>>> >>>>> Hi Vladimir, >>>>> >>>>> Sorry for the long delay in response, I was at multiple conferences >>>>> over the past few >>>>> weeks. I'm back to the office now and fully focus on getting >>>>> progress on that. >>>>> >>>>>>> Possible avenues of improvements I can see are: >>>>>>> ??? - Gather all the types in an unbounded list so we can know >>>>>>> which ones >>>>>>> are the most frequent. It is unlikely to help with Java as, in >>>>>>> the general >>>>>>> case, there are only a few types present a call-sites. It could, >>>>>>> however, >>>>>>> be particularly helpful for languages that tend to have many >>>>>>> types at >>>>>>> call-sites, like functional languages, for example. >>>>>> >>>>>> I doubt having unbounded list of receiver types is practical: it's >>>>>> costly to gather, but isn't too useful for compilation. But measuring >>>>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some >>>>>> numbers. >>>>> >>>>> I agree that it isn't very practical. It can be useful in the case >>>>> where there are >>>>> many types at a call-site, and the first ones end up not being >>>>> frequent enough to >>>>> mandate a guard. This is clearly an edge-case, and I don't think we >>>>> should optimize >>>>> for it. >>>>> >>>>>>> In what we have today, some of the worst-case scenarios are the >>>>>>> following: >>>>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, >>>>>>> the first and >>>>>>> second types are types A and B, and the other type(s) is(are) not >>>>>>> recorded, >>>>>>> and it increments the `count` value. Even if A and B are used in >>>>>>> the initialization >>>>>>> path (i.e. only a few times) and the other type(s) is(are) used >>>>>>> in the hot >>>>>>> path (i.e. many times), the latter are never considered for >>>>>>> inlining - because >>>>>>> it was never recorded during profiling. >>>>>> >>>>>> Can it be alleviated by (partially) clearing type profile (e.g., >>>>>> periodically free some space by removing elements with lower >>>>>> frequencies >>>>>> and give new types a chance to be profiled? >>>>> >>>>> Doing that reliably relies on the assumption that we know what the >>>>> shape of >>>>> the workload is going to be in future iterations. Otherwise, how >>>>> could you >>>>> guarantee that a type that's not currently frequent will not be in >>>>> the future, >>>>> and that the information that you remove now will not be important >>>>> later. This >>>>> is an assumption that, IMO, is worst than missing types which are >>>>> hot later in >>>>> the execution for two reasons: 1. it's no better, and 2. it's a lot >>>>> less intuitive and >>>>> harder to debug/understand than a straightforward "overflow". >>>>> >>>>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, >>>>>>> you have the >>>>>>> first type A with 49% probability, the second type B with 49% >>>>>>> probability, and >>>>>>> the other types with 2% probability. Even though A and B are the >>>>>>> two hottest >>>>>>> paths, it does not generate guards because none are a major >>>>>>> receiver. >>>>>> >>>>>> Yes. On the other hand, on average it'll cause inlining twice as much >>>>>> code (2 methods vs 1). >>>>> >>>>> It will not necessarily cause twice as much inlining because of >>>>> late-inlining. Like >>>>> you point out later, it will generate a direct-call in case there >>>>> isn't room for more >>>>> inlinable code. >>>>> >>>>>> Also, does it make sense to increase morphism factor even if inlining >>>>>> doesn't happen? >>>>>> >>>>>> ?? if (recv.klass == C1) {? // >>0% >>>>>> ????? ... inlined ... >>>>>> ?? } else if (recv.klass == C2) { // >>0% >>>>>> ????? m2(); // direct call >>>>>> ?? } else { // >0% >>>>>> ????? m(); // virtual call >>>>>> ?? } >>>>>> >>>>>> vs >>>>>> >>>>>> ?? if (recv.klass == C1) {? // >>0% >>>>>> ????? ... inlined ... >>>>>> ?? } else { // >>0% >>>>>> ????? m(); // virtual call >>>>>> ?? } >>>>> >>>>> There is the advantage that modern CPUs are better at predicting >>>>> instruction-branches >>>>> than data-branches. These guards will then allow the CPU to make >>>>> better decisions allowing >>>>> for better superscalar executions, memory prefetching, etc. >>>>> >>>>> This, IMO, makes sense for warm calls, especially since the cost is >>>>> a guard + a call, which is >>>>> much lower than a inlined method, but brings benefits over an >>>>> indirect call. >>>>> >>>>>> In other words, how much could we get just by lowering >>>>>> TypeProfileMajorReceiverPercent? >>>>> >>>>> TypeProfileMajorReceiverPercent is only used today when you have a >>>>> megamorphic >>>>> call-site (aka more types than TypeProfileWidth) but still one type >>>>> receiving more than >>>>> N% of the calls. By reducing the value, you would not increase the >>>>> number of guards, >>>>> but the threshold at which you generate the 1st guard in a >>>>> megamorphic case. >>>>> >>>>>>>> ??????? - for N-morphic case what's the negative effect >>>>>>>> (quantitative) of >>>>>>>> the deopt? >>>>>>> We are triggering the uncommon trap in this case iff we observed >>>>>>> a limited >>>>>>> and stable set of types in the early stages of the Tiered >>>>>>> Compilation >>>>>>> pipeline (making us generate N-morphic guards), and we suddenly >>>>>>> observe a >>>>>>> new type. AFAIU, this is precisely what deopt is for. >>>>>> >>>>>> I should have added "... compared to N-polymorhic case". My >>>>>> intuition is >>>>>> the higher morphism factor is the fewer the benefits of deopt >>>>>> (compared >>>>>> to a call) are. It would be very good to validate it with some >>>>>> benchmarks (both micro- and larger ones). >>>>> >>>>> I agree that what you are describing makes sense as well. To reduce >>>>> the cost of deopt >>>>> here, having a TypeProfileMinimumReceiverPercent helps. That is >>>>> because if any type is >>>>> seen less than this specific frequency, then it won't generate a >>>>> guard, leading to an indirect >>>>> call in the fallback case. >>>>> >>>>>>> I'm writing a JMH benchmark to stress that specific case. I'll >>>>>>> share it as soon >>>>>>> as I have something reliably reproducing. >>>>>> >>>>>> Thanks! A representative set of microbenchmarks will be very helpful. >>>>> >>>>> It turns out the guard is only generated once, meaning that if we >>>>> ever hit it then we >>>>> generate an indirect call. >>>>> >>>>> We also only generate the trap iff all the guards are hot (inlined) >>>>> or warm (direct call), >>>>> so any of the following case triggers the creation of an indirect >>>>> call over a trap: >>>>> ? - we hit the trap once before >>>>> ? - one or more guards are cold (aka not inlinable even with >>>>> late-inlining) >>>>> >>>>>> It was more about opportunities for future explorations. I don't >>>>>> think >>>>>> we have to act on it right away. >>>>>> >>>>>> As with "deopt vs call", my guess is callee should benefit much more >>>>>> from inlining than the caller it is inlined into (caller sees >>>>>> multiple >>>>>> callee candidates and has to merge the results while each callee >>>>>> observes the full context and can benefit from it). >>>>>> >>>>>> If we can run some sort of static analysis on callee bytecode, >>>>>> what kind >>>>>> of code patterns should we look for to guide inlining decisions? >>>>> >>>>> Any pattern that would benefit from other optimizations (escape >>>>> analysis, >>>>> dead code elimination, constant propagation, etc.) is good, but >>>>> short of >>>>> shadowing statically what all these optimizations do, I can't see >>>>> an easy way >>>>> to do it. >>>>> >>>>> That is where late-inlining, or more advanced dynamic heuristics >>>>> like the one you >>>>> can find in Graal EE, is worthwhile. >>>>> >>>>>> Regaring experiments to try first, here are some ideas I find >>>>>> promising: >>>>>> >>>>>> ???? * measure the cost of additional profiling >>>>>> ???????? -XX:TypeProfileWidth=N without changing compilers >>>>> >>>>> I am running the following jmh microbenchmark >>>>> >>>>> ???? public final static int N = 100_000_000; >>>>> >>>>> ???? @State(Scope.Benchmark) >>>>> ???? public static class TypeProfileWidthOverheadBenchmarkState { >>>>> ???????? public A[] objs = new A[N]; >>>>> >>>>> ???????? @Setup >>>>> ???????? public void setup() throws Exception { >>>>> ???????????? for (int i = 0; i < objs.length; ++i) { >>>>> ???????????????? switch (i % 8) { >>>>> ???????????????? case 0: objs[i] = new A1(); break; >>>>> ???????????????? case 1: objs[i] = new A2(); break; >>>>> ???????????????? case 2: objs[i] = new A3(); break; >>>>> ???????????????? case 3: objs[i] = new A4(); break; >>>>> ???????????????? case 4: objs[i] = new A5(); break; >>>>> ???????????????? case 5: objs[i] = new A6(); break; >>>>> ???????????????? case 6: objs[i] = new A7(); break; >>>>> ???????????????? case 7: objs[i] = new A8(); break; >>>>> ???????????????? } >>>>> ???????????? } >>>>> ???????? } >>>>> ???? } >>>>> >>>>> ???? @Benchmark @OperationsPerInvocation(N) >>>>> ???? public void run(TypeProfileWidthOverheadBenchmarkState state, >>>>> Blackhole blackhole) { >>>>> ???????? A[] objs = state.objs; >>>>> ???????? for (int i = 0; i < objs.length; ++i) { >>>>> ???????????? objs[i].foo(i, blackhole); >>>>> ???????? } >>>>> ???? } >>>>> >>>>> And I am running with the following JVM parameters: >>>>> >>>>> -XX:TypeProfileWidth=0 -XX:CompileThreshold=200000000 >>>>> -XX:Tier3CompileThreshold=200000000 >>>>> -XX:Tier3InvocationThreshold=200000000 >>>>> -XX:Tier3BackEdgeThreshold=200000000 >>>>> -XX:TypeProfileWidth=8 -XX:CompileThreshold=200000000 >>>>> -XX:Tier3CompileThreshold=200000000 >>>>> -XX:Tier3InvocationThreshold=200000000 >>>>> -XX:Tier3BackEdgeThreshold=200000000 >>>>> >>>>> I observe no statistically representative difference between in s/ops >>>>> between TypeProfileWidth=0 and TypeProfileWidth=8. I also could >>>>> observe >>>>> no significant difference in the resulting analysis using Intel VTune. >>>>> >>>>> I verified that the benchmark never goes beyond Tier-0 with >>>>> -XX:+PrintCompilation. >>>>> >>>>>> ???? * N-morphic vs N-polymorphic (N>=2): >>>>>> ?????? - how much deopt helps compared to a virtual call on >>>>>> fallback path? >>>>> >>>>> I have done the following microbenchmark, but I am not sure that it's >>>>> going to fully answer the question you are raising here. >>>>> >>>>> ???? public final static int N = 100_000_000; >>>>> >>>>> ???? @State(Scope.Benchmark) >>>>> ???? public static class PolymorphicDeoptBenchmarkState { >>>>> ???????? public A[] objs = new A[N]; >>>>> >>>>> ???????? @Setup >>>>> ???????? public void setup() throws Exception { >>>>> ???????????? int cutoff1 = (int)(objs.length * .90); >>>>> ???????????? int cutoff2 = (int)(objs.length * .95); >>>>> ???????????? for (int i = 0; i < cutoff1; ++i) { >>>>> ???????????????? switch (i % 2) { >>>>> ???????????????? case 0: objs[i] = new A1(); break; >>>>> ???????????????? case 1: objs[i] = new A2(); break; >>>>> ???????????????? } >>>>> ???????????? } >>>>> ???????????? for (int i = cutoff1; i < cutoff2; ++i) { >>>>> ???????????????? switch (i % 4) { >>>>> ???????????????? case 0: objs[i] = new A1(); break; >>>>> ???????????????? case 1: objs[i] = new A2(); break; >>>>> ???????????????? case 2: >>>>> ???????????????? case 3: objs[i] = new A3(); break; >>>>> ???????????????? } >>>>> ???????????? } >>>>> ???????????? for (int i = cutoff2; i < objs.length; ++i) { >>>>> ???????????????? switch (i % 4) { >>>>> ???????????????? case 0: >>>>> ???????????????? case 1: objs[i] = new A3(); break; >>>>> ???????????????? case 2: >>>>> ???????????????? case 3: objs[i] = new A4(); break; >>>>> ???????????????? } >>>>> ???????????? } >>>>> ???????? } >>>>> ???? } >>>>> >>>>> ???? @Benchmark @OperationsPerInvocation(N) >>>>> ???? public void run(PolymorphicDeoptBenchmarkState state, >>>>> Blackhole blackhole) { >>>>> ???????? A[] objs = state.objs; >>>>> ???????? for (int i = 0; i < objs.length; ++i) { >>>>> ???????????? objs[i].foo(i, blackhole); >>>>> ???????? } >>>>> ???? } >>>>> >>>>> I run this benchmark with -XX:+PolyGuardDisableTrap or >>>>> -XX:-PolyGuardDisableTrap which force enable/disable the trap in the >>>>> fallback. >>>>> >>>>> For that kind of cases, a visitor pattern is what I expect to most >>>>> largely >>>>> profit/suffer from a deopt or virtual-call in the fallback path. >>>>> Would you >>>>> know of such benchmark that heavily relies on this pattern, and that I >>>>> could readily reuse? >>>>> >>>>>> ???? * inlining vs devirtualization >>>>>> ?????? - a knob to control inlining in N-morphic/N-polymorphic cases >>>>>> ?????? - measure separately the effects of devirtualization and >>>>>> inlining >>>>> >>>>> For that one, I reused the first microbenchmark I mentioned above, and >>>>> added a PolyGuardDisableInlining flag that controls whether we >>>>> create a >>>>> direct-call or inline. >>>>> >>>>> The results are 2.958 ? 0.011 ops/s for >>>>> -XX:-PolyGuardDisableInlining (aka inlined) >>>>> vs 2.540 ? 0.018 ops/s for -XX:+PolyGuardDisableInlining (aka >>>>> direct call). >>>>> >>>>> This benchmarks hasn't been run in the best possible conditions (on >>>>> my dev >>>>> machine, in WSL), but it gives a strong indication that even a >>>>> direct call has a >>>>> non-negligible impact, and that inlining leads to better result >>>>> (again, in this >>>>> microbenchmark). >>>>> >>>>> Otherwise, on the per-method TypeProfileWidth knob, I couldn't find >>>>> anything >>>>> that would be readily available from the Interpreter. Would you >>>>> have any pointer >>>>> of a pre-existing feature that required this specific kind of >>>>> plumbing? I would otherwise >>>>> find myself in need of making CompilerDirectives available from the >>>>> Interpreter, and >>>>> that is something outside of my current expertise (always happy to >>>>> learn, but I >>>>> will need some pointers!). >>>>> >>>>> Thank you, >>>>> >>>>> -- >>>>> Ludovic >>>>> >>>>> -----Original Message----- >>>>> From: Vladimir Ivanov >>>>> Sent: Thursday, February 20, 2020 9:00 AM >>>>> To: Ludovic Henry ; John Rose >>>>> ; hotspot-compiler-dev at openjdk.java.net >>>>> Subject: Re: Polymorphic Guarded Inlining in C2 >>>>> >>>>> Hi Ludovic, >>>>> >>>>> [...] >>>>> >>>>>> Thanks for this explanation, it makes it a lot clearer what the >>>>>> cases and >>>>>> your concerns are. To rephrase in my own words, what you are >>>>>> interested in >>>>>> is not this change in particular, but more the possibility that >>>>>> this change >>>>>> provides and how to take it the next step, correct? >>>>> >>>>> Yes, it's a good summary. >>>>> >>>>> [...] >>>>> >>>>>>> ??????? - affects profiling strategy: majority of receivers vs >>>>>>> complete >>>>>>> list of receiver types observed; >>>>>> Today, we only use the N first receivers when the number of types >>>>>> does >>>>>> not exceed TypeProfileWidth; otherwise, we use none of them. >>>>>> Possible avenues of improvements I can see are: >>>>>> ??? - Gather all the types in an unbounded list so we can know >>>>>> which ones >>>>>> are the most frequent. It is unlikely to help with Java as, in the >>>>>> general >>>>>> case, there are only a few types present a call-sites. It could, >>>>>> however, >>>>>> be particularly helpful for languages that tend to have many types at >>>>>> call-sites, like functional languages, for example. >>>>> >>>>> I doubt having unbounded list of receiver types is practical: it's >>>>> costly to gather, but isn't too useful for compilation. But measuring >>>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some >>>>> numbers. >>>>> >>>>>> ?? - Use the existing types to generate guards for these types we >>>>>> know are >>>>>> common enough. Then use the types which are hot or warm, even in >>>>>> case of a >>>>>> megamorphic call-site. It would be a simple iteration of what we have >>>>>> nowadays. >>>>> >>>>>> In what we have today, some of the worst-case scenarios are the >>>>>> following: >>>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, >>>>>> the first and >>>>>> second types are types A and B, and the other type(s) is(are) not >>>>>> recorded, >>>>>> and it increments the `count` value. Even if A and B are used in >>>>>> the initialization >>>>>> path (i.e. only a few times) and the other type(s) is(are) used in >>>>>> the hot >>>>>> path (i.e. many times), the latter are never considered for >>>>>> inlining - because >>>>>> it was never recorded during profiling. >>>>> >>>>> Can it be alleviated by (partially) clearing type profile (e.g., >>>>> periodically free some space by removing elements with lower >>>>> frequencies >>>>> and give new types a chance to be profiled? >>>>> >>>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, >>>>>> you have the >>>>>> first type A with 49% probability, the second type B with 49% >>>>>> probability, and >>>>>> the other types with 2% probability. Even though A and B are the >>>>>> two hottest >>>>>> paths, it does not generate guards because none are a major receiver. >>>>> >>>>> Yes. On the other hand, on average it'll cause inlining twice as much >>>>> code (2 methods vs 1). >>>>> >>>>> Also, does it make sense to increase morphism factor even if inlining >>>>> doesn't happen? >>>>> >>>>> ??? if (recv.klass == C1) {? // >>0% >>>>> ?????? ... inlined ... >>>>> ??? } else if (recv.klass == C2) { // >>0% >>>>> ?????? m2(); // direct call >>>>> ??? } else { // >0% >>>>> ?????? m(); // virtual call >>>>> ??? } >>>>> >>>>> vs >>>>> >>>>> ??? if (recv.klass == C1) {? // >>0% >>>>> ?????? ... inlined ... >>>>> ??? } else { // >>0% >>>>> ?????? m(); // virtual call >>>>> ??? } >>>>> >>>>> In other words, how much could we get just by lowering >>>>> TypeProfileMajorReceiverPercent? >>>>> >>>>> And it relates to "virtual/interface call" vs "type guard + direct >>>>> call" >>>>> code shapes comparison: how much does devirtualization help? >>>>> >>>>> Otherwise, enabling 2-polymorphic shape becomes feasible only if both >>>>> cases are inlined. >>>>> >>>>>>> ??????? - for N-morphic case what's the negative effect >>>>>>> (quantitative) of >>>>>>> the deopt? >>>>>> We are triggering the uncommon trap in this case iff we observed a >>>>>> limited >>>>>> and stable set of types in the early stages of the Tiered Compilation >>>>>> pipeline (making us generate N-morphic guards), and we suddenly >>>>>> observe a >>>>>> new type. AFAIU, this is precisely what deopt is for. >>>>> >>>>> I should have added "... compared to N-polymorhic case". My >>>>> intuition is >>>>> the higher morphism factor is the fewer the benefits of deopt >>>>> (compared >>>>> to a call) are. It would be very good to validate it with some >>>>> benchmarks (both micro- and larger ones). >>>>> >>>>>> I'm writing a JMH benchmark to stress that specific case. I'll >>>>>> share it as soon >>>>>> as I have something reliably reproducing. >>>>> >>>>> Thanks! A representative set of microbenchmarks will be very helpful. >>>>> >>>>>>> ???? * invokevirtual vs invokeinterface call sites >>>>>>> ??????? - different cost models; >>>>>>> ??????? - interfaces are harder to optimize, but opportunities for >>>>>>> strength-reduction from interface to virtual calls exist; >>>>>> ? From the profiling information and the inlining mechanism point >>>>>> of view, >>>>>> that it is an invokevirtual or an invokeinterface doesn't change >>>>>> anything >>>>>> >>>>>> Are you saying that we have more to gain from generating a guard for >>>>>> invokeinterface over invokevirtual because the fall-back of the >>>>>> invokeinterface is much more expensive? >>>>> >>>>> Yes, that's the question: if we see an improvement, how much does >>>>> devirtualization contribute to that? >>>>> >>>>> (If we add a type-guarded direct call, but there's no inlining >>>>> happening, inline cache effectively strength-reduce a virtual call >>>>> to a >>>>> direct call.) >>>>> >>>>> Considering current implementation of virtual and interface calls >>>>> (vtables vs itables), the cost model is very different. >>>>> >>>>> For vtable calls, it doesn't look too appealing to introduce large >>>>> inline caches for individual receiver types since a call through a >>>>> vtable involves 3 dependent loads [1] (recv => Klass* => Method* => >>>>> address). >>>>> >>>>> For itable calls it can be a big win in some situations: itable lookup >>>>> iterates over Klass::_secondary_supers array and it can become quite >>>>> costly. For example, some Scala workloads experience significant >>>>> overheads from megamorphic calls. >>>>> >>>>> If we see an improvement on some benchmark, it would be very useful to >>>>> be able to determine (quantitatively) how much does inlining and >>>>> devirtualization contribute. >>>>> >>>>> FTR ErikO has been experimenting with an alternative vtable/itable >>>>> implementation [4] which brings interface calls close to virtual >>>>> calls. >>>>> So, if it turns out that devirtualization (and not inlining) of >>>>> interface calls is what contributes the most, then speeding up >>>>> megamorphic interface calls becomes a more attractive alternative. >>>>> >>>>>>> ???? * inlining heuristics >>>>>>> ??????? - devirtualization vs inlining >>>>>>> ????????? - how much benefit from expanding a call site >>>>>>> (devirtualize more >>>>>>> cases) without inlining? should differ for virtual & interface cases >>>>>> I'm also writing a JMH benchmark for this case, and I'll share it >>>>>> as soon >>>>>> as I have it reliably reproducing the issue you describe. >>>>> >>>>> Also, I think it's important to have a knob to control it (inline vs >>>>> devirtualize). It'll enable experiments with larger benchmarks. >>>>> >>>>>>> ??????? - diminishing returns with increase in number of cases >>>>>>> ??????? - expanding a single call site leads to more code, but >>>>>>> frequencies >>>>>>> stay the same => colder code >>>>>>> ??????? - based on profiling info (types + frequencies), dynamically >>>>>>> choose morphism factor on per-call site basis? >>>>>> That is where I propose to have a lower receiver probability at >>>>>> which we'll >>>>>> stop adding more guards. I am experimenting with a global flag >>>>>> with a default >>>>>> value of 10%. >>>>>>> ??????? - what optimization opportunities to look for? it looks >>>>>>> like in >>>>>>> general callees should benefit more than the caller (due to >>>>>>> merges after >>>>>>> the call site) >>>>>> Could you please expand your concern or provide an example. >>>>> >>>>> It was more about opportunities for future explorations. I don't think >>>>> we have to act on it right away. >>>>> >>>>> As with "deopt vs call", my guess is callee should benefit much more >>>>> from inlining than the caller it is inlined into (caller sees multiple >>>>> callee candidates and has to merge the results while each callee >>>>> observes the full context and can benefit from it). >>>>> >>>>> If we can run some sort of static analysis on callee bytecode, what >>>>> kind >>>>> of code patterns should we look for to guide inlining decisions? >>>>> >>>>> >>>>> ? >> What's your take on it? Any other ideas? >>>>> ? > >>>>> ? > We don't know what we don't know. We need first to improve the >>>>> logging and >>>>> ? > debugging output of uncommon traps for polymorphic call-sites. >>>>> Then, we >>>>> ? > need to gather data about the different cases you talked about. >>>>> ? > >>>>> ? > We also need to have some microbenchmarks to validate some of the >>>>> questions >>>>> ? > you are raising, and verify what level of gains we can expect >>>>> from this >>>>> ? > optimization. Further validation will be needed on larger >>>>> benchmarks and >>>>> ? > real-world applications as well, and that's where, I think, we >>>>> need >>>>> to develop >>>>> ? > logging and debugging for this feature. >>>>> >>>>> Yes, sounds good. >>>>> >>>>> Regaring experiments to try first, here are some ideas I find >>>>> promising: >>>>> >>>>> ???? * measure the cost of additional profiling >>>>> ???????? -XX:TypeProfileWidth=N without changing compilers >>>>> >>>>> ???? * N-morphic vs N-polymorphic (N>=2): >>>>> ?????? - how much deopt helps compared to a virtual call on >>>>> fallback path? >>>>> >>>>> ???? * inlining vs devirtualization >>>>> ?????? - a knob to control inlining in N-morphic/N-polymorphic cases >>>>> ?????? - measure separately the effects of devirtualization and >>>>> inlining >>>>> >>>>> Best regards, >>>>> Vladimir Ivanov >>>>> >>>>> [1] >>>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l48&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=jp4emfr%2FRnQA3RLtGI%2BCJ4Q%2F%2Bx6NpIUsWPexl3KbxvM%3D&reserved=0 >>>>> >>>>> >>>>> [2] >>>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l142&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=P1iMschSMsRCBw7e4IHWimB26v3MdqWWrGZKSvv64vA%3D&reserved=0 >>>>> >>>>> >>>>> [3] >>>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FmacroAssembler_x86.cpp%23l4294&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=pSEuxGJYxfPHzcouSeoCYSRdfdlSByYwpyZfFA7dlYU%3D&reserved=0 >>>>> >>>>> >>>>> [4] >>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.openjdk.java.net%2Fbrowse%2FJDK-8221828&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=ppajmGyCb1rcDjr0v52XMPcLySA7jQTTjmYE5vIEEoo%3D&reserved=0 >>>>> >>>>> >>>>>> -----Original Message----- >>>>>> From: Vladimir Ivanov >>>>>> Sent: Tuesday, February 11, 2020 3:10 PM >>>>>> To: Ludovic Henry ; John Rose >>>>>> ; hotspot-compiler-dev at openjdk.java.net >>>>>> Subject: Re: Polymorphic Guarded Inlining in C2 >>>>>> >>>>>> Hi Ludovic, >>>>>> >>>>>> I fully agree that it's premature to discuss how default behavior >>>>>> should >>>>>> be changed since much more data is needed to be able to proceed >>>>>> with the >>>>>> decision. But considering the ultimate goal is to actually improve >>>>>> relevant heuristics (and effectively change the default behavior), >>>>>> it's >>>>>> the right time to discuss what kind of experiments are needed to >>>>>> gather >>>>>> enough data for further analysis. >>>>>> >>>>>> Though different shapes do look very similar at first, the shape of >>>>>> fallback makes a big difference. That's why monomorphic and >>>>>> polymorphic >>>>>> cases are distinct: uncommon traps are effectively exits and can >>>>>> significantly simplify CFG while calls can return and have to be >>>>>> merged >>>>>> back. >>>>>> >>>>>> Polymorphic shape is stable (no deopts/recompiles involved), but >>>>>> doesn't >>>>>> simplify the CFG around the call site. >>>>>> >>>>>> Monomorphic shape gives more optimization opportunities, but >>>>>> deopts are >>>>>> highly undesirable due to associated costs. >>>>>> >>>>>> For example: >>>>>> >>>>>> ???? if (recv.klass != C) { deopt(); } >>>>>> ???? C.m(recv); >>>>>> >>>>>> ???? // recv.klass == C - exact type >>>>>> ???? // return value == C.m(recv) >>>>>> >>>>>> vs >>>>>> >>>>>> ???? if (recv.klass == C) { >>>>>> ?????? C.m(recv); >>>>>> ???? } else { >>>>>> ?????? I.m(recv); >>>>>> ???? } >>>>>> >>>>>> ???? // recv.klass <: I - subtype >>>>>> ???? // return value is a phi merging C.m() & I.m() where I.m() is >>>>>> completley opaque. >>>>>> >>>>>> Monomorphic shape can degenerate into polymorphic (too many >>>>>> recompiles), >>>>>> but that's a forced move to stabilize the behavior and avoid vicious >>>>>> recomilation cycle (which is *very* expensive). (Another >>>>>> alternative is >>>>>> to leave deopt as is - set deopt action to "none" - but that's >>>>>> usually >>>>>> much worse decision.) >>>>>> >>>>>> And that's the reason why monomorphic shape requires a unique >>>>>> receiver >>>>>> type in profile while polymorphic shape works with major receiver >>>>>> type >>>>>> and probabilities. >>>>>> >>>>>> >>>>>> Considering further steps, IMO for experimental purposes a single >>>>>> knob >>>>>> won't cut it: there are multiple degrees of freedom which may play >>>>>> important role in building accurate performance model. I'm not yet >>>>>> convinced it's all about inlining and narrowing the scope of >>>>>> discussion >>>>>> specifically to type profile width doesn't help. >>>>>> >>>>>> I'd like to see more knobs introduced before we start conducting >>>>>> extensive experiments. So, let's discuss what other information we >>>>>> can >>>>>> benefit from. >>>>>> >>>>>> I mentioned some possible options in the previous email. I find the >>>>>> following aspects important for future discussion: >>>>>> >>>>>> ???? * shape of fallback path >>>>>> ??????? - what to generalize: 2- to N-morphic vs 1- to N-polymorphic; >>>>>> ??????? - affects profiling strategy: majority of receivers vs >>>>>> complete >>>>>> list of receiver types observed; >>>>>> ??????? - for N-morphic case what's the negative effect >>>>>> (quantitative) of >>>>>> the deopt? >>>>>> >>>>>> ???? * invokevirtual vs invokeinterface call sites >>>>>> ??????? - different cost models; >>>>>> ??????? - interfaces are harder to optimize, but opportunities for >>>>>> strength-reduction from interface to virtual calls exist; >>>>>> >>>>>> ???? * inlining heuristics >>>>>> ??????? - devirtualization vs inlining >>>>>> ????????? - how much benefit from expanding a call site >>>>>> (devirtualize more >>>>>> cases) without inlining? should differ for virtual & interface cases >>>>>> ??????? - diminishing returns with increase in number of cases >>>>>> ??????? - expanding a single call site leads to more code, but >>>>>> frequencies >>>>>> stay the same => colder code >>>>>> ??????? - based on profiling info (types + frequencies), dynamically >>>>>> choose morphism factor on per-call site basis? >>>>>> ??????? - what optimization opportunities to look for? it looks >>>>>> like in >>>>>> general callees should benefit more than the caller (due to merges >>>>>> after >>>>>> the call site) >>>>>> >>>>>> What's your take on it? Any other ideas? >>>>>> >>>>>> Best regards, >>>>>> Vladimir Ivanov >>>>>> >>>>>> On 11.02.2020 02:42, Ludovic Henry wrote: >>>>>>> Hello, >>>>>>> Thank you very much, John and Vladimir, for your feedback. >>>>>>> First, I want to stress out that this patch does not change the >>>>>>> default. It is still bi-morphic guarded inlining by default. This >>>>>>> patch, however, provides you the ability to configure the JVM to >>>>>>> go for N-morphic guarded inlining, with N being controlled by the >>>>>>> -XX:TypeProfileWidth configuration knob. I understand there are >>>>>>> shortcomings with the specifics of this approach so I'll work on >>>>>>> fixing those. However, I would want this discussion to focus on >>>>>>> this *configurable* feature and not on changing the default. The >>>>>>> latter, I think, should be discussed as part of another, more >>>>>>> extended running discussion, since, as you pointed out, it has >>>>>>> far more reaching consequences that are merely improving a >>>>>>> micro-benchmark. >>>>>>> >>>>>>> Now to answer some of your specific questions. >>>>>>> >>>>>>>> >>>>>>>> I haven't looked through the patch in details, but here are some >>>>>>>> thoughts. >>>>>>>> >>>>>>>> As of now, there are 4 main scenarios for devirtualization [1]. >>>>>>>> It seems you try to generalize (b) which becomes: >>>>>>>> >>>>>>>> ????? if (recv.klass == K1) { >>>>>>> m1(...); // either inline or a direct call >>>>>>>> ????? } else if (recv.klass == K2) { >>>>>>> m2(...); // either inline or a direct call >>>>>>>> ????? ... >>>>>>>> ????? } else if (recv.klass == Kn) { >>>>>>> mn(...); // either inline or a direct call >>>>>>>> ????? } else { >>>>>>> deopt(); // invalidate + reinterpret >>>>>>>> ????? } >>>>>>> >>>>>>> The general shape that exist currently in tip is: >>>>>>> >>>>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0) >>>>>>> if (recv.klass == K1) { >>>>>>> ???? m1(.); // either inline or a direct call >>>>>>> } >>>>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && >>>>>>> UseBimorphicInlining && !is_cold >>>>>>> else if (recv.klass == K2) { >>>>>>> ???? m2(.); // either inline or a direct call >>>>>>> } >>>>>>> else { >>>>>>> ???? // if (!too_many_traps_or_deopt()) >>>>>>> ???? deopt(); // invalidate + reinterpret >>>>>>> ???? // else >>>>>>> ???? invokeinterface A.foo(.); // virtual call with Inline Cache >>>>>>> } >>>>>>> There is no particular distinction between Bimorphic, >>>>>>> Polymorphic, and Megamorphic. The latter relates more to the >>>>>>> fallback rather than the guards. What this change brings is more >>>>>>> guards for N-morphic call-sites with N > 2. But it doesn't change >>>>>>> why and how these guards are generated (or at least, that is not >>>>>>> the intention). >>>>>>> The general shape that this change proposes is: >>>>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0) >>>>>>> if (recv.klass == K1) { >>>>>>> ???? m1(.); // either inline or a direct call >>>>>>> } >>>>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && >>>>>>> (UseBimorphicInlining || UsePolymorphicInling) >>>>>>> && !is_cold >>>>>>> else if (recv.klass == K2) { >>>>>>> ???? m2(.); // either inline or a direct call >>>>>>> } >>>>>>> // if TypeProfileWidth >= 3 && profile.has_receiver(2) && >>>>>>> UsePolymorphicInling && !is_cold >>>>>>> else if (recv.klass == K3) { >>>>>>> ???? m3(.); // either inline or a direct call >>>>>>> } >>>>>>> // if TypeProfileWidth >= 4 && profile.has_receiver(3) && >>>>>>> UsePolymorphicInling && !is_cold >>>>>>> else if (recv.klass == K4) { >>>>>>> ???? m4(.); // either inline or a direct call >>>>>>> } >>>>>>> else { >>>>>>> ???? // if (!too_many_traps_or_deopt()) >>>>>>> ???? deopt(); // invalidate + reinterpret >>>>>>> ???? // else >>>>>>> ???? invokeinterface A.foo(.); // virtual call with Inline Cache >>>>>>> } >>>>>>> You can observe that the condition to create the guards is no >>>>>>> different; only the total number increases based on >>>>>>> TypeProfileWidth and UsePolymorphicInlining. >>>>>>>> Question #1: what if you generalize polymorphic shape instead >>>>>>>> and allow multiple major receivers? Deoptimizing (and then >>>>>>>> recompiling) look less beneficial the higher morphism is >>>>>>>> (especially considering the inlining on all paths becomes less >>>>>>>> likely as well). So, having a virtual call (which becomes less >>>>>>>> likely due to lower frequency) on the fallback path may be a >>>>>>>> better option. >>>>>>> I agree with this statement in the general sense. However, in >>>>>>> practice, it depends on the specifics of each application. That >>>>>>> is why the degree of polymorphism needs to rely on a >>>>>>> configuration knob, and not pre-determined on a set of >>>>>>> benchmarks. I agree with the proposal to have this knob as a >>>>>>> per-method knob, instead of a global knob. >>>>>>> As for the impact of a higher morphism, I expect deoptimizations >>>>>>> to happen less often as more guards are generated, leading to a >>>>>>> lower probability of reaching the fallback path, leading to less >>>>>>> uncommon trap/deoptimizations. Moreover, the fallback is already >>>>>>> going to be a virtual call in case we hit the uncommon trap too >>>>>>> often (using too_many_traps_or_recompiles). >>>>>>>> Question #2: it would be very interesting to understand what >>>>>>>> exactly contributes the most to performance improvements? Is it >>>>>>>> inlining? Or maybe devirtualization (avoid the cost of virtual >>>>>>>> call)? How much come from optimizing interface calls (itable vs >>>>>>>> vtable stubs)? >>>>>>> Devirtualization in itself (direct vs. indirect call) is not the >>>>>>> *primary* source of the gain. The gain comes from the additional >>>>>>> optimizations that are applied by C2 when increasing the >>>>>>> scope/size of the code compiled via inlining. >>>>>>> In the case of warm code that's not inlined as part of >>>>>>> incremental inlining, the call is a direct call rather than an >>>>>>> indirect call. I haven't measured it, but I expect performance to >>>>>>> be positively impacted because of the better ability of modern >>>>>>> CPUs to correctly predict instruction branches (a direct call) >>>>>>> rather than data branches (an indirect call). >>>>>>>> Deciding how to spend inlining budget on multiple targets with >>>>>>>> moderate frequency can be hard, so it makes sense to consider >>>>>>>> expanding 3/4/mega-morphic call sites in post-parse phase >>>>>>>> (during incremental inlining). >>>>>>> Incremental inlining is already integrated with the existing >>>>>>> solution. In the case of a hot or warm call, in case of failure >>>>>>> to inline, it generates a direct call. You still have the guards, >>>>>>> reducing the cost of an indirect call, but without the cost of >>>>>>> the inlined code. >>>>>>>> Question #3: how much TypeProfileWidth affects profiling speed >>>>>>>> (interpreter and level #3 code) and dynamic footprint? >>>>>>> I'll come back to you with some results. >>>>>>>> Getting answers to those (and similar) questions should give us >>>>>>>> much more insights what is actually happening in practice. >>>>>>>> >>>>>>>> Speaking of the first deliverables, it would be good to >>>>>>>> introduce a new experimental mode to be able to easily conduct >>>>>>>> such experiments with product binaries and I'd like to see the >>>>>>>> patch evolving in that direction. It'll enable us to gather >>>>>>>> important data to guide our decisions about how to enhance the >>>>>>>> heuristics in the product. >>>>>>> This patch does not change the default shape of the generated >>>>>>> code with bimorphic guarded inlining, because the default value >>>>>>> of TypeProfileWidth is 2. If your concern is that >>>>>>> TypeProfileWidth is used for other purposes and that I should add >>>>>>> a dedicated knob to control the maximum morphism of these guards, >>>>>>> then I agree. I am using TypeProfileWidth because it's the >>>>>>> available and more straightforward knob today. >>>>>>> Overall, this change does not propose to go from bimorphic to >>>>>>> N-morphic by default (with N between 0 and 8). This change >>>>>>> focuses on using an existing knob (TypeProfileWidth) to open the >>>>>>> possibility for N-morphic guarded inlining. I would want the >>>>>>> discussion to change the default to be part of a separate RFR, to >>>>>>> separate the feature change discussion from the default change >>>>>>> discussion. >>>>>>>> Such optimizations are usually not unqualified wins because of >>>>>>>> highly "non-linear" or "non-local" effects, where a local change >>>>>>>> in one direction might couple to nearby change in a different >>>>>>>> direction, with a net change that's "wrong", due to side effects >>>>>>>> rolling out from the "good" change. (I'm talking about side >>>>>>>> effects in our IR graph shaping heuristics, not memory side >>>>>>>> effects.) >>>>>>>> >>>>>>>> One out of many such "wrong" changes is a local optimization >>>>>>>> which expands code on a medium-hot path, which has the side >>>>>>>> effect of making a containing block of code larger than >>>>>>>> convenient.? Three ways of being "larger than convenient" are a. >>>>>>>> the object code of some containing loop doesn't fit as well in >>>>>>>> the instruction memory, b. the total IR size tips over some >>>>>>>> budgetary limit which causes further IR creation to be throttled >>>>>>>> (or the whole graph to be thrown away!), or c. some loop gains >>>>>>>> additional branch structure that impedes the optimization of the >>>>>>>> loop, where an out of line call would not. >>>>>>>> >>>>>>>> My overall point here is that an eager expansion of IR that is >>>>>>>> locally "better" (we might even say "optimal") with respect to >>>>>>>> the specific path under consideration hurts the optimization of >>>>>>>> nearby paths which are more important. >>>>>>> I generally agree with this statement and explanation. Again, it >>>>>>> is not the intention of this patch to change the default number >>>>>>> of guards for polymorphic call-sites, but it is to give users the >>>>>>> ability to optimize the code generation of their JVM to their >>>>>>> application. >>>>>>> Since I am relying on the existing inlining infrastructure, late >>>>>>> inlining and hot/warm/cold call generators allows to have a >>>>>>> "best-of-both-world" approach: it inlines code in the hot guards, >>>>>>> it direct calls or inline (if inlining thresholds permits) the >>>>>>> method in the warm guards, and it doesn't even generate the guard >>>>>>> in the cold guards. The question here is, then how do you define >>>>>>> hot, warm, and cold. As discussed above, I want to explore using >>>>>>> a low-threshold even to try to generate a guard (at least 10% of >>>>>>> calls are to this specific receiver). >>>>>>> On the overhead of adding more guards, I see this change as >>>>>>> beneficial because it removes an arbitrary limit on what code can >>>>>>> be inlined. For example, if you have a call-site with 3 types, >>>>>>> each with a hit probability of 30%, then with a maximum limit of >>>>>>> 2 types (with bimorphic guarded inlining), only the first 2 types >>>>>>> are guarded and inlined. That is despite an apparent gain in >>>>>>> guarding and inlining against the 3 types. >>>>>>> I agree we want to have guardrails to avoid worst-case >>>>>>> degradations. It is my understanding that the existing inlining >>>>>>> infrastructure (with late inlining, for example) provides many >>>>>>> safeguards already, and it is up to this change not to abuse these. >>>>>>>> (It clearly doesn't work to tell an impacted customer, well, you >>>>>>>> may get a 5% loss, but the micro created to test this thing >>>>>>>> shows a 20% gain, and all the functional tests pass.) >>>>>>>> >>>>>>>> This leads me to the following suggestion:? Your code is a very >>>>>>>> good POC, and deserves more work, and the next step in that work >>>>>>>> is probably looking for and thinking about performance >>>>>>>> regressions, and figuring out how to throttle this thing. >>>>>>> Here again, I want that feature to be behind a configuration >>>>>>> knob, and then discuss in a future RFR to change the default. >>>>>>>> A specific next step would be to make the throttling of this >>>>>>>> feature be controllable. MorphismLimit should be a global on its >>>>>>>> own.? And it should be configurable through the CompilerOracle >>>>>>>> per method.? (See similar code for similar throttles.)? And it >>>>>>>> should be more sensitive to the hotness of the overall call and >>>>>>>> of the various slices of the call's profile.? (I notice with >>>>>>>> suspicion that the comment "The single majority receiver >>>>>>>> sufficiently outweighs the minority" is missing in the changed >>>>>>>> code.)? And, if the change is as disruptive to heuristics as I >>>>>>>> suspect it *might* be, the call site itself *might* need some >>>>>>>> kind of dynamic feedback which says, after some deopt or >>>>>>>> reprofiling, "take it easy here, try plan B." That last point is >>>>>>>> just speculation, but I threw it in to show the kinds of >>>>>>>> measures we *sometimes* have to take in avoiding "side effects" >>>>>>>> to our locally pleasant optimizations. >>>>>>> I'll add this per-method knob on the CompilerOracle in the next >>>>>>> iteration of this patch. >>>>>>>> But, let me repeat: I'm glad to see this experiment. And very, >>>>>>>> very glad to see all the cool stuff that is coming out of your >>>>>>>> work-group.? Welcome to the adventure! >>>>>>> For future improvements, I will keep focusing on inlining as I >>>>>>> see it as the door opener to many more optimizations in C2. I am >>>>>>> still learning at what can be done to reduce the size of the >>>>>>> inlined code by, for example, applying specific optimizations >>>>>>> that simplify the CG (like dead-code elimination or constant >>>>>>> propagation) before inlining the code. As you said, we are not >>>>>>> short of ideas on *how* to improve it, but we have to be very >>>>>>> wary of *what impact* it'll have on real-world applications. >>>>>>> We're working with internal customers to figure that out, and >>>>>>> we'll share them as soon as we are ready with benchmarks for >>>>>>> those use-case patterns. >>>>>>> What I am working on now is: >>>>>>> ??? - Add a per-method flag through CompilerOracle >>>>>>> ??? - Add a threshold on the probability of a receiver to >>>>>>> generate a guard (I am thinking of 10%, i.e., if a receiver is >>>>>>> observed less than 1 in every 10 calls, then don't generate a >>>>>>> guard and use the fallback) >>>>>>> ??? - Check the overhead of increasing TypeProfileWidth on >>>>>>> profiling speed (in the interpreter and level #3 code) >>>>>>> Thank you, and looking forward to the next review (I expect to >>>>>>> post the next iteration of the patch today or tomorrow). >>>>>>> -- >>>>>>> Ludovic >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Vladimir Ivanov >>>>>>> Sent: Thursday, February 6, 2020 1:07 PM >>>>>>> To: Ludovic Henry ; >>>>>>> hotspot-compiler-dev at openjdk.java.net >>>>>>> Subject: Re: Polymorphic Guarded Inlining in C2 >>>>>>> >>>>>>> Very interesting results, Ludovic! >>>>>>> >>>>>>>> The image can be found at >>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fnam06.safelinks.protection.outlook.com%2F%3Furl%3Dhttps*3A*2F*2Fgist.github.com*2Fluhenry*2Fb7cc1ed55c51cb0fbc527cbc45018473%26amp%3Bdata%3D02*7C01*7Cluhenry*40microsoft.com*7Cc802bcf8b3154a1bb97508d7af47850b*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637170594027774879%26amp%3Bsdata%3D6dxACbgyeU0S51dVY6fSVZ*2B4USDsTOdnuxTOXkiVYqI*3D%26amp%3Breserved%3D0__%3BJSUlJSUlJSUlJSUlJSUl!!GqivPVa7Brio!MIQWA4cQu4RzL0yqjiU5mO4tpHcImj_lUv7tEh-gG6NCMtSRIAyhvfyJZQBPoAJC_Ajjaxg%24&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=aLAxfc5BDeyOK6VMwM7bUAgjVLMCAt8A6LDlsNkC6fU%3D&reserved=0 >>>>>>>> >>>>>>> >>>>>>> Can you elaborate on the experiment itself, please? In >>>>>>> particular, what >>>>>>> does PERCENTILES actually mean? >>>>>>> >>>>>>> I haven't looked through the patch in details, but here are some >>>>>>> thoughts. >>>>>>> >>>>>>> As of now, there are 4 main scenarios for devirtualization [1]. >>>>>>> It seems >>>>>>> you try to generalize (b) which becomes: >>>>>>> >>>>>>> ????? if (recv.klass == K1) { >>>>>>> ???????? m1(...); // either inline or a direct call >>>>>>> ????? } else if (recv.klass == K2) { >>>>>>> ???????? m2(...); // either inline or a direct call >>>>>>> ????? ... >>>>>>> ????? } else if (recv.klass == Kn) { >>>>>>> ???????? mn(...); // either inline or a direct call >>>>>>> ????? } else { >>>>>>> ???????? deopt(); // invalidate + reinterpret >>>>>>> ????? } >>>>>>> >>>>>>> Question #1: what if you generalize polymorphic shape instead and >>>>>>> allow >>>>>>> multiple major receivers? Deoptimizing (and then recompiling) >>>>>>> look less >>>>>>> beneficial the higher morphism is (especially considering the >>>>>>> inlining >>>>>>> on all paths becomes less likely as well). So, having a virtual call >>>>>>> (which becomes less likely due to lower frequency) on the >>>>>>> fallback path >>>>>>> may be a better option. >>>>>>> >>>>>>> >>>>>>> Question #2: it would be very interesting to understand what exactly >>>>>>> contributes the most to performance improvements? Is it inlining? Or >>>>>>> maybe devirtualization (avoid the cost of virtual call)? How much >>>>>>> come >>>>>>> from optimizing interface calls (itable vs vtable stubs)? >>>>>>> >>>>>>> Deciding how to spend inlining budget on multiple targets with >>>>>>> moderate >>>>>>> frequency can be hard, so it makes sense to consider expanding >>>>>>> 3/4/mega-morphic call sites in post-parse phase (during incremental >>>>>>> inlining). >>>>>>> >>>>>>> >>>>>>> Question #3: how much TypeProfileWidth affects profiling speed >>>>>>> (interpreter and level #3 code) and dynamic footprint? >>>>>>> >>>>>>> >>>>>>> Getting answers to those (and similar) questions should give us much >>>>>>> more insights what is actually happening in practice. >>>>>>> >>>>>>> Speaking of the first deliverables, it would be good to introduce >>>>>>> a new >>>>>>> experimental mode to be able to easily conduct such experiments with >>>>>>> product binaries and I'd like to see the patch evolving in that >>>>>>> direction. It'll enable us to gather important data to guide our >>>>>>> decisions about how to enhance the heuristics in the product. >>>>>>> >>>>>>> Best regards, >>>>>>> Vladimir Ivanov >>>>>>> >>>>>>> [1] (a) Monomorphic: >>>>>>> ????? if (recv.klass == K1) { >>>>>>> ???????? m1(...); // either inline or a direct call >>>>>>> ????? } else { >>>>>>> ???????? deopt(); // invalidate + reinterpret >>>>>>> ????? } >>>>>>> >>>>>>> ????? (b) Bimorphic: >>>>>>> ????? if (recv.klass == K1) { >>>>>>> ???????? m1(...); // either inline or a direct call >>>>>>> ????? } else if (recv.klass == K2) { >>>>>>> ???????? m2(...); // either inline or a direct call >>>>>>> ????? } else { >>>>>>> ???????? deopt(); // invalidate + reinterpret >>>>>>> ????? } >>>>>>> >>>>>>> ????? (c) Polymorphic: >>>>>>> ????? if (recv.klass == K1) { // major receiver (by default, >90%) >>>>>>> ???????? m1(...); // either inline or a direct call >>>>>>> ????? } else { >>>>>>> ???????? K.m(); // virtual call >>>>>>> ????? } >>>>>>> >>>>>>> ????? (d) Megamorphic: >>>>>>> ????? K.m(); // virtual (K is either concrete or interface class) >>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ludovic >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: hotspot-compiler-dev >>>>>>>> On Behalf Of >>>>>>>> Ludovic Henry >>>>>>>> Sent: Thursday, February 6, 2020 9:18 AM >>>>>>>> To: hotspot-compiler-dev at openjdk.java.net >>>>>>>> Subject: RFR: Polymorphic Guarded Inlining in C2 >>>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> In our evergoing search of improving performance, I've looked at >>>>>>>> inlining and, more specifically, at polymorphic guarded >>>>>>>> inlining. Today in HotSpot, the maximum number of guards for >>>>>>>> types at any call site is two - with bimorphic guarded inlining. >>>>>>>> However, Graal and Zing have observed great results with >>>>>>>> increasing that limit. >>>>>>>> >>>>>>>> You'll find following a patch that makes the number of guards >>>>>>>> for types configurable with the `TypeProfileWidth` global. >>>>>>>> >>>>>>>> Testing: >>>>>>>> Passing tier1 on Linux and Windows, plus other large >>>>>>>> applications (through the Adopt testing scripts) >>>>>>>> >>>>>>>> Benchmarking: >>>>>>>> To get data, we run a benchmark against Apache Pinot and observe >>>>>>>> the following results: >>>>>>>> >>>>>>>> [cid:image001.png at 01D5D2DB.F5165550] >>>>>>>> >>>>>>>> We observe close to 20% improvements on this sample benchmark >>>>>>>> with a morphism (=width) of 3 or 4. We are currently validating >>>>>>>> these numbers on a more extensive set of benchmarks and >>>>>>>> platforms, and I'll share them as soon as we have them. >>>>>>>> >>>>>>>> I am happy to provide more information, just let me know if you >>>>>>>> have any question. >>>>>>>> >>>>>>>> Thank you, >>>>>>>> >>>>>>>> -- >>>>>>>> Ludovic >>>>>>>> >>>>>>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp >>>>>>>> b/src/hotspot/share/ci/ciCallProfile.hpp >>>>>>>> index 73854806ed..845070fbe1 100644 >>>>>>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp >>>>>>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp >>>>>>>> @@ -38,7 +38,7 @@ private: >>>>>>>> ?????? friend class ciMethod; >>>>>>>> ?????? friend class ciMethodHandle; >>>>>>>> >>>>>>>> -? enum { MorphismLimit = 2 }; // Max call site's morphism we >>>>>>>> care about >>>>>>>> +? enum { MorphismLimit = 8 }; // Max call site's morphism we >>>>>>>> care about >>>>>>>> ?????? int? _limit;??????????????? // number of receivers have >>>>>>>> been determined >>>>>>>> ?????? int? _morphism;???????????? // determined call site's >>>>>>>> morphism >>>>>>>> ?????? int? _count;??????????????? // # times has this call been >>>>>>>> executed >>>>>>>> @@ -47,6 +47,7 @@ private: >>>>>>>> ?????? ciKlass*? _receiver[MorphismLimit + 1];? // receivers >>>>>>>> (exact) >>>>>>>> >>>>>>>> ?????? ciCallProfile() { >>>>>>>> +??? guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit >>>>>>>> can't be smaller than TypeProfileWidth"); >>>>>>>> ???????? _limit = 0; >>>>>>>> ???????? _morphism??? = 0; >>>>>>>> ???????? _count = -1; >>>>>>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp >>>>>>>> b/src/hotspot/share/ci/ciMethod.cpp >>>>>>>> index d771be8dac..8e4ecc8597 100644 >>>>>>>> --- a/src/hotspot/share/ci/ciMethod.cpp >>>>>>>> +++ b/src/hotspot/share/ci/ciMethod.cpp >>>>>>>> @@ -496,9 +496,7 @@ ciCallProfile >>>>>>>> ciMethod::call_profile_at_bci(int bci) { >>>>>>>> ?????????? // Every profiled call site has a counter. >>>>>>>> ?????????? int count = >>>>>>>> check_overflow(data->as_CounterData()->count(), >>>>>>>> java_code_at_bci(bci)); >>>>>>>> >>>>>>>> -????? if (!data->is_ReceiverTypeData()) { >>>>>>>> -??????? result._receiver_count[0] = 0;? // that's a definite zero >>>>>>>> -????? } else { // ReceiverTypeData is a subclass of CounterData >>>>>>>> +????? if (data->is_ReceiverTypeData()) { >>>>>>>> ???????????? ciReceiverTypeData* call = >>>>>>>> (ciReceiverTypeData*)data->as_ReceiverTypeData(); >>>>>>>> ???????????? // In addition, virtual call sites have receiver >>>>>>>> type information >>>>>>>> ???????????? int receivers_count_total = 0; >>>>>>>> @@ -515,7 +513,7 @@ ciCallProfile >>>>>>>> ciMethod::call_profile_at_bci(int bci) { >>>>>>>> ?????????????? // is recorded or an associated counter is >>>>>>>> incremented, but not both. With >>>>>>>> ?????????????? // tiered compilation, however, both can happen >>>>>>>> due to the interpreter and >>>>>>>> ?????????????? // C1 profiling invocations differently. Address >>>>>>>> that inconsistency here. >>>>>>>> -????????? if (morphism == 1 && count > 0) { >>>>>>>> +????????? if (morphism >= 1 && count > 0) { >>>>>>>> ???????????????? epsilon = count; >>>>>>>> ???????????????? count = 0; >>>>>>>> ?????????????? } >>>>>>>> @@ -531,25 +529,26 @@ ciCallProfile >>>>>>>> ciMethod::call_profile_at_bci(int bci) { >>>>>>>> ????????????? // If we extend profiling to record methods, >>>>>>>> ?????????????? // we will set result._method also. >>>>>>>> ???????????? } >>>>>>>> +??????? result._morphism = morphism; >>>>>>>> ???????????? // Determine call site's morphism. >>>>>>>> ???????????? // The call site count is 0 with known morphism >>>>>>>> (only 1 or 2 receivers) >>>>>>>> ???????????? // or < 0 in the case of a type check failure for >>>>>>>> checkcast, aastore, instanceof. >>>>>>>> ???????????? // The call site count is > 0 in the case of a >>>>>>>> polymorphic virtual call. >>>>>>>> -??????? if (morphism > 0 && morphism == result._limit) { >>>>>>>> -?????????? // The morphism <= MorphismLimit. >>>>>>>> -?????????? if ((morphism >>>>>>> -?????????????? (morphism == ciCallProfile::MorphismLimit && >>>>>>>> count == 0)) { >>>>>>>> +??????? assert(result._morphism == result._limit, ""); >>>>>>>> #ifdef ASSERT >>>>>>>> +??????? if (result._morphism > 0) { >>>>>>>> +?????????? // The morphism <= TypeProfileWidth. >>>>>>>> +?????????? if ((result._morphism >>>>>>> +?????????????? (result._morphism == TypeProfileWidth && count >>>>>>>> == 0)) { >>>>>>>> ????????????????? if (count > 0) { >>>>>>>> ??????????????????? this->print_short_name(tty); >>>>>>>> ??????????????????? tty->print_cr(" @ bci:%d", bci); >>>>>>>> ??????????????????? this->print_codes(); >>>>>>>> ??????????????????? assert(false, "this call site should not be >>>>>>>> polymorphic"); >>>>>>>> ????????????????? } >>>>>>>> -#endif >>>>>>>> -???????????? result._morphism = morphism; >>>>>>>> ??????????????? } >>>>>>>> ???????????? } >>>>>>>> +#endif >>>>>>>> ???????????? // Make the count consistent if this is a call >>>>>>>> profile. If count is >>>>>>>> ???????????? // zero or less, presume that this is a typecheck >>>>>>>> profile and >>>>>>>> ???????????? // do nothing.? Otherwise, increase count to be the >>>>>>>> sum of all >>>>>>>> @@ -578,7 +577,7 @@ void ciCallProfile::add_receiver(ciKlass* >>>>>>>> receiver, int receiver_count) { >>>>>>>> ?????? } >>>>>>>> ?????? _receiver[i] = receiver; >>>>>>>> ?????? _receiver_count[i] = receiver_count; >>>>>>>> -? if (_limit < MorphismLimit) _limit++; >>>>>>>> +? if (_limit < TypeProfileWidth) _limit++; >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp >>>>>>>> b/src/hotspot/share/opto/c2_globals.hpp >>>>>>>> index d605bdb7bd..7a8dee43e5 100644 >>>>>>>> --- a/src/hotspot/share/opto/c2_globals.hpp >>>>>>>> +++ b/src/hotspot/share/opto/c2_globals.hpp >>>>>>>> @@ -389,9 +389,16 @@ >>>>>>>> ?????? product(bool, UseBimorphicInlining, >>>>>>>> true,???????????????????????????????? \ >>>>>>>> ?????????????? "Profiling based inlining for two >>>>>>>> receivers")???????????????????? \ >>>>>>>> \ >>>>>>>> +? product(bool, UsePolymorphicInlining, >>>>>>>> true,?????????????????????????????? \ >>>>>>>> +????????? "Profiling based inlining for two or more >>>>>>>> receivers")???????????? \ >>>>>>>> + \ >>>>>>>> ?????? product(bool, UseOnlyInlinedBimorphic, >>>>>>>> true,????????????????????????????? \ >>>>>>>> ?????????????? "Don't use BimorphicInlining if can't inline a >>>>>>>> second method")??? \ >>>>>>>> \ >>>>>>>> +? product(bool, UseOnlyInlinedPolymorphic, >>>>>>>> true,??????????????????????????? \ >>>>>>>> +????????? "Don't use PolymorphicInlining if can't inline a >>>>>>>> non-major "????? \ >>>>>>>> +????????? "receiver's >>>>>>>> method")????????????????????????????????????????????? \ >>>>>>>> + \ >>>>>>>> ?????? product(bool, InsertMemBarAfterArraycopy, >>>>>>>> true,?????????????????????????? \ >>>>>>>> ?????????????? "Insert memory barrier after arraycopy >>>>>>>> call")???????????????????? \ >>>>>>>> \ >>>>>>>> diff --git a/src/hotspot/share/opto/doCall.cpp >>>>>>>> b/src/hotspot/share/opto/doCall.cpp >>>>>>>> index 44ab387ac8..6f940209ce 100644 >>>>>>>> --- a/src/hotspot/share/opto/doCall.cpp >>>>>>>> +++ b/src/hotspot/share/opto/doCall.cpp >>>>>>>> @@ -83,25 +83,23 @@ CallGenerator* >>>>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool >>>>>>>> >>>>>>>> ?????? // See how many times this site has been invoked. >>>>>>>> ?????? int site_count = profile.count(); >>>>>>>> -? int receiver_count = -1; >>>>>>>> -? if (call_does_dispatch && UseTypeProfile && >>>>>>>> profile.has_receiver(0)) { >>>>>>>> -??? // Receivers in the profile structure are ordered by call >>>>>>>> counts >>>>>>>> -??? // so that the most called (major) receiver is >>>>>>>> profile.receiver(0). >>>>>>>> -??? receiver_count = profile.receiver_count(0); >>>>>>>> -? } >>>>>>>> >>>>>>>> ?????? CompileLog* log = this->log(); >>>>>>>> ?????? if (log != NULL) { >>>>>>>> -??? int rid = (receiver_count >= 0)? >>>>>>>> log->identify(profile.receiver(0)): -1; >>>>>>>> -??? int r2id = (rid != -1 && profile.has_receiver(1))? >>>>>>>> log->identify(profile.receiver(1)):-1; >>>>>>>> +??? ResourceMark rm; >>>>>>>> +??? int* rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth); >>>>>>>> +??? for (int i = 0; i < TypeProfileWidth && >>>>>>>> profile.has_receiver(i); i++) { >>>>>>>> +????? rids[i] = log->identify(profile.receiver(i)); >>>>>>>> +??? } >>>>>>>> ???????? log->begin_elem("call method='%d' count='%d' >>>>>>>> prof_factor='%f'", >>>>>>>> ???????????????????????? log->identify(callee), site_count, >>>>>>>> prof_factor); >>>>>>>> ???????? if (call_does_dispatch)? log->print(" virtual='1'"); >>>>>>>> ???????? if (allow_inline)???? log->print(" inline='1'"); >>>>>>>> -??? if (receiver_count >= 0) { >>>>>>>> -????? log->print(" receiver='%d' receiver_count='%d'", rid, >>>>>>>> receiver_count); >>>>>>>> -?????? if (profile.has_receiver(1)) { >>>>>>>> -??????? log->print(" receiver2='%d' receiver2_count='%d'", >>>>>>>> r2id, profile.receiver_count(1)); >>>>>>>> +??? for (int i = 0; i < TypeProfileWidth && >>>>>>>> profile.has_receiver(i); i++) { >>>>>>>> +????? if (i == 0) { >>>>>>>> +??????? log->print(" receiver='%d' receiver_count='%d'", >>>>>>>> rids[i], profile.receiver_count(i)); >>>>>>>> +????? } else { >>>>>>>> +??????? log->print(" receiver%d='%d' receiver%d_count='%d'", i >>>>>>>> + 1, rids[i], i + 1, profile.receiver_count(i)); >>>>>>>> ?????????? } >>>>>>>> ???????? } >>>>>>>> ???????? if (callee->is_method_handle_intrinsic()) { >>>>>>>> @@ -205,90 +203,96 @@ CallGenerator* >>>>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool >>>>>>>> ???????? if (call_does_dispatch && site_count > 0 && >>>>>>>> UseTypeProfile) { >>>>>>>> ?????????? // The major receiver's count >= >>>>>>>> TypeProfileMajorReceiverPercent of site_count. >>>>>>>> ?????????? bool have_major_receiver = profile.has_receiver(0) && >>>>>>>> (100.*profile.receiver_prob(0) >= >>>>>>>> (float)TypeProfileMajorReceiverPercent); >>>>>>>> -????? ciMethod* receiver_method = NULL; >>>>>>>> >>>>>>>> ?????????? int morphism = profile.morphism(); >>>>>>>> + >>>>>>>> +????? ciMethod** receiver_methods = >>>>>>>> NEW_RESOURCE_ARRAY(ciMethod*, MAX(1, morphism)); >>>>>>>> +????? memset(receiver_methods, 0, sizeof(ciMethod*) * MAX(1, >>>>>>>> morphism)); >>>>>>>> + >>>>>>>> ?????????? if (speculative_receiver_type != NULL) { >>>>>>>> ???????????? if (!too_many_traps_or_recompiles(caller, bci, >>>>>>>> Deoptimization::Reason_speculate_class_check)) { >>>>>>>> ?????????????? // We have a speculative type, we should be able >>>>>>>> to resolve >>>>>>>> ?????????????? // the call. We do that before looking at the >>>>>>>> profiling at >>>>>>>> -????????? // this invoke because it may lead to bimorphic >>>>>>>> inlining which >>>>>>>> +????????? // this invoke because it may lead to polymorphic >>>>>>>> inlining which >>>>>>>> ?????????????? // a speculative type should help us avoid. >>>>>>>> -????????? receiver_method = >>>>>>>> callee->resolve_invoke(jvms->method()->holder(), >>>>>>>> - speculative_receiver_type); >>>>>>>> -????????? if (receiver_method == NULL) { >>>>>>>> +????????? receiver_methods[0] = >>>>>>>> callee->resolve_invoke(jvms->method()->holder(), >>>>>>>> + speculative_receiver_type); >>>>>>>> +????????? if (receiver_methods[0] == NULL) { >>>>>>>> ???????????????? speculative_receiver_type = NULL; >>>>>>>> ?????????????? } else { >>>>>>>> ???????????????? morphism = 1; >>>>>>>> ?????????????? } >>>>>>>> ???????????? } else { >>>>>>>> ?????????????? // speculation failed before. Use profiling at >>>>>>>> the call >>>>>>>> -????????? // (could allow bimorphic inlining for instance). >>>>>>>> +????????? // (could allow polymorphic inlining for instance). >>>>>>>> ?????????????? speculative_receiver_type = NULL; >>>>>>>> ???????????? } >>>>>>>> ?????????? } >>>>>>>> -????? if (receiver_method == NULL && >>>>>>>> +????? if (receiver_methods[0] == NULL && >>>>>>>> ?????????????? (have_major_receiver || morphism == 1 || >>>>>>>> -?????????? (morphism == 2 && UseBimorphicInlining))) { >>>>>>>> -??????? // receiver_method = profile.method(); >>>>>>>> +?????????? (morphism == 2 && UseBimorphicInlining) || >>>>>>>> +?????????? (morphism >= 2 && UsePolymorphicInlining))) { >>>>>>>> +??????? assert(profile.has_receiver(0), "no receiver at 0"); >>>>>>>> +??????? // receiver_methods[0] = profile.method(); >>>>>>>> ???????????? // Profiles do not suggest methods now.? Look it up >>>>>>>> in the major receiver. >>>>>>>> -??????? receiver_method = >>>>>>>> callee->resolve_invoke(jvms->method()->holder(), >>>>>>>> - profile.receiver(0)); >>>>>>>> +??????? receiver_methods[0] = >>>>>>>> callee->resolve_invoke(jvms->method()->holder(), >>>>>>>> + profile.receiver(0)); >>>>>>>> ?????????? } >>>>>>>> -????? if (receiver_method != NULL) { >>>>>>>> -??????? // The single majority receiver sufficiently outweighs >>>>>>>> the minority. >>>>>>>> -??????? CallGenerator* hit_cg = >>>>>>>> this->call_generator(receiver_method, >>>>>>>> -????????????? vtable_index, !call_does_dispatch, jvms, >>>>>>>> allow_inline, prof_factor); >>>>>>>> -??????? if (hit_cg != NULL) { >>>>>>>> -????????? // Look up second receiver. >>>>>>>> -????????? CallGenerator* next_hit_cg = NULL; >>>>>>>> -????????? ciMethod* next_receiver_method = NULL; >>>>>>>> -????????? if (morphism == 2 && UseBimorphicInlining) { >>>>>>>> -??????????? next_receiver_method = >>>>>>>> callee->resolve_invoke(jvms->method()->holder(), >>>>>>>> - profile.receiver(1)); >>>>>>>> -??????????? if (next_receiver_method != NULL) { >>>>>>>> -????????????? next_hit_cg = >>>>>>>> this->call_generator(next_receiver_method, >>>>>>>> -????????????????????????????????? vtable_index, >>>>>>>> !call_does_dispatch, jvms, >>>>>>>> -????????????????????????????????? allow_inline, prof_factor); >>>>>>>> -????????????? if (next_hit_cg != NULL && >>>>>>>> !next_hit_cg->is_inline() && >>>>>>>> -????????????????? have_major_receiver && >>>>>>>> UseOnlyInlinedBimorphic) { >>>>>>>> -????????????????? // Skip if we can't inline second receiver's >>>>>>>> method >>>>>>>> -????????????????? next_hit_cg = NULL; >>>>>>>> +????? if (receiver_methods[0] != NULL) { >>>>>>>> +??????? CallGenerator** hit_cgs = >>>>>>>> NEW_RESOURCE_ARRAY(CallGenerator*, MAX(1, morphism)); >>>>>>>> +??????? memset(hit_cgs, 0, sizeof(CallGenerator*) * MAX(1, >>>>>>>> morphism)); >>>>>>>> + >>>>>>>> +??????? hit_cgs[0] = this->call_generator(receiver_methods[0], >>>>>>>> +??????????????????????????? vtable_index, !call_does_dispatch, >>>>>>>> jvms, >>>>>>>> +??????????????????????????? allow_inline, prof_factor); >>>>>>>> +??????? if (hit_cgs[0] != NULL) { >>>>>>>> +????????? if ((morphism == 2 && UseBimorphicInlining) || >>>>>>>> (morphism >= 2 && UsePolymorphicInlining)) { >>>>>>>> +??????????? for (int i = 1; i < morphism; i++) { >>>>>>>> +????????????? assert(profile.has_receiver(i), "no receiver at >>>>>>>> %d", i); >>>>>>>> +????????????? receiver_methods[i] = >>>>>>>> callee->resolve_invoke(jvms->method()->holder(), >>>>>>>> + profile.receiver(i)); >>>>>>>> +????????????? if (receiver_methods[i] != NULL) { >>>>>>>> +??????????????? hit_cgs[i] = >>>>>>>> this->call_generator(receiver_methods[i], >>>>>>>> +????????????????????????????????????? vtable_index, >>>>>>>> !call_does_dispatch, jvms, >>>>>>>> +????????????????????????????????????? allow_inline, prof_factor); >>>>>>>> +??????????????? if (hit_cgs[i] != NULL && >>>>>>>> !hit_cgs[i]->is_inline() && have_major_receiver && >>>>>>>> +??????????????????? ((morphism == 2 && UseOnlyInlinedBimorphic) >>>>>>>> || (morphism >= 2 && UseOnlyInlinedPolymorphic))) { >>>>>>>> +????????????????? // Skip if we can't inline non-major >>>>>>>> receiver's method >>>>>>>> +????????????????? hit_cgs[i] = NULL; >>>>>>>> +??????????????? } >>>>>>>> ?????????????????? } >>>>>>>> ???????????????? } >>>>>>>> ?????????????? } >>>>>>>> ?????????????? CallGenerator* miss_cg; >>>>>>>> -????????? Deoptimization::DeoptReason reason = (morphism == 2 >>>>>>>> -?????????????????????????????????????????????? ? >>>>>>>> Deoptimization::Reason_bimorphic >>>>>>>> +????????? Deoptimization::DeoptReason reason = (morphism >= 2 >>>>>>>> +?????????????????????????????????????????????? ? >>>>>>>> Deoptimization::Reason_polymorphic >>>>>>>> ??????????????????????????????????????????????????? : >>>>>>>> Deoptimization::reason_class_check(speculative_receiver_type != >>>>>>>> NULL)); >>>>>>>> -????????? if ((morphism == 1 || (morphism == 2 && next_hit_cg >>>>>>>> != NULL)) && >>>>>>>> -????????????? !too_many_traps_or_recompiles(caller, bci, reason) >>>>>>>> -???????????? ) { >>>>>>>> +????????? if (!too_many_traps_or_recompiles(caller, bci, >>>>>>>> reason)) { >>>>>>>> ???????????????? // Generate uncommon trap for class check >>>>>>>> failure path >>>>>>>> -??????????? // in case of monomorphic or bimorphic virtual call >>>>>>>> site. >>>>>>>> +??????????? // in case of polymorphic virtual call site. >>>>>>>> ???????????????? miss_cg = >>>>>>>> CallGenerator::for_uncommon_trap(callee, reason, >>>>>>>> >>>>>>>> Deoptimization::Action_maybe_recompile); >>>>>>>> ?????????????? } else { >>>>>>>> ???????????????? // Generate virtual call for class check >>>>>>>> failure path >>>>>>>> -??????????? // in case of polymorphic virtual call site. >>>>>>>> +??????????? // in case of megamorphic virtual call site. >>>>>>>> ???????????????? miss_cg = >>>>>>>> CallGenerator::for_virtual_call(callee, vtable_index); >>>>>>>> ?????????????? } >>>>>>>> -????????? if (miss_cg != NULL) { >>>>>>>> -??????????? if (next_hit_cg != NULL) { >>>>>>>> +????????? for (int i = morphism - 1; i >= 1 && miss_cg != NULL; >>>>>>>> i--) { >>>>>>>> +??????????? if (hit_cgs[i] != NULL) { >>>>>>>> ?????????????????? assert(speculative_receiver_type == NULL, >>>>>>>> "shouldn't end up here if we used speculation"); >>>>>>>> -????????????? trace_type_profile(C, jvms->method(), >>>>>>>> jvms->depth() - 1, jvms->bci(), next_receiver_method, >>>>>>>> profile.receiver(1), site_count, profile.receiver_count(1)); >>>>>>>> +????????????? trace_type_profile(C, jvms->method(), >>>>>>>> jvms->depth() - 1, jvms->bci(), receiver_methods[i], >>>>>>>> profile.receiver(i), site_count, profile.receiver_count(i)); >>>>>>>> ?????????????????? // We don't need to record dependency on a >>>>>>>> receiver here and below. >>>>>>>> ?????????????????? // Whenever we inline, the dependency is >>>>>>>> added by Parse::Parse(). >>>>>>>> -????????????? miss_cg = >>>>>>>> CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, >>>>>>>> next_hit_cg, PROB_MAX); >>>>>>>> -??????????? } >>>>>>>> -??????????? if (miss_cg != NULL) { >>>>>>>> -????????????? ciKlass* k = speculative_receiver_type != NULL ? >>>>>>>> speculative_receiver_type : profile.receiver(0); >>>>>>>> -????????????? trace_type_profile(C, jvms->method(), >>>>>>>> jvms->depth() - 1, jvms->bci(), receiver_method, k, site_count, >>>>>>>> receiver_count); >>>>>>>> -????????????? float hit_prob = speculative_receiver_type != >>>>>>>> NULL ? 1.0 : profile.receiver_prob(0); >>>>>>>> -????????????? CallGenerator* cg = >>>>>>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob); >>>>>>>> -????????????? if (cg != NULL)? return cg; >>>>>>>> +????????????? miss_cg = >>>>>>>> CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, >>>>>>>> hit_cgs[i], PROB_MAX); >>>>>>>> ???????????????? } >>>>>>>> ?????????????? } >>>>>>>> +????????? if (miss_cg != NULL) { >>>>>>>> +??????????? ciKlass* k = speculative_receiver_type != NULL ? >>>>>>>> speculative_receiver_type : profile.receiver(0); >>>>>>>> +??????????? trace_type_profile(C, jvms->method(), jvms->depth() >>>>>>>> - 1, jvms->bci(), receiver_methods[0], k, site_count, >>>>>>>> profile.receiver_count(0)); >>>>>>>> +??????????? float hit_prob = speculative_receiver_type != NULL >>>>>>>> ? 1.0 : profile.receiver_prob(0); >>>>>>>> +??????????? CallGenerator* cg = >>>>>>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cgs[0], >>>>>>>> hit_prob); >>>>>>>> +??????????? if (cg != NULL)? return cg; >>>>>>>> +????????? } >>>>>>>> ???????????? } >>>>>>>> ????????? } >>>>>>>> ???????? } >>>>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp >>>>>>>> b/src/hotspot/share/runtime/deoptimization.cpp >>>>>>>> index 11df15e004..2d14b52854 100644 >>>>>>>> --- a/src/hotspot/share/runtime/deoptimization.cpp >>>>>>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp >>>>>>>> @@ -2382,7 +2382,7 @@ const char* >>>>>>>> Deoptimization::_trap_reason_name[] = { >>>>>>>> ?????? "class_check", >>>>>>>> ?????? "array_check", >>>>>>>> ?????? "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"), >>>>>>>> -? "bimorphic" JVMCI_ONLY("_or_optimized_type_check"), >>>>>>>> +? "polymorphic" JVMCI_ONLY("_or_optimized_type_check"), >>>>>>>> ?????? "profile_predicate", >>>>>>>> ?????? "unloaded", >>>>>>>> ?????? "uninitialized", >>>>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp >>>>>>>> b/src/hotspot/share/runtime/deoptimization.hpp >>>>>>>> index 1cfff5394e..c1eb998aba 100644 >>>>>>>> --- a/src/hotspot/share/runtime/deoptimization.hpp >>>>>>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp >>>>>>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic { >>>>>>>> ???????? Reason_class_check,?????????? // saw unexpected object >>>>>>>> class (@bci) >>>>>>>> ???????? Reason_array_check,?????????? // saw unexpected array >>>>>>>> class (aastore @bci) >>>>>>>> ???????? Reason_intrinsic,???????????? // saw unexpected operand >>>>>>>> to intrinsic (@bci) >>>>>>>> -??? Reason_bimorphic,???????????? // saw unexpected object >>>>>>>> class in bimorphic inlining (@bci) >>>>>>>> +??? Reason_polymorphic,?????????? // saw unexpected object >>>>>>>> class in bimorphic inlining (@bci) >>>>>>>> >>>>>>>> #if INCLUDE_JVMCI >>>>>>>> ???????? Reason_unreached0???????????? = Reason_null_assert, >>>>>>>> ???????? Reason_type_checked_inlining? = Reason_intrinsic, >>>>>>>> -??? Reason_optimized_type_check?? = Reason_bimorphic, >>>>>>>> +??? Reason_optimized_type_check?? = Reason_polymorphic, >>>>>>>> #endif >>>>>>>> >>>>>>>> ???????? Reason_profile_predicate,???? // compiler generated >>>>>>>> predicate moved from frequent branch in a loop failed >>>>>>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp >>>>>>>> b/src/hotspot/share/runtime/vmStructs.cpp >>>>>>>> index 94b544824e..ee761626c4 100644 >>>>>>>> --- a/src/hotspot/share/runtime/vmStructs.cpp >>>>>>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp >>>>>>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry>>>>>>> mtClass>? KlassHashtableEntry; >>>>>>>> declare_constant(Deoptimization::Reason_class_check) \ >>>>>>>> declare_constant(Deoptimization::Reason_array_check) \ >>>>>>>> declare_constant(Deoptimization::Reason_intrinsic) \ >>>>>>>> - declare_constant(Deoptimization::Reason_bimorphic) \ >>>>>>>> + declare_constant(Deoptimization::Reason_polymorphic) \ >>>>>>>> declare_constant(Deoptimization::Reason_profile_predicate) \ >>>>>>>> declare_constant(Deoptimization::Reason_unloaded) \ >>>>>>>> declare_constant(Deoptimization::Reason_uninitialized) \ >>>>>>>> From ekaterina.pavlova at oracle.com Tue Apr 7 21:05:55 2020 From: ekaterina.pavlova at oracle.com (Ekaterina Pavlova) Date: Tue, 7 Apr 2020 14:05:55 -0700 Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator): General HotSpot changes In-Reply-To: <7dc065c6-b8c4-3d83-4b5d-788e07d8d6e5@oracle.com> References: <3b6f8501-5d90-0a14-ed21-5978567d95d2@oracle.com> <7dc065c6-b8c4-3d83-4b5d-788e07d8d6e5@oracle.com> Message-ID: Thanks Vladimir, Running tier1-tier4 tests and not getting any regressions is very good. I would also recommend to run other tiers as they contain more stress tests as well as jck. Doing it at least once before the integration would be very helpful and prevents us from getting late issues. Please let me know if you need any help with this. regards, -katya On 4/7/20 2:39 AM, Vladimir Ivanov wrote: > Hi Katya, > >> what kind of testing has been done to verify these changes? >> Taking into account the changes are quite large and touch share code >> running hs compiler and perhaps runtime tiers would be very advisable. > > The changes (and previous versions) were tested in 2 modes: > > ? * ran through tier1-tier4 with the functionality turned OFF; (also, some previous version went through tier1-tier6 once) > > ? * unit tests on Vector API were run on different x86 hardware in the following modes: -XX:UseAVX=[3,2,1,0] -XX:UseSSE=[4,3,2]. Arm engineers tested the version in vector-unstable branch on AArch64 hardware. > > As of now, the only known test failure is compiler/graalunit/HotspotTest.java in org.graalvm.compiler.hotspot.test.CheckGraalIntrinsics which should be taught about new JVM intrinsics added. > > Best regards, > Vladimir Ivanov > >> On 4/3/20 4:12 PM, Vladimir Ivanov wrote: >>> Hi, >>> >>> Following up on review requests of API [0] and Java implementation [1] for Vector API (JEP 338 [2]), here's a request for review of general HotSpot changes (in shared code) required for supporting the API: >>> >>> >>> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/all.00-03/ >>> >>> (First of all, to set proper expectations: since the JEP is still in Candidate state, the intention is to initiate preliminary round(s) of review to inform the community and gather feedback before sending out final/official RFRs once the JEP is Targeted to a release.) >>> >>> Vector API (being developed in Project Panama [3]) relies on JVM support to utilize optimal vector hardware instructions at runtime. It interacts with JVM through intrinsics (declared in jdk.internal.vm.vector.VectorSupport [4]) which expose vector operations support in C2 JIT-compiler. >>> >>> As Paul wrote earlier: "A vector intrinsic is an internal low-level vector operation. The last argument to the intrinsic is fall back behavior in Java, implementing the scalar operation over the number of elements held by the vector.? Thus, If the intrinsic is not supported in C2 for the other arguments then the Java implementation is executed (the Java implementation is always executed when running in the interpreter or for C1)." >>> >>> The rest of JVM support is about aggressively optimizing vector boxes to minimize (ideally eliminate) the overhead of boxing for vector values. >>> It's a stop-the-gap solution for vector box elimination problem until inline classes arrive. Vector classes are value-based and in the longer term will be migrated to inline classes once the support becomes available. >>> >>> Vector API talk from JVMLS'18 [5] contains brief overview of JVM implementation and some details. >>> >>> Complete implementation resides in vector-unstable branch of panama/dev repository [6]. >>> >>> Now to gory details (the patch is split in multiple "sub-webrevs"): >>> >>> =========================================================== >>> >>> (1) http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/00.backend.shared/ >>> >>> Ideal vector nodes for new operations introduced by Vector API. >>> >>> (Platform-specific back end support will be posted for review separately). >>> >>> =========================================================== >>> >>> (2) http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/ >>> >>> JVM Java interface (VectorSupport) and intrinsic support in C2. >>> >>> Vector instances are initially represented as VectorBox macro nodes and "unboxing" is represented by VectorUnbox node. It simplifies vector box elimination analysis and the nodes are expanded later right before EA pass. >>> >>> Vectors have 2-level on-heap representation: for the vector value primitive array is used as a backing storage and it is encapsulated in a typed wrapper (e.g., Int256Vector - vector of 8 ints - contains a int[8] instance which is used to store vector value). >>> >>> Unless VectorBox node goes away, it needs to be expanded into an allocation eventually, but it is a pure node and doesn't have any JVM state associated with it. The problem is solved by keeping JVM state separately in a VectorBoxAllocate node associated with VectorBox node and use it during expansion. >>> >>> Also, to simplify vector box elimination, inlining of vector reboxing calls (VectorSupport::maybeRebox) is delayed until the analysis is over. >>> >>> =========================================================== >>> >>> (3) http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/02.vbox_elimination/ >>> >>> Vector box elimination analysis implementation. (Brief overview: slides #36-42 [5].) >>> >>> The main part is devoted to scalarization across safepoints and rematerialization support during deoptimization. In C2-generated code vector operations work with raw vector values which live in registers or spilled on the stack and it allows to avoid boxing/unboxing when a vector value is alive across a safepoint. As with other values, there's just a location of the vector value at the safepoint and vector type information recorded in the relevant nmethod metadata and all the heavy-lifting happens only when rematerialization takes place. >>> >>> The analysis preserves object identity invariants except during aggressive reboxing (guarded by -XX:+EnableAggressiveReboxing). >>> >>> (Aggressive reboxing is crucial for cases when vectors "escape": it allocates a fresh instance at every escape point thus enabling original instance to go away.) >>> >>> =========================================================== >>> >>> (4) http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/03.module.hotspot/ >>> >>> HotSpot changes for jdk.incubator.vector module. Vector support is makred experimental and turned off by default. JEP 338 proposes the API to be released as an incubator module, so a user has to specify "--add-module jdk.incubator.vector" on the command line to be able to use it. >>> When user does that, JVM automatically enables Vector API support. >>> It improves usability (user doesn't need to separately "open" the API and enable JVM support) while minimizing risks of destabilitzation from new code when the API is not used. >>> >>> >>> That's it! Will be happy to answer any questions. >>> >>> And thanks in advance for any feedback! >>> >>> Best regards, >>> Vladimir Ivanov >>> >>> [0] https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html >>> >>> [1] https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041228.html >>> >>> [2] https://openjdk.java.net/jeps/338 >>> >>> [3] https://openjdk.java.net/projects/panama/ >>> >>> [4] http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java.html >>> >>> [5] http://cr.openjdk.java.net/~vlivanov/talks/2018_JVMLS_VectorAPI.pdf >>> >>> [6] http://hg.openjdk.java.net/panama/dev/shortlog/92bbd44386e9 >>> >>> ???? $ hg clone http://hg.openjdk.java.net/panama/dev/ -b vector-unstable >> From igor.ignatyev at oracle.com Wed Apr 8 01:04:54 2020 From: igor.ignatyev at oracle.com (Igor Ignatyev) Date: Tue, 7 Apr 2020 18:04:54 -0700 Subject: RFR(S/M) : 8242310 : use reproducible random in hotspot compiler tests Message-ID: <3EDC354E-9CC1-418B-978A-689FB50BE061@oracle.com> http://cr.openjdk.java.net/~iignatyev//8242310/webrev.00 > 282 lines changed: 123 ins; 24 del; 135 mod; Hi all, could you please review the patch which marks hotspot compiler tests w/ randomness k/w and uses Utils.getRandomInstance() instead of Random w/ _random_ seeds where possible? To identify tests which should be marked, I've used both static (in a form of analyzing classes which directly or indirectly depend on Random/SecureRandom/ThreadLocalRandom) and dynamic (by instrumenting the said classes to log tests which called their 'next' methods) analyses. I've decided *not* to mark tests which use SecureRandom only via File.createTemp* b/c in all such cases temp files are not used as a source of randomness, but rather just a reliable way to get a new/empty file/dir. the patch also - replaces fixed seed w/ 42 (in the tests which don't really depend on a seed value) as it's most common fixed seed in hotspot test suite; - moves Utils.getRandomInstance() calls closer to usage, so 'To re-run test with same seed value please add ... ' won't be printed out by the tests which don't actually use random but use share classes which might use random. webrevs: for the sake of reviewers, I've split the patch into parts, webrev.code.00 has only changes in the code, webrev.kw.00 -- only adds the k/w (and comments in few places where one might think k/w is needed), and webrev.00 contains all changes (including copyright year updates) http://cr.openjdk.java.net/~iignatyev//8242310/webrev.code.00 > 109 lines changed: 41 ins; 24 del; 44 mod; http://cr.openjdk.java.net/~iignatyev//8242310/webrev.kw.00 > 84 lines changed: 82 ins; 0 del; 2 mod; http://cr.openjdk.java.net/~iignatyev//8242310/webrev.00 > 282 lines changed: 123 ins; 24 del; 135 mod; NB the patch depends on 8241707[1], which is currently under review[2]. testing: test/hotspot/jtreg/compiler tests on {linux,windows,macosx}-x64 JBS: https://bugs.openjdk.java.net/browse/JDK-8242310 [1] https://bugs.openjdk.java.net/browse/JDK-8241707 [2] https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041300.html Thanks, -- Igor From tobias.hartmann at oracle.com Wed Apr 8 06:17:46 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Wed, 8 Apr 2020 08:17:46 +0200 Subject: RFR(S/M) : 8242310 : use reproducible random in hotspot compiler tests In-Reply-To: <3EDC354E-9CC1-418B-978A-689FB50BE061@oracle.com> References: <3EDC354E-9CC1-418B-978A-689FB50BE061@oracle.com> Message-ID: <8ccc1dfe-5389-0fcc-898e-1d1b1414d907@oracle.com> Hi Igor, On 08.04.20 03:04, Igor Ignatyev wrote: > - replaces fixed seed w/ 42 (in the tests which don't really depend on a seed value) as it's most common fixed seed in hotspot test suite; What's the reason to use a fixed seed in the first place? Seems to me that even if the test does not directly use the random value, it doesn't hurt to use a non-fixed seed. In fact, wouldn't using a non-fixed seed increase coverage? Even if the value is not checked, it's still propagated through registers, stack and heap space and might therefore make a difference. > http://cr.openjdk.java.net/~iignatyev//8242310/webrev.00 Looks good. Best regards, Tobias From rwestrel at redhat.com Wed Apr 8 07:32:16 2020 From: rwestrel at redhat.com (Roland Westrelin) Date: Wed, 08 Apr 2020 09:32:16 +0200 Subject: RFR(S): 8241900: Loop unswitching may cause dependence on null check to be lost In-Reply-To: <289f3e63-9603-d90e-8b31-1d02d22d6ae7@oracle.com> References: <878sjdc5jl.fsf@redhat.com> <36d43333-81e8-1f79-1a04-06f8e34a2c30@oracle.com> <87zhbpau71.fsf@redhat.com> <6cd91986-952c-1419-47d1-f7d25b955a70@oracle.com> <289f3e63-9603-d90e-8b31-1d02d22d6ae7@oracle.com> Message-ID: <87wo6qbfgf.fsf@redhat.com> Thanks for the review, Vladimir. Roland. From jiefu at tencent.com Wed Apr 8 13:51:34 2020 From: jiefu at tencent.com (=?utf-8?B?amllZnUo5YKF5p2wKQ==?=) Date: Wed, 8 Apr 2020 13:51:34 +0000 Subject: RFR: 8242379: [TESTBUG] compiler/loopopts/TestLoopUnswitchingLostCastDependency.java fails with release VMs Message-ID: Hi all, JBS: https://bugs.openjdk.java.net/browse/JDK-8242379 Webrev: http://cr.openjdk.java.net/~jiefu/8242379/webrev.00/ Please review this trivial fix. It only adds -XX:+UnlockDiagnosticVMOptions in the test. Thanks a lot. Best regards, Jie From rwestrel at redhat.com Wed Apr 8 13:56:35 2020 From: rwestrel at redhat.com (Roland Westrelin) Date: Wed, 08 Apr 2020 15:56:35 +0200 Subject: RFR: 8242379: [TESTBUG] compiler/loopopts/TestLoopUnswitchingLostCastDependency.java fails with release VMs In-Reply-To: References: Message-ID: <87r1wyaxnw.fsf@redhat.com> > JBS: https://bugs.openjdk.java.net/browse/JDK-8242379 > Webrev: http://cr.openjdk.java.net/~jiefu/8242379/webrev.00/ That looks good to me. Thanks for fixing this. Roland. From jiefu at tencent.com Wed Apr 8 14:03:00 2020 From: jiefu at tencent.com (=?utf-8?B?amllZnUo5YKF5p2wKQ==?=) Date: Wed, 8 Apr 2020 14:03:00 +0000 Subject: 8242379: [TESTBUG] compiler/loopopts/TestLoopUnswitchingLostCastDependency.java fails with release VMs(Internet mail) Message-ID: <3AEBD8DD-51CB-4C07-9026-52E4E9C7842B@tencent.com> Thanks for your review, Roland. Do you think it's trivial to be pushed now? Thanks a lot. Best regards, Jie ?On 2020/4/8, 9:56 PM, "Roland Westrelin" wrote: > JBS: https://bugs.openjdk.java.net/browse/JDK-8242379 > Webrev: http://cr.openjdk.java.net/~jiefu/8242379/webrev.00/ That looks good to me. Thanks for fixing this. Roland. From igor.ignatyev at oracle.com Wed Apr 8 14:47:15 2020 From: igor.ignatyev at oracle.com (Igor Ignatyev) Date: Wed, 8 Apr 2020 07:47:15 -0700 Subject: RFR(S/M) : 8242310 : use reproducible random in hotspot compiler tests In-Reply-To: <8ccc1dfe-5389-0fcc-898e-1d1b1414d907@oracle.com> References: <3EDC354E-9CC1-418B-978A-689FB50BE061@oracle.com> <8ccc1dfe-5389-0fcc-898e-1d1b1414d907@oracle.com> Message-ID: <2E6F0667-70AF-46DF-8EE5-8FB03C527AC4@oracle.com> > On Apr 7, 2020, at 11:17 PM, Tobias Hartmann wrote: > > Hi Igor, > > On 08.04.20 03:04, Igor Ignatyev wrote: >> - replaces fixed seed w/ 42 (in the tests which don't really depend on a seed value) as it's most common fixed seed in hotspot test suite; > > What's the reason to use a fixed seed in the first place? Seems to me that even if the test does not > directly use the random value, it doesn't hurt to use a non-fixed seed. In fact, wouldn't using a > non-fixed seed increase coverage? Even if the value is not checked, it's still propagated through > registers, stack and heap space and might therefore make a difference. the thing is randomness (even reproducible) in tests comes w/ a price -- you had to be more careful when use such tests to verify fixes, compare results across different runs, etc. so in some cases, the possible gain in code coverage doesn't justify the drawbacks, and frankly I'm not a big fun of using something just b/c it might increase coverage in areas unrelated to the original goals of a test. I had to admit thought that I had several internal discussions w/ myself, at first I removed almost all fixed seed values, then I was going back and forth weighing pros and cons; at the end I decided to leave it as-is for now and reevaluate later on a test-by-test basis. > >> http://cr.openjdk.java.net/~iignatyev//8242310/webrev.00 > > Looks good. > > Best regards, > Tobias From tobias.hartmann at oracle.com Wed Apr 8 14:56:23 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Wed, 8 Apr 2020 16:56:23 +0200 Subject: RFR(S/M) : 8242310 : use reproducible random in hotspot compiler tests In-Reply-To: <2E6F0667-70AF-46DF-8EE5-8FB03C527AC4@oracle.com> References: <3EDC354E-9CC1-418B-978A-689FB50BE061@oracle.com> <8ccc1dfe-5389-0fcc-898e-1d1b1414d907@oracle.com> <2E6F0667-70AF-46DF-8EE5-8FB03C527AC4@oracle.com> Message-ID: <6009359f-bf3e-e37d-6f39-7f8a1c604a2c@oracle.com> Hi Igor, On 08.04.20 16:47, Igor Ignatyev wrote: > the thing is randomness (even reproducible) in tests comes w/ a price -- you had to be more careful when use such tests to verify fixes, compare results across different runs, etc. so in some cases, the possible gain in code coverage doesn't justify the drawbacks, and frankly I'm not a big fun of using something just b/c it might increase coverage in areas unrelated to the original goals of a test. I had to admit thought that I had several internal discussions w/ myself, at first I removed almost all fixed seed values, then I was going back and forth weighing pros and cons; at the end I decided to leave it as-is for now and reevaluate later on a test-by-test basis. Okay, fair enough. I agree that this discussion is independent of your fix. Best regards, Tobias From rwestrel at redhat.com Wed Apr 8 15:11:35 2020 From: rwestrel at redhat.com (Roland Westrelin) Date: Wed, 08 Apr 2020 17:11:35 +0200 Subject: 8242379: [TESTBUG] compiler/loopopts/TestLoopUnswitchingLostCastDependency.java fails with release VMs(Internet mail) In-Reply-To: <3AEBD8DD-51CB-4C07-9026-52E4E9C7842B@tencent.com> References: <3AEBD8DD-51CB-4C07-9026-52E4E9C7842B@tencent.com> Message-ID: <87o8s2au6w.fsf@redhat.com> > Do you think it's trivial to be pushed now? Yes I think it is. Roland. From vladimir.kozlov at oracle.com Wed Apr 8 18:10:06 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 8 Apr 2020 11:10:06 -0700 Subject: 8242379: [TESTBUG] compiler/loopopts/TestLoopUnswitchingLostCastDependency.java fails with release VMs(Internet mail) In-Reply-To: <3AEBD8DD-51CB-4C07-9026-52E4E9C7842B@tencent.com> References: <3AEBD8DD-51CB-4C07-9026-52E4E9C7842B@tencent.com> Message-ID: Please, also add: * @requires vm.compiler2.enabled because both Stress flags are C2 flags. Thanks, Vladimir On 4/8/20 7:03 AM, jiefu(??) wrote: > Thanks for your review, Roland. > > Do you think it's trivial to be pushed now? > > Thanks a lot. > Best regards, > Jie > > ?On 2020/4/8, 9:56 PM, "Roland Westrelin" wrote: > > > > JBS: https://bugs.openjdk.java.net/browse/JDK-8242379 > > Webrev: http://cr.openjdk.java.net/~jiefu/8242379/webrev.00/ > > That looks good to me. Thanks for fixing this. > > Roland. > > > > From vladimir.kozlov at oracle.com Wed Apr 8 18:54:48 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 8 Apr 2020 11:54:48 -0700 Subject: RFR(S/M) : 8242310 : use reproducible random in hotspot compiler tests In-Reply-To: <3EDC354E-9CC1-418B-978A-689FB50BE061@oracle.com> References: <3EDC354E-9CC1-418B-978A-689FB50BE061@oracle.com> Message-ID: <9cdfad14-904d-94ef-156a-eae2f741976c@oracle.com> Looks good. Thanks, Vladimir On 4/7/20 6:04 PM, Igor Ignatyev wrote: > http://cr.openjdk.java.net/~iignatyev//8242310/webrev.00 >> 282 lines changed: 123 ins; 24 del; 135 mod; > > Hi all, > > could you please review the patch which marks hotspot compiler tests w/ randomness k/w and uses Utils.getRandomInstance() instead of Random w/ _random_ seeds where possible? To identify tests which should be marked, I've used both static (in a form of analyzing classes which directly or indirectly depend on Random/SecureRandom/ThreadLocalRandom) and dynamic (by instrumenting the said classes to log tests which called their 'next' methods) analyses. I've decided *not* to mark tests which use SecureRandom only via File.createTemp* b/c in all such cases temp files are not used as a source of randomness, but rather just a reliable way to get a new/empty file/dir. > > the patch also > - replaces fixed seed w/ 42 (in the tests which don't really depend on a seed value) as it's most common fixed seed in hotspot test suite; > - moves Utils.getRandomInstance() calls closer to usage, so 'To re-run test with same seed value please add ... ' won't be printed out by the tests which don't actually use random but use share classes which might use random. > > webrevs: for the sake of reviewers, I've split the patch into parts, webrev.code.00 has only changes in the code, webrev.kw.00 -- only adds the k/w (and comments in few places where one might think k/w is needed), and webrev.00 contains all changes (including copyright year updates) > http://cr.openjdk.java.net/~iignatyev//8242310/webrev.code.00 >> 109 lines changed: 41 ins; 24 del; 44 mod; > http://cr.openjdk.java.net/~iignatyev//8242310/webrev.kw.00 >> 84 lines changed: 82 ins; 0 del; 2 mod; > http://cr.openjdk.java.net/~iignatyev//8242310/webrev.00 >> 282 lines changed: 123 ins; 24 del; 135 mod; > > NB the patch depends on 8241707[1], which is currently under review[2]. > > testing: test/hotspot/jtreg/compiler tests on {linux,windows,macosx}-x64 > JBS: https://bugs.openjdk.java.net/browse/JDK-8242310 > > [1] https://bugs.openjdk.java.net/browse/JDK-8241707 > [2] https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041300.html > > Thanks, > -- Igor > From john.r.rose at oracle.com Wed Apr 8 20:11:38 2020 From: john.r.rose at oracle.com (John Rose) Date: Wed, 8 Apr 2020 13:11:38 -0700 Subject: is it time fully optimize long loops? (JDK-8223051) Message-ID: I see that strip mining [1] is pretty mature now. I think this may open up new options for dealing with an RFE for 64-bit iteration variables [2], specifically using some combination of predication and/or strip mining for strength-reducing 64-bit-tripcount loops into one or more 32-bit-tripcount loops. Because Project Panama works on loops over native addresses, and is attempting to produce code that is competitive with C code, it is necessary that Panama code uses 64-bit iteration variables (?long loops?), but it also expects that such loops get optimized fully, including (but not limited to) iteration range splitting, predication, unswitching (e.g., of type tests), and escape analysis. Some of this stuff works best (or only works) with 32-bit iteration variables (we can call them ?short loops?, can?t we?). To get good performance today, Panama library code sometimes has to perform predication or strip mining manually, in Java code, but this is risky (like any premature optimization) because it makes the intention of the code more obscure to the real optimizer, such as C2. When we get long loops fully supported, Panama?s performance model will get more reliable. But for now, Panama is making uncomfortable compromises (e.g., [3]). Getting the whole story working well, especially for explicitly vectorized loops, may require new intrinsics (such as [4]), but I think we can make progress with strip mining or predication alone. Is now a good time to investigate this? ? John [1] https://bugs.openjdk.java.net/browse/JDK-8186027 [2] https://bugs.openjdk.java.net/browse/JDK-8223051 [3] https://mail.openjdk.java.net/pipermail/panama-dev/2020-April/008411.html [4] https://bugs.openjdk.java.net/browse/JDK-8221358 From igor.ignatyev at oracle.com Wed Apr 8 22:31:24 2020 From: igor.ignatyev at oracle.com (Igor Ignatyev) Date: Wed, 8 Apr 2020 15:31:24 -0700 Subject: RFR(S/M) : 8242310 : use reproducible random in hotspot compiler tests In-Reply-To: <9cdfad14-904d-94ef-156a-eae2f741976c@oracle.com> References: <3EDC354E-9CC1-418B-978A-689FB50BE061@oracle.com> <9cdfad14-904d-94ef-156a-eae2f741976c@oracle.com> Message-ID: <8B9A462B-8594-4CEF-9102-813C47772ABE@oracle.com> Vladimir, Tobias, thank you for review! could you please also review 8241707 (on hotspot-dev) which prevents me from pushing this patch? -- Igor > On Apr 8, 2020, at 11:54 AM, Vladimir Kozlov wrote: > > Looks good. > > Thanks, > Vladimir > > On 4/7/20 6:04 PM, Igor Ignatyev wrote: >> http://cr.openjdk.java.net/~iignatyev//8242310/webrev.00 >>> 282 lines changed: 123 ins; 24 del; 135 mod; >> Hi all, >> could you please review the patch which marks hotspot compiler tests w/ randomness k/w and uses Utils.getRandomInstance() instead of Random w/ _random_ seeds where possible? To identify tests which should be marked, I've used both static (in a form of analyzing classes which directly or indirectly depend on Random/SecureRandom/ThreadLocalRandom) and dynamic (by instrumenting the said classes to log tests which called their 'next' methods) analyses. I've decided *not* to mark tests which use SecureRandom only via File.createTemp* b/c in all such cases temp files are not used as a source of randomness, but rather just a reliable way to get a new/empty file/dir. >> the patch also >> - replaces fixed seed w/ 42 (in the tests which don't really depend on a seed value) as it's most common fixed seed in hotspot test suite; >> - moves Utils.getRandomInstance() calls closer to usage, so 'To re-run test with same seed value please add ... ' won't be printed out by the tests which don't actually use random but use share classes which might use random. >> webrevs: for the sake of reviewers, I've split the patch into parts, webrev.code.00 has only changes in the code, webrev.kw.00 -- only adds the k/w (and comments in few places where one might think k/w is needed), and webrev.00 contains all changes (including copyright year updates) >> http://cr.openjdk.java.net/~iignatyev//8242310/webrev.code.00 >>> 109 lines changed: 41 ins; 24 del; 44 mod; >> http://cr.openjdk.java.net/~iignatyev//8242310/webrev.kw.00 >>> 84 lines changed: 82 ins; 0 del; 2 mod; >> http://cr.openjdk.java.net/~iignatyev//8242310/webrev.00 >>> 282 lines changed: 123 ins; 24 del; 135 mod; >> NB the patch depends on 8241707[1], which is currently under review[2]. >> testing: test/hotspot/jtreg/compiler tests on {linux,windows,macosx}-x64 >> JBS: https://bugs.openjdk.java.net/browse/JDK-8242310 >> [1] https://bugs.openjdk.java.net/browse/JDK-8241707 >> [2] https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041300.html >> Thanks, >> -- Igor From jiefu at tencent.com Thu Apr 9 01:23:47 2020 From: jiefu at tencent.com (=?utf-8?B?amllZnUo5YKF5p2wKQ==?=) Date: Thu, 9 Apr 2020 01:23:47 +0000 Subject: 8242379: [TESTBUG] compiler/loopopts/TestLoopUnswitchingLostCastDependency.java fails with release VMs(Internet mail) In-Reply-To: <87o8s2au6w.fsf@redhat.com> References: <3AEBD8DD-51CB-4C07-9026-52E4E9C7842B@tencent.com> <87o8s2au6w.fsf@redhat.com> Message-ID: Pushed: http://hg.openjdk.java.net/jdk/jdk/rev/801bd63c32f2 Thanks. ?On 2020/4/8, 11:11 PM, "Roland Westrelin" wrote: > Do you think it's trivial to be pushed now? Yes I think it is. Roland. From jiefu at tencent.com Thu Apr 9 01:23:03 2020 From: jiefu at tencent.com (=?utf-8?B?amllZnUo5YKF5p2wKQ==?=) Date: Thu, 9 Apr 2020 01:23:03 +0000 Subject: 8242379: [TESTBUG] compiler/loopopts/TestLoopUnswitchingLostCastDependency.java fails with release VMs(Internet mail) In-Reply-To: References: <3AEBD8DD-51CB-4C07-9026-52E4E9C7842B@tencent.com> Message-ID: <246370D4-F29E-41AB-A33F-94ED368DFD4B@tencent.com> On 2020/4/9, 2:11 AM, "Vladimir Kozlov" wrote: Please, also add: * @requires vm.compiler2.enabled because both Stress flags are C2 flags. Done. Thanks for your review, Vladimir K. From Yang.Zhang at arm.com Thu Apr 9 06:43:12 2020 From: Yang.Zhang at arm.com (Yang Zhang) Date: Thu, 9 Apr 2020 06:43:12 +0000 Subject: [aarch64-port-dev ] RFR(S): 8241911: AArch64: Fix a potential register clash issue in reduce_add2I In-Reply-To: References: Message-ID: Hi Update the patch a little. Could you please help to review it? http://cr.openjdk.java.net/~yzhang/8241911/webrev.01/ Test: tier1. -----Original Message----- From: aarch64-port-dev On Behalf Of Yang Zhang Sent: Friday, April 3, 2020 6:49 PM To: hotspot-compiler-dev at openjdk.java.net; aarch64-port-dev at openjdk.java.net Cc: nd Subject: [aarch64-port-dev ] RFR(S): 8241911: AArch64: Fix a potential register clash issue in reduce_add2I Hi, Could you please help to review this patch? In original reduce_add2I, dst may be the same as tmp2, which may get incorrect result. Some reduction operation instruct code formats are also cleaned up. JBS: https://bugs.openjdk.java.net/browse/JDK-8241911 Webrev: http://cr.openjdk.java.net/~yzhang/8241911/webrev.00/ Regards Yang From aph at redhat.com Thu Apr 9 09:41:59 2020 From: aph at redhat.com (Andrew Haley) Date: Thu, 9 Apr 2020 10:41:59 +0100 Subject: [aarch64-port-dev ] RFR(S): 8241911: AArch64: Fix a potential register clash issue in reduce_add2I In-Reply-To: References: Message-ID: <1a9ed6d0-40bb-1dc4-4eff-b55c86627a47@redhat.com> On 4/9/20 7:43 AM, Yang Zhang wrote: > Hi > > Update the patch a little. Could you please help to review it? > http://cr.openjdk.java.net/~yzhang/8241911/webrev.01/ I've been trying to figure out why this code is so difficult to understand. I think it's because names like tmp1 and src1 are used regardless of what kind of thing tmp1 is. I suggest something like instruct reduce_add4I(iRegINoSp dst, iRegIorL2I i_src, vecX v_src, vecX v_tmp, iRegINoSp i_tmp) %{ match(Set dst (AddReductionVI i_src v_src)); ins_cost(INSN_COST); effect(TEMP v_tmp, TEMP i_tmp); format %{ "addv $v_tmp, T4S, $v_src\n\t" "umov $i_tmp, $v_tmp, S, 0\n\t" "addw $dst, $i_tmp, $i_src\t# add reduction4I" %} ins_encode %{ __ addv(as_FloatRegister($v_tmp$$reg), __ T4S, as_FloatRegister($v_src$$reg)); __ umov($i_tmp$$Register, as_FloatRegister($v_tmp$$reg), __ S, 0); __ addw($dst$$Register, $i_tmp$$Register, $i_src$$Register); %} ins_pipe(pipe_class_default); %} I think this makes the intent much clearer. Thanks. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From Yang.Zhang at arm.com Thu Apr 9 11:21:42 2020 From: Yang.Zhang at arm.com (Yang Zhang) Date: Thu, 9 Apr 2020 11:21:42 +0000 Subject: [aarch64-port-dev ] RFR(S): 8241911: AArch64: Fix a potential register clash issue in reduce_add2I In-Reply-To: <1a9ed6d0-40bb-1dc4-4eff-b55c86627a47@redhat.com> References: <1a9ed6d0-40bb-1dc4-4eff-b55c86627a47@redhat.com> Message-ID: Hi Andrew >instruct reduce_add4I(iRegINoSp dst, iRegIorL2I i_src, vecX v_src, vecX v_tmp, iRegINoSp i_tmp) %{ Besides reduce_add4I, other reduction operations (reduce_mul4I, reduce_max4F, etc) also have such issues. How about creating another JBS and patch to fix this issue? -----Original Message----- From: Andrew Haley Sent: Thursday, April 9, 2020 5:42 PM To: Yang Zhang ; hotspot-compiler-dev at openjdk.java.net; aarch64-port-dev at openjdk.java.net Cc: nd Subject: Re: [aarch64-port-dev ] RFR(S): 8241911: AArch64: Fix a potential register clash issue in reduce_add2I On 4/9/20 7:43 AM, Yang Zhang wrote: > Hi > > Update the patch a little. Could you please help to review it? > http://cr.openjdk.java.net/~yzhang/8241911/webrev.01/ I've been trying to figure out why this code is so difficult to understand. I think it's because names like tmp1 and src1 are used regardless of what kind of thing tmp1 is. I suggest something like instruct reduce_add4I(iRegINoSp dst, iRegIorL2I i_src, vecX v_src, vecX v_tmp, iRegINoSp i_tmp) %{ match(Set dst (AddReductionVI i_src v_src)); ins_cost(INSN_COST); effect(TEMP v_tmp, TEMP i_tmp); format %{ "addv $v_tmp, T4S, $v_src\n\t" "umov $i_tmp, $v_tmp, S, 0\n\t" "addw $dst, $i_tmp, $i_src\t# add reduction4I" %} ins_encode %{ __ addv(as_FloatRegister($v_tmp$$reg), __ T4S, as_FloatRegister($v_src$$reg)); __ umov($i_tmp$$Register, as_FloatRegister($v_tmp$$reg), __ S, 0); __ addw($dst$$Register, $i_tmp$$Register, $i_src$$Register); %} ins_pipe(pipe_class_default); %} I think this makes the intent much clearer. Thanks. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From kuaiwei.kw at alibaba-inc.com Thu Apr 9 11:58:36 2020 From: kuaiwei.kw at alibaba-inc.com (Kuai Wei) Date: Thu, 09 Apr 2020 19:58:36 +0800 Subject: =?UTF-8?B?UkZSOiBoZWFwYmFzZSByZWdpc3RlciBjYW4gYmUgYWxsb2NhdGVkIGluIGNvbXByZXNzZWQg?= =?UTF-8?B?bW9kZQ==?= Message-ID: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com> Hi, I made an enhancement for aarch64 platform. It's based on great work of https://bugs.openjdk.java.net/browse/JDK-8233743 and . In compressed oops mode , if heapbase is zero, jvm don't use heapbase register to encode/decode. So it can be allocated by JIT compiler. The webrev is: http://cr.openjdk.java.net/~wzhuo/8242449/webrev.00/ The bug link: https://bugs.openjdk.java.net/browse/JDK-8242449 Thanks, Kuai Wei From eric.c.liu at arm.com Thu Apr 9 12:17:08 2020 From: eric.c.liu at arm.com (Eric Liu) Date: Thu, 9 Apr 2020 12:17:08 +0000 Subject: RFR(S):8242429:Better implementation for signed extract Message-ID: Hi, This is a small enhancement for C2 compiler. For java code "(i >> 31) >>> 31", it can be optimized to "i >>> 31". AArch64 has implemented this in back-end match rules, while AMD64 hasn?t. Indeed, this pattern can be optimized in mid-end by adding some simple transformations. Besides, "0 - (i >> 31)" could also be optimized to "i >>> 31". This patch adds two conversions: 1. URShiftINode: (i >> 31) >>> 31 ==> i >>> 31 +------+ +----------+ | Parm | | ConI(31) | +------+ +----------+ | / | | / | | / | +---------+ | | RShiftI | | +---------+ | \ | \ | \ | +----------+ | URShiftI | +----------+ 2. SubINode: 0 - (i >> 31) ==> i >>> 31 +------+ +----------+ | Parm | | ConI(31) | +------+ +----------+ \ | \ | \ | \ | +---------+ +---------+ | ConI(0) | | RShiftI | +---------+ +---------+ \ | \ | \ | +------+ | SubI | +------+ With this patch, these two graghs above both can be optimized to below: +------+ +----------+ | Parm | | ConI(31) | +------+ +----------+ | / | / | / | / +----------+ | URShiftI | +----------+ This patch solved the same issue for long type and also removed the relevant match rules in "aarch64.ad" which become useless now. JBS: https://bugs.openjdk.java.net/browse/JDK-8242429 Webrev: http://cr.openjdk.java.net/~yzhang/ericliu/8242429/webrev.00/ [Tests] Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core and langtools::tier1. No new failure found. -- Thanks, Eric From aph at redhat.com Thu Apr 9 12:21:22 2020 From: aph at redhat.com (Andrew Haley) Date: Thu, 9 Apr 2020 13:21:22 +0100 Subject: [aarch64-port-dev ] RFR(S): 8241911: AArch64: Fix a potential register clash issue in reduce_add2I In-Reply-To: References: <1a9ed6d0-40bb-1dc4-4eff-b55c86627a47@redhat.com> Message-ID: On 4/9/20 12:21 PM, Yang Zhang wrote: > Hi Andrew > >> instruct reduce_add4I(iRegINoSp dst, iRegIorL2I i_src, vecX v_src, vecX v_tmp, iRegINoSp i_tmp) %{ > > Besides reduce_add4I, other reduction operations (reduce_mul4I, reduce_max4F, etc) also have such issues. How about creating another JBS and patch to fix this issue? That's a good point. I'll accept http://cr.openjdk.java.net/~yzhang/8241911/webrev.01/ as it is, with a separate patch to clarify those reduction operations. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From eric.c.liu at arm.com Thu Apr 9 12:57:32 2020 From: eric.c.liu at arm.com (Eric Liu) Date: Thu, 9 Apr 2020 12:57:32 +0000 Subject: RFR(S):8242429:Better implementation for signed extract In-Reply-To: References: Message-ID: Hi, This is a small enhancement for C2 compiler. For java code "(i >> 31) >>> 31", it can be optimized to "i >>> 31". AArch64 has implemented this in back-end match rules, while AMD64 hasn't. Indeed, this pattern can be optimized in mid-end by adding some simple transformations. Besides, "0 - (i >> 31)" could also be optimized to "i >>> 31". This patch adds two conversions: 1. URShiftINode: (i >> 31) >>> 31 ==> i >>> 31 +------+ +----------+ | Parm | | ConI(31) | +------+ +----------+ | / | | / | | / | +---------+ | | RShiftI | | +---------+ | \ | \ | \ | +----------+ | URShiftI| +----------+ 2. SubINode: 0 - (i >> 31) ==> i >>> 31 +------+ +----------+ | Parm | | ConI(31) | +------+ +----------+ \ | \ | \ | \ | +---------+ +---------+ | ConI(0) | | RShiftI | +---------+ +---------+ \ | \ | \ | +------+ | SubI | +------+ With this patch, these two graghs above both can be optimized to below: +------+ +----------+ | Parm | | ConI(31) | +------+ +----------+ | / | / | / | / +----------+ | URShiftI | +----------+ This patch solved the same issue for long type and also removed the relevant match rules in "aarch64.ad" which become useless now. JBS: https://bugs.openjdk.java.net/browse/JDK-8242429 Webrev: http://cr.openjdk.java.net/~yzhang/ericliu/8242429/webrev.00/ [Tests] Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core and langtools::tier1. No new failure found. -- Thanks, Eric From rwestrel at redhat.com Thu Apr 9 14:28:28 2020 From: rwestrel at redhat.com (Roland Westrelin) Date: Thu, 09 Apr 2020 16:28:28 +0200 Subject: is it time fully optimize long loops? (JDK-8223051) In-Reply-To: References: Message-ID: <87imi8bunn.fsf@redhat.com> > Getting the whole story working well, especially for > explicitly vectorized loops, may require new intrinsics > (such as [4]), but I think we can make progress with strip > mining or predication alone. Is now a good time to > investigate this? I'll give it a shot. Roland. From aph at redhat.com Thu Apr 9 17:00:38 2020 From: aph at redhat.com (Andrew Haley) Date: Thu, 9 Apr 2020 18:00:38 +0100 Subject: RFR: heapbase register can be allocated in compressed mode In-Reply-To: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com> References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com> Message-ID: <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com> Hi, On 4/9/20 12:58 PM, Kuai Wei wrote: > I made an enhancement for aarch64 platform. It's based on great work of https://bugs.openjdk.java.net/browse/JDK-8233743 > and . > > In compressed oops mode , if heapbase is zero, jvm don't use heapbase register to encode/decode. So it can be allocated by > JIT compiler. > > The webrev is: > http://cr.openjdk.java.net/~wzhuo/8242449/webrev.00/ > > The bug link: > https://bugs.openjdk.java.net/browse/JDK-8242449 That looks safe. I think the only reason we never did something like that before was because no-one felt brave enough, but perhaps we should do it now. MacroAssembler::reinit_heapbase() points to a potential problem, though: we generate some of this code before we know what the heapbase is going to be, so we unconditionally write to rheapbase. I think this only happens in three places: generate_call_stub, interpreter::generate_throw_exception, and interpreter::generate_native_entry, so we should be safe. It's tricky to test this stuff, though. OK for mainline, and let's test it as much as we can. Thanks. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From vladimir.x.ivanov at oracle.com Thu Apr 9 18:29:18 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Thu, 9 Apr 2020 21:29:18 +0300 Subject: [15] RFR (S): 8242289: C2: Support platform-specific node cloning in Matcher In-Reply-To: References: <2645fc7c-76f7-dd3b-bee3-72d0b923d46f@oracle.com> Message-ID: <967a7fb2-931a-e0fc-d8e0-88166d8ffe43@oracle.com> Thanks, Vladimir. Best regards, Vladimir Ivanov On 07.04.2020 20:43, Vladimir Kozlov wrote: > Good. > > Thanks, > Vladimir > > On 4/7/20 10:29 AM, Vladimir Ivanov wrote: >> http://cr.openjdk.java.net/~vlivanov/8242289/webrev.00/ >> https://bugs.openjdk.java.net/browse/JDK-8242289 >> >> Introduce a platform-specific entry point (Matcher::pd_clone_node) and >> move platform-specific node cloning during matching. >> >> Matcher processes every node only once unless it is marked as shared. >> It is too restrictive in some cases, so the workaround is to >> explicitly check for particular IR patterns and clone relevant nodes >> during matching phase. >> >> As an example, take a look at ShiftCntV. There are the following match >> rules in aarch64.ad: >> >> ?? match(Set dst (RShiftVB src (RShiftCntV shift))); >> >> By default, RShiftCntV node is matched only once, so when it has >> multiple users, only it will be folded only into one of them and for >> the rest the value it produces will be put in register. To overcome >> that, Matcher is taught to detect such pattern and "clone" RShiftCntV >> input every time it matches RShiftV node. In case of RShiftCntV, it's >> arm32/aarch64-specific and other platforms (x86 in particular) don't >> optimize for it. >> >> To avoid polluting shared code (in matcher.cpp) with platform-specific >> portions, I propose to add Matcher::pd_clone_node and place >> platform-specific checks there. >> >> Also, as a cleanup, renamed Matcher::clone_address_expressions() to >> pd_clone_address_expressions since it's a platform-specific method. >> >> Testing: hs-precheckin-comp, hs-tier1, hs-tier2, >> ????????? cross-builds on all affected platforms >> >> Thanks! >> >> Best regards, >> Vladimir Ivanov From john.r.rose at oracle.com Thu Apr 9 21:59:40 2020 From: john.r.rose at oracle.com (John Rose) Date: Thu, 9 Apr 2020 14:59:40 -0700 Subject: is it time fully optimize long loops? (JDK-8223051) In-Reply-To: <87imi8bunn.fsf@redhat.com> References: <87imi8bunn.fsf@redhat.com> Message-ID: On Apr 9, 2020, at 7:28 AM, Roland Westrelin wrote: > >> Is now a good time to >> investigate this? > > I'll give it a shot. Thanks Roland! From Yang.Zhang at arm.com Fri Apr 10 02:45:45 2020 From: Yang.Zhang at arm.com (Yang Zhang) Date: Fri, 10 Apr 2020 02:45:45 +0000 Subject: [aarch64-port-dev ] RFR(S): 8241911: AArch64: Fix a potential register clash issue in reduce_add2I In-Reply-To: References: <1a9ed6d0-40bb-1dc4-4eff-b55c86627a47@redhat.com> Message-ID: Okay. When the patch is ready, I will send it for review. Regards Yang -----Original Message----- From: Andrew Haley Sent: Thursday, April 9, 2020 8:21 PM To: Yang Zhang ; hotspot-compiler-dev at openjdk.java.net; aarch64-port-dev at openjdk.java.net Cc: nd Subject: Re: [aarch64-port-dev ] RFR(S): 8241911: AArch64: Fix a potential register clash issue in reduce_add2I On 4/9/20 12:21 PM, Yang Zhang wrote: > Hi Andrew > >> instruct reduce_add4I(iRegINoSp dst, iRegIorL2I i_src, vecX v_src, >> vecX v_tmp, iRegINoSp i_tmp) %{ > > Besides reduce_add4I, other reduction operations (reduce_mul4I, reduce_max4F, etc) also have such issues. How about creating another JBS and patch to fix this issue? That's a good point. I'll accept http://cr.openjdk.java.net/~yzhang/8241911/webrev.01/ as it is, with a separate patch to clarify those reduction operations. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From Yang.Zhang at arm.com Fri Apr 10 02:52:45 2020 From: Yang.Zhang at arm.com (Yang Zhang) Date: Fri, 10 Apr 2020 02:52:45 +0000 Subject: [aarch64-port-dev ] RFR(XS): 8242070: AArch64: Fix a typo introduced by JDK-8238690 Message-ID: Hi, Could you please help to review this patch? JBS: https://bugs.openjdk.java.net/browse/JDK-8242070 Webrev: http://cr.openjdk.java.net/~yzhang/8242070/webrev.00/ In JDK-8238690, it unified IR shape for vector shifts by scalar and always used ShiftV src (ShiftCntV shift) When shift is scalar, the following IR nodes are generated. scalar_shift | src ShiftCntV | / | / ShiftV But when implementing this on AArch64, there is an issue in match rule of vector shift right with imm shift for short type. match(Set dst (RShiftVS src (LShiftCntV shift))); LShiftCntV should be RShiftCntV here. Test case: public static void shiftR(short[] a, short[] c) { for (int i = 0; i < a.length; i++) { c[i] = (short)(a[i] >> 2); } } IR nodes: imm:2 | LoadVector RShiftCntV | / | / RShiftVS C2 aassembly generated: Before: 0x0000ffffac563764: orr w11, wzr, #0x2 0x0000ffffac563768: dup v16.16b, w11 -------- vshiftcnt16B 0x0000ffffac5637a8: ldr q24, [x18, #16] 0x0000ffffac5637ac: neg v25.16b, v16.16b ------ 0x0000ffffac5637b0: sshl v24.8h, v24.8h, v25.8h ------vsra8S 0x0000ffffac5637b8: str q24, [x14, #16] "match(Set dst (RShiftVS src (LShiftCntV shift)));" matching fails. RShiftCntV and RShiftVS are matched separately by vshiftcnt16B and vsra8S. After: 0x0000ffffac563808: ldr q16, [x15, #16] 0x0000ffffac56380c: sshr v16.8h, v16.8h, #2 0x0000ffffac563814: str q16, [x14, #16] "match(Set dst (RShiftVS src (RShiftCntV shift)));" matching succeeds. Performance: JMH test case is attached in JBS. Before: Benchmark Mode Cnt Score Error Units TestVect.testVectShift avgt 10 66.964 ? 0.052 us/op After: Benchmark Mode Cnt Score Error Units TestVect.testVectShift avgt 10 56.156 ? 0.053 us/op Testing: tier1 Pass and no new failure. Regards Yang From kuaiwei.kw at alibaba-inc.com Fri Apr 10 04:16:30 2020 From: kuaiwei.kw at alibaba-inc.com (Kuai Wei) Date: Fri, 10 Apr 2020 12:16:30 +0800 Subject: =?UTF-8?B?UmU6IFJGUjogaGVhcGJhc2UgcmVnaXN0ZXIgY2FuIGJlIGFsbG9jYXRlZCBpbiBjb21wcmVz?= =?UTF-8?B?c2VkIG1vZGU=?= In-Reply-To: <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com> References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>, <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com> Message-ID: Hi Andrew, Thanks for your review. As you pointed out, some stubs are generated before universe fully initialized and they will reset r27 in reinit_heap. My initial think is they are not the problem. Interpreters can be safe because they are initialized after heap. I can change them not to dependent on fully_initialized flag. I will check call stubs to guarantee they are safe. Thanks, Kuai Wei ------------------------------------------------------------------ From:Andrew Haley Send Time:2020?4?10?(???) 01:01 To:??(??) ; hotspot compiler Subject:Re: RFR: heapbase register can be allocated in compressed mode Hi, On 4/9/20 12:58 PM, Kuai Wei wrote: > I made an enhancement for aarch64 platform. It's based on great work of https://bugs.openjdk.java.net/browse/JDK-8233743 > and . > > In compressed oops mode , if heapbase is zero, jvm don't use heapbase register to encode/decode. So it can be allocated by > JIT compiler. > > The webrev is: > http://cr.openjdk.java.net/~wzhuo/8242449/webrev.00/ > > The bug link: > https://bugs.openjdk.java.net/browse/JDK-8242449 That looks safe. I think the only reason we never did something like that before was because no-one felt brave enough, but perhaps we should do it now. MacroAssembler::reinit_heapbase() points to a potential problem, though: we generate some of this code before we know what the heapbase is going to be, so we unconditionally write to rheapbase. I think this only happens in three places: generate_call_stub, interpreter::generate_throw_exception, and interpreter::generate_native_entry, so we should be safe. It's tricky to test this stuff, though. OK for mainline, and let's test it as much as we can. Thanks. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From rwestrel at redhat.com Fri Apr 10 07:38:38 2020 From: rwestrel at redhat.com (Roland Westrelin) Date: Fri, 10 Apr 2020 09:38:38 +0200 Subject: is it time fully optimize long loops? (JDK-8223051) In-Reply-To: References: <87imi8bunn.fsf@redhat.com> Message-ID: <87ftdbbxj5.fsf@redhat.com> Once the long loop is transformed to an int counted loop what are the optimizations that need to trigger reliably? In particular do we need range check elimination? Can you or someone from the panama project shar code samples that I can use to verify the long loop optimizes well? Roland. From HORIE at jp.ibm.com Fri Apr 10 08:47:42 2020 From: HORIE at jp.ibm.com (Michihiro Horie) Date: Fri, 10 Apr 2020 17:47:42 +0900 Subject: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9 In-Reply-To: <67fa8056-a8ed-cdfc-1e5a-d36b49c4af18@linux.ibm.com> References: <67fa8056-a8ed-cdfc-1e5a-d36b49c4af18@linux.ibm.com>, <0079874e-7bc2-5ff4-f004-337c718ec6df@linux.ibm.com> Message-ID: Hi Corey, Thank you for sharing your benchmarks. I confirmed your change reduced the elapsed time of the benchmarks by more than 30% on my P9 node. Also, I checked JTREG results, which look no problem. BTW, I cannot find further points of improvement in your change. Best regards, Michihiro ----- Original message ----- From: "Corey Ashford" To: Michihiro Horie/Japan/IBM at IBMJP Cc: hotspot-compiler-dev at openjdk.java.net, ppc-aix-port-dev at openjdk.java.net, "Gustavo Romero" Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9 Date: Fri, Apr 3, 2020 8:07 AM On 4/2/20 7:27 AM, Michihiro Horie wrote: > Hi Corey, > > I?m not a reviewer, but I can run your benchmark in my local P9 node if > you share it. > > Best regards, > Michihiro The tests are somewhat hokey; I added the shifts to keep the compiler from hoisting the code that it could predetermine the result. Here's the one for Long.reverseBytes(): import java.lang.*; class ReverseLong { public static void main(String args[]) { long reversed, re_reversed; long accum = 0; long orig = 0x1122334455667788L; long start = System.currentTimeMillis(); for (int i = 0; i < 1_000_000_000; i++) { // Try to keep java from figuring out stuff in advance reversed = Long.reverseBytes(orig); re_reversed = Long.reverseBytes(reversed); if (re_reversed != orig) { System.out.println("Orig: " + String.format("%16x", orig) + " Re-reversed: " + String.format("%16x", re_reversed)); } accum += orig; orig = Long.rotateRight(orig, 3); } System.out.println("Elapsed time: " + Long.toString(System.currentTimeMillis() - start)); System.out.println("accum: " + Long.toString(accum)); } } And the one for Integer.reverseBytes(): import java.lang.*; class ReverseInt { public static void main(String args[]) { int reversed, re_reversed; int orig = 0x11223344; int accum = 0; long start = System.currentTimeMillis(); for (int i = 0; i < 1_000_000_000; i++) { // Try to keep java from figuring out stuff in advance reversed = Integer.reverseBytes(orig); re_reversed = Integer.reverseBytes(reversed); if (re_reversed != orig) { System.out.println("Orig: " + String.format("%08x", orig) + " Re-reversed: " + String.format("%08x", re_reversed)); } accum += orig; orig = Integer.rotateRight(orig, 3); } System.out.println("Elapsed time: " + Long.toString(System.currentTimeMillis() - start)); System.out.println("accum: " + Integer.toString(accum)); } } From rwestrel at redhat.com Fri Apr 10 11:26:56 2020 From: rwestrel at redhat.com (Roland Westrelin) Date: Fri, 10 Apr 2020 13:26:56 +0200 Subject: is it time fully optimize long loops? (JDK-8223051) In-Reply-To: <87ftdbbxj5.fsf@redhat.com> References: <87imi8bunn.fsf@redhat.com> <87ftdbbxj5.fsf@redhat.com> Message-ID: <87d08fbmyn.fsf@redhat.com> > Once the long loop is transformed to an int counted loop what are the > optimizations that need to trigger reliably? In particular do we need > range check elimination? Can you or someone from the panama project shar > code samples that I can use to verify the long loop optimizes well? I see now that you mentioned RCE in JDK-8223051. Roland. From aph at redhat.com Fri Apr 10 12:19:01 2020 From: aph at redhat.com (Andrew Haley) Date: Fri, 10 Apr 2020 13:19:01 +0100 Subject: RFR: heapbase register can be allocated in compressed mode In-Reply-To: References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com> <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com> Message-ID: <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com> On 4/10/20 5:16 AM, Kuai Wei wrote: > As you pointed out, some stubs are generated before universe fully > initialized and they will reset r27 in reinit_heap. My initial > think is they are not the problem. Interpreters can be safe because > they are initialized after heap. I can change them not to dependent > on fully_initialized flag. Please don't change that; there's no need. Loading r27 unnecessarily in these places does no harm. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From vladimir.x.ivanov at oracle.com Fri Apr 10 14:07:08 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Fri, 10 Apr 2020 17:07:08 +0300 Subject: [15] RFR (S): 8242491: C2: assert(v2->bottom_type() == vt) failed: mismatch when creating MacroLogicV Message-ID: <0e456498-e4d6-7147-b7db-2e2c3a539430@oracle.com> http://cr.openjdk.java.net/~vlivanov/8242491/webrev.00/ https://bugs.openjdk.java.net/browse/JDK-8242491 Asserts on input types for MacroLogicV are too strong. SuperWord pass can mix vectors of distinct subword types (byte and boolean or short and char). Though it's possible to explicitly check for such particular cases, the fix relaxes the assert even more and only verifies that inputs are of the same size (in bytes), so bitwise reinterpretation of vector values is safe. Testing: hs-precheckin-comp,hs-tier1,hs-tier2 Thanks! Best regards, Vladimir Ivanov From vladimir.x.ivanov at oracle.com Fri Apr 10 14:25:56 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Fri, 10 Apr 2020 17:25:56 +0300 Subject: [15] RFR (S): 8242492: C2: Remove Matcher::vector_shift_count_ideal_reg() Message-ID: http://cr.openjdk.java.net/~vlivanov/8242492/webrev.00/ https://bugs.openjdk.java.net/browse/JDK-8242492 Matcher::vector_shift_count_ideal_reg() was introduced specifically for x86 to communicate that only low 32 bits are used by vector shift instructions, so only those bits should be spilled when needed. Unfortunately, it is broken for LShiftCntV/RShiftCntV: Matcher doesn't capute overridden ideal_reg value and spills use bottom type instead. So, it causes a mismatch during RA. Fortunately, LShiftCntV/RShiftCntV are never spilled on x86. Considering how simple AD instructions for LShiftCntV/RShiftCntV are, RA prefers to rematerialize the value instead (which is a reg-to-reg move). I propose to simplify the implementation and completely remove Matcher::vector_shift_count_ideal_reg() along with additional special handling logic for LShiftCntV/RShiftCntV. Testing: hs-precheckin-comp, hs-tier1, hs-tier2 Thanks! Best regards, Vladimir Ivanov From john.r.rose at oracle.com Sat Apr 11 05:37:23 2020 From: john.r.rose at oracle.com (John Rose) Date: Fri, 10 Apr 2020 22:37:23 -0700 Subject: is it time fully optimize long loops? (JDK-8223051) In-Reply-To: <87d08fbmyn.fsf@redhat.com> References: <87imi8bunn.fsf@redhat.com> <87ftdbbxj5.fsf@redhat.com> <87d08fbmyn.fsf@redhat.com> Message-ID: On Apr 10, 2020, at 4:26 AM, Roland Westrelin wrote: > >> Once the long loop is transformed to an int counted loop what are the >> optimizations that need to trigger reliably? In particular do we need >> range check elimination? Can you or someone from the panama project shar >> code samples that I can use to verify the long loop optimizes well? > > I see now that you mentioned RCE in JDK-8223051. RCE focuses on comparisons against array lengths but it is more general than that. If long loops are strip mined into short loops, and if the range checks in those short loops are somehow transformed into 32-bit comparisons, they should be amenable to RCE transformations. I hope we don?t need to generalize RCE transformations to know about 64-bit comparisons; that seems to be harder. ? John From kuaiwei.kw at alibaba-inc.com Mon Apr 13 01:32:45 2020 From: kuaiwei.kw at alibaba-inc.com (Kuai Wei) Date: Mon, 13 Apr 2020 09:32:45 +0800 Subject: =?UTF-8?B?UmU6IFJGUjogaGVhcGJhc2UgcmVnaXN0ZXIgY2FuIGJlIGFsbG9jYXRlZCBpbiBjb21wcmVz?= =?UTF-8?B?c2VkIG1vZGU=?= In-Reply-To: <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com> References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com> <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com> , <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com> Message-ID: <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com> Ok, I will keep the origin change. I can not push to tip branch. Can you help me to push it ? Or do we need other reivew? Thanks, Kuai Wei ------------------------------------------------------------------ From:Andrew Haley Send Time:2020?4?10?(???) 20:19 To:??(??) ; hotspot compiler Subject:Re: RFR: heapbase register can be allocated in compressed mode On 4/10/20 5:16 AM, Kuai Wei wrote: > As you pointed out, some stubs are generated before universe fully > initialized and they will reset r27 in reinit_heap. My initial > think is they are not the problem. Interpreters can be safe because > they are initialized after heap. I can change them not to dependent > on fully_initialized flag. Please don't change that; there's no need. Loading r27 unnecessarily in these places does no harm. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From Pengfei.Li at arm.com Mon Apr 13 02:22:40 2020 From: Pengfei.Li at arm.com (Pengfei Li) Date: Mon, 13 Apr 2020 02:22:40 +0000 Subject: RFR: heapbase register can be allocated in compressed mode In-Reply-To: <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com> References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com> <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com> , <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com> <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com> Message-ID: Hi Wei, > I can not push to tip branch. Can you help me to push it ? Or do we need > other reivew? Thanks for your enhancement patch. I ran full jtreg in the weekend and found no new failure after this change. We could also help push if there's no other review comments. -- Thanks, Pengfei From vladimir.x.ivanov at oracle.com Mon Apr 13 08:41:21 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Mon, 13 Apr 2020 11:41:21 +0300 Subject: RFR(S):8242429:Better implementation for signed extract In-Reply-To: References: Message-ID: Hi Eric, I was confused at first by what "signed extract" means. It should be "sign extract". Overall, the changes look good. One comment: (i >> 31) >>> 31 ==> i >>> 31 The shift count value is irrelevant here, isn't it? So, the transformation can be generalized to: (i >> n) >>> 31 ==> i >>> 31 Best regards, Vladimir Ivanov On 09.04.2020 15:17, Eric Liu wrote: > Hi, > > This is a small enhancement for C2 compiler. > > > For java code "(i >> 31) >>> 31", it can be optimized to "i >>> 31". > AArch64 has implemented this in back-end match rules, while AMD64 > hasn?t. > > Indeed, this pattern can be optimized in mid-end by adding some simple > transformations. Besides, "0 - (i >> 31)" could also be optimized to > "i >>> 31". > > This patch adds two conversions: > > 1. URShiftINode: (i >> 31) >>> 31 ==> i >>> 31 > > +------+ +----------+ > | Parm | | ConI(31) | > +------+ +----------+ > | / | > | / | > | / | > +---------+ | > | RShiftI | | > +---------+ | > \ | > \ | > \ | > +----------+ > | URShiftI | > +----------+ > > 2. SubINode: 0 - (i >> 31) ==> i >>> 31 > > +------+ +----------+ > | Parm | | ConI(31) | > +------+ +----------+ > \ | > \ | > \ | > \ | > +---------+ +---------+ > | ConI(0) | | RShiftI | > +---------+ +---------+ > \ | > \ | > \ | > +------+ > | SubI | > +------+ > > With this patch, these two graghs above both can be optimized to below: > > +------+ +----------+ > | Parm | | ConI(31) | > +------+ +----------+ > | / > | / > | / > | / > +----------+ > | URShiftI | > +----------+ > > This patch solved the same issue for long type and also removed the > relevant match rules in "aarch64.ad" which become useless now. > > > JBS: https://bugs.openjdk.java.net/browse/JDK-8242429 > Webrev: http://cr.openjdk.java.net/~yzhang/ericliu/8242429/webrev.00/ > > [Tests] > Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core and langtools::tier1. > No new failure found. > > > -- > Thanks, > Eric > From kuaiwei.kw at alibaba-inc.com Mon Apr 13 09:52:33 2020 From: kuaiwei.kw at alibaba-inc.com (Kuai Wei) Date: Mon, 13 Apr 2020 17:52:33 +0800 Subject: =?UTF-8?B?UmU6IFJGUjogaGVhcGJhc2UgcmVnaXN0ZXIgY2FuIGJlIGFsbG9jYXRlZCBpbiBjb21wcmVz?= =?UTF-8?B?c2VkIG1vZGU=?= In-Reply-To: References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com> <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com> , <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com> <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com>, Message-ID: Hi Pengfei, Thanks for your help. Kuai Wei ------------------------------------------------------------------ From:Pengfei Li Send Time:2020?4?13?(???) 10:37 To:??(??) ; Andrew Haley ; hotspot compiler Cc:nd Subject:RE: RFR: heapbase register can be allocated in compressed mode Hi Wei, > I can not push to tip branch. Can you help me to push it ? Or do we need > other reivew? Thanks for your enhancement patch. I ran full jtreg in the weekend and found no new failure after this change. We could also help push if there's no other review comments. -- Thanks, Pengfei From sandhya.viswanathan at intel.com Mon Apr 13 17:02:15 2020 From: sandhya.viswanathan at intel.com (Viswanathan, Sandhya) Date: Mon, 13 Apr 2020 17:02:15 +0000 Subject: [15] RFR (S): 8242491: C2: assert(v2->bottom_type() == vt) failed: mismatch when creating MacroLogicV In-Reply-To: <0e456498-e4d6-7147-b7db-2e2c3a539430@oracle.com> References: <0e456498-e4d6-7147-b7db-2e2c3a539430@oracle.com> Message-ID: Hi Vladimir, Your change looks good to me. Best Regards, Sandhya -----Original Message----- From: hotspot-compiler-dev On Behalf Of Vladimir Ivanov Sent: Friday, April 10, 2020 7:07 AM To: hotspot compiler Subject: [15] RFR (S): 8242491: C2: assert(v2->bottom_type() == vt) failed: mismatch when creating MacroLogicV http://cr.openjdk.java.net/~vlivanov/8242491/webrev.00/ https://bugs.openjdk.java.net/browse/JDK-8242491 Asserts on input types for MacroLogicV are too strong. SuperWord pass can mix vectors of distinct subword types (byte and boolean or short and char). Though it's possible to explicitly check for such particular cases, the fix relaxes the assert even more and only verifies that inputs are of the same size (in bytes), so bitwise reinterpretation of vector values is safe. Testing: hs-precheckin-comp,hs-tier1,hs-tier2 Thanks! Best regards, Vladimir Ivanov From xxinliu at amazon.com Mon Apr 13 17:33:54 2020 From: xxinliu at amazon.com (Liu, Xin) Date: Mon, 13 Apr 2020 17:33:54 +0000 Subject: FR[M]: 8151779: Some intrinsic flags could be replaced with one general flag Message-ID: Hi, compiler developers, I attempt to refactor UseXXXIntrinsics for JDK-8151779. I think we still need to keep UseXXXIntrinsics options because many applications may be using them. My change provide 2 new features: 1) a shorthand to enable/disable intrinsics. A comma-separated string. Each one is an intrinsic. An optional tailing symbol + or '-' denotes enabling or disabling. If the tailing symbol is missing, it means enable. Eg. -XX:UseIntrinsics="AESCTR-,CRC32C+,CRC32-,MathExact" This jvm option will expand to multiple options -XX:-UseAESCTRIntrinsics, -XX:+UseCRC32CIntrinsics, -XX:-UseCRC32Intrinsics, -XX:UseMathExactIntrinsics 2) provide a set of macro to declare intrinsic options Developers declare once in intrinsics.hpp and macros will take care all other places. Here are example: https://cr.openjdk.java.net/~xliu/8151779/00/webrev/src/hotspot/share/compiler/intrinsics.hpp.html Ion Lam is overhauling jvm options. I am thinking how to be consistent with his proposal. I handle UseIntrinsics before VM_Version::initialize. It means that platform-specific initialization still has a chance to correct those options. If we do that after VM_Version::initialize, some intrinsics may cause JVM crash. Eg. +UseBase64Intrinsics on x86_64 Linux. Even though this behavior is same as -XX:+UseXXXIntrinsics, from user's perspective, it's not straightforward when JVM overrides what users specify implicitly. It's dilemma here, stable jvm or fidelity of cmdline. What do you think? Another problem is naming convention. Almost all intrinsics options use UseXXXIntrinsics. One exception is UseVectorizedMismatchIntrinsic. Personally, I think it should be "UseXXXIntrinsic" because one option is for one intrinsic, right? Is it possible to change this name convention? I plan to write a gtest to test intrinsics.cpp and finalize the webrev when Ion finalize his overhaul. But here is quick preview of my change. I really appreciate if you can give me some feedback. https://cr.openjdk.java.net/~xliu/8151779/00/webrev/ I use -XX:+PrintFlagsFinal to verify my expression work or not. eg. $java -XX:UseIntrinsics=",AESCTR-,CRC32C,,CRC32-,,MathExact," -XX:+PrintFlagsFinal -version |& grep "Use.*Intrinsics" Thanks. --lx From jatin.bhateja at intel.com Mon Apr 13 19:07:00 2020 From: jatin.bhateja at intel.com (Bhateja, Jatin) Date: Mon, 13 Apr 2020 19:07:00 +0000 Subject: [15] RFR (S): 8242491: C2: assert(v2->bottom_type() == vt) failed: mismatch when creating MacroLogicV In-Reply-To: References: <0e456498-e4d6-7147-b7db-2e2c3a539430@oracle.com> Message-ID: +1 Looks good to me. Regards, Jatin > -----Original Message----- > From: hotspot-compiler-dev > On Behalf Of Viswanathan, Sandhya > Sent: Monday, April 13, 2020 10:32 PM > To: Vladimir Ivanov ; hotspot compiler > > Subject: RE: [15] RFR (S): 8242491: C2: assert(v2->bottom_type() == vt) > failed: mismatch when creating MacroLogicV > > Hi Vladimir, > > Your change looks good to me. > > Best Regards, > Sandhya > > -----Original Message----- > From: hotspot-compiler-dev > On Behalf Of Vladimir Ivanov > Sent: Friday, April 10, 2020 7:07 AM > To: hotspot compiler > Subject: [15] RFR (S): 8242491: C2: assert(v2->bottom_type() == vt) failed: > mismatch when creating MacroLogicV > > http://cr.openjdk.java.net/~vlivanov/8242491/webrev.00/ > https://bugs.openjdk.java.net/browse/JDK-8242491 > > Asserts on input types for MacroLogicV are too strong. > SuperWord pass can mix vectors of distinct subword types (byte and boolean > or short and char). > > Though it's possible to explicitly check for such particular cases, the fix > relaxes the assert even more and only verifies that inputs are of the same > size (in bytes), so bitwise reinterpretation of vector values is safe. > > Testing: hs-precheckin-comp,hs-tier1,hs-tier2 > > Thanks! > > Best regards, > Vladimir Ivanov From cjashfor at linux.ibm.com Mon Apr 13 20:42:40 2020 From: cjashfor at linux.ibm.com (Corey Ashford) Date: Mon, 13 Apr 2020 13:42:40 -0700 Subject: FR[M]: 8151779: Some intrinsic flags could be replaced with one general flag In-Reply-To: References: Message-ID: <2cfeb040-72a6-7e40-8356-56ee3bda3cdf@linux.ibm.com> On 4/13/20 10:33 AM, Liu, Xin wrote: > Hi, compiler developers, > I attempt to refactor UseXXXIntrinsics for JDK-8151779. I think we still need to keep UseXXXIntrinsics options because many applications may be using them. > > My change provide 2 new features: > 1) a shorthand to enable/disable intrinsics. > A comma-separated string. Each one is an intrinsic. An optional tailing symbol + or '-' denotes enabling or disabling. > If the tailing symbol is missing, it means enable. > Eg. -XX:UseIntrinsics="AESCTR-,CRC32C+,CRC32-,MathExact" > This jvm option will expand to multiple options -XX:-UseAESCTRIntrinsics, -XX:+UseCRC32CIntrinsics, -XX:-UseCRC32Intrinsics, -XX:UseMathExactIntrinsics > > 2) provide a set of macro to declare intrinsic options > Developers declare once in intrinsics.hpp and macros will take care all other places. > Here are example: https://cr.openjdk.java.net/~xliu/8151779/00/webrev/src/hotspot/share/compiler/intrinsics.hpp.html > Ion Lam is overhauling jvm options. I am thinking how to be consistent with his proposal. > Great idea, though to be consistent with the original syntax, I think the +/- should be in front of the name: -XX:UseIntrinsics=-AESCTR,+CRC32C,... > I handle UseIntrinsics before VM_Version::initialize. It means that platform-specific initialization still has a chance to correct those options. > If we do that after VM_Version::initialize, some intrinsics may cause JVM crash. Eg. +UseBase64Intrinsics on x86_64 Linux. > Even though this behavior is same as -XX:+UseXXXIntrinsics, from user's perspective, it's not straightforward when JVM overrides what users specify implicitly. It's dilemma here, stable jvm or fidelity of cmdline. What do you think? > > Another problem is naming convention. Almost all intrinsics options use UseXXXIntrinsics. One exception is UseVectorizedMismatchIntrinsic. > Personally, I think it should be "UseXXXIntrinsic" because one option is for one intrinsic, right? Is it possible to change this name convention? Some (many?) intrinsic options turn on more than one .ad instruct instrinsic, or library instrinsics at the same time. I think that's why the plural is there. Also, consistently adding the plural allows you to add more capabilities to a flag that initially only had one intrinsic without changing the plurality (and thus backward compatibility). Regards, - Corey From xxinliu at amazon.com Tue Apr 14 07:41:25 2020 From: xxinliu at amazon.com (Liu, Xin) Date: Tue, 14 Apr 2020 07:41:25 +0000 Subject: RFR: heapbase register can be allocated in compressed mode In-Reply-To: References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com> <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com> <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com> <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com> Message-ID: <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com> Hi, Wei, Your change of aarch64.ad is definitely correct, but I feel that's the only place c2 refers to reg_class heapbase_reg. If it's gone, is that possible we use R27 no matter what UseCompressedOops is? I read JDK-8234794 but I don't understand why that change involves in r27 and CompressedOop. Btw, I think you can just keep the assignment in MacroAssembler::reinit_heapbase() for simplicity. Leaving a comment is better. I think Assignment of rheapbase is harmless. Only c2-generated code will use rheapbase and it's for locals. I still can pass hotspot-tier1 without your change of macroAssembler_aarch64.cpp. Another argument is that your change of reinit_heapbase() makes verify_heapbase() more complex. I don't know why it is commented out, but it looks quite easy to fix currently. Thanks, --lx ?On 4/13/20, 2:55 AM, "hotspot-compiler-dev on behalf of Kuai Wei" wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Hi Pengfei, Thanks for your help. Kuai Wei ------------------------------------------------------------------ From:Pengfei Li Send Time:2020?4?13?(???) 10:37 To:??(??) ; Andrew Haley ; hotspot compiler Cc:nd Subject:RE: RFR: heapbase register can be allocated in compressed mode Hi Wei, > I can not push to tip branch. Can you help me to push it ? Or do we need > other reivew? Thanks for your enhancement patch. I ran full jtreg in the weekend and found no new failure after this change. We could also help push if there's no other review comments. -- Thanks, Pengfei From Pengfei.Li at arm.com Tue Apr 14 08:38:37 2020 From: Pengfei.Li at arm.com (Pengfei Li) Date: Tue, 14 Apr 2020 08:38:37 +0000 Subject: RFR: heapbase register can be allocated in compressed mode In-Reply-To: <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com> References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com> <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com> <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com> <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com> <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com> Message-ID: Hi Xin, > I read JDK-8234794 but I don't understand why that change involves in r27 > and CompressedOop. JDK-8234794 is the metaspace reservation fix. It also simplifies the encoding/decoding of compressed class pointers. Before that patch, r27 is used for both compressed oops and compressed class pointers. At that time we have to consider if r27 is allocatable if compressed class pointers is on. But after that patch, r27 is for compressed oops only. That's why I could simplify my JDK-8233743 patch after JDK-8234794 was merged. -- Thanks, Pengfei From xxinliu at amazon.com Tue Apr 14 09:37:22 2020 From: xxinliu at amazon.com (Liu, Xin) Date: Tue, 14 Apr 2020 09:37:22 +0000 Subject: RFR: heapbase register can be allocated in compressed mode In-Reply-To: References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com> <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com> <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com> <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com> <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com> Message-ID: <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com> Hi, Pengfei and Kuai, Thanks to point out. Aarch64.ad does use MacroAssembler::encode_heap_oop, which refers to rheapbase. That's why we can't use rheapbase as a GP register in C2. Got it. thanks! --lx ?On 4/14/20, 1:39 AM, "Pengfei Li" wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Hi Xin, > I read JDK-8234794 but I don't understand why that change involves in r27 > and CompressedOop. JDK-8234794 is the metaspace reservation fix. It also simplifies the encoding/decoding of compressed class pointers. Before that patch, r27 is used for both compressed oops and compressed class pointers. At that time we have to consider if r27 is allocatable if compressed class pointers is on. But after that patch, r27 is for compressed oops only. That's why I could simplify my JDK-8233743 patch after JDK-8234794 was merged. -- Thanks, Pengfei From kuaiwei.kw at alibaba-inc.com Tue Apr 14 13:25:01 2020 From: kuaiwei.kw at alibaba-inc.com (Kuai Wei) Date: Tue, 14 Apr 2020 21:25:01 +0800 Subject: =?UTF-8?B?UmU6IFJGUjogaGVhcGJhc2UgcmVnaXN0ZXIgY2FuIGJlIGFsbG9jYXRlZCBpbiBjb21wcmVz?= =?UTF-8?B?c2VkIG1vZGU=?= In-Reply-To: <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com> References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com> <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com> <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com> <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com> <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com> , <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com> Message-ID: <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com> Hi Xin and Pengfei, Thanks for your comments. I checked change in reinit_heapbase and decide to revert it since it's no harm to set rheapbase. I also made change in verify_heapbase in case someone want to enable this check again. The new patch is in http://cr.openjdk.java.net/~wzhuo/8242449/webrev.01/ It has passed tiered 1 test without new failure. Thanks, Kuai Wei ------------------------------------------------------------------ From:Liu, Xin Send Time:2020?4?14?(???) 17:37 To:Pengfei Li ; ??(??) ; Andrew Haley ; hotspot compiler Cc:nd Subject:Re: RFR: heapbase register can be allocated in compressed mode Hi, Pengfei and Kuai, Thanks to point out. Aarch64.ad does use MacroAssembler::encode_heap_oop, which refers to rheapbase. That's why we can't use rheapbase as a GP register in C2. Got it. thanks! --lx On 4/14/20, 1:39 AM, "Pengfei Li" wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Hi Xin, > I read JDK-8234794 but I don't understand why that change involves in r27 > and CompressedOop. JDK-8234794 is the metaspace reservation fix. It also simplifies the encoding/decoding of compressed class pointers. Before that patch, r27 is used for both compressed oops and compressed class pointers. At that time we have to consider if r27 is allocatable if compressed class pointers is on. But after that patch, r27 is for compressed oops only. That's why I could simplify my JDK-8233743 patch after JDK-8234794 was merged. -- Thanks, Pengfei From martin.doerr at sap.com Tue Apr 14 13:26:08 2020 From: martin.doerr at sap.com (Doerr, Martin) Date: Tue, 14 Apr 2020 13:26:08 +0000 Subject: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9 In-Reply-To: References: <67fa8056-a8ed-cdfc-1e5a-d36b49c4af18@linux.ibm.com>, <0079874e-7bc2-5ff4-f004-337c718ec6df@linux.ibm.com> Message-ID: Hi Corey, thanks for contributing it. Looks good to me. I?ll run it through our testing and let you know about the results. Best regards, Martin From: ppc-aix-port-dev On Behalf Of Michihiro Horie Sent: Freitag, 10. April 2020 10:48 To: cjashfor at linux.ibm.com Cc: hotspot-compiler-dev at openjdk.java.net; ppc-aix-port-dev at openjdk.java.net Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9 Hi Corey, Thank you for sharing your benchmarks. I confirmed your change reduced the elapsed time of the benchmarks by more than 30% on my P9 node. Also, I checked JTREG results, which look no problem. BTW, I cannot find further points of improvement in your change. Best regards, Michihiro ----- Original message ----- From: "Corey Ashford" > To: Michihiro Horie/Japan/IBM at IBMJP Cc: hotspot-compiler-dev at openjdk.java.net, ppc-aix-port-dev at openjdk.java.net, "Gustavo Romero" > Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9 Date: Fri, Apr 3, 2020 8:07 AM On 4/2/20 7:27 AM, Michihiro Horie wrote: > Hi Corey, > > I?m not a reviewer, but I can run your benchmark in my local P9 node if > you share it. > > Best regards, > Michihiro The tests are somewhat hokey; I added the shifts to keep the compiler from hoisting the code that it could predetermine the result. Here's the one for Long.reverseBytes(): import java.lang.*; class ReverseLong { public static void main(String args[]) { long reversed, re_reversed; long accum = 0; long orig = 0x1122334455667788L; long start = System.currentTimeMillis(); for (int i = 0; i < 1_000_000_000; i++) { // Try to keep java from figuring out stuff in advance reversed = Long.reverseBytes(orig); re_reversed = Long.reverseBytes(reversed); if (re_reversed != orig) { System.out.println("Orig: " + String.format("%16x", orig) + " Re-reversed: " + String.format("%16x", re_reversed)); } accum += orig; orig = Long.rotateRight(orig, 3); } System.out.println("Elapsed time: " + Long.toString(System.currentTimeMillis() - start)); System.out.println("accum: " + Long.toString(accum)); } } And the one for Integer.reverseBytes(): import java.lang.*; class ReverseInt { public static void main(String args[]) { int reversed, re_reversed; int orig = 0x11223344; int accum = 0; long start = System.currentTimeMillis(); for (int i = 0; i < 1_000_000_000; i++) { // Try to keep java from figuring out stuff in advance reversed = Integer.reverseBytes(orig); re_reversed = Integer.reverseBytes(reversed); if (re_reversed != orig) { System.out.println("Orig: " + String.format("%08x", orig) + " Re-reversed: " + String.format("%08x", re_reversed)); } accum += orig; orig = Integer.rotateRight(orig, 3); } System.out.println("Elapsed time: " + Long.toString(System.currentTimeMillis() - start)); System.out.println("accum: " + Integer.toString(accum)); } } From martin.doerr at sap.com Tue Apr 14 14:07:06 2020 From: martin.doerr at sap.com (Doerr, Martin) Date: Tue, 14 Apr 2020 14:07:06 +0000 Subject: RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range Message-ID: Hi, I'd like to resolve a very old PPC64 issue: https://bugs.openjdk.java.net/browse/JDK-8151030 There's code for AllocatePrefetchStyle=4 which is not an accepted option. It was used for a special experimental prefetch mode using dcbz instructions to combine prefetching and zeroing in the TLABs. However, this code was never contributed and there are no plans to work on it. So I'd like to simply remove this small part of it. In addition to that, AllocatePrefetchLines is currently set to 3 by default which doesn't make sense to me. PPC64 has an automatic prefetch engine and executing several prefetch instructions for succeeding cache lines doesn't seem to be beneficial at all. So I'm setting it to 1 by default. I couldn't observe regressions on Power7, Power8 and Power9. Webrev: http://cr.openjdk.java.net/~mdoerr/8151030_ppc_prefetch/webrev.00/ Please review. If somebody from IBM would like to check performance impact of changing the AllocatePrefetchLines + Distance, I'll be glad to receive feedback. Best regards, Martin From tom.rodriguez at oracle.com Tue Apr 14 20:44:20 2020 From: tom.rodriguez at oracle.com (Tom Rodriguez) Date: Tue, 14 Apr 2020 13:44:20 -0700 Subject: RFR(S) 8231756: [JVMCI] need support for deoptimizing virtual byte arrays encoding non-byte primitives In-Reply-To: <68c8608f-fb2c-83ed-3b14-39dccb9e80ad@oracle.com> References: <5eb640c1-fb33-2fea-ef2f-416e19220d97@oracle.com> <68e115c1-2088-c075-8ffc-2c79f489fa81@oracle.com> <19684e18-4d08-c687-1bb0-c869ab707d74@oracle.com> <68c8608f-fb2c-83ed-3b14-39dccb9e80ad@oracle.com> Message-ID: <113b3322-ddd1-9a7a-6247-c70c5bc5f5fc@oracle.com> Vladimir Kozlov wrote on 4/3/20 5:41 PM: > I think new code in deoptimize.cpp should be JVMCI specific. > > I filed 8242150 for serviceability tests failures in testing. It seems > caused by recent changes. > > It is weird to see SPARC_32 checks in deoptimization.cpp which we should > not have in new code: > > #ifdef _LP64 > ??????? jlong res = (jlong) *((jlong *) &val); > #else > #ifdef SPARC > ????? // For SPARC we have to swap high and low words. > > We don't support such configuration for eons. Currently there are 3 places in deoptimization.cpp that handle sparc 32 bit, like http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/share/runtime/deoptimization.cpp#l1084. Should remove those and the logic in my new code? output.cpp appears to have a case as well. > > I don't see? where _support_large_access_byte_array_virtualization? is > checked. If it is only in Graal then it should be guarded by #if. I'll add the requested ifdefs. tom > > Thanks, > Vladimir > > On 4/3/20 12:37 PM, Tom Rodriguez wrote: >> >> >> Vladimir Kozlov wrote on 4/3/20 10:31 AM: >>> Hi Tom, >>> >>> I looked on testing results and one test fails consistently: >>> >>> compiler/jvmci/jdk.vm.ci.hotspot.test/src/jdk/vm/ci/hotspot/test/VirtualObjectLayoutTest.java >> >> >> Sorry that was an old mach5 run and I forgot to update with the new >> one. ?There are some failures but they seem unrelated to me. >> >> tom >> >>> >>> >>> Vladimir K >>> >>> On 4/2/20 12:12 PM, Tom Rodriguez wrote: >>>> http://cr.openjdk.java.net/~never/8231756/webrev >>>> https://bugs.openjdk.java.net/browse/JDK-8231756 >>>> >>>> This adds support for deoptimizing with non-byte primitive values >>>> stored on top of a byte array, similarly to the way that a double or >>>> long can be stored on top of 2 int fields.? More detail is provided >>>> in the bug report and new unit tests exercise the deoptimization. >>>> mach5 testing is in progress. >>>> >>>> tom From vladimir.kozlov at oracle.com Tue Apr 14 21:07:42 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 14 Apr 2020 14:07:42 -0700 Subject: RFR(S) 8231756: [JVMCI] need support for deoptimizing virtual byte arrays encoding non-byte primitives In-Reply-To: <113b3322-ddd1-9a7a-6247-c70c5bc5f5fc@oracle.com> References: <5eb640c1-fb33-2fea-ef2f-416e19220d97@oracle.com> <68e115c1-2088-c075-8ffc-2c79f489fa81@oracle.com> <19684e18-4d08-c687-1bb0-c869ab707d74@oracle.com> <68c8608f-fb2c-83ed-3b14-39dccb9e80ad@oracle.com> <113b3322-ddd1-9a7a-6247-c70c5bc5f5fc@oracle.com> Message-ID: <5a7bb358-2d9a-5218-95a7-20d1d51bad1c@oracle.com> On 4/14/20 1:44 PM, Tom Rodriguez wrote: > > > Vladimir Kozlov wrote on 4/3/20 5:41 PM: >> I think new code in deoptimize.cpp should be JVMCI specific. >> >> I filed 8242150 for serviceability tests failures in testing. It seems caused by recent changes. >> >> It is weird to see SPARC_32 checks in deoptimization.cpp which we should not have in new code: >> >> #ifdef _LP64 >> ???????? jlong res = (jlong) *((jlong *) &val); >> #else >> #ifdef SPARC >> ?????? // For SPARC we have to swap high and low words. >> >> We don't support such configuration for eons. > > Currently there are 3 places in deoptimization.cpp that handle sparc 32 bit, like > http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/share/runtime/deoptimization.cpp#l1084. ?Should remove those and > the logic in my new code?? output.cpp appears to have a case as well. No, we will remove them soon for JEP: 381: Remove the Solaris and SPARC Ports. I don't want you to add new case. > >> >> I don't see? where _support_large_access_byte_array_virtualization? is checked. If it is only in Graal then it should >> be guarded by #if. > > I'll add the requested ifdefs. Good. Thanks, Vladimir > > tom > >> >> Thanks, >> Vladimir >> >> On 4/3/20 12:37 PM, Tom Rodriguez wrote: >>> >>> >>> Vladimir Kozlov wrote on 4/3/20 10:31 AM: >>>> Hi Tom, >>>> >>>> I looked on testing results and one test fails consistently: >>>> >>>> compiler/jvmci/jdk.vm.ci.hotspot.test/src/jdk/vm/ci/hotspot/test/VirtualObjectLayoutTest.java >>> >>> >>> Sorry that was an old mach5 run and I forgot to update with the new one. ?There are some failures but they seem >>> unrelated to me. >>> >>> tom >>> >>>> >>>> >>>> Vladimir K >>>> >>>> On 4/2/20 12:12 PM, Tom Rodriguez wrote: >>>>> http://cr.openjdk.java.net/~never/8231756/webrev >>>>> https://bugs.openjdk.java.net/browse/JDK-8231756 >>>>> >>>>> This adds support for deoptimizing with non-byte primitive values stored on top of a byte array, similarly to the >>>>> way that a double or long can be stored on top of 2 int fields.? More detail is provided in the bug report and new >>>>> unit tests exercise the deoptimization. mach5 testing is in progress. >>>>> >>>>> tom From xxinliu at amazon.com Wed Apr 15 03:16:55 2020 From: xxinliu at amazon.com (Liu, Xin) Date: Wed, 15 Apr 2020 03:16:55 +0000 Subject: RFR: heapbase register can be allocated in compressed mode In-Reply-To: <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com> References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com> <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com> <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com> <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com> <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com> <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com> <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com> Message-ID: <781CB090-0386-4D32-8465-8238E516789B@amazon.com> Hi, Wei, LGTM. Thanks. --lx From: Kuai Wei Reply-To: Kuai Wei Date: Tuesday, April 14, 2020 at 6:26 AM To: "Liu, Xin" , Pengfei Li , Andrew Haley , hotspot compiler Cc: nd Subject: RE: RFR: heapbase register can be allocated in compressed mode Hi Xin and Pengfei, ? Thanks for your comments. I checked change in reinit_heapbase and decide to revert it since it's no harm to set rheapbase. I also made change in verify_heapbase in case someone want to enable this check again. ? The new patch is in?http://cr.openjdk.java.net/~wzhuo/8242449/webrev.01/ ? It has passed tiered 1 test without new failure. Thanks, Kuai Wei ------------------------------------------------------------------ From:Liu, Xin Send Time:2020?4?14?(???) 17:37 To:Pengfei Li ; ??(??) ; Andrew Haley ; hotspot compiler Cc:nd Subject:Re: RFR: heapbase register can be allocated in compressed mode Hi,?Pengfei?and?Kuai,? Thanks?to?point?out.? Aarch64.ad?does?use?MacroAssembler::encode_heap_oop,?which?refers?to?rheapbase. That's?why?we?can't?use?rheapbase?as?a?GP?register?in?C2.?Got?it.?thanks! --lx ?On?4/14/20,?1:39?AM,?"Pengfei?Li"??wrote: ????CAUTION:?This?email?originated?from?outside?of?the?organization.?Do?not?click?links?or?open?attachments?unless?you?can?confirm?the?sender?and?know?the?content?is?safe. ???? ???? ???? ????Hi?Xin, ???? ????>?I?read?JDK-8234794?but?I?don't?understand?why?that?change?involves?in?r27 ????>?and?CompressedOop. ???? ????JDK-8234794?is?the?metaspace?reservation?fix.?It?also?simplifies?the?encoding/decoding?of?compressed?class?pointers.?Before?that?patch,?r27?is?used?for?both?compressed?oops?and?compressed?class?pointers.?At?that?time?we?have?to?consider?if?r27?is?allocatable?if?compressed?class?pointers?is?on.?But?after?that?patch,?r27?is?for?compressed?oops?only.?That's?why?I?could?simplify?my?JDK-8233743?patch?after?JDK-8234794?was?merged. ???? ????-- ????Thanks, ????Pengfei ???? ???? From martin.doerr at sap.com Wed Apr 15 12:33:16 2020 From: martin.doerr at sap.com (Doerr, Martin) Date: Wed, 15 Apr 2020 12:33:16 +0000 Subject: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9 In-Reply-To: References: <67fa8056-a8ed-cdfc-1e5a-d36b49c4af18@linux.ibm.com>, <0079874e-7bc2-5ff4-f004-337c718ec6df@linux.ibm.com> Message-ID: Hi again, testing didn?t show any new issues. Only the copyright years should get updated before pushing. Is there already a sponsor or do you want me to push it? Best regards, Martin From: Doerr, Martin Sent: Dienstag, 14. April 2020 15:26 To: Michihiro Horie ; cjashfor at linux.ibm.com Cc: hotspot-compiler-dev at openjdk.java.net; ppc-aix-port-dev at openjdk.java.net Subject: RE: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9 Hi Corey, thanks for contributing it. Looks good to me. I?ll run it through our testing and let you know about the results. Best regards, Martin From: ppc-aix-port-dev > On Behalf Of Michihiro Horie Sent: Freitag, 10. April 2020 10:48 To: cjashfor at linux.ibm.com Cc: hotspot-compiler-dev at openjdk.java.net; ppc-aix-port-dev at openjdk.java.net Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9 Hi Corey, Thank you for sharing your benchmarks. I confirmed your change reduced the elapsed time of the benchmarks by more than 30% on my P9 node. Also, I checked JTREG results, which look no problem. BTW, I cannot find further points of improvement in your change. Best regards, Michihiro ----- Original message ----- From: "Corey Ashford" > To: Michihiro Horie/Japan/IBM at IBMJP Cc: hotspot-compiler-dev at openjdk.java.net, ppc-aix-port-dev at openjdk.java.net, "Gustavo Romero" > Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9 Date: Fri, Apr 3, 2020 8:07 AM On 4/2/20 7:27 AM, Michihiro Horie wrote: > Hi Corey, > > I?m not a reviewer, but I can run your benchmark in my local P9 node if > you share it. > > Best regards, > Michihiro The tests are somewhat hokey; I added the shifts to keep the compiler from hoisting the code that it could predetermine the result. Here's the one for Long.reverseBytes(): import java.lang.*; class ReverseLong { public static void main(String args[]) { long reversed, re_reversed; long accum = 0; long orig = 0x1122334455667788L; long start = System.currentTimeMillis(); for (int i = 0; i < 1_000_000_000; i++) { // Try to keep java from figuring out stuff in advance reversed = Long.reverseBytes(orig); re_reversed = Long.reverseBytes(reversed); if (re_reversed != orig) { System.out.println("Orig: " + String.format("%16x", orig) + " Re-reversed: " + String.format("%16x", re_reversed)); } accum += orig; orig = Long.rotateRight(orig, 3); } System.out.println("Elapsed time: " + Long.toString(System.currentTimeMillis() - start)); System.out.println("accum: " + Long.toString(accum)); } } And the one for Integer.reverseBytes(): import java.lang.*; class ReverseInt { public static void main(String args[]) { int reversed, re_reversed; int orig = 0x11223344; int accum = 0; long start = System.currentTimeMillis(); for (int i = 0; i < 1_000_000_000; i++) { // Try to keep java from figuring out stuff in advance reversed = Integer.reverseBytes(orig); re_reversed = Integer.reverseBytes(reversed); if (re_reversed != orig) { System.out.println("Orig: " + String.format("%08x", orig) + " Re-reversed: " + String.format("%08x", re_reversed)); } accum += orig; orig = Integer.rotateRight(orig, 3); } System.out.println("Elapsed time: " + Long.toString(System.currentTimeMillis() - start)); System.out.println("accum: " + Integer.toString(accum)); } } From vladimir.kozlov at oracle.com Wed Apr 15 18:12:53 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 15 Apr 2020 11:12:53 -0700 Subject: RFR(S) 8231756: [JVMCI] need support for deoptimizing virtual byte arrays encoding non-byte primitives In-Reply-To: <5a7bb358-2d9a-5218-95a7-20d1d51bad1c@oracle.com> References: <5eb640c1-fb33-2fea-ef2f-416e19220d97@oracle.com> <68e115c1-2088-c075-8ffc-2c79f489fa81@oracle.com> <19684e18-4d08-c687-1bb0-c869ab707d74@oracle.com> <68c8608f-fb2c-83ed-3b14-39dccb9e80ad@oracle.com> <113b3322-ddd1-9a7a-6247-c70c5bc5f5fc@oracle.com> <5a7bb358-2d9a-5218-95a7-20d1d51bad1c@oracle.com> Message-ID: <73936b07-976d-52aa-6427-339878a571b0@oracle.com> After discussion with Tom offline I agree to keep his SPARC code because we would need to backport this later into 11u. Thanks, Vladimir On 4/14/20 2:07 PM, Vladimir Kozlov wrote: > On 4/14/20 1:44 PM, Tom Rodriguez wrote: >> >> >> Vladimir Kozlov wrote on 4/3/20 5:41 PM: >>> I think new code in deoptimize.cpp should be JVMCI specific. >>> >>> I filed 8242150 for serviceability tests failures in testing. It seems caused by recent changes. >>> >>> It is weird to see SPARC_32 checks in deoptimization.cpp which we should not have in new code: >>> >>> #ifdef _LP64 >>> ???????? jlong res = (jlong) *((jlong *) &val); >>> #else >>> #ifdef SPARC >>> ?????? // For SPARC we have to swap high and low words. >>> >>> We don't support such configuration for eons. >> >> Currently there are 3 places in deoptimization.cpp that handle sparc 32 bit, like >> http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/share/runtime/deoptimization.cpp#l1084. ?Should remove those >> and the logic in my new code?? output.cpp appears to have a case as well. > > No, we will remove them soon for JEP: 381: Remove the Solaris and SPARC Ports. > > I don't want you to add new case. > >> >>> >>> I don't see? where _support_large_access_byte_array_virtualization? is checked. If it is only in Graal then it should >>> be guarded by #if. >> >> I'll add the requested ifdefs. > > Good. > > Thanks, > Vladimir > >> >> tom >> >>> >>> Thanks, >>> Vladimir >>> >>> On 4/3/20 12:37 PM, Tom Rodriguez wrote: >>>> >>>> >>>> Vladimir Kozlov wrote on 4/3/20 10:31 AM: >>>>> Hi Tom, >>>>> >>>>> I looked on testing results and one test fails consistently: >>>>> >>>>> compiler/jvmci/jdk.vm.ci.hotspot.test/src/jdk/vm/ci/hotspot/test/VirtualObjectLayoutTest.java >>>> >>>> >>>> Sorry that was an old mach5 run and I forgot to update with the new one. ?There are some failures but they seem >>>> unrelated to me. >>>> >>>> tom >>>> >>>>> >>>>> >>>>> Vladimir K >>>>> >>>>> On 4/2/20 12:12 PM, Tom Rodriguez wrote: >>>>>> http://cr.openjdk.java.net/~never/8231756/webrev >>>>>> https://bugs.openjdk.java.net/browse/JDK-8231756 >>>>>> >>>>>> This adds support for deoptimizing with non-byte primitive values stored on top of a byte array, similarly to the >>>>>> way that a double or long can be stored on top of 2 int fields.? More detail is provided in the bug report and new >>>>>> unit tests exercise the deoptimization. mach5 testing is in progress. >>>>>> >>>>>> tom From tkachuk.vladyslav at gmail.com Wed Apr 15 22:05:27 2020 From: tkachuk.vladyslav at gmail.com (Vladyslav Tkachuk) Date: Thu, 16 Apr 2020 00:05:27 +0200 Subject: Master Thesis Research Advice. JIT Message-ID: Hello, I am a Master's student at the University of Passau, Germany. My master thesis research is concerned with detecting equivalent mutants in Java. The main research question is to use the Trivial Compiler Equivalency technique. This means that we acquire Assembly code produced by Java JIT compiler for initial and mutated source and then compare them. I have previously contacted Tobias Hartmann, who advised me to write here regarding technical questions. I would like to ask you if there is any solution to a problem I have. Last time Tobias recommended me to use Opto-Assembly to achieve my purpose. It was a good hint and it helped me to get more precise data. However, after doing some research I noticed that in some cases C2 compiler unloaded the method code which I expected to find in assembly. As I found out this was a part of deoptimization and the method code was meant to be executed by the interpreter. Here is an example of what I mean: {method} - this oop: 0x000000000d2319c8 - method holder: 'Rational' - constants: 0x000000000d230cf8 constant pool [85] {0x000000000d230d00} for 'Rational' cache=0x000000000d231cd8 - access: 0x81000001 public - name: 'toString' - signature: '()Ljava/lang/String;' ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ some setup code ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 02c movq RBP, RDX # spill 02f movl RDX, #11 # int nop # 3 bytes pad for loops and calls *037 call,static wrapper for: uncommon_trap(reason='unloaded' action='reinterpret' index='11')* * # Rational::toString @ bci:0 L[0]=RBP L[1]=_ L[2]=_ L[3]=_ L[4]=_ L[5]=_ L[6]=_ L[7]=_* * # OopMap{rbp=Oop off=60}* 03c int3 # ShouldNotReachHere 03c This is a 'toString' method and as I could see and understand, there is no actual method code, but only a call to it. I would like to know if it is possible to completely disable any deoptimizations and consistently receive the full asm code? I consent that it is not practical and hurts performance, but it is not a goal in this scope. According to my observations, in most cases the method code is full, but strangely here it did not work. I have tried to google any useful info, unfortunately, I did not see anything helpful, despite the explanations about what deoptimization is and its types. I would be grateful if you could shed some light on the issue. Thanks in advance for any useful information. Best regards, Vladyslav Tkachuk From vladimir.kozlov at oracle.com Wed Apr 15 23:29:25 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 15 Apr 2020 16:29:25 -0700 Subject: [15] RFR (S): 8242491: C2: assert(v2->bottom_type() == vt) failed: mismatch when creating MacroLogicV In-Reply-To: <0e456498-e4d6-7147-b7db-2e2c3a539430@oracle.com> References: <0e456498-e4d6-7147-b7db-2e2c3a539430@oracle.com> Message-ID: <41a3a12c-2361-fef8-bc81-9012b75a1c9e@oracle.com> Good. Thanks, Vladimir K On 4/10/20 7:07 AM, Vladimir Ivanov wrote: > http://cr.openjdk.java.net/~vlivanov/8242491/webrev.00/ > https://bugs.openjdk.java.net/browse/JDK-8242491 > > Asserts on input types for MacroLogicV are too strong. > SuperWord pass can mix vectors of distinct subword types (byte and boolean or short and char). > > Though it's possible to explicitly check for such particular cases, the fix relaxes the assert even more and only > verifies that inputs are of the same size (in bytes), so bitwise reinterpretation of vector values is safe. > > Testing: hs-precheckin-comp,hs-tier1,hs-tier2 > > Thanks! > > Best regards, > Vladimir Ivanov From vladimir.kozlov at oracle.com Wed Apr 15 23:33:54 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 15 Apr 2020 16:33:54 -0700 Subject: [15] RFR (S): 8242492: C2: Remove Matcher::vector_shift_count_ideal_reg() In-Reply-To: References: Message-ID: <8466f935-5ace-bb02-9258-44541582c00d@oracle.com> Good. Thanks, Vladimir K On 4/10/20 7:25 AM, Vladimir Ivanov wrote: > http://cr.openjdk.java.net/~vlivanov/8242492/webrev.00/ > https://bugs.openjdk.java.net/browse/JDK-8242492 > > Matcher::vector_shift_count_ideal_reg() was introduced specifically for x86 to communicate that only low 32 bits are > used by vector shift instructions, so only those bits should be spilled when needed. > > Unfortunately, it is broken for LShiftCntV/RShiftCntV: Matcher doesn't capute overridden ideal_reg value and spills use > bottom type instead. So, it causes a mismatch during RA. > > Fortunately, LShiftCntV/RShiftCntV are never spilled on x86. Considering how simple AD instructions for > LShiftCntV/RShiftCntV are, RA prefers to rematerialize the value instead (which is a reg-to-reg move). > > I propose to simplify the implementation and completely remove Matcher::vector_shift_count_ideal_reg() along with > additional special handling logic for LShiftCntV/RShiftCntV. > > Testing: hs-precheckin-comp, hs-tier1, hs-tier2 > > Thanks! > > Best regards, > Vladimir Ivanov From tom.rodriguez at oracle.com Thu Apr 16 00:34:35 2020 From: tom.rodriguez at oracle.com (Tom Rodriguez) Date: Wed, 15 Apr 2020 17:34:35 -0700 Subject: RFR(S) 8231756: [JVMCI] need support for deoptimizing virtual byte arrays encoding non-byte primitives In-Reply-To: <73936b07-976d-52aa-6427-339878a571b0@oracle.com> References: <5eb640c1-fb33-2fea-ef2f-416e19220d97@oracle.com> <68e115c1-2088-c075-8ffc-2c79f489fa81@oracle.com> <19684e18-4d08-c687-1bb0-c869ab707d74@oracle.com> <68c8608f-fb2c-83ed-3b14-39dccb9e80ad@oracle.com> <113b3322-ddd1-9a7a-6247-c70c5bc5f5fc@oracle.com> <5a7bb358-2d9a-5218-95a7-20d1d51bad1c@oracle.com> <73936b07-976d-52aa-6427-339878a571b0@oracle.com> Message-ID: <2426180e-0c29-e8e9-2ffc-f5de005608e5@oracle.com> I've updated the webrev in place with the new ifdefs in deoptimization.cpp. The mach5 run was clean apart from known failures. tom Vladimir Kozlov wrote on 4/15/20 11:12 AM: > After discussion with Tom offline I agree to keep his SPARC code because > we would need to backport this later into 11u. > > Thanks, > Vladimir > > On 4/14/20 2:07 PM, Vladimir Kozlov wrote: >> On 4/14/20 1:44 PM, Tom Rodriguez wrote: >>> >>> >>> Vladimir Kozlov wrote on 4/3/20 5:41 PM: >>>> I think new code in deoptimize.cpp should be JVMCI specific. >>>> >>>> I filed 8242150 for serviceability tests failures in testing. It >>>> seems caused by recent changes. >>>> >>>> It is weird to see SPARC_32 checks in deoptimization.cpp which we >>>> should not have in new code: >>>> >>>> #ifdef _LP64 >>>> ???????? jlong res = (jlong) *((jlong *) &val); >>>> #else >>>> #ifdef SPARC >>>> ?????? // For SPARC we have to swap high and low words. >>>> >>>> We don't support such configuration for eons. >>> >>> Currently there are 3 places in deoptimization.cpp that handle sparc >>> 32 bit, like >>> http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/share/runtime/deoptimization.cpp#l1084. >>> ?Should remove those and the logic in my new code?? output.cpp >>> appears to have a case as well. >> >> No, we will remove them soon for JEP: 381: Remove the Solaris and >> SPARC Ports. >> >> I don't want you to add new case. >> >>> >>>> >>>> I don't see? where _support_large_access_byte_array_virtualization >>>> is checked. If it is only in Graal then it should be guarded by #if. >>> >>> I'll add the requested ifdefs. >> >> Good. >> >> Thanks, >> Vladimir >> >>> >>> tom >>> >>>> >>>> Thanks, >>>> Vladimir >>>> >>>> On 4/3/20 12:37 PM, Tom Rodriguez wrote: >>>>> >>>>> >>>>> Vladimir Kozlov wrote on 4/3/20 10:31 AM: >>>>>> Hi Tom, >>>>>> >>>>>> I looked on testing results and one test fails consistently: >>>>>> >>>>>> compiler/jvmci/jdk.vm.ci.hotspot.test/src/jdk/vm/ci/hotspot/test/VirtualObjectLayoutTest.java >>>>> >>>>> >>>>> >>>>> Sorry that was an old mach5 run and I forgot to update with the new >>>>> one. ?There are some failures but they seem unrelated to me. >>>>> >>>>> tom >>>>> >>>>>> >>>>>> >>>>>> Vladimir K >>>>>> >>>>>> On 4/2/20 12:12 PM, Tom Rodriguez wrote: >>>>>>> http://cr.openjdk.java.net/~never/8231756/webrev >>>>>>> https://bugs.openjdk.java.net/browse/JDK-8231756 >>>>>>> >>>>>>> This adds support for deoptimizing with non-byte primitive values >>>>>>> stored on top of a byte array, similarly to the way that a double >>>>>>> or long can be stored on top of 2 int fields.? More detail is >>>>>>> provided in the bug report and new unit tests exercise the >>>>>>> deoptimization. mach5 testing is in progress. >>>>>>> >>>>>>> tom From vladimir.kozlov at oracle.com Thu Apr 16 00:40:19 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Wed, 15 Apr 2020 17:40:19 -0700 Subject: RFR(S) 8231756: [JVMCI] need support for deoptimizing virtual byte arrays encoding non-byte primitives In-Reply-To: <2426180e-0c29-e8e9-2ffc-f5de005608e5@oracle.com> References: <5eb640c1-fb33-2fea-ef2f-416e19220d97@oracle.com> <68e115c1-2088-c075-8ffc-2c79f489fa81@oracle.com> <19684e18-4d08-c687-1bb0-c869ab707d74@oracle.com> <68c8608f-fb2c-83ed-3b14-39dccb9e80ad@oracle.com> <113b3322-ddd1-9a7a-6247-c70c5bc5f5fc@oracle.com> <5a7bb358-2d9a-5218-95a7-20d1d51bad1c@oracle.com> <73936b07-976d-52aa-6427-339878a571b0@oracle.com> <2426180e-0c29-e8e9-2ffc-f5de005608e5@oracle.com> Message-ID: <6e075a13-cda5-9d2d-5d96-5b2c7c2c7cdd@oracle.com> Good. Thanks, Vladimir On 4/15/20 5:34 PM, Tom Rodriguez wrote: > I've updated the webrev in place with the new ifdefs in deoptimization.cpp.? The mach5 run was clean apart from known > failures. > > tom > > Vladimir Kozlov wrote on 4/15/20 11:12 AM: >> After discussion with Tom offline I agree to keep his SPARC code because we would need to backport this later into 11u. >> >> Thanks, >> Vladimir >> >> On 4/14/20 2:07 PM, Vladimir Kozlov wrote: >>> On 4/14/20 1:44 PM, Tom Rodriguez wrote: >>>> >>>> >>>> Vladimir Kozlov wrote on 4/3/20 5:41 PM: >>>>> I think new code in deoptimize.cpp should be JVMCI specific. >>>>> >>>>> I filed 8242150 for serviceability tests failures in testing. It seems caused by recent changes. >>>>> >>>>> It is weird to see SPARC_32 checks in deoptimization.cpp which we should not have in new code: >>>>> >>>>> #ifdef _LP64 >>>>> ???????? jlong res = (jlong) *((jlong *) &val); >>>>> #else >>>>> #ifdef SPARC >>>>> ?????? // For SPARC we have to swap high and low words. >>>>> >>>>> We don't support such configuration for eons. >>>> >>>> Currently there are 3 places in deoptimization.cpp that handle sparc 32 bit, like >>>> http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/share/runtime/deoptimization.cpp#l1084. ?Should remove those >>>> and the logic in my new code?? output.cpp appears to have a case as well. >>> >>> No, we will remove them soon for JEP: 381: Remove the Solaris and SPARC Ports. >>> >>> I don't want you to add new case. >>> >>>> >>>>> >>>>> I don't see? where _support_large_access_byte_array_virtualization is checked. If it is only in Graal then it >>>>> should be guarded by #if. >>>> >>>> I'll add the requested ifdefs. >>> >>> Good. >>> >>> Thanks, >>> Vladimir >>> >>>> >>>> tom >>>> >>>>> >>>>> Thanks, >>>>> Vladimir >>>>> >>>>> On 4/3/20 12:37 PM, Tom Rodriguez wrote: >>>>>> >>>>>> >>>>>> Vladimir Kozlov wrote on 4/3/20 10:31 AM: >>>>>>> Hi Tom, >>>>>>> >>>>>>> I looked on testing results and one test fails consistently: >>>>>>> >>>>>>> compiler/jvmci/jdk.vm.ci.hotspot.test/src/jdk/vm/ci/hotspot/test/VirtualObjectLayoutTest.java >>>>>> >>>>>> >>>>>> >>>>>> Sorry that was an old mach5 run and I forgot to update with the new one. ?There are some failures but they seem >>>>>> unrelated to me. >>>>>> >>>>>> tom >>>>>> >>>>>>> >>>>>>> >>>>>>> Vladimir K >>>>>>> >>>>>>> On 4/2/20 12:12 PM, Tom Rodriguez wrote: >>>>>>>> http://cr.openjdk.java.net/~never/8231756/webrev >>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8231756 >>>>>>>> >>>>>>>> This adds support for deoptimizing with non-byte primitive values stored on top of a byte array, similarly to >>>>>>>> the way that a double or long can be stored on top of 2 int fields.? More detail is provided in the bug report >>>>>>>> and new unit tests exercise the deoptimization. mach5 testing is in progress. >>>>>>>> >>>>>>>> tom From cjashfor at linux.ibm.com Thu Apr 16 01:34:46 2020 From: cjashfor at linux.ibm.com (Corey Ashford) Date: Wed, 15 Apr 2020 18:34:46 -0700 Subject: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9 In-Reply-To: References: <67fa8056-a8ed-cdfc-1e5a-d36b49c4af18@linux.ibm.com> <0079874e-7bc2-5ff4-f004-337c718ec6df@linux.ibm.com> Message-ID: <1964be00-8926-7a70-d23a-2f7e85eb4ef3@linux.ibm.com> Hello Martin, I'm having some trouble with my email server, so I'm having to reply to your earlier post, but I saw your most recent post on the mailing list archive. Thanks for reviewing and testing this patch. I went to look at the copyright dates, and see two date ranges: one for Oracle and its affiliates, and another for SAP. In the files I looked at, the end date wasn't the same between the two. Which one (or both) should I modify? Thanks, - Corey On 4/14/20 6:26 AM, Doerr, Martin wrote: > Hi Corey, > > thanks for contributing it. Looks good to me. I?ll run it through our > testing and let you know about the results. > > Best regards, > > Martin > > *From:*ppc-aix-port-dev *On > Behalf Of *Michihiro Horie > *Sent:* Freitag, 10. April 2020 10:48 > *To:* cjashfor at linux.ibm.com > *Cc:* hotspot-compiler-dev at openjdk.java.net; > ppc-aix-port-dev at openjdk.java.net > *Subject:* Re: RFR[S]:8241874 [PPC64] Improve performance of > Long.reverseBytes() and Integer.reverseBytes() on Power9 > > Hi Corey, > > Thank you for sharing your benchmarks. I confirmed your change reduced > the elapsed time of the benchmarks by more than 30% on my P9 node. Also, > I checked JTREG results, which look no problem. > > BTW, I cannot find further points of improvement in your change. > > Best regards, > Michihiro > > > ----- Original message ----- > From: "Corey Ashford" > > To: Michihiro Horie/Japan/IBM at IBMJP > Cc: hotspot-compiler-dev at openjdk.java.net > , > ppc-aix-port-dev at openjdk.java.net > , "Gustavo Romero" > > > Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of > Long.reverseBytes() and Integer.reverseBytes() on Power9 > Date: Fri, Apr 3, 2020 8:07 AM > > On 4/2/20 7:27 AM, Michihiro Horie wrote: >> Hi Corey, >> >> I?m not a reviewer, but I can run your benchmark in my local P9 node if >> you share it. >> >> Best regards, >> Michihiro > > The tests are somewhat hokey; I added the shifts to keep the compiler > from hoisting the code that it could predetermine the result. > > Here's the one for Long.reverseBytes(): > > import java.lang.*; > > class ReverseLong > { > ? ? ?public static void main(String args[]) > ? ? ?{ > ? ? ? ? ?long reversed, re_reversed; > long accum = 0; > long orig = 0x1122334455667788L; > long start = System.currentTimeMillis(); > for (int i = 0; i < 1_000_000_000; i++) { > // Try to keep java from figuring out stuff in advance > reversed = Long.reverseBytes(orig); > re_reversed = Long.reverseBytes(reversed); > if (re_reversed != orig) { > ? ? ? ? ?System.out.println("Orig: " + String.format("%16x", orig) + > " ?Re-reversed: " + String.format("%16x", re_reversed)); > } > accum += orig; > orig = Long.rotateRight(orig, 3); > } > System.out.println("Elapsed time: " + > Long.toString(System.currentTimeMillis() - start)); > System.out.println("accum: " + Long.toString(accum)); > ? ? ?} > } > > > And the one for Integer.reverseBytes(): > > import java.lang.*; > > class ReverseInt > { > ? ? ?public static void main(String args[]) > ? ? ?{ > ? ? ? ? ?int reversed, re_reversed; > int orig = 0x11223344; > int accum = 0; > long start = System.currentTimeMillis(); > for (int i = 0; i < 1_000_000_000; i++) { > // Try to keep java from figuring out stuff in advance > reversed = Integer.reverseBytes(orig); > re_reversed = Integer.reverseBytes(reversed); > if (re_reversed != orig) { > ? ? ? ? ?System.out.println("Orig: " + String.format("%08x", orig) + > " ?Re-reversed: " + String.format("%08x", re_reversed)); > } > accum += orig; > orig = Integer.rotateRight(orig, 3); > } > System.out.println("Elapsed time: " + > Long.toString(System.currentTimeMillis() - start)); > System.out.println("accum: " + Integer.toString(accum)); > ? ? ?} > } > From eric.c.liu at arm.com Thu Apr 16 04:13:32 2020 From: eric.c.liu at arm.com (Eric Liu) Date: Thu, 16 Apr 2020 04:13:32 +0000 Subject: RFR(S):8242429:Better implementation for signed extract In-Reply-To: References: , Message-ID: Hi Vladimir, Thanks for your review. > One comment: >? >? ?(i >> 31) >>> 31 ==> i >>> 31 > > The shift count value is irrelevant here, isn't it? >? > So, the transformation can be generalized to: >? >? ?(i >> n) >>> 31 ==> i >>> 31 Yes. This match rule exactly could be more general. JBS:?https://bugs.openjdk.java.net/browse/JDK-8242429 Webrev:?http://cr.openjdk.java.net/~yzhang/ericliu/8242429/webrev.01/ [Tests] Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core and langtools::tier1.? No new failure found.? JMH: A simple JMH case [1] on AArch64 and AMD64 machines.? For AArch64, one platform has no obvious improvement, but on others the performance gain is 7.3%~32.7%.? For AMD64, one platform has no obvious improvement, but on others the performance gain is 13.7%~32.4%.? A simple test case [2] has checked the correctness for some corner cases. [1] https://bugs.openjdk.java.net/secure/attachment/87712/IdealNegate.java [2] https://bugs.openjdk.java.net/secure/attachment/87713/SignExtractTest.java Thanks, Eric From martin.doerr at sap.com Thu Apr 16 08:08:24 2020 From: martin.doerr at sap.com (Doerr, Martin) Date: Thu, 16 Apr 2020 08:08:24 +0000 Subject: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9 In-Reply-To: <1964be00-8926-7a70-d23a-2f7e85eb4ef3@linux.ibm.com> References: <67fa8056-a8ed-cdfc-1e5a-d36b49c4af18@linux.ibm.com> <0079874e-7bc2-5ff4-f004-337c718ec6df@linux.ibm.com> <1964be00-8926-7a70-d23a-2f7e85eb4ef3@linux.ibm.com> Message-ID: Hi Corey, please use 2020 for both, the Oracle and the SAP copyright. Usually, both should be the same, but some people forget to update one of them. Best regards, Martin > -----Original Message----- > From: Corey Ashford > Sent: Donnerstag, 16. April 2020 03:35 > To: Doerr, Martin > Cc: Michihiro Horie ; hotspot-compiler- > dev at openjdk.java.net; ppc-aix-port-dev at openjdk.java.net > Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of > Long.reverseBytes() and Integer.reverseBytes() on Power9 > > Hello Martin, > > I'm having some trouble with my email server, so I'm having to reply to > your earlier post, but I saw your most recent post on the mailing list > archive. > > Thanks for reviewing and testing this patch. I went to look at the > copyright dates, and see two date ranges: one for Oracle and its > affiliates, and another for SAP. In the files I looked at, the end date > wasn't the same between the two. Which one (or both) should I modify? > > Thanks, > > - Corey > > On 4/14/20 6:26 AM, Doerr, Martin wrote: > > Hi Corey, > > > > thanks for contributing it. Looks good to me. I?ll run it through our > > testing and let you know about the results. > > > > Best regards, > > > > Martin > > > > *From:*ppc-aix-port-dev > *On > > Behalf Of *Michihiro Horie > > *Sent:* Freitag, 10. April 2020 10:48 > > *To:* cjashfor at linux.ibm.com > > *Cc:* hotspot-compiler-dev at openjdk.java.net; > > ppc-aix-port-dev at openjdk.java.net > > *Subject:* Re: RFR[S]:8241874 [PPC64] Improve performance of > > Long.reverseBytes() and Integer.reverseBytes() on Power9 > > > > Hi Corey, > > > > Thank you for sharing your benchmarks. I confirmed your change reduced > > the elapsed time of the benchmarks by more than 30% on my P9 node. > Also, > > I checked JTREG results, which look no problem. > > > > BTW, I cannot find further points of improvement in your change. > > > > Best regards, > > Michihiro > > > > > > ----- Original message ----- > > From: "Corey Ashford" > > > > To: Michihiro Horie/Japan/IBM at IBMJP > > Cc: hotspot-compiler-dev at openjdk.java.net > > , > > ppc-aix-port-dev at openjdk.java.net > > , "Gustavo Romero" > > > > > Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of > > Long.reverseBytes() and Integer.reverseBytes() on Power9 > > Date: Fri, Apr 3, 2020 8:07 AM > > > > On 4/2/20 7:27 AM, Michihiro Horie wrote: > >> Hi Corey, > >> > >> I?m not a reviewer, but I can run your benchmark in my local P9 node if > >> you share it. > >> > >> Best regards, > >> Michihiro > > > > The tests are somewhat hokey; I added the shifts to keep the compiler > > from hoisting the code that it could predetermine the result. > > > > Here's the one for Long.reverseBytes(): > > > > import java.lang.*; > > > > class ReverseLong > > { > > ? ? ?public static void main(String args[]) > > ? ? ?{ > > ? ? ? ? ?long reversed, re_reversed; > > long accum = 0; > > long orig = 0x1122334455667788L; > > long start = System.currentTimeMillis(); > > for (int i = 0; i < 1_000_000_000; i++) { > > // Try to keep java from figuring out stuff in advance > > reversed = Long.reverseBytes(orig); > > re_reversed = Long.reverseBytes(reversed); > > if (re_reversed != orig) { > > ? ? ? ? ?System.out.println("Orig: " + String.format("%16x", orig) + > > " ?Re-reversed: " + String.format("%16x", re_reversed)); > > } > > accum += orig; > > orig = Long.rotateRight(orig, 3); > > } > > System.out.println("Elapsed time: " + > > Long.toString(System.currentTimeMillis() - start)); > > System.out.println("accum: " + Long.toString(accum)); > > ? ? ?} > > } > > > > > > And the one for Integer.reverseBytes(): > > > > import java.lang.*; > > > > class ReverseInt > > { > > ? ? ?public static void main(String args[]) > > ? ? ?{ > > ? ? ? ? ?int reversed, re_reversed; > > int orig = 0x11223344; > > int accum = 0; > > long start = System.currentTimeMillis(); > > for (int i = 0; i < 1_000_000_000; i++) { > > // Try to keep java from figuring out stuff in advance > > reversed = Integer.reverseBytes(orig); > > re_reversed = Integer.reverseBytes(reversed); > > if (re_reversed != orig) { > > ? ? ? ? ?System.out.println("Orig: " + String.format("%08x", orig) + > > " ?Re-reversed: " + String.format("%08x", re_reversed)); > > } > > accum += orig; > > orig = Integer.rotateRight(orig, 3); > > } > > System.out.println("Elapsed time: " + > > Long.toString(System.currentTimeMillis() - start)); > > System.out.println("accum: " + Integer.toString(accum)); > > ? ? ?} > > } > > From Yang.Zhang at arm.com Thu Apr 16 08:58:15 2020 From: Yang.Zhang at arm.com (Yang Zhang) Date: Thu, 16 Apr 2020 08:58:15 +0000 Subject: RFR(XS): 8242796: Fix client build failure Message-ID: Hi, Could you please help to review this patch? JBS: https://bugs.openjdk.java.net/browse/JDK-8242796 Webrev: http://cr.openjdk.java.net/~yzhang/8242796/webrev.00/ This build failure is introduced by JDK-8193210 [JVMCI/Graal] add JFR compiler phase/inlining events. C2 only code is used for JFR. To fix this issue, I use COMPILER2 macro. With this patch, x86 client build succeeds. But AArch64 client build still fails, which is caused by [1]. I have filed [2] for AArch64 client build failure and will summit another patch for that. [1] https://bugs.openjdk.java.net/browse/JDK-8241665 [2] https://bugs.openjdk.java.net/browse/JDK-8242905 Regards Yang From richard.reingruber at sap.com Thu Apr 16 09:57:22 2020 From: richard.reingruber at sap.com (Reingruber, Richard) Date: Thu, 16 Apr 2020 09:57:22 +0000 Subject: [15] RFR(T) 8242793: Incorrect copyright header in ContinuousCallSiteTargetChange.java Message-ID: Hi, please review this trivial patch that adds a comma to the copyright header of the test ContinuousCallSiteTargetChange.java Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8242793/webrev.0/ Bug: https://bugs.openjdk.java.net/browse/JDK-8242793 The test still succeeds with the patch. The license check fails without and succeeds with the patch. sh make/scripts/lic_check.sh -gpl test/hotspot/jtreg/compiler/jsr292/ContinuousCallSiteTargetChange.java Thanks, Richard. From vladimir.x.ivanov at oracle.com Thu Apr 16 10:08:59 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Thu, 16 Apr 2020 13:08:59 +0300 Subject: [15] RFR (S): 8242491: C2: assert(v2->bottom_type() == vt) failed: mismatch when creating MacroLogicV In-Reply-To: <41a3a12c-2361-fef8-bc81-9012b75a1c9e@oracle.com> References: <0e456498-e4d6-7147-b7db-2e2c3a539430@oracle.com> <41a3a12c-2361-fef8-bc81-9012b75a1c9e@oracle.com> Message-ID: Thanks for the reviews, Vladimir, Sandhya, and Jatin. Best regards, Vladimir Ivanov On 16.04.2020 02:29, Vladimir Kozlov wrote: > Good. > > Thanks, > Vladimir K > > On 4/10/20 7:07 AM, Vladimir Ivanov wrote: >> http://cr.openjdk.java.net/~vlivanov/8242491/webrev.00/ >> https://bugs.openjdk.java.net/browse/JDK-8242491 >> >> Asserts on input types for MacroLogicV are too strong. >> SuperWord pass can mix vectors of distinct subword types (byte and >> boolean or short and char). >> >> Though it's possible to explicitly check for such particular cases, >> the fix relaxes the assert even more and only verifies that inputs are >> of the same size (in bytes), so bitwise reinterpretation of vector >> values is safe. >> >> Testing: hs-precheckin-comp,hs-tier1,hs-tier2 >> >> Thanks! >> >> Best regards, >> Vladimir Ivanov From vladimir.x.ivanov at oracle.com Thu Apr 16 10:09:56 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Thu, 16 Apr 2020 13:09:56 +0300 Subject: [15] RFR (S): 8242492: C2: Remove Matcher::vector_shift_count_ideal_reg() In-Reply-To: <8466f935-5ace-bb02-9258-44541582c00d@oracle.com> References: <8466f935-5ace-bb02-9258-44541582c00d@oracle.com> Message-ID: Thanks for the review, Vladimir. Best regards, Vladimir Ivanov On 16.04.2020 02:33, Vladimir Kozlov wrote: > Good. > > Thanks, > Vladimir K > > On 4/10/20 7:25 AM, Vladimir Ivanov wrote: >> http://cr.openjdk.java.net/~vlivanov/8242492/webrev.00/ >> https://bugs.openjdk.java.net/browse/JDK-8242492 >> >> Matcher::vector_shift_count_ideal_reg() was introduced specifically >> for x86 to communicate that only low 32 bits are used by vector shift >> instructions, so only those bits should be spilled when needed. >> >> Unfortunately, it is broken for LShiftCntV/RShiftCntV: Matcher doesn't >> capute overridden ideal_reg value and spills use bottom type instead. >> So, it causes a mismatch during RA. >> >> Fortunately, LShiftCntV/RShiftCntV are never spilled on x86. >> Considering how simple AD instructions for LShiftCntV/RShiftCntV are, >> RA prefers to rematerialize the value instead (which is a reg-to-reg >> move). >> >> I propose to simplify the implementation and completely remove >> Matcher::vector_shift_count_ideal_reg() along with additional special >> handling logic for LShiftCntV/RShiftCntV. >> >> Testing: hs-precheckin-comp, hs-tier1, hs-tier2 >> >> Thanks! >> >> Best regards, >> Vladimir Ivanov From vladimir.x.ivanov at oracle.com Thu Apr 16 10:28:38 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Thu, 16 Apr 2020 13:28:38 +0300 Subject: Master Thesis Research Advice. JIT In-Reply-To: References: Message-ID: <9765f74c-bfd5-19da-a343-6efccde73195@oracle.com> Hi Vladyslav, C2 has a number of aggressive optimizations which heavily rely on profiling data. It leads to numerous uncommon traps in the generated code. You can disable some of such optimizations, but there's no way to completely eliminate uncommon traps in the generated code: they are a core piece of the design. Have you tried switching to C1 instead? C1 doesn't rely on profiling data that much and use code patching techniques in place of uncommon traps. So, the generated code usually has complete coverage of the compiled method. Best regards, Vladimir Ivanov On 16.04.2020 01:05, Vladyslav Tkachuk wrote: > Hello, > > I am a Master's student at the University of Passau, Germany. > My master thesis research is concerned with detecting equivalent mutants in > Java. > The main research question is to use the Trivial Compiler Equivalency > technique. This means that we acquire Assembly code produced by Java JIT > compiler for initial and mutated source and then compare them. > > I have previously contacted Tobias Hartmann, who advised me to write here > regarding technical questions. I would like to ask you if there is any > solution to a problem I have. > > Last time Tobias recommended me to use Opto-Assembly to achieve my purpose. > It was a good hint and it helped me to get more precise data. > However, after doing some research I noticed that in some cases C2 compiler > unloaded the method code which I expected to find in assembly. As I found > out this was a part of deoptimization and the method code was meant to be > executed by the interpreter. > Here is an example of what I mean: > > {method} > - this oop: 0x000000000d2319c8 > - method holder: 'Rational' > - constants: 0x000000000d230cf8 constant pool [85] > {0x000000000d230d00} for 'Rational' cache=0x000000000d231cd8 > - access: 0x81000001 public > - name: 'toString' > - signature: '()Ljava/lang/String;' > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > some setup code > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > 02c movq RBP, RDX # spill > 02f movl RDX, #11 # int > nop # 3 bytes pad for loops and calls > *037 call,static wrapper for: uncommon_trap(reason='unloaded' > action='reinterpret' index='11')* > * # Rational::toString @ bci:0 L[0]=RBP L[1]=_ L[2]=_ L[3]=_ L[4]=_ > L[5]=_ L[6]=_ L[7]=_* > * # OopMap{rbp=Oop off=60}* > 03c int3 # ShouldNotReachHere > 03c > > > This is a 'toString' method and as I could see and understand, there is no > actual method code, but only a call to it. > > I would like to know if it is possible to completely disable any > deoptimizations and consistently receive the full asm code? I consent that > it is not practical and hurts performance, but it is not a goal in this > scope. According to my observations, in most cases the method code is full, > but strangely here it did not work. I have tried to google any useful info, > unfortunately, I did not see anything helpful, despite the explanations > about what deoptimization is and its types. > > I would be grateful if you could shed some light on the issue. > Thanks in advance for any useful information. > > Best regards, > Vladyslav Tkachuk > From vladimir.x.ivanov at oracle.com Thu Apr 16 12:28:46 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Thu, 16 Apr 2020 15:28:46 +0300 Subject: RFR(S):8242429:Better implementation for signed extract In-Reply-To: References: Message-ID: > Webrev:?http://cr.openjdk.java.net/~yzhang/ericliu/8242429/webrev.01/ Looks good. Have you tested it through submit repo? Best regards, Vladimir Ivanov > [Tests] > Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core and langtools::tier1. > No new failure found. > > JMH: A simple JMH case [1] on AArch64 and AMD64 machines. > > For AArch64, one platform has no obvious improvement, but on others the > performance gain is 7.3%~32.7%. > > For AMD64, one platform has no obvious improvement, but on others the > performance gain is 13.7%~32.4%. > > A simple test case [2] has checked the correctness for some corner > cases. > > [1] https://bugs.openjdk.java.net/secure/attachment/87712/IdealNegate.java > [2] https://bugs.openjdk.java.net/secure/attachment/87713/SignExtractTest.java > > > Thanks, > Eric > From vladimir.x.ivanov at oracle.com Thu Apr 16 12:32:52 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Thu, 16 Apr 2020 15:32:52 +0300 Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator): General HotSpot changes In-Reply-To: References: Message-ID: <25a564a1-7f40-6988-060f-86b06e02ad21@oracle.com> Hi, Any more reviews, please? Especially, compiler and runtime-related changes. Thanks in advance! Best regards, Vladimir Ivanov On 04.04.2020 02:12, Vladimir Ivanov wrote: > Hi, > > Following up on review requests of API [0] and Java implementation [1] > for Vector API (JEP 338 [2]), here's a request for review of general > HotSpot changes (in shared code) required for supporting the API: > > > http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/all.00-03/ > > > (First of all, to set proper expectations: since the JEP is still in > Candidate state, the intention is to initiate preliminary round(s) of > review to inform the community and gather feedback before sending out > final/official RFRs once the JEP is Targeted to a release.) > > Vector API (being developed in Project Panama [3]) relies on JVM support > to utilize optimal vector hardware instructions at runtime. It interacts > with JVM through intrinsics (declared in > jdk.internal.vm.vector.VectorSupport [4]) which expose vector operations > support in C2 JIT-compiler. > > As Paul wrote earlier: "A vector intrinsic is an internal low-level > vector operation. The last argument to the intrinsic is fall back > behavior in Java, implementing the scalar operation over the number of > elements held by the vector.? Thus, If the intrinsic is not supported in > C2 for the other arguments then the Java implementation is executed (the > Java implementation is always executed when running in the interpreter > or for C1)." > > The rest of JVM support is about aggressively optimizing vector boxes to > minimize (ideally eliminate) the overhead of boxing for vector values. > It's a stop-the-gap solution for vector box elimination problem until > inline classes arrive. Vector classes are value-based and in the longer > term will be migrated to inline classes once the support becomes available. > > Vector API talk from JVMLS'18 [5] contains brief overview of JVM > implementation and some details. > > Complete implementation resides in vector-unstable branch of panama/dev > repository [6]. > > Now to gory details (the patch is split in multiple "sub-webrevs"): > > =========================================================== > > (1) > http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/00.backend.shared/ > > > Ideal vector nodes for new operations introduced by Vector API. > > (Platform-specific back end support will be posted for review separately). > > =========================================================== > > (2) > http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/ > > > JVM Java interface (VectorSupport) and intrinsic support in C2. > > Vector instances are initially represented as VectorBox macro nodes and > "unboxing" is represented by VectorUnbox node. It simplifies vector box > elimination analysis and the nodes are expanded later right before EA pass. > > Vectors have 2-level on-heap representation: for the vector value > primitive array is used as a backing storage and it is encapsulated in a > typed wrapper (e.g., Int256Vector - vector of 8 ints - contains a int[8] > instance which is used to store vector value). > > Unless VectorBox node goes away, it needs to be expanded into an > allocation eventually, but it is a pure node and doesn't have any JVM > state associated with it. The problem is solved by keeping JVM state > separately in a VectorBoxAllocate node associated with VectorBox node > and use it during expansion. > > Also, to simplify vector box elimination, inlining of vector reboxing > calls (VectorSupport::maybeRebox) is delayed until the analysis is over. > > =========================================================== > > (3) > http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/02.vbox_elimination/ > > > Vector box elimination analysis implementation. (Brief overview: slides > #36-42 [5].) > > The main part is devoted to scalarization across safepoints and > rematerialization support during deoptimization. In C2-generated code > vector operations work with raw vector values which live in registers or > spilled on the stack and it allows to avoid boxing/unboxing when a > vector value is alive across a safepoint. As with other values, there's > just a location of the vector value at the safepoint and vector type > information recorded in the relevant nmethod metadata and all the > heavy-lifting happens only when rematerialization takes place. > > The analysis preserves object identity invariants except during > aggressive reboxing (guarded by -XX:+EnableAggressiveReboxing). > > (Aggressive reboxing is crucial for cases when vectors "escape": it > allocates a fresh instance at every escape point thus enabling original > instance to go away.) > > =========================================================== > > (4) > http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/03.module.hotspot/ > > > HotSpot changes for jdk.incubator.vector module. Vector support is > makred experimental and turned off by default. JEP 338 proposes the API > to be released as an incubator module, so a user has to specify > "--add-module jdk.incubator.vector" on the command line to be able to > use it. > When user does that, JVM automatically enables Vector API support. > It improves usability (user doesn't need to separately "open" the API > and enable JVM support) while minimizing risks of destabilitzation from > new code when the API is not used. > > > That's it! Will be happy to answer any questions. > > And thanks in advance for any feedback! > > Best regards, > Vladimir Ivanov > > [0] > https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html > > > [1] > https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041228.html > > [2] https://openjdk.java.net/jeps/338 > > [3] https://openjdk.java.net/projects/panama/ > > [4] > http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java.html > > > [5] http://cr.openjdk.java.net/~vlivanov/talks/2018_JVMLS_VectorAPI.pdf > > [6] http://hg.openjdk.java.net/panama/dev/shortlog/92bbd44386e9 > > ??? $ hg clone http://hg.openjdk.java.net/panama/dev/ -b vector-unstable From jamsheed.c.m at oracle.com Thu Apr 16 13:12:49 2020 From: jamsheed.c.m at oracle.com (Jamsheed C M) Date: Thu, 16 Apr 2020 18:42:49 +0530 Subject: RFR: 8237949: CTW: C1 compilation fails with "too many stack slots used" Message-ID: Hi all, As part of the enhancement requirement from truffle use case [1] OopMapValue was extended by 2 bits,? this change will be automatically handled in c1 here [2]. There was a day one code[3] that handled this case before [2] covering more cases than Oop cases. But it seems this extension is not really useful for C1 java use case. So the earlier bailout is preserved with change in the comments. [4] Request for review JBS: https://bugs.openjdk.java.net/browse/JDK-8237949 webrev: http://cr.openjdk.java.net/~jcm/8237949/webrev.00/ Best regards, Jamsheed [1] https://bugs.openjdk.java.net/browse/JDK-8231586 [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/c1/c1_LinearScan.hpp#L341 [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/c1/c1_LinearScan.cpp#L246 [4] http://cr.openjdk.java.net/~jcm/8237949/webrev.00/src/hotspot/share/c1/c1_LinearScan.cpp.udiff.html From vladimir.x.ivanov at oracle.com Thu Apr 16 13:29:55 2020 From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov) Date: Thu, 16 Apr 2020 16:29:55 +0300 Subject: RFR: 8237949: CTW: C1 compilation fails with "too many stack slots used" In-Reply-To: References: Message-ID: <3ec95b34-f40f-689e-5e97-369ad42b949a@oracle.com> Looks good and trivial. Best regards, Vladimir Ivanov On 16.04.2020 16:12, Jamsheed C M wrote: > Hi all, > > As part of the enhancement requirement from truffle use case [1] > OopMapValue was extended by 2 bits,? this change will be automatically > handled in c1 here [2]. > > There was a day one code[3] that handled this case before [2] covering > more cases than Oop cases. But it seems this extension is not really > useful for C1 java use case. > > So the earlier bailout is preserved with change in the comments. [4] > > Request for review > > JBS: https://bugs.openjdk.java.net/browse/JDK-8237949 > > webrev: http://cr.openjdk.java.net/~jcm/8237949/webrev.00/ > > Best regards, > > Jamsheed > > [1] https://bugs.openjdk.java.net/browse/JDK-8231586 > > [2] > https://github.com/openjdk/jdk/blob/master/src/hotspot/share/c1/c1_LinearScan.hpp#L341 > > > [3] > https://github.com/openjdk/jdk/blob/master/src/hotspot/share/c1/c1_LinearScan.cpp#L246 > > > [4] > http://cr.openjdk.java.net/~jcm/8237949/webrev.00/src/hotspot/share/c1/c1_LinearScan.cpp.udiff.html > > From jamsheed.c.m at oracle.com Thu Apr 16 13:52:51 2020 From: jamsheed.c.m at oracle.com (Jamsheed C M) Date: Thu, 16 Apr 2020 19:22:51 +0530 Subject: RFR: 8237949: CTW: C1 compilation fails with "too many stack slots used" In-Reply-To: <3ec95b34-f40f-689e-5e97-369ad42b949a@oracle.com> References: <3ec95b34-f40f-689e-5e97-369ad42b949a@oracle.com> Message-ID: <25345e8a-0c14-95e4-91af-41427a408f85@oracle.com> Hi Vladimir Ivanov, Thank you for the review Best regards, Jamsheed On 16/04/2020 18:59, Vladimir Ivanov wrote: > Looks good and trivial. > > Best regards, > Vladimir Ivanov > > On 16.04.2020 16:12, Jamsheed C M wrote: >> Hi all, >> >> As part of the enhancement requirement from truffle use case [1] >> OopMapValue was extended by 2 bits,? this change will be >> automatically handled in c1 here [2]. >> >> There was a day one code[3] that handled this case before [2] >> covering more cases than Oop cases. But it seems this extension is >> not really useful for C1 java use case. >> >> So the earlier bailout is preserved with change in the comments. [4] >> >> Request for review >> >> JBS: https://bugs.openjdk.java.net/browse/JDK-8237949 >> >> webrev: http://cr.openjdk.java.net/~jcm/8237949/webrev.00/ >> >> Best regards, >> >> Jamsheed >> >> [1] https://bugs.openjdk.java.net/browse/JDK-8231586 >> >> [2] >> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/c1/c1_LinearScan.hpp#L341 >> >> >> [3] >> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/c1/c1_LinearScan.cpp#L246 >> >> >> [4] >> http://cr.openjdk.java.net/~jcm/8237949/webrev.00/src/hotspot/share/c1/c1_LinearScan.cpp.udiff.html >> >> From vladimir.kozlov at oracle.com Thu Apr 16 21:27:10 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 16 Apr 2020 14:27:10 -0700 Subject: RFR(XS): 8242796: Fix client build failure In-Reply-To: References: Message-ID: Method register_jfr_phasetype_serializer() is only called when C2 or JVMCI code is included in JVM build. I think you need to put whole method under checks: #if INCLUDE_JFR && COMPILER2_OR_JVMCI // It appends new compiler phase names to growable array phase_names(a new CompilerPhaseType mapping // in compiler/compilerEvent.cpp) and registers it with its serializer. Thanks, Vladimir On 4/16/20 1:58 AM, Yang Zhang wrote: > Hi, > > Could you please help to review this patch? > > JBS: https://bugs.openjdk.java.net/browse/JDK-8242796 > Webrev: http://cr.openjdk.java.net/~yzhang/8242796/webrev.00/ > > This build failure is introduced by JDK-8193210 [JVMCI/Graal] add JFR > compiler phase/inlining events. > C2 only code is used for JFR. To fix this issue, I use COMPILER2 macro. > > With this patch, x86 client build succeeds. But AArch64 client build > still fails, which is caused by [1]. I have filed [2] for AArch64 > client build failure and will summit another patch for that. > > [1] https://bugs.openjdk.java.net/browse/JDK-8241665 > [2] https://bugs.openjdk.java.net/browse/JDK-8242905 > > Regards > Yang > From vladimir.kozlov at oracle.com Thu Apr 16 21:28:26 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 16 Apr 2020 14:28:26 -0700 Subject: [15] RFR(T) 8242793: Incorrect copyright header in ContinuousCallSiteTargetChange.java In-Reply-To: References: Message-ID: <9b95031f-668b-449e-b779-b59980364c24@oracle.com> Good and trivial. Thanks, Vladimir K On 4/16/20 2:57 AM, Reingruber, Richard wrote: > Hi, > > please review this trivial patch that adds a comma to the copyright header of the test > ContinuousCallSiteTargetChange.java > > Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8242793/webrev.0/ > Bug: https://bugs.openjdk.java.net/browse/JDK-8242793 > > The test still succeeds with the patch. The license check fails without and succeeds with the patch. > > sh make/scripts/lic_check.sh -gpl test/hotspot/jtreg/compiler/jsr292/ContinuousCallSiteTargetChange.java > > Thanks, > Richard. > From Yang.Zhang at arm.com Fri Apr 17 06:34:20 2020 From: Yang.Zhang at arm.com (Yang Zhang) Date: Fri, 17 Apr 2020 06:34:20 +0000 Subject: [aarch64-port-dev ] RFR(M): 8242482: AArch64: Change parameter names of reduction operations to make code clear Message-ID: Hi, Could you please help to review this patch? JBS: https://bugs.openjdk.java.net/browse/JDK-8242482 Webrev: http://cr.openjdk.java.net/~yzhang/8242482/webrev.00/ This patch is a followup patch of previous discussion. https://mail.openjdk.java.net/pipermail/aarch64-port-dev/2020-April/008740.html To make the intent clear, the scalar parameter name is changed to isrc, fsrc or dsrc based on its data type. The vector parameter name is changed to vsrc. And so does temp register. Testing: tier1 Regards Yang From eric.c.liu at arm.com Fri Apr 17 06:39:53 2020 From: eric.c.liu at arm.com (Eric Liu) Date: Fri, 17 Apr 2020 06:39:53 +0000 Subject: RFR(S):8242429:Better implementation for signed extract In-Reply-To: References: , Message-ID: Hi?Vladimir, Thanks for your review. Ningsheng will help me to submit it. Thanks, Eric From xxinliu at amazon.com Fri Apr 17 06:58:35 2020 From: xxinliu at amazon.com (Liu, Xin) Date: Fri, 17 Apr 2020 06:58:35 +0000 Subject: RFR[M]: 8151779: Some intrinsic flags could be replaced with one general flag Message-ID: <19CD3956-4DC6-4908-8626-27D48A9AB4A4@amazon.com> Hi, Corey and Vladimir, I recently go through vmSymbols.hpp/cpp. I think I understand your comments. Each UseXXXIntrinsics does control a bunch of intrinsics (plural). Thanks for the hint. Even though I feel I know intrinsics mechanism of hotspot better, I still need a clarification of JDK- 8151779. There're 321 intrinsics (https://chriswhocodes.com/hotspot_intrinsics_jdk15.html). If there's no any option, they are all available for compilers. That makes sense because intrinsics are always beneficial. But there're reasons we need to disable a subset of them. A specific architecture may miss efficient instructions or fixed functions. Or simply because an intrinsic is buggy. Currently, JDK provides developers 2 ways to control intrinsics. 1. Some diagnostic options. Eg. InlineMathNatives, UseBase64Intrinsics. Developers can use one option to disable a group of intrinsics. That is to say, it's a coarse-grained approach. 2. DisableIntrinsic="a,b,c" By passing a string list of vmIntrinsics::IDs, it's capable of disabling any specified intrinsic. But even putting above 2 approaches together, we still can't precisely control any intrinsic. If we want to enable an intrinsic which is under control of InlineMathNatives but keep others disable, it's impossible now. [please correct if I am wrong here]. I think that the motivation JDK-8151779 tried to solve. If we provide a new option EnableIntrinsic and put it least priority, then we can precisely control any intrinsic. Quote Vladimir Kozlov "DisableIntrinsic list prevails if an intrinsic is specified on both EnableIntrinsic and DisableIntrinsic." "-XX:ControlIntrinsic=+_dabs,-_fabs,-_getClass" looks more elegant, but it will confuse developers with DisableIntrinsic. If we decide to deprecate DisableIntrinsic, I think ControlIntrinsic may be a better option. Now I prefer to provide EnableIntrinsic for simplicity and symmetry. What do you think? Thanks, --lx ?On 4/13/20, 1:47 PM, "hotspot-compiler-dev on behalf of Corey Ashford" wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. On 4/13/20 10:33 AM, Liu, Xin wrote: > Hi, compiler developers, > I attempt to refactor UseXXXIntrinsics for JDK-8151779. I think we still need to keep UseXXXIntrinsics options because many applications may be using them. > > My change provide 2 new features: > 1) a shorthand to enable/disable intrinsics. > A comma-separated string. Each one is an intrinsic. An optional tailing symbol + or '-' denotes enabling or disabling. > If the tailing symbol is missing, it means enable. > Eg. -XX:UseIntrinsics="AESCTR-,CRC32C+,CRC32-,MathExact" > This jvm option will expand to multiple options -XX:-UseAESCTRIntrinsics, -XX:+UseCRC32CIntrinsics, -XX:-UseCRC32Intrinsics, -XX:UseMathExactIntrinsics > > 2) provide a set of macro to declare intrinsic options > Developers declare once in intrinsics.hpp and macros will take care all other places. > Here are example: https://cr.openjdk.java.net/~xliu/8151779/00/webrev/src/hotspot/share/compiler/intrinsics.hpp.html > Ion Lam is overhauling jvm options. I am thinking how to be consistent with his proposal. > Great idea, though to be consistent with the original syntax, I think the +/- should be in front of the name: -XX:UseIntrinsics=-AESCTR,+CRC32C,... > I handle UseIntrinsics before VM_Version::initialize. It means that platform-specific initialization still has a chance to correct those options. > If we do that after VM_Version::initialize, some intrinsics may cause JVM crash. Eg. +UseBase64Intrinsics on x86_64 Linux. > Even though this behavior is same as -XX:+UseXXXIntrinsics, from user's perspective, it's not straightforward when JVM overrides what users specify implicitly. It's dilemma here, stable jvm or fidelity of cmdline. What do you think? > > Another problem is naming convention. Almost all intrinsics options use UseXXXIntrinsics. One exception is UseVectorizedMismatchIntrinsic. > Personally, I think it should be "UseXXXIntrinsic" because one option is for one intrinsic, right? Is it possible to change this name convention? Some (many?) intrinsic options turn on more than one .ad instruct instrinsic, or library instrinsics at the same time. I think that's why the plural is there. Also, consistently adding the plural allows you to add more capabilities to a flag that initially only had one intrinsic without changing the plurality (and thus backward compatibility). Regards, - Corey From Yang.Zhang at arm.com Fri Apr 17 08:37:16 2020 From: Yang.Zhang at arm.com (Yang Zhang) Date: Fri, 17 Apr 2020 08:37:16 +0000 Subject: RFR(XS): 8242796: Fix client build failure In-Reply-To: References: Message-ID: Hi Vladimir I update the patch according to your comment. http://cr.openjdk.java.net/~yzhang/8242796/webrev.01/ These checks are needed. #if INCLUDE_JFR && COMPILER2_OR_JVMCI #if COMPILER2 ----------- without it, configuring --with-jvm-features=-compiler2 fails to build. The error is the same as before. Regards Yang -----Original Message----- From: hotspot-compiler-dev On Behalf Of Vladimir Kozlov Sent: Friday, April 17, 2020 5:27 AM To: hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR(XS): 8242796: Fix client build failure Method register_jfr_phasetype_serializer() is only called when C2 or JVMCI code is included in JVM build. I think you need to put whole method under checks: #if INCLUDE_JFR && COMPILER2_OR_JVMCI // It appends new compiler phase names to growable array phase_names(a new CompilerPhaseType mapping // in compiler/compilerEvent.cpp) and registers it with its serializer. Thanks, Vladimir On 4/16/20 1:58 AM, Yang Zhang wrote: > Hi, > > Could you please help to review this patch? > > JBS: https://bugs.openjdk.java.net/browse/JDK-8242796 > Webrev: http://cr.openjdk.java.net/~yzhang/8242796/webrev.00/ > > This build failure is introduced by JDK-8193210 [JVMCI/Graal] add JFR > compiler phase/inlining events. > C2 only code is used for JFR. To fix this issue, I use COMPILER2 macro. > > With this patch, x86 client build succeeds. But AArch64 client build > still fails, which is caused by [1]. I have filed [2] for AArch64 > client build failure and will summit another patch for that. > > [1] https://bugs.openjdk.java.net/browse/JDK-8241665 > [2] https://bugs.openjdk.java.net/browse/JDK-8242905 > > Regards > Yang > From aph at redhat.com Fri Apr 17 08:42:10 2020 From: aph at redhat.com (Andrew Haley) Date: Fri, 17 Apr 2020 09:42:10 +0100 Subject: [aarch64-port-dev ] RFR(M): 8242482: AArch64: Change parameter names of reduction operations to make code clear In-Reply-To: References: Message-ID: On 4/17/20 7:34 AM, Yang Zhang wrote: > JBS: https://bugs.openjdk.java.net/browse/JDK-8242482 > Webrev: http://cr.openjdk.java.net/~yzhang/8242482/webrev.00/ > > This patch is a followup patch of previous discussion. > https://mail.openjdk.java.net/pipermail/aarch64-port-dev/2020-April/008740.html > > To make the intent clear, the scalar parameter name is changed to isrc, fsrc or dsrc based on > its data type. The vector parameter name is changed to vsrc. And so does temp register. Thanks, that's much nicer. I haven't been able to check every substitution, though. I'm not quite sure about how to do that. Is all this stuff covered by our test cases? -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From Yang.Zhang at arm.com Fri Apr 17 09:13:11 2020 From: Yang.Zhang at arm.com (Yang Zhang) Date: Fri, 17 Apr 2020 09:13:11 +0000 Subject: [aarch64-port-dev ] RFR(M): 8242482: AArch64: Change parameter names of reduction operations to make code clear In-Reply-To: References: Message-ID: Hi Andrew Besides tier1, I also test these operations in Vector API test, which can cover all the reduction operations. In this directory, there are also some test cases about reduction operations, which is added in [1]. https://hg.openjdk.java.net/jdk/jdk/file/55c4283a7606/test/hotspot/jtreg/compiler/loopopts/superword [1] https://bugs.openjdk.java.net/browse/JDK-8240248 Regards Yang -----Original Message----- From: Andrew Haley Sent: Friday, April 17, 2020 4:42 PM To: Yang Zhang ; aarch64-port-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net Cc: nd Subject: Re: [aarch64-port-dev ] RFR(M): 8242482: AArch64: Change parameter names of reduction operations to make code clear On 4/17/20 7:34 AM, Yang Zhang wrote: > JBS: https://bugs.openjdk.java.net/browse/JDK-8242482 > Webrev: http://cr.openjdk.java.net/~yzhang/8242482/webrev.00/ > > This patch is a followup patch of previous discussion. > https://mail.openjdk.java.net/pipermail/aarch64-port-dev/2020-April/00 > 8740.html > > To make the intent clear, the scalar parameter name is changed to > isrc, fsrc or dsrc based on its data type. The vector parameter name is changed to vsrc. And so does temp register. Thanks, that's much nicer. I haven't been able to check every substitution, though. I'm not quite sure about how to do that. Is all this stuff covered by our test cases? -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From Yang.Zhang at arm.com Fri Apr 17 09:14:24 2020 From: Yang.Zhang at arm.com (Yang Zhang) Date: Fri, 17 Apr 2020 09:14:24 +0000 Subject: [aarch64-port-dev ] RFR(XS): 8242070: AArch64: Fix a typo introduced by JDK-8238690 In-Reply-To: References: Message-ID: Hi Andrew Ping it again. Could you please help to review this? Regards Yang -----Original Message----- From: aarch64-port-dev On Behalf Of Yang Zhang Sent: Friday, April 10, 2020 10:53 AM To: aarch64-port-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net Cc: nd Subject: [aarch64-port-dev ] RFR(XS): 8242070: AArch64: Fix a typo introduced by JDK-8238690 Hi, Could you please help to review this patch? JBS: https://bugs.openjdk.java.net/browse/JDK-8242070 Webrev: http://cr.openjdk.java.net/~yzhang/8242070/webrev.00/ In JDK-8238690, it unified IR shape for vector shifts by scalar and always used ShiftV src (ShiftCntV shift) When shift is scalar, the following IR nodes are generated. scalar_shift | src ShiftCntV | / | / ShiftV But when implementing this on AArch64, there is an issue in match rule of vector shift right with imm shift for short type. match(Set dst (RShiftVS src (LShiftCntV shift))); LShiftCntV should be RShiftCntV here. Test case: public static void shiftR(short[] a, short[] c) { for (int i = 0; i < a.length; i++) { c[i] = (short)(a[i] >> 2); } } IR nodes: imm:2 | LoadVector RShiftCntV | / | / RShiftVS C2 aassembly generated: Before: 0x0000ffffac563764: orr w11, wzr, #0x2 0x0000ffffac563768: dup v16.16b, w11 -------- vshiftcnt16B 0x0000ffffac5637a8: ldr q24, [x18, #16] 0x0000ffffac5637ac: neg v25.16b, v16.16b ------ 0x0000ffffac5637b0: sshl v24.8h, v24.8h, v25.8h ------vsra8S 0x0000ffffac5637b8: str q24, [x14, #16] "match(Set dst (RShiftVS src (LShiftCntV shift)));" matching fails. RShiftCntV and RShiftVS are matched separately by vshiftcnt16B and vsra8S. After: 0x0000ffffac563808: ldr q16, [x15, #16] 0x0000ffffac56380c: sshr v16.8h, v16.8h, #2 0x0000ffffac563814: str q16, [x14, #16] "match(Set dst (RShiftVS src (RShiftCntV shift)));" matching succeeds. Performance: JMH test case is attached in JBS. Before: Benchmark Mode Cnt Score Error Units TestVect.testVectShift avgt 10 66.964 ? 0.052 us/op After: Benchmark Mode Cnt Score Error Units TestVect.testVectShift avgt 10 56.156 ? 0.053 us/op Testing: tier1 Pass and no new failure. Regards Yang From richard.reingruber at sap.com Fri Apr 17 14:55:01 2020 From: richard.reingruber at sap.com (Reingruber, Richard) Date: Fri, 17 Apr 2020 14:55:01 +0000 Subject: [15] RFR(T) 8242793: Incorrect copyright header in ContinuousCallSiteTargetChange.java In-Reply-To: <9b95031f-668b-449e-b779-b59980364c24@oracle.com> References: <9b95031f-668b-449e-b779-b59980364c24@oracle.com> Message-ID: Thank you, Vladimir. Richard. -----Original Message----- From: hotspot-compiler-dev On Behalf Of Vladimir Kozlov Sent: Donnerstag, 16. April 2020 23:28 To: hotspot-compiler-dev at openjdk.java.net Subject: Re: [15] RFR(T) 8242793: Incorrect copyright header in ContinuousCallSiteTargetChange.java Good and trivial. Thanks, Vladimir K On 4/16/20 2:57 AM, Reingruber, Richard wrote: > Hi, > > please review this trivial patch that adds a comma to the copyright header of the test > ContinuousCallSiteTargetChange.java > > Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8242793/webrev.0/ > Bug: https://bugs.openjdk.java.net/browse/JDK-8242793 > > The test still succeeds with the patch. The license check fails without and succeeds with the patch. > > sh make/scripts/lic_check.sh -gpl test/hotspot/jtreg/compiler/jsr292/ContinuousCallSiteTargetChange.java > > Thanks, > Richard. > From rwestrel at redhat.com Fri Apr 17 15:51:13 2020 From: rwestrel at redhat.com (Roland Westrelin) Date: Fri, 17 Apr 2020 17:51:13 +0200 Subject: RFR(XS): 8242502: UnexpectedDeoptimizationTest.java failed "assert(phase->type(obj)->isa_oopptr()) failed: only for oop input" Message-ID: <878siu9klq.fsf@redhat.com> https://bugs.openjdk.java.net/browse/JDK-8242502 http://cr.openjdk.java.net/~roland/8242502/webrev.00/ I wasn't able to reproduce that failure (neither by running the test or with the replay file) but I suspect the assert fails because it encounters a unexpected top node. Roland. From vladimir.kozlov at oracle.com Fri Apr 17 19:07:17 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 17 Apr 2020 12:07:17 -0700 Subject: RFR(XS): 8242796: Fix client build failure In-Reply-To: References: Message-ID: Hi Yang On 4/17/20 1:37 AM, Yang Zhang wrote: > Hi Vladimir > > I update the patch according to your comment. > http://cr.openjdk.java.net/~yzhang/8242796/webrev.01/ > > These checks are needed. > #if INCLUDE_JFR && COMPILER2_OR_JVMCI > #if COMPILER2 ----------- without it, configuring --with-jvm-features=-compiler2 fails to build. The error is the same as before. Yes, I agree that additional #ifdef COMPILER2 is needed. The only comment I have is to may be include compiler_c2 check under that #ifdef and leaving #endif at the same place: + #ifdef COMPILER2 } else if (compiler_type == compiler_c2) { first_registration = false; + #endif // COMPILER2 } Thanks, Vladimir > > Regards > Yang > > -----Original Message----- > From: hotspot-compiler-dev On Behalf Of Vladimir Kozlov > Sent: Friday, April 17, 2020 5:27 AM > To: hotspot-compiler-dev at openjdk.java.net > Subject: Re: RFR(XS): 8242796: Fix client build failure > > Method register_jfr_phasetype_serializer() is only called when C2 or JVMCI code is included in JVM build. > I think you need to put whole method under checks: > > #if INCLUDE_JFR && COMPILER2_OR_JVMCI > // It appends new compiler phase names to growable array phase_names(a new CompilerPhaseType mapping // in compiler/compilerEvent.cpp) and registers it with its serializer. > > Thanks, > Vladimir > > On 4/16/20 1:58 AM, Yang Zhang wrote: >> Hi, >> >> Could you please help to review this patch? >> >> JBS: https://bugs.openjdk.java.net/browse/JDK-8242796 >> Webrev: http://cr.openjdk.java.net/~yzhang/8242796/webrev.00/ >> >> This build failure is introduced by JDK-8193210 [JVMCI/Graal] add JFR >> compiler phase/inlining events. >> C2 only code is used for JFR. To fix this issue, I use COMPILER2 macro. >> >> With this patch, x86 client build succeeds. But AArch64 client build >> still fails, which is caused by [1]. I have filed [2] for AArch64 >> client build failure and will summit another patch for that. >> >> [1] https://bugs.openjdk.java.net/browse/JDK-8241665 >> [2] https://bugs.openjdk.java.net/browse/JDK-8242905 >> >> Regards >> Yang >> From vladimir.kozlov at oracle.com Fri Apr 17 23:58:10 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 17 Apr 2020 16:58:10 -0700 Subject: [15] RFR(T) 8242357: [JVMCI] Incorrect use of JVMCI_CHECK_ on return statement Message-ID: https://bugs.openjdk.java.net/browse/JDK-8242357 CHECK macros can't be used on a return statement - they expand to include code after the return [2] and so have no affect. Fix: src/hotspot/share/jvmci/jvmciEnv.hpp @@ -262,7 +262,8 @@ char* as_utf8_string(JVMCIObject str, char* buf, int buflen); JVMCIObject create_string(Symbol* str, JVMCI_TRAPS) { - return create_string(str->as_C_string(), JVMCI_CHECK_(JVMCIObject())); + JVMCIObject s = create_string(str->as_C_string(), JVMCI_CHECK_(JVMCIObject())); + return s; } I tried to find similar cases but it was the only one. Clang -Wunreachable-code-aggressive does not catch this case. Tested hs-tier1,hs-tier3-graal Thanks, Vladimir [1] http://hg.openjdk.java.net/jdk/jdk/file/90882ba9f488/src/hotspot/share/jvmci/jvmciExceptions.hpp#l48 From xxinliu at amazon.com Sat Apr 18 00:36:43 2020 From: xxinliu at amazon.com (Liu, Xin) Date: Sat, 18 Apr 2020 00:36:43 +0000 Subject: [15] RFR(T) 8242357: [JVMCI] Incorrect use of JVMCI_CHECK_ on return statement In-Reply-To: References: Message-ID: <395A687F-8883-4210-BA6E-AE83B32D76E9@amazon.com> LGTM. I used to backport a similar change (exceptions.hpp) to jdk8u. I also use regex to scan the whole source code, I think it?s the only place in hotspot. Thanks, --lx ?On 4/17/20, 5:02 PM, "hotspot-compiler-dev on behalf of Vladimir Kozlov" wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. https://bugs.openjdk.java.net/browse/JDK-8242357 CHECK macros can't be used on a return statement - they expand to include code after the return [2] and so have no affect. Fix: src/hotspot/share/jvmci/jvmciEnv.hpp @@ -262,7 +262,8 @@ char* as_utf8_string(JVMCIObject str, char* buf, int buflen); JVMCIObject create_string(Symbol* str, JVMCI_TRAPS) { - return create_string(str->as_C_string(), JVMCI_CHECK_(JVMCIObject())); + JVMCIObject s = create_string(str->as_C_string(), JVMCI_CHECK_(JVMCIObject())); + return s; } I tried to find similar cases but it was the only one. Clang -Wunreachable-code-aggressive does not catch this case. Tested hs-tier1,hs-tier3-graal Thanks, Vladimir [1] http://hg.openjdk.java.net/jdk/jdk/file/90882ba9f488/src/hotspot/share/jvmci/jvmciExceptions.hpp#l48 From vladimir.kozlov at oracle.com Sat Apr 18 00:43:01 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 17 Apr 2020 17:43:01 -0700 Subject: [15] RFR(T) 8242357: [JVMCI] Incorrect use of JVMCI_CHECK_ on return statement In-Reply-To: <395A687F-8883-4210-BA6E-AE83B32D76E9@amazon.com> References: <395A687F-8883-4210-BA6E-AE83B32D76E9@amazon.com> Message-ID: <58abe636-27d9-b027-b8b1-8f7ed862d7bc@oracle.com> Thank you, Xin Vladimir K On 4/17/20 5:36 PM, Liu, Xin wrote: > LGTM. I used to backport a similar change (exceptions.hpp) to jdk8u. > I also use regex to scan the whole source code, I think it?s the only place in hotspot. > > Thanks, > --lx > > ?On 4/17/20, 5:02 PM, "hotspot-compiler-dev on behalf of Vladimir Kozlov" wrote: > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > https://bugs.openjdk.java.net/browse/JDK-8242357 > > CHECK macros can't be used on a return statement - they expand to include code after the return [2] and so have no affect. > > Fix: > > src/hotspot/share/jvmci/jvmciEnv.hpp > @@ -262,7 +262,8 @@ > char* as_utf8_string(JVMCIObject str, char* buf, int buflen); > > JVMCIObject create_string(Symbol* str, JVMCI_TRAPS) { > - return create_string(str->as_C_string(), JVMCI_CHECK_(JVMCIObject())); > + JVMCIObject s = create_string(str->as_C_string(), JVMCI_CHECK_(JVMCIObject())); > + return s; > } > > I tried to find similar cases but it was the only one. > Clang -Wunreachable-code-aggressive does not catch this case. > > Tested hs-tier1,hs-tier3-graal > > Thanks, > Vladimir > > [1] http://hg.openjdk.java.net/jdk/jdk/file/90882ba9f488/src/hotspot/share/jvmci/jvmciExceptions.hpp#l48 > From vladimir.kozlov at oracle.com Sat Apr 18 01:44:55 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 17 Apr 2020 18:44:55 -0700 Subject: RFR[M]: 8151779: Some intrinsic flags could be replaced with one general flag In-Reply-To: <19CD3956-4DC6-4908-8626-27D48A9AB4A4@amazon.com> References: <19CD3956-4DC6-4908-8626-27D48A9AB4A4@amazon.com> Message-ID: I withdraw my suggestion about EnableIntrinsic from JDK-8151779 because ControlIntrinsics will provide such functionality and will replace existing DisableIntrinsic. Note, we can start deprecating Use*Intrinsic flags (and DisableIntrinsic) later in other changes. You don't need to do everything at once. What we need now a mechanism to replace them. On 4/16/20 11:58 PM, Liu, Xin wrote: > Hi, Corey and Vladimir, > > I recently go through vmSymbols.hpp/cpp. I think I understand your comments. > Each UseXXXIntrinsics does control a bunch of intrinsics (plural). Thanks for the hint. > > Even though I feel I know intrinsics mechanism of hotspot better, I still need a clarification of JDK- 8151779. > > There're 321 intrinsics (https://chriswhocodes.com/hotspot_intrinsics_jdk15.html). > If there's no any option, they are all available for compilers. That makes sense because intrinsics are always beneficial. > But there're reasons we need to disable a subset of them. A specific architecture may miss efficient instructions or fixed functions. Or simply because an intrinsic is buggy. > > Currently, JDK provides developers 2 ways to control intrinsics. > 1. Some diagnostic options. Eg. InlineMathNatives, UseBase64Intrinsics. > Developers can use one option to disable a group of intrinsics. That is to say, it's a coarse-grained approach. > > 2. DisableIntrinsic="a,b,c" > By passing a string list of vmIntrinsics::IDs, it's capable of disabling any specified intrinsic. > > But even putting above 2 approaches together, we still can't precisely control any intrinsic. Yes, you are right. We seems are trying to put these 2 different ways into one flag which may be mistake. -XX:ControlIntrinsic=-_updateBytesCRC32C,-_updateDirectByteBufferCRC32C is a similar to -XX:-UseCRC32CIntrinsics but it requires more detailed knowledge about intrinsics ids. May be we can have 2nd flag, as you suggested -XX:UseIntrinsics=-AESCTR,+CRC32C, for such cases. > If we want to enable an intrinsic which is under control of InlineMathNatives but keep others disable, it's impossible now. [please correct if I am wrong here]. You can disable all other from 321 intrinsics with DisableIntrinsic flag which is very tedious I agree. > I think that the motivation JDK-8151779 tried to solve. The idea is that instead of flags we use to control particular intrinsics depending on CPU we will use vmIntrinsics::IDs or other tables as you showed in your changes. It will require changes in vm_version_ codes. > > If we provide a new option EnableIntrinsic and put it least priority, then we can precisely control any intrinsic. > Quote Vladimir Kozlov "DisableIntrinsic list prevails if an intrinsic is specified on both EnableIntrinsic and DisableIntrinsic." > > "-XX:ControlIntrinsic=+_dabs,-_fabs,-_getClass" looks more elegant, but it will confuse developers with DisableIntrinsic. > If we decide to deprecate DisableIntrinsic, I think ControlIntrinsic may be a better option. Now I prefer to provide EnableIntrinsic for simplicity and symmetry. I prefer to have one ControlIntrinsic flag and deprecate DisableIntrinsic. I don't think it is confusing. Thanks, Vladimir > What do you think? > > Thanks, > --lx > > > ?On 4/13/20, 1:47 PM, "hotspot-compiler-dev on behalf of Corey Ashford" wrote: > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On 4/13/20 10:33 AM, Liu, Xin wrote: > > Hi, compiler developers, > > I attempt to refactor UseXXXIntrinsics for JDK-8151779. I think we still need to keep UseXXXIntrinsics options because many applications may be using them. > > > > My change provide 2 new features: > > 1) a shorthand to enable/disable intrinsics. > > A comma-separated string. Each one is an intrinsic. An optional tailing symbol + or '-' denotes enabling or disabling. > > If the tailing symbol is missing, it means enable. > > Eg. -XX:UseIntrinsics="AESCTR-,CRC32C+,CRC32-,MathExact" > > This jvm option will expand to multiple options -XX:-UseAESCTRIntrinsics, -XX:+UseCRC32CIntrinsics, -XX:-UseCRC32Intrinsics, -XX:UseMathExactIntrinsics > > > > 2) provide a set of macro to declare intrinsic options > > Developers declare once in intrinsics.hpp and macros will take care all other places. > > Here are example: https://cr.openjdk.java.net/~xliu/8151779/00/webrev/src/hotspot/share/compiler/intrinsics.hpp.html > > Ion Lam is overhauling jvm options. I am thinking how to be consistent with his proposal. > > > > Great idea, though to be consistent with the original syntax, I think > the +/- should be in front of the name: > > -XX:UseIntrinsics=-AESCTR,+CRC32C,... > > > > I handle UseIntrinsics before VM_Version::initialize. It means that platform-specific initialization still has a chance to correct those options. > > If we do that after VM_Version::initialize, some intrinsics may cause JVM crash. Eg. +UseBase64Intrinsics on x86_64 Linux. > > Even though this behavior is same as -XX:+UseXXXIntrinsics, from user's perspective, it's not straightforward when JVM overrides what users specify implicitly. It's dilemma here, stable jvm or fidelity of cmdline. What do you think? > > > > Another problem is naming convention. Almost all intrinsics options use UseXXXIntrinsics. One exception is UseVectorizedMismatchIntrinsic. > > Personally, I think it should be "UseXXXIntrinsic" because one option is for one intrinsic, right? Is it possible to change this name convention? > > Some (many?) intrinsic options turn on more than one .ad instruct > instrinsic, or library instrinsics at the same time. I think that's why > the plural is there. Also, consistently adding the plural allows you to > add more capabilities to a flag that initially only had one intrinsic > without changing the plurality (and thus backward compatibility). > > Regards, > > - Corey > > From xxinliu at amazon.com Sat Apr 18 02:19:11 2020 From: xxinliu at amazon.com (Liu, Xin) Date: Sat, 18 Apr 2020 02:19:11 +0000 Subject: RFR[M]: 8151779: Some intrinsic flags could be replaced with one general flag In-Reply-To: References: <19CD3956-4DC6-4908-8626-27D48A9AB4A4@amazon.com> Message-ID: <0EDAAC88-E5D9-424F-A19E-5E20C689C2F3@amazon.com> Hi, Vladimir, Thanks for the clarification. Oh, yes, it's theoretically possible, but it's tedious. I am wrong at that point. I think I got your point. ControlIntrinsics will make developer's life easier. I will implement it. Thanks, --lx ?On 4/17/20, 6:46 PM, "Vladimir Kozlov" wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. I withdraw my suggestion about EnableIntrinsic from JDK-8151779 because ControlIntrinsics will provide such functionality and will replace existing DisableIntrinsic. Note, we can start deprecating Use*Intrinsic flags (and DisableIntrinsic) later in other changes. You don't need to do everything at once. What we need now a mechanism to replace them. On 4/16/20 11:58 PM, Liu, Xin wrote: > Hi, Corey and Vladimir, > > I recently go through vmSymbols.hpp/cpp. I think I understand your comments. > Each UseXXXIntrinsics does control a bunch of intrinsics (plural). Thanks for the hint. > > Even though I feel I know intrinsics mechanism of hotspot better, I still need a clarification of JDK- 8151779. > > There're 321 intrinsics (https://chriswhocodes.com/hotspot_intrinsics_jdk15.html). > If there's no any option, they are all available for compilers. That makes sense because intrinsics are always beneficial. > But there're reasons we need to disable a subset of them. A specific architecture may miss efficient instructions or fixed functions. Or simply because an intrinsic is buggy. > > Currently, JDK provides developers 2 ways to control intrinsics. > 1. Some diagnostic options. Eg. InlineMathNatives, UseBase64Intrinsics. > Developers can use one option to disable a group of intrinsics. That is to say, it's a coarse-grained approach. > > 2. DisableIntrinsic="a,b,c" > By passing a string list of vmIntrinsics::IDs, it's capable of disabling any specified intrinsic. > > But even putting above 2 approaches together, we still can't precisely control any intrinsic. Yes, you are right. We seems are trying to put these 2 different ways into one flag which may be mistake. -XX:ControlIntrinsic=-_updateBytesCRC32C,-_updateDirectByteBufferCRC32C is a similar to -XX:-UseCRC32CIntrinsics but it requires more detailed knowledge about intrinsics ids. May be we can have 2nd flag, as you suggested -XX:UseIntrinsics=-AESCTR,+CRC32C, for such cases. > If we want to enable an intrinsic which is under control of InlineMathNatives but keep others disable, it's impossible now. [please correct if I am wrong here]. You can disable all other from 321 intrinsics with DisableIntrinsic flag which is very tedious I agree. > I think that the motivation JDK-8151779 tried to solve. The idea is that instead of flags we use to control particular intrinsics depending on CPU we will use vmIntrinsics::IDs or other tables as you showed in your changes. It will require changes in vm_version_ codes. > > If we provide a new option EnableIntrinsic and put it least priority, then we can precisely control any intrinsic. > Quote Vladimir Kozlov "DisableIntrinsic list prevails if an intrinsic is specified on both EnableIntrinsic and DisableIntrinsic." > > "-XX:ControlIntrinsic=+_dabs,-_fabs,-_getClass" looks more elegant, but it will confuse developers with DisableIntrinsic. > If we decide to deprecate DisableIntrinsic, I think ControlIntrinsic may be a better option. Now I prefer to provide EnableIntrinsic for simplicity and symmetry. I prefer to have one ControlIntrinsic flag and deprecate DisableIntrinsic. I don't think it is confusing. Thanks, Vladimir > What do you think? > > Thanks, > --lx > > > On 4/13/20, 1:47 PM, "hotspot-compiler-dev on behalf of Corey Ashford" wrote: > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On 4/13/20 10:33 AM, Liu, Xin wrote: > > Hi, compiler developers, > > I attempt to refactor UseXXXIntrinsics for JDK-8151779. I think we still need to keep UseXXXIntrinsics options because many applications may be using them. > > > > My change provide 2 new features: > > 1) a shorthand to enable/disable intrinsics. > > A comma-separated string. Each one is an intrinsic. An optional tailing symbol + or '-' denotes enabling or disabling. > > If the tailing symbol is missing, it means enable. > > Eg. -XX:UseIntrinsics="AESCTR-,CRC32C+,CRC32-,MathExact" > > This jvm option will expand to multiple options -XX:-UseAESCTRIntrinsics, -XX:+UseCRC32CIntrinsics, -XX:-UseCRC32Intrinsics, -XX:UseMathExactIntrinsics > > > > 2) provide a set of macro to declare intrinsic options > > Developers declare once in intrinsics.hpp and macros will take care all other places. > > Here are example: https://cr.openjdk.java.net/~xliu/8151779/00/webrev/src/hotspot/share/compiler/intrinsics.hpp.html > > Ion Lam is overhauling jvm options. I am thinking how to be consistent with his proposal. > > > > Great idea, though to be consistent with the original syntax, I think > the +/- should be in front of the name: > > -XX:UseIntrinsics=-AESCTR,+CRC32C,... > > > > I handle UseIntrinsics before VM_Version::initialize. It means that platform-specific initialization still has a chance to correct those options. > > If we do that after VM_Version::initialize, some intrinsics may cause JVM crash. Eg. +UseBase64Intrinsics on x86_64 Linux. > > Even though this behavior is same as -XX:+UseXXXIntrinsics, from user's perspective, it's not straightforward when JVM overrides what users specify implicitly. It's dilemma here, stable jvm or fidelity of cmdline. What do you think? > > > > Another problem is naming convention. Almost all intrinsics options use UseXXXIntrinsics. One exception is UseVectorizedMismatchIntrinsic. > > Personally, I think it should be "UseXXXIntrinsic" because one option is for one intrinsic, right? Is it possible to change this name convention? > > Some (many?) intrinsic options turn on more than one .ad instruct > instrinsic, or library instrinsics at the same time. I think that's why > the plural is there. Also, consistently adding the plural allows you to > add more capabilities to a flag that initially only had one intrinsic > without changing the plurality (and thus backward compatibility). > > Regards, > > - Corey > > From david.holmes at oracle.com Sat Apr 18 13:34:11 2020 From: david.holmes at oracle.com (David Holmes) Date: Sat, 18 Apr 2020 23:34:11 +1000 Subject: [15] RFR(T) 8242357: [JVMCI] Incorrect use of JVMCI_CHECK_ on return statement In-Reply-To: References: Message-ID: <4b477c0f-22c8-5271-fb1a-96e1ad8c5cba@oracle.com> Looks good! Thanks, David On 18/04/2020 9:58 am, Vladimir Kozlov wrote: > https://bugs.openjdk.java.net/browse/JDK-8242357 > > CHECK macros can't be used on a return statement - they expand to > include code after the return [2] and so have no affect. > > Fix: > > src/hotspot/share/jvmci/jvmciEnv.hpp > @@ -262,7 +262,8 @@ > ?? char* as_utf8_string(JVMCIObject str, char* buf, int buflen); > > ?? JVMCIObject create_string(Symbol* str, JVMCI_TRAPS) { > -??? return create_string(str->as_C_string(), JVMCI_CHECK_(JVMCIObject())); > +??? JVMCIObject s = create_string(str->as_C_string(), > JVMCI_CHECK_(JVMCIObject())); > +??? return s; > ?? } > > I tried to find similar cases but it was the only one. > Clang -Wunreachable-code-aggressive does not catch this case. > > Tested hs-tier1,hs-tier3-graal > > Thanks, > Vladimir > > [1] > http://hg.openjdk.java.net/jdk/jdk/file/90882ba9f488/src/hotspot/share/jvmci/jvmciExceptions.hpp#l48 > From vladimir.kozlov at oracle.com Sat Apr 18 14:41:19 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Sat, 18 Apr 2020 07:41:19 -0700 Subject: [15] RFR(T) 8242357: [JVMCI] Incorrect use of JVMCI_CHECK_ on return statement In-Reply-To: <4b477c0f-22c8-5271-fb1a-96e1ad8c5cba@oracle.com> References: <4b477c0f-22c8-5271-fb1a-96e1ad8c5cba@oracle.com> Message-ID: <1f8bf985-1290-088c-1982-1d058076cbcb@oracle.com> Thank you, David Vladimir On 4/18/20 6:34 AM, David Holmes wrote: > Looks good! > > Thanks, > David > > On 18/04/2020 9:58 am, Vladimir Kozlov wrote: >> https://bugs.openjdk.java.net/browse/JDK-8242357 >> >> CHECK macros can't be used on a return statement - they expand to include code after the return [2] and so have no >> affect. >> >> Fix: >> >> src/hotspot/share/jvmci/jvmciEnv.hpp >> @@ -262,7 +262,8 @@ >> ??? char* as_utf8_string(JVMCIObject str, char* buf, int buflen); >> >> ??? JVMCIObject create_string(Symbol* str, JVMCI_TRAPS) { >> -??? return create_string(str->as_C_string(), JVMCI_CHECK_(JVMCIObject())); >> +??? JVMCIObject s = create_string(str->as_C_string(), JVMCI_CHECK_(JVMCIObject())); >> +??? return s; >> ??? } >> >> I tried to find similar cases but it was the only one. >> Clang -Wunreachable-code-aggressive does not catch this case. >> >> Tested hs-tier1,hs-tier3-graal >> >> Thanks, >> Vladimir >> >> [1] http://hg.openjdk.java.net/jdk/jdk/file/90882ba9f488/src/hotspot/share/jvmci/jvmciExceptions.hpp#l48 From tkachuk.vladyslav at gmail.com Sun Apr 19 19:56:57 2020 From: tkachuk.vladyslav at gmail.com (Vladyslav Tkachuk) Date: Sun, 19 Apr 2020 21:56:57 +0200 Subject: Master Thesis Research Advice. JIT In-Reply-To: <9765f74c-bfd5-19da-a343-6efccde73195@oracle.com> References: <9765f74c-bfd5-19da-a343-6efccde73195@oracle.com> Message-ID: Hello Vladimir, Thank you for your reply. I have considered all compiler levels from C1 and C2, but the main problem was that the code produced by them has too many aspects that make it hard to analyze. The point of my task is Trivial Compiler Equivalence, meaning that I literally compare the Asm code for a source class and mutants line by line and I expect that the same Java code produced same Asm code. However, the code produced by C1 contains many addresses which vary every time the code is run. That is why I switched to Opto-Asm which has much less "variability". Best regards, Vladyslav Tkachuk ??, 16 ????. 2020 ? 12:26 Vladimir Ivanov ????: > Hi Vladyslav, > > C2 has a number of aggressive optimizations which heavily rely on > profiling data. It leads to numerous uncommon traps in the generated > code. You can disable some of such optimizations, but there's no way to > completely eliminate uncommon traps in the generated code: they are a > core piece of the design. > > Have you tried switching to C1 instead? C1 doesn't rely on profiling > data that much and use code patching techniques in place of uncommon > traps. So, the generated code usually has complete coverage of the > compiled method. > > Best regards, > Vladimir Ivanov > > On 16.04.2020 01:05, Vladyslav Tkachuk wrote: > > Hello, > > > > I am a Master's student at the University of Passau, Germany. > > My master thesis research is concerned with detecting equivalent mutants > in > > Java. > > The main research question is to use the Trivial Compiler Equivalency > > technique. This means that we acquire Assembly code produced by Java JIT > > compiler for initial and mutated source and then compare them. > > > > I have previously contacted Tobias Hartmann, who advised me to write here > > regarding technical questions. I would like to ask you if there is any > > solution to a problem I have. > > > > Last time Tobias recommended me to use Opto-Assembly to achieve my > purpose. > > It was a good hint and it helped me to get more precise data. > > However, after doing some research I noticed that in some cases C2 > compiler > > unloaded the method code which I expected to find in assembly. As I found > > out this was a part of deoptimization and the method code was meant to be > > executed by the interpreter. > > Here is an example of what I mean: > > > > {method} > > - this oop: 0x000000000d2319c8 > > - method holder: 'Rational' > > - constants: 0x000000000d230cf8 constant pool [85] > > {0x000000000d230d00} for 'Rational' cache=0x000000000d231cd8 > > - access: 0x81000001 public > > - name: 'toString' > > - signature: '()Ljava/lang/String;' > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > some setup code > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > 02c movq RBP, RDX # spill > > 02f movl RDX, #11 # int > > nop # 3 bytes pad for loops and calls > > *037 call,static wrapper for: uncommon_trap(reason='unloaded' > > action='reinterpret' index='11')* > > * # Rational::toString @ bci:0 L[0]=RBP L[1]=_ L[2]=_ L[3]=_ > L[4]=_ > > L[5]=_ L[6]=_ L[7]=_* > > * # OopMap{rbp=Oop off=60}* > > 03c int3 # ShouldNotReachHere > > 03c > > > > > > This is a 'toString' method and as I could see and understand, there is > no > > actual method code, but only a call to it. > > > > I would like to know if it is possible to completely disable any > > deoptimizations and consistently receive the full asm code? I consent > that > > it is not practical and hurts performance, but it is not a goal in this > > scope. According to my observations, in most cases the method code is > full, > > but strangely here it did not work. I have tried to google any useful > info, > > unfortunately, I did not see anything helpful, despite the explanations > > about what deoptimization is and its types. > > > > I would be grateful if you could shed some light on the issue. > > Thanks in advance for any useful information. > > > > Best regards, > > Vladyslav Tkachuk > > > From kuaiwei.kw at alibaba-inc.com Mon Apr 20 02:19:20 2020 From: kuaiwei.kw at alibaba-inc.com (Kuai Wei) Date: Mon, 20 Apr 2020 10:19:20 +0800 Subject: =?UTF-8?B?UmU6IFJGUjogaGVhcGJhc2UgcmVnaXN0ZXIgY2FuIGJlIGFsbG9jYXRlZCBpbiBjb21wcmVz?= =?UTF-8?B?c2VkIG1vZGU=?= In-Reply-To: <781CB090-0386-4D32-8465-8238E516789B@amazon.com> References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com> <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com> <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com> <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com> <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com> <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com> <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com>, <781CB090-0386-4D32-8465-8238E516789B@amazon.com> Message-ID: <77fd9246-b951-47b9-9743-11aa3fd851bd.kuaiwei.kw@alibaba-inc.com> Thanks for all feedback. I think this patch has enough review and can be merged. Hi Pengfei, I need help to push it. Could you help to merge it? Thanks, Kuai Wei ------------------------------------------------------------------ From:Liu, Xin Send Time:2020?4?15?(???) 11:17 To:??(??) ; Pengfei Li ; Andrew Haley ; hotspot compiler Cc:nd Subject:Re: RFR: heapbase register can be allocated in compressed mode Hi, Wei, LGTM. Thanks. --lx From: Kuai Wei Reply-To: Kuai Wei Date: Tuesday, April 14, 2020 at 6:26 AM To: "Liu, Xin" , Pengfei Li , Andrew Haley , hotspot compiler Cc: nd Subject: RE: RFR: heapbase register can be allocated in compressed mode Hi Xin and Pengfei, Thanks for your comments. I checked change in reinit_heapbase and decide to revert it since it's no harm to set rheapbase. I also made change in verify_heapbase in case someone want to enable this check again. The new patch is in http://cr.openjdk.java.net/~wzhuo/8242449/webrev.01/ It has passed tiered 1 test without new failure. Thanks, Kuai Wei ------------------------------------------------------------------ From:Liu, Xin Send Time:2020?4?14?(???) 17:37 To:Pengfei Li ; ??(??) ; Andrew Haley ; hotspot compiler Cc:nd Subject:Re: RFR: heapbase register can be allocated in compressed mode Hi, Pengfei and Kuai, Thanks to point out. Aarch64.ad does use MacroAssembler::encode_heap_oop, which refers to rheapbase. That's why we can't use rheapbase as a GP register in C2. Got it. thanks! --lx On 4/14/20, 1:39 AM, "Pengfei Li" wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Hi Xin, > I read JDK-8234794 but I don't understand why that change involves in r27 > and CompressedOop. JDK-8234794 is the metaspace reservation fix. It also simplifies the encoding/decoding of compressed class pointers. Before that patch, r27 is used for both compressed oops and compressed class pointers. At that time we have to consider if r27 is allocatable if compressed class pointers is on. But after that patch, r27 is for compressed oops only. That's why I could simplify my JDK-8233743 patch after JDK-8234794 was merged. -- Thanks, Pengfei From Pengfei.Li at arm.com Mon Apr 20 04:32:00 2020 From: Pengfei.Li at arm.com (Pengfei Li) Date: Mon, 20 Apr 2020 04:32:00 +0000 Subject: RFR: heapbase register can be allocated in compressed mode In-Reply-To: <77fd9246-b951-47b9-9743-11aa3fd851bd.kuaiwei.kw@alibaba-inc.com> References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com> <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com> <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com> <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com> <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com> <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com> <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com>, <781CB090-0386-4D32-8465-8238E516789B@amazon.com> <77fd9246-b951-47b9-9743-11aa3fd851bd.kuaiwei.kw@alibaba-inc.com> Message-ID: Hi Wei, > Thanks for all feedback. I think this patch has enough review and can be merged. > > Hi Pengfei, > > I need help to push it. Could you help to merge it? I'm not a reviewer, and not sure whether your updated webrev.01 [1] still requires an official reviewer to confirm. Maybe Andrew Haley or other AArch64 reviewers can help? [1] http://cr.openjdk.java.net/~wzhuo/8242449/webrev.01/ -- Thanks, Pengfei From Yang.Zhang at arm.com Mon Apr 20 06:30:47 2020 From: Yang.Zhang at arm.com (Yang Zhang) Date: Mon, 20 Apr 2020 06:30:47 +0000 Subject: RFR(XS): 8242796: Fix client build failure In-Reply-To: References: Message-ID: Hi Vladimir Thanks for your comment. I update the patch. http://cr.openjdk.java.net/~yzhang/8242796/webrev.02/ Regards Yang -----Original Message----- From: Vladimir Kozlov Sent: Saturday, April 18, 2020 3:07 AM To: Yang Zhang ; hotspot-compiler-dev at openjdk.java.net Subject: Re: RFR(XS): 8242796: Fix client build failure Hi Yang On 4/17/20 1:37 AM, Yang Zhang wrote: > Hi Vladimir > > I update the patch according to your comment. > http://cr.openjdk.java.net/~yzhang/8242796/webrev.01/ > > These checks are needed. > #if INCLUDE_JFR && COMPILER2_OR_JVMCI > #if COMPILER2 ----------- without it, configuring --with-jvm-features=-compiler2 fails to build. The error is the same as before. Yes, I agree that additional #ifdef COMPILER2 is needed. The only comment I have is to may be include compiler_c2 check under that #ifdef and leaving #endif at the same place: + #ifdef COMPILER2 } else if (compiler_type == compiler_c2) { first_registration = false; + #endif // COMPILER2 } Thanks, Vladimir > > Regards > Yang > > -----Original Message----- > From: hotspot-compiler-dev > On Behalf Of Vladimir > Kozlov > Sent: Friday, April 17, 2020 5:27 AM > To: hotspot-compiler-dev at openjdk.java.net > Subject: Re: RFR(XS): 8242796: Fix client build failure > > Method register_jfr_phasetype_serializer() is only called when C2 or JVMCI code is included in JVM build. > I think you need to put whole method under checks: > > #if INCLUDE_JFR && COMPILER2_OR_JVMCI > // It appends new compiler phase names to growable array phase_names(a new CompilerPhaseType mapping // in compiler/compilerEvent.cpp) and registers it with its serializer. > > Thanks, > Vladimir > > On 4/16/20 1:58 AM, Yang Zhang wrote: >> Hi, >> >> Could you please help to review this patch? >> >> JBS: https://bugs.openjdk.java.net/browse/JDK-8242796 >> Webrev: http://cr.openjdk.java.net/~yzhang/8242796/webrev.00/ >> >> This build failure is introduced by JDK-8193210 [JVMCI/Graal] add JFR >> compiler phase/inlining events. >> C2 only code is used for JFR. To fix this issue, I use COMPILER2 macro. >> >> With this patch, x86 client build succeeds. But AArch64 client build >> still fails, which is caused by [1]. I have filed [2] for AArch64 >> client build failure and will summit another patch for that. >> >> [1] https://bugs.openjdk.java.net/browse/JDK-8241665 >> [2] https://bugs.openjdk.java.net/browse/JDK-8242905 >> >> Regards >> Yang >> From aph at redhat.com Mon Apr 20 08:48:50 2020 From: aph at redhat.com (Andrew Haley) Date: Mon, 20 Apr 2020 09:48:50 +0100 Subject: RFR: heapbase register can be allocated in compressed mode In-Reply-To: References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com> <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com> <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com> <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com> <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com> <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com> <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com> <781CB090-0386-4D32-8465-8238E516789B@amazon.com> <77fd9246-b951-47b9-9743-11aa3fd851bd.kuaiwei.kw@alibaba-inc.com> Message-ID: On 4/20/20 5:32 AM, Pengfei Li wrote: > Maybe Andrew Haley or other AArch64 reviewers can help? > > [1] http://cr.openjdk.java.net/~wzhuo/8242449/webrev.01/ It's fine. At some point in the future maybe we can get round to taking out all references to rheapbase, but it'll require careful thinking about JVMCI and Graal-precompiled code. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From Pengfei.Li at arm.com Mon Apr 20 09:54:40 2020 From: Pengfei.Li at arm.com (Pengfei Li) Date: Mon, 20 Apr 2020 09:54:40 +0000 Subject: RFR: heapbase register can be allocated in compressed mode In-Reply-To: References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com> <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com> <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com> <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com> <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com> <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com> <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com> <781CB090-0386-4D32-8465-8238E516789B@amazon.com> <77fd9246-b951-47b9-9743-11aa3fd851bd.kuaiwei.kw@alibaba-inc.com> Message-ID: > It's fine. At some point in the future maybe we can get round to taking out all > references to rheapbase, but it'll require careful thinking about JVMCI and > Graal-precompiled code. Thanks Andrew. Pushed here http://hg.openjdk.java.net/jdk/jdk/rev/aedc9bf21743 -- Thanks, Pengfei From aph at redhat.com Mon Apr 20 10:01:10 2020 From: aph at redhat.com (Andrew Haley) Date: Mon, 20 Apr 2020 11:01:10 +0100 Subject: RFR: heapbase register can be allocated in compressed mode In-Reply-To: References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com> <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com> <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com> <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com> <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com> <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com> <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com> <781CB090-0386-4D32-8465-8238E516789B@amazon.com> <77fd9246-b951-47b9-9743-11aa3fd851bd.kuaiwei.kw@alibaba-inc.com> Message-ID: <84c21683-eaba-5598-6a1d-c58abdb39014@redhat.com> On 4/20/20 9:48 AM, Andrew Haley wrote: > On 4/20/20 5:32 AM, Pengfei Li wrote: >> Maybe Andrew Haley or other AArch64 reviewers can help? >> >> [1] http://cr.openjdk.java.net/~wzhuo/8242449/webrev.01/ > It's fine. Sorry, no it isn't fine. Please get rid of this hunk: --- old/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp 2020-04-14 21:18:52.009758661 +0800 +++ new/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp 2020-04-14 21:18:51.785764043 +0800 @@ -2185,6 +2185,10 @@ #if 0 assert (UseCompressedOops || UseCompressedClassPointers, "should be compressed"); assert (Universe::heap() != NULL, "java heap should be initialized"); + if (!UseCompressedOops || Universe::ptr_base() == NULL) { + // rheapbase is allocated as general register + return; + } if (CheckCompressedOops) { Label ok; push(1 << rscratch1->encoding(), sp); // cmpptr trashes rscratch1 -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From Pengfei.Li at arm.com Mon Apr 20 10:10:05 2020 From: Pengfei.Li at arm.com (Pengfei Li) Date: Mon, 20 Apr 2020 10:10:05 +0000 Subject: RFR: heapbase register can be allocated in compressed mode In-Reply-To: <84c21683-eaba-5598-6a1d-c58abdb39014@redhat.com> References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com> <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com> <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com> <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com> <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com> <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com> <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com> <781CB090-0386-4D32-8465-8238E516789B@amazon.com> <77fd9246-b951-47b9-9743-11aa3fd851bd.kuaiwei.kw@alibaba-inc.com> <84c21683-eaba-5598-6a1d-c58abdb39014@redhat.com> Message-ID: Hi Andrew, > Sorry, no it isn't fine. Please get rid of this hunk: > > --- old/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp 2020- > 04-14 21:18:52.009758661 +0800 > +++ new/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp 2020- > 04-14 21:18:51.785764043 +0800 > @@ -2185,6 +2185,10 @@ > #if 0 > assert (UseCompressedOops || UseCompressedClassPointers, "should be > compressed"); > assert (Universe::heap() != NULL, "java heap should be initialized"); > + if (!UseCompressedOops || Universe::ptr_base() == NULL) { > + // rheapbase is allocated as general register > + return; > + } > if (CheckCompressedOops) { > Label ok; > push(1 << rscratch1->encoding(), sp); // cmpptr trashes rscratch1 Oh. It's already pushed just now. According to the process, we may need Wei to create another JBS to backout that part? -- Thanks, Pengfei From aph at redhat.com Mon Apr 20 10:23:41 2020 From: aph at redhat.com (Andrew Haley) Date: Mon, 20 Apr 2020 11:23:41 +0100 Subject: [aarch64-port-dev ] RFR(XS): 8242070: AArch64: Fix a typo introduced by JDK-8238690 In-Reply-To: References: Message-ID: <3b47599e-6b2f-06a9-6ea4-057795850065@redhat.com> On 4/17/20 10:14 AM, Yang Zhang wrote: > Ping it again. Could you please help to review this? I'm running it, and I get no vector code generated. How did you test it? -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From aph at redhat.com Mon Apr 20 10:36:19 2020 From: aph at redhat.com (Andrew Haley) Date: Mon, 20 Apr 2020 11:36:19 +0100 Subject: [aarch64-port-dev ] RFR(XS): 8242070: AArch64: Fix a typo introduced by JDK-8238690 In-Reply-To: <3b47599e-6b2f-06a9-6ea4-057795850065@redhat.com> References: <3b47599e-6b2f-06a9-6ea4-057795850065@redhat.com> Message-ID: On 4/20/20 11:23 AM, Andrew Haley wrote: > On 4/17/20 10:14 AM, Yang Zhang wrote: >> Ping it again. Could you please help to review this? > > I'm running it, and I get no vector code generated. How did you test it? Sorry, my mistake. I'm testing it now. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From kuaiwei.kw at alibaba-inc.com Mon Apr 20 11:12:55 2020 From: kuaiwei.kw at alibaba-inc.com (Kuai Wei) Date: Mon, 20 Apr 2020 19:12:55 +0800 Subject: =?UTF-8?B?UmU6IFJGUjogaGVhcGJhc2UgcmVnaXN0ZXIgY2FuIGJlIGFsbG9jYXRlZCBpbiBjb21wcmVz?= =?UTF-8?B?c2VkIG1vZGU=?= In-Reply-To: <84c21683-eaba-5598-6a1d-c58abdb39014@redhat.com> References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com> <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com> <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com> <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com> <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com> <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com> <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com> <781CB090-0386-4D32-8465-8238E516789B@amazon.com> <77fd9246-b951-47b9-9743-11aa3fd851bd.kuaiwei.kw@alibaba-inc.com> , <84c21683-eaba-5598-6a1d-c58abdb39014@redhat.com> Message-ID: <74ad538f-3247-4b31-832f-b3cb1bd9f41a.kuaiwei.kw@alibaba-inc.com> Hi Andrew, Could you tell more detail about it? I can start a new patch for it if it break anything. Kuai Wei ------------------------------------------------------------------ From:Andrew Haley Send Time:2020?4?20?(???) 18:01 To:Pengfei Li ; ??(??) ; "Liu, Xin" ; hotspot compiler Cc:nd ; aarch64-port-dev at openjdk.java.net Subject:Re: RFR: heapbase register can be allocated in compressed mode On 4/20/20 9:48 AM, Andrew Haley wrote: > On 4/20/20 5:32 AM, Pengfei Li wrote: >> Maybe Andrew Haley or other AArch64 reviewers can help? >> >> [1] http://cr.openjdk.java.net/~wzhuo/8242449/webrev.01/ > It's fine. Sorry, no it isn't fine. Please get rid of this hunk: --- old/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp 2020-04-14 21:18:52.009758661 +0800 +++ new/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp 2020-04-14 21:18:51.785764043 +0800 @@ -2185,6 +2185,10 @@ #if 0 assert (UseCompressedOops || UseCompressedClassPointers, "should be compressed"); assert (Universe::heap() != NULL, "java heap should be initialized"); + if (!UseCompressedOops || Universe::ptr_base() == NULL) { + // rheapbase is allocated as general register + return; + } if (CheckCompressedOops) { Label ok; push(1 << rscratch1->encoding(), sp); // cmpptr trashes rscratch1 -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From aph at redhat.com Mon Apr 20 11:50:33 2020 From: aph at redhat.com (Andrew Haley) Date: Mon, 20 Apr 2020 12:50:33 +0100 Subject: [aarch64-port-dev ] RFR(XS): 8242070: AArch64: Fix a typo introduced by JDK-8238690 In-Reply-To: References: Message-ID: On 4/17/20 10:14 AM, Yang Zhang wrote: > > Ping it again. Could you please help to review this? Before: Benchmark Mode Cnt Score Error Units TestVect.testVectShift avgt 5 141.027 ? 0.117 us/op 0.41% 0x0000ffffa8c5fc40: sbfiz x15, x11, #1, #32 0x0000ffffa8c5fc44: add x16, x18, x15 ;*saload {reexecute=0 rethrow=0 return_oop=0} ; - org.sample.TestVect::testVectShift at 16 (line 31) 0x0000ffffa8c5fc48: ldr q16, [x16, #16] 0.51% 0x0000ffffa8c5fc4c: neg v17.16b, v18.16b 0x0000ffffa8c5fc50: sshl v16.8h, v16.8h, v17.8h 0x0000ffffa8c5fc54: add x15, x17, x15 After: Benchmark Mode Cnt Score Error Units TestVect.testVectShift avgt 5 143.021 ? 0.506 us/op 0.46% 0x0000ffff78c61f00: sbfiz x13, x15, #1, #32 0x0000ffff78c61f04: add x14, x17, x13 ;*saload {reexecute=0 rethrow=0 return_oop=0} ; - org.sample.TestVect::testVectShift at 16 (line 31) 0x0000ffff78c61f08: ldr q16, [x14, #16] 0.36% 0x0000ffff78c61f0c: sshr v16.8h, v16.8h, #2 0x0000ffff78c61f10: add x13, x16, x13 So, at least on this thing it makes no difference. I'll grant you it's less code, so OK. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From aph at redhat.com Mon Apr 20 12:14:29 2020 From: aph at redhat.com (Andrew Haley) Date: Mon, 20 Apr 2020 13:14:29 +0100 Subject: RFR: heapbase register can be allocated in compressed mode In-Reply-To: <74ad538f-3247-4b31-832f-b3cb1bd9f41a.kuaiwei.kw@alibaba-inc.com> References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com> <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com> <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com> <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com> <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com> <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com> <781CB090-0386-4D32-8465-8238E516789B@amazon.com> <77fd9246-b951-47b9-9743-11aa3fd851bd.kuaiwei.kw@alibaba-inc.com> <84c21683-eaba-5598-6a1d-c58abdb39014@redhat.co m> <74ad538f-3247-4b31-832f-b3cb1bd9f41a.kuaiwei.kw@alibaba-inc.com> Message-ID: On 4/20/20 12:12 PM, Kuai Wei wrote: > Could you tell more detail about it? I can start a new patch for it > if it break anything. Well, it's ifdef'd out at the moment, so by definition it can't break anything. But there may be issues with Graal whereby we really do need to check rheapbase, but it's OK for now. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From maurizio.cimadamore at oracle.com Mon Apr 20 14:59:49 2020 From: maurizio.cimadamore at oracle.com (Maurizio Cimadamore) Date: Mon, 20 Apr 2020 15:59:49 +0100 Subject: Intrinsics for divideUnsigned/remainderUnsigned In-Reply-To: References: Message-ID: <84dcf12f-7a75-0a8b-b590-3ebb122c7ec3@oracle.com> Hi David, did you mean to write to hotspot compiler (CCed) ? Maurizio On 20/04/2020 15:38, David Lloyd wrote: > Am I correct in understanding that there are no compiler intrinsics > for Long.divideUnsigned/remainderUnsigned? > > The implementation seems pretty expensive for an operation that is, if > I understand correctly, a single instruction on many CPU > architectures. But maybe these methods are not very frequently used? > (My clue was a comment in the source referencing an algorithm from > Hacker's Delight that could be used - if such an algorithm exists, but > wasn't implemented, presumably demand is low?) From david.lloyd at redhat.com Mon Apr 20 15:07:56 2020 From: david.lloyd at redhat.com (David Lloyd) Date: Mon, 20 Apr 2020 10:07:56 -0500 Subject: Intrinsics for divideUnsigned/remainderUnsigned In-Reply-To: <84dcf12f-7a75-0a8b-b590-3ebb122c7ec3@oracle.com> References: <84dcf12f-7a75-0a8b-b590-3ebb122c7ec3@oracle.com> Message-ID: Yes, I did, sorry about that. On Mon, Apr 20, 2020 at 10:02 AM Maurizio Cimadamore wrote: > > Hi David, > did you mean to write to hotspot compiler (CCed) ? > > Maurizio > > On 20/04/2020 15:38, David Lloyd wrote: > > Am I correct in understanding that there are no compiler intrinsics > > for Long.divideUnsigned/remainderUnsigned? > > > > The implementation seems pretty expensive for an operation that is, if > > I understand correctly, a single instruction on many CPU > > architectures. But maybe these methods are not very frequently used? > > (My clue was a comment in the source referencing an algorithm from > > Hacker's Delight that could be used - if such an algorithm exists, but > > wasn't implemented, presumably demand is low?) > -- - DML From tobias.hartmann at oracle.com Mon Apr 20 15:52:27 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Mon, 20 Apr 2020 17:52:27 +0200 Subject: [15] RFR(M): 8242108: Performance regression after fix for JDK-8229496 Message-ID: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com> Hi, please review the following patch: https://bugs.openjdk.java.net/browse/JDK-8242108 http://cr.openjdk.java.net/~thartmann/8242108/webrev.00/ The fix for 8229496 [1] triggers a performance regression with NumberFormat.format(). The problem is the additional control dependency on a CastII/LL which restricts optimizations due to _carry_dependency being set (which was necessary because we can not represent non-null integers/long values in C2's type system). While investigating, I've noticed that Roland's fix for 8241900 [2] fixes the exact same problem but in a more elegant way, avoiding an impact on performance. I'm therefore proposing to back out the original fix for 8229496, leaving the regression test in and also adding a microbenchmark. I've verified that this solves the performance regression (4547 ops/ms vs. 5048 ops/ms on my machine). Thanks, Tobias [1] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-August/034865.html [2] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-April/037778.html From joe.darcy at oracle.com Mon Apr 20 17:40:52 2020 From: joe.darcy at oracle.com (Joe Darcy) Date: Mon, 20 Apr 2020 10:40:52 -0700 Subject: Intrinsics for divideUnsigned/remainderUnsigned In-Reply-To: References: <84dcf12f-7a75-0a8b-b590-3ebb122c7ec3@oracle.com> Message-ID: <2104801b-5942-0982-ade7-1e6f418cfd06@oracle.com> The divideUnsigned methods in question are not marked with the @HotSpotIntrinsicCandidate annotation so it doesn't look like there are currently intrinsics. Cheers, -Joe On 4/20/2020 8:07 AM, David Lloyd wrote: > Yes, I did, sorry about that. > > On Mon, Apr 20, 2020 at 10:02 AM Maurizio Cimadamore > wrote: >> Hi David, >> did you mean to write to hotspot compiler (CCed) ? >> >> Maurizio >> >> On 20/04/2020 15:38, David Lloyd wrote: >>> Am I correct in understanding that there are no compiler intrinsics >>> for Long.divideUnsigned/remainderUnsigned? >>> >>> The implementation seems pretty expensive for an operation that is, if >>> I understand correctly, a single instruction on many CPU >>> architectures. But maybe these methods are not very frequently used? >>> (My clue was a comment in the source referencing an algorithm from >>> Hacker's Delight that could be used - if such an algorithm exists, but >>> wasn't implemented, presumably demand is low?) > From vladimir.kozlov at oracle.com Mon Apr 20 19:32:16 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 20 Apr 2020 12:32:16 -0700 Subject: RFR(XS): 8242796: Fix client build failure In-Reply-To: References: Message-ID: Looks good. Thanks, Vladimir On 4/19/20 11:30 PM, Yang Zhang wrote: > Hi Vladimir > > Thanks for your comment. I update the patch. > http://cr.openjdk.java.net/~yzhang/8242796/webrev.02/ > > Regards > Yang > > -----Original Message----- > From: Vladimir Kozlov > Sent: Saturday, April 18, 2020 3:07 AM > To: Yang Zhang ; hotspot-compiler-dev at openjdk.java.net > Subject: Re: RFR(XS): 8242796: Fix client build failure > > Hi Yang > > On 4/17/20 1:37 AM, Yang Zhang wrote: >> Hi Vladimir >> >> I update the patch according to your comment. >> http://cr.openjdk.java.net/~yzhang/8242796/webrev.01/ >> >> These checks are needed. >> #if INCLUDE_JFR && COMPILER2_OR_JVMCI >> #if COMPILER2 ----------- without it, configuring --with-jvm-features=-compiler2 fails to build. The error is the same as before. > > Yes, I agree that additional #ifdef COMPILER2 is needed. > The only comment I have is to may be include compiler_c2 check under that #ifdef and leaving #endif at the same place: > > + #ifdef COMPILER2 > } else if (compiler_type == compiler_c2) { > > first_registration = false; > + #endif // COMPILER2 > } > > Thanks, > Vladimir > >> >> Regards >> Yang >> >> -----Original Message----- >> From: hotspot-compiler-dev >> On Behalf Of Vladimir >> Kozlov >> Sent: Friday, April 17, 2020 5:27 AM >> To: hotspot-compiler-dev at openjdk.java.net >> Subject: Re: RFR(XS): 8242796: Fix client build failure >> >> Method register_jfr_phasetype_serializer() is only called when C2 or JVMCI code is included in JVM build. >> I think you need to put whole method under checks: >> >> #if INCLUDE_JFR && COMPILER2_OR_JVMCI >> // It appends new compiler phase names to growable array phase_names(a new CompilerPhaseType mapping // in compiler/compilerEvent.cpp) and registers it with its serializer. >> >> Thanks, >> Vladimir >> >> On 4/16/20 1:58 AM, Yang Zhang wrote: >>> Hi, >>> >>> Could you please help to review this patch? >>> >>> JBS: https://bugs.openjdk.java.net/browse/JDK-8242796 >>> Webrev: http://cr.openjdk.java.net/~yzhang/8242796/webrev.00/ >>> >>> This build failure is introduced by JDK-8193210 [JVMCI/Graal] add JFR >>> compiler phase/inlining events. >>> C2 only code is used for JFR. To fix this issue, I use COMPILER2 macro. >>> >>> With this patch, x86 client build succeeds. But AArch64 client build >>> still fails, which is caused by [1]. I have filed [2] for AArch64 >>> client build failure and will summit another patch for that. >>> >>> [1] https://bugs.openjdk.java.net/browse/JDK-8241665 >>> [2] https://bugs.openjdk.java.net/browse/JDK-8242905 >>> >>> Regards >>> Yang >>> From vladimir.kozlov at oracle.com Mon Apr 20 19:55:39 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Mon, 20 Apr 2020 12:55:39 -0700 Subject: [15] RFR(M): 8242108: Performance regression after fix for JDK-8229496 In-Reply-To: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com> References: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com> Message-ID: <9b641dc6-4a25-f26f-9dc1-822d616f0e75@oracle.com> Hi Tobias, aarch64.ad has more changes than just undo 8229496. Otherwise it is good. Does it affect performance of our standard benchmarks? Thanks, Vladimir K On 4/20/20 8:52 AM, Tobias Hartmann wrote: > Hi, > > please review the following patch: > https://bugs.openjdk.java.net/browse/JDK-8242108 > http://cr.openjdk.java.net/~thartmann/8242108/webrev.00/ > > The fix for 8229496 [1] triggers a performance regression with NumberFormat.format(). The problem is > the additional control dependency on a CastII/LL which restricts optimizations due to > _carry_dependency being set (which was necessary because we can not represent non-null integers/long > values in C2's type system). > > While investigating, I've noticed that Roland's fix for 8241900 [2] fixes the exact same problem but > in a more elegant way, avoiding an impact on performance. > > I'm therefore proposing to back out the original fix for 8229496, leaving the regression test in and > also adding a microbenchmark. I've verified that this solves the performance regression (4547 ops/ms > vs. 5048 ops/ms on my machine). > > Thanks, > Tobias > > [1] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-August/034865.html > [2] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-April/037778.html > From cjashfor at linux.ibm.com Mon Apr 20 20:39:33 2020 From: cjashfor at linux.ibm.com (Corey Ashford) Date: Mon, 20 Apr 2020 13:39:33 -0700 Subject: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9 In-Reply-To: References: <67fa8056-a8ed-cdfc-1e5a-d36b49c4af18@linux.ibm.com> <0079874e-7bc2-5ff4-f004-337c718ec6df@linux.ibm.com> <1964be00-8926-7a70-d23a-2f7e85eb4ef3@linux.ibm.com> Message-ID: <13786032-d4e9-9682-5cd7-698ceb4f8c00@linux.ibm.com> Hi Martin, Sorry for the delay on getting the copyright changes in (I work half time). Here's the revised patch, with all copyright dates set to 2020: https://bugs.openjdk.java.net/browse/JDK-8241874 http://cr.openjdk.java.net/~gromero/8241874/v2/ Thanks for your consideration, - Corey On 4/16/20 1:08 AM, Doerr, Martin wrote: > Hi Corey, > > please use 2020 for both, the Oracle and the SAP copyright. > Usually, both should be the same, but some people forget to update one of them. > > Best regards, > Martin > > >> -----Original Message----- >> From: Corey Ashford >> Sent: Donnerstag, 16. April 2020 03:35 >> To: Doerr, Martin >> Cc: Michihiro Horie ; hotspot-compiler- >> dev at openjdk.java.net; ppc-aix-port-dev at openjdk.java.net >> Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of >> Long.reverseBytes() and Integer.reverseBytes() on Power9 >> >> Hello Martin, >> >> I'm having some trouble with my email server, so I'm having to reply to >> your earlier post, but I saw your most recent post on the mailing list >> archive. >> >> Thanks for reviewing and testing this patch. I went to look at the >> copyright dates, and see two date ranges: one for Oracle and its >> affiliates, and another for SAP. In the files I looked at, the end date >> wasn't the same between the two. Which one (or both) should I modify? >> >> Thanks, >> >> - Corey >> >> On 4/14/20 6:26 AM, Doerr, Martin wrote: >>> Hi Corey, >>> >>> thanks for contributing it. Looks good to me. I?ll run it through our >>> testing and let you know about the results. >>> >>> Best regards, >>> >>> Martin >>> >>> *From:*ppc-aix-port-dev >> *On >>> Behalf Of *Michihiro Horie >>> *Sent:* Freitag, 10. April 2020 10:48 >>> *To:* cjashfor at linux.ibm.com >>> *Cc:* hotspot-compiler-dev at openjdk.java.net; >>> ppc-aix-port-dev at openjdk.java.net >>> *Subject:* Re: RFR[S]:8241874 [PPC64] Improve performance of >>> Long.reverseBytes() and Integer.reverseBytes() on Power9 >>> >>> Hi Corey, >>> >>> Thank you for sharing your benchmarks. I confirmed your change reduced >>> the elapsed time of the benchmarks by more than 30% on my P9 node. >> Also, >>> I checked JTREG results, which look no problem. >>> >>> BTW, I cannot find further points of improvement in your change. >>> >>> Best regards, >>> Michihiro >>> >>> >>> ----- Original message ----- >>> From: "Corey Ashford" >> > >>> To: Michihiro Horie/Japan/IBM at IBMJP >>> Cc: hotspot-compiler-dev at openjdk.java.net >>> , >>> ppc-aix-port-dev at openjdk.java.net >>> , "Gustavo Romero" >>> > >>> Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of >>> Long.reverseBytes() and Integer.reverseBytes() on Power9 >>> Date: Fri, Apr 3, 2020 8:07 AM >>> >>> On 4/2/20 7:27 AM, Michihiro Horie wrote: >>>> Hi Corey, >>>> >>>> I?m not a reviewer, but I can run your benchmark in my local P9 node if >>>> you share it. >>>> >>>> Best regards, >>>> Michihiro >>> >>> The tests are somewhat hokey; I added the shifts to keep the compiler >>> from hoisting the code that it could predetermine the result. >>> >>> Here's the one for Long.reverseBytes(): >>> >>> import java.lang.*; >>> >>> class ReverseLong >>> { >>> ? ? ?public static void main(String args[]) >>> ? ? ?{ >>> ? ? ? ? ?long reversed, re_reversed; >>> long accum = 0; >>> long orig = 0x1122334455667788L; >>> long start = System.currentTimeMillis(); >>> for (int i = 0; i < 1_000_000_000; i++) { >>> // Try to keep java from figuring out stuff in advance >>> reversed = Long.reverseBytes(orig); >>> re_reversed = Long.reverseBytes(reversed); >>> if (re_reversed != orig) { >>> ? ? ? ? ?System.out.println("Orig: " + String.format("%16x", orig) + >>> " ?Re-reversed: " + String.format("%16x", re_reversed)); >>> } >>> accum += orig; >>> orig = Long.rotateRight(orig, 3); >>> } >>> System.out.println("Elapsed time: " + >>> Long.toString(System.currentTimeMillis() - start)); >>> System.out.println("accum: " + Long.toString(accum)); >>> ? ? ?} >>> } >>> >>> >>> And the one for Integer.reverseBytes(): >>> >>> import java.lang.*; >>> >>> class ReverseInt >>> { >>> ? ? ?public static void main(String args[]) >>> ? ? ?{ >>> ? ? ? ? ?int reversed, re_reversed; >>> int orig = 0x11223344; >>> int accum = 0; >>> long start = System.currentTimeMillis(); >>> for (int i = 0; i < 1_000_000_000; i++) { >>> // Try to keep java from figuring out stuff in advance >>> reversed = Integer.reverseBytes(orig); >>> re_reversed = Integer.reverseBytes(reversed); >>> if (re_reversed != orig) { >>> ? ? ? ? ?System.out.println("Orig: " + String.format("%08x", orig) + >>> " ?Re-reversed: " + String.format("%08x", re_reversed)); >>> } >>> accum += orig; >>> orig = Integer.rotateRight(orig, 3); >>> } >>> System.out.println("Elapsed time: " + >>> Long.toString(System.currentTimeMillis() - start)); >>> System.out.println("accum: " + Integer.toString(accum)); >>> ? ? ?} >>> } >>> > From eric.c.liu at arm.com Tue Apr 21 03:20:44 2020 From: eric.c.liu at arm.com (Eric Liu) Date: Tue, 21 Apr 2020 03:20:44 +0000 Subject: RFR(S):8242429:Better implementation for signed extract In-Reply-To: References: , Message-ID: Hi Vladimir, There's one failure, but I don't know whether it's cause by my patch. Unfortunately I don't have detailed report. Could you help to check the result? http://hg.openjdk.java.net/jdk/submit/rev/01cbc15277b8 Thanks, Eric From eric.c.liu at arm.com Tue Apr 21 04:12:33 2020 From: eric.c.liu at arm.com (Eric Liu) Date: Tue, 21 Apr 2020 04:12:33 +0000 Subject: [15] RFR(M): 8242108: Performance regression after fix for JDK-8229496 In-Reply-To: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com> References: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com> Message-ID: Hi Tobias, I'm not sure whether you noticed that with https://bugs.openjdk.java.net/browse/JDK-8229496, the 'CastII/CastLL' would mislead GVN that make it unable to recognize the same 'CmpNode' as before. E.g. for java code: ``` public int foo(int a, int b) { int r = a / b; r = r / b; // no need zero-check r = r / b; // no need zero-check return r; } ``` The zero-check for 'b' could not be removed as before if 'b' is boxed with CastII. I think backing out the original fix for 8229496 would solve this problem. One comment: The test case [1] try to detect the wrong dependency order but I assume it unable to find the same issue in AArch64 due to the different behavior of div/mod compared with AMD64. [1] http://cr.openjdk.java.net/~thartmann/8229496/webrev.00/raw_files/new/test/hotspot/jtreg/compiler/loopopts/TestDivZeroCheckControl.java Thanks, Eric -----Original Message----- From: hotspot-compiler-dev On Behalf Of Tobias Hartmann Sent: Monday, April 20, 2020 11:52 PM To: hotspot compiler Subject: [15] RFR(M): 8242108: Performance regression after fix for JDK-8229496 Hi, please review the following patch: https://bugs.openjdk.java.net/browse/JDK-8242108 http://cr.openjdk.java.net/~thartmann/8242108/webrev.00/ The fix for 8229496 [1] triggers a performance regression with NumberFormat.format(). The problem is the additional control dependency on a CastII/LL which restricts optimizations due to _carry_dependency being set (which was necessary because we can not represent non-null integers/long values in C2's type system). While investigating, I've noticed that Roland's fix for 8241900 [2] fixes the exact same problem but in a more elegant way, avoiding an impact on performance. I'm therefore proposing to back out the original fix for 8229496, leaving the regression test in and also adding a microbenchmark. I've verified that this solves the performance regression (4547 ops/ms vs. 5048 ops/ms on my machine). Thanks, Tobias [1] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-August/034865.html [2] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-April/037778.html From eric.c.liu at arm.com Tue Apr 21 05:03:00 2020 From: eric.c.liu at arm.com (Eric Liu) Date: Tue, 21 Apr 2020 05:03:00 +0000 Subject: RFR(S):8242429:Better implementation for signed extract In-Reply-To: References: , Message-ID: Hi Vladimir, The test report I received: [Mach5] mach5-one-yzhang-JDK-8242429-1-20200420-1153-10322515: [FAILED] 1 Failed tier1-debug-open_test_hotspot_jtreg_tier1_serviceability-macosx-x64-debug-64 TimeoutException in EXECUTION. Thanks, Eric -----Original Message----- From: Eric Liu Sent: Tuesday, April 21, 2020 11:21 AM To: Eric Liu ; Vladimir Ivanov ; hotspot-compiler-dev at openjdk.java.net Cc: nd Subject: RE: RFR(S):8242429:Better implementation for signed extract Hi Vladimir, There's one failure, but I don't know whether it's cause by my patch. Unfortunately I don't have detailed report. Could you help to check the result? http://hg.openjdk.java.net/jdk/submit/rev/01cbc15277b8 Thanks, Eric From tobias.hartmann at oracle.com Tue Apr 21 06:29:51 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Tue, 21 Apr 2020 08:29:51 +0200 Subject: [15] RFR(M): 8242108: Performance regression after fix for JDK-8229496 In-Reply-To: <9b641dc6-4a25-f26f-9dc1-822d616f0e75@oracle.com> References: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com> <9b641dc6-4a25-f26f-9dc1-822d616f0e75@oracle.com> Message-ID: <19e8467e-4abc-5f26-bf49-cec6b5aa29e6@oracle.com> Hi Vladimir, On 20.04.20 21:55, Vladimir Kozlov wrote: > aarch64.ad has more changes than just undo 8229496. Oops, not sure how that happened. I've updated the webrev in-place. > Otherwise it is good. Thanks for the review! > Does it affect performance of our standard benchmarks? No, I've checked performance already with the fix for 8229496 and there was no measurable difference. Thanks, Tobias From tobias.hartmann at oracle.com Tue Apr 21 06:44:12 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Tue, 21 Apr 2020 08:44:12 +0200 Subject: [15] RFR(M): 8242108: Performance regression after fix for JDK-8229496 In-Reply-To: References: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com> Message-ID: <7febb066-5b53-902f-0bf1-ee2f76d36748@oracle.com> Hi Eric, thanks for looking at this! On 21.04.20 06:12, Eric Liu wrote: > I'm not sure whether you noticed that with https://bugs.openjdk.java.net/browse/JDK-8229496, the 'CastII/CastLL' would mislead GVN that make it unable to recognize the same 'CmpNode' as before. > > E.g. for java code: > ``` > public int foo(int a, int b) { > int r = a / b; > r = r / b; // no need zero-check > r = r / b; // no need zero-check > return r; > } > ``` > > The zero-check for 'b' could not be removed as before if 'b' is boxed with CastII. Right and there are also some other problems (for example, CastLL does not implement the Value optimizations that CastII has). > I think backing out the original fix for 8229496 would solve this problem. Yes, I've verified that. > One comment: > > The test case [1] try to detect the wrong dependency order but I assume it unable to find the same issue in AArch64 due to the different behavior of div/mod compared with AMD64. > > [1] http://cr.openjdk.java.net/~thartmann/8229496/webrev.00/raw_files/new/test/hotspot/jtreg/compiler/loopopts/TestDivZeroCheckControl.java I'm not familiar with the div/mod implementation on AArch64 but the underlying issue, which is a div/mod node floating above the null-check, is platform independent. Thanks, Tobias From rwestrel at redhat.com Tue Apr 21 07:21:31 2020 From: rwestrel at redhat.com (Roland Westrelin) Date: Tue, 21 Apr 2020 09:21:31 +0200 Subject: [15] RFR(M): 8242108: Performance regression after fix for JDK-8229496 In-Reply-To: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com> References: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com> Message-ID: <875zdt9udg.fsf@redhat.com> > http://cr.openjdk.java.net/~thartmann/8242108/webrev.00/ Looks good to me. Roland. From tobias.hartmann at oracle.com Tue Apr 21 07:34:51 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Tue, 21 Apr 2020 09:34:51 +0200 Subject: [15] RFR(M): 8242108: Performance regression after fix for JDK-8229496 In-Reply-To: <875zdt9udg.fsf@redhat.com> References: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com> <875zdt9udg.fsf@redhat.com> Message-ID: Hi Roland, thanks for the review! Best regards, Tobias On 21.04.20 09:21, Roland Westrelin wrote: > >> http://cr.openjdk.java.net/~thartmann/8242108/webrev.00/ > > Looks good to me. > > Roland. > From aph at redhat.com Tue Apr 21 09:23:33 2020 From: aph at redhat.com (Andrew Haley) Date: Tue, 21 Apr 2020 10:23:33 +0100 Subject: [aarch64-port-dev ] RFR(M): 8242482: AArch64: Change parameter names of reduction operations to make code clear In-Reply-To: References: Message-ID: On 4/17/20 10:13 AM, Yang Zhang wrote: > Besides tier1, I also test these operations in Vector API test, which can cover all the reduction operations. > > In this directory, there are also some test cases about reduction operations, which is added in [1]. > https://hg.openjdk.java.net/jdk/jdk/file/55c4283a7606/test/hotspot/jtreg/compiler/loopopts/superword > > [1] https://bugs.openjdk.java.net/browse/JDK-8240248 Sounds good. Thanks! -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From eric.c.liu at arm.com Tue Apr 21 10:14:05 2020 From: eric.c.liu at arm.com (Eric Liu) Date: Tue, 21 Apr 2020 10:14:05 +0000 Subject: [15] RFR(M): 8242108: Performance regression after fix for JDK-8229496 In-Reply-To: <7febb066-5b53-902f-0bf1-ee2f76d36748@oracle.com> References: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com> <7febb066-5b53-902f-0bf1-ee2f76d36748@oracle.com> Message-ID: Hi Tobias, > I'm not familiar with the div/mod implementation on AArch64 but the > underlying issue, which is a div/mod node floating above the null-check, > is platform independent. Yes, it's platform independent. As you said, this test case intends to detect whether div/mod node floating above the null-check. But in AArch64, division by zero would not throw any exception, while AMD64 would generate a SIGFPE. I'm not sure whether this test case only be used for AMD64. Thanks, Eric -----Original Message----- From: Tobias Hartmann Sent: Tuesday, April 21, 2020 2:44 PM To: Eric Liu ; hotspot compiler Cc: nd Subject: Re: [15] RFR(M): 8242108: Performance regression after fix for JDK-8229496 Hi Eric, thanks for looking at this! On 21.04.20 06:12, Eric Liu wrote: > I'm not sure whether you noticed that with https://bugs.openjdk.java.net/browse/JDK-8229496, the 'CastII/CastLL' would mislead GVN that make it unable to recognize the same 'CmpNode' as before. > > E.g. for java code: > ``` > public int foo(int a, int b) { > int r = a / b; > r = r / b; // no need zero-check > r = r / b; // no need zero-check > return r; > } > ``` > > The zero-check for 'b' could not be removed as before if 'b' is boxed with CastII. Right and there are also some other problems (for example, CastLL does not implement the Value optimizations that CastII has). > I think backing out the original fix for 8229496 would solve this problem. Yes, I've verified that. > One comment: > > The test case [1] try to detect the wrong dependency order but I assume it unable to find the same issue in AArch64 due to the different behavior of div/mod compared with AMD64. > > [1] > http://cr.openjdk.java.net/~thartmann/8229496/webrev.00/raw_files/new/ > test/hotspot/jtreg/compiler/loopopts/TestDivZeroCheckControl.java I'm not familiar with the div/mod implementation on AArch64 but the underlying issue, which is a div/mod node floating above the null-check, is platform independent. Thanks, Tobias From tobias.hartmann at oracle.com Tue Apr 21 12:09:53 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Tue, 21 Apr 2020 14:09:53 +0200 Subject: [15] RFR(M): 8242108: Performance regression after fix for JDK-8229496 In-Reply-To: References: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com> <7febb066-5b53-902f-0bf1-ee2f76d36748@oracle.com> Message-ID: <405e0fc6-f4ef-e660-7c69-6f90d589c60d@oracle.com> Hi Eric, On 21.04.20 12:14, Eric Liu wrote: > Yes, it's platform independent. > > As you said, this test case intends to detect whether div/mod node floating > above the null-check. But in AArch64, division by zero would not throw any > exception, while AMD64 would generate a SIGFPE. Okay, thanks for the details. > I'm not sure whether this test case only be used for AMD64. Right but I think in any case it doesn't hurt to execute in on AARCH64 as well. Best regards, Tobias From tobias.hartmann at oracle.com Tue Apr 21 12:12:23 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Tue, 21 Apr 2020 14:12:23 +0200 Subject: Intrinsics for divideUnsigned/remainderUnsigned In-Reply-To: <2104801b-5942-0982-ade7-1e6f418cfd06@oracle.com> References: <84dcf12f-7a75-0a8b-b590-3ebb122c7ec3@oracle.com> <2104801b-5942-0982-ade7-1e6f418cfd06@oracle.com> Message-ID: <0cd41102-c950-13dc-1959-0893ef1237dc@oracle.com> That's correct, these methods are currently not intrinsified by the JITs. Best regards, Tobias On 20.04.20 19:40, Joe Darcy wrote: > The divideUnsigned methods in question are not marked with the @HotSpotIntrinsicCandidate annotation > so it doesn't look like there are currently intrinsics. > > Cheers, > > -Joe > > On 4/20/2020 8:07 AM, David Lloyd wrote: >> Yes, I did, sorry about that. >> >> On Mon, Apr 20, 2020 at 10:02 AM Maurizio Cimadamore >> wrote: >>> Hi David, >>> did you mean to write to hotspot compiler (CCed) ? >>> >>> Maurizio >>> >>> On 20/04/2020 15:38, David Lloyd wrote: >>>> Am I correct in understanding that there are no compiler intrinsics >>>> for Long.divideUnsigned/remainderUnsigned? >>>> >>>> The implementation seems pretty expensive for an operation that is, if >>>> I understand correctly, a single instruction on many CPU >>>> architectures.? But maybe these methods are not very frequently used? >>>> (My clue was a comment in the source referencing an algorithm from >>>> Hacker's Delight that could be used - if such an algorithm exists, but >>>> wasn't implemented, presumably demand is low?) >> From HORIE at jp.ibm.com Tue Apr 21 13:21:32 2020 From: HORIE at jp.ibm.com (Michihiro Horie) Date: Tue, 21 Apr 2020 22:21:32 +0900 Subject: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9 In-Reply-To: <13786032-d4e9-9682-5cd7-698ceb4f8c00@linux.ibm.com> References: <13786032-d4e9-9682-5cd7-698ceb4f8c00@linux.ibm.com>, <67fa8056-a8ed-cdfc-1e5a-d36b49c4af18@linux.ibm.com> <0079874e-7bc2-5ff4-f004-337c718ec6df@linux.ibm.com> <1964be00-8926-7a70-d23a-2f7e85eb4ef3@linux.ibm.com> Message-ID: Hi Corey, Martin, I confirmed the latest webrev fixes copyright year properly, so the change looks ready to be pushed. I will push the change my tomorrow. Best regards, Michihiro ----- Original message ----- From: "Corey Ashford" To: "Doerr, Martin" Cc: Michihiro Horie/Japan/IBM at IBMJP, "hotspot-compiler-dev at openjdk.java.net" , "ppc-aix-port-dev at openjdk.java.net" Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9 Date: Tue, Apr 21, 2020 5:39 AM Hi Martin, Sorry for the delay on getting the copyright changes in (I work half time). Here's the revised patch, with all copyright dates set to 2020: https://bugs.openjdk.java.net/browse/JDK-8241874 http://cr.openjdk.java.net/~gromero/8241874/v2/ Thanks for your consideration, - Corey On 4/16/20 1:08 AM, Doerr, Martin wrote: > Hi Corey, > > please use 2020 for both, the Oracle and the SAP copyright. > Usually, both should be the same, but some people forget to update one of them. > > Best regards, > Martin > > >> -----Original Message----- >> From: Corey Ashford >> Sent: Donnerstag, 16. April 2020 03:35 >> To: Doerr, Martin >> Cc: Michihiro Horie ; hotspot-compiler- >> dev at openjdk.java.net; ppc-aix-port-dev at openjdk.java.net >> Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of >> Long.reverseBytes() and Integer.reverseBytes() on Power9 >> >> Hello Martin, >> >> I'm having some trouble with my email server, so I'm having to reply to >> your earlier post, but I saw your most recent post on the mailing list >> archive. >> >> Thanks for reviewing and testing this patch. I went to look at the >> copyright dates, and see two date ranges: one for Oracle and its >> affiliates, and another for SAP. In the files I looked at, the end date >> wasn't the same between the two. Which one (or both) should I modify? >> >> Thanks, >> >> - Corey >> >> On 4/14/20 6:26 AM, Doerr, Martin wrote: >>> Hi Corey, >>> >>> thanks for contributing it. Looks good to me. I?ll run it through our >>> testing and let you know about the results. >>> >>> Best regards, >>> >>> Martin >>> >>> *From:*ppc-aix-port-dev >> *On >>> Behalf Of *Michihiro Horie >>> *Sent:* Freitag, 10. April 2020 10:48 >>> *To:* cjashfor at linux.ibm.com >>> *Cc:* hotspot-compiler-dev at openjdk.java.net; >>> ppc-aix-port-dev at openjdk.java.net >>> *Subject:* Re: RFR[S]:8241874 [PPC64] Improve performance of >>> Long.reverseBytes() and Integer.reverseBytes() on Power9 >>> >>> Hi Corey, >>> >>> Thank you for sharing your benchmarks. I confirmed your change reduced >>> the elapsed time of the benchmarks by more than 30% on my P9 node. >> Also, >>> I checked JTREG results, which look no problem. >>> >>> BTW, I cannot find further points of improvement in your change. >>> >>> Best regards, >>> Michihiro >>> >>> >>> ----- Original message ----- >>> From: "Corey Ashford" >> > >>> To: Michihiro Horie/Japan/IBM at IBMJP >>> Cc: hotspot-compiler-dev at openjdk.java.net >>> , >>> ppc-aix-port-dev at openjdk.java.net >>> , "Gustavo Romero" >>> > >>> Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of >>> Long.reverseBytes() and Integer.reverseBytes() on Power9 >>> Date: Fri, Apr 3, 2020 8:07 AM >>> >>> On 4/2/20 7:27 AM, Michihiro Horie wrote: >>>> Hi Corey, >>>> >>>> I?m not a reviewer, but I can run your benchmark in my local P9 node if >>>> you share it. >>>> >>>> Best regards, >>>> Michihiro >>> >>> The tests are somewhat hokey; I added the shifts to keep the compiler >>> from hoisting the code that it could predetermine the result. >>> >>> Here's the one for Long.reverseBytes(): >>> >>> import java.lang.*; >>> >>> class ReverseLong >>> { >>> public static void main(String args[]) >>> { >>> long reversed, re_reversed; >>> long accum = 0; >>> long orig = 0x1122334455667788L; >>> long start = System.currentTimeMillis(); >>> for (int i = 0; i < 1_000_000_000; i++) { >>> // Try to keep java from figuring out stuff in advance >>> reversed = Long.reverseBytes(orig); >>> re_reversed = Long.reverseBytes(reversed); >>> if (re_reversed != orig) { >>> System.out.println("Orig: " + String.format("%16x", orig) + >>> " Re-reversed: " + String.format("%16x", re_reversed)); >>> } >>> accum += orig; >>> orig = Long.rotateRight(orig, 3); >>> } >>> System.out.println("Elapsed time: " + >>> Long.toString(System.currentTimeMillis() - start)); >>> System.out.println("accum: " + Long.toString(accum)); >>> } >>> } >>> >>> >>> And the one for Integer.reverseBytes(): >>> >>> import java.lang.*; >>> >>> class ReverseInt >>> { >>> public static void main(String args[]) >>> { >>> int reversed, re_reversed; >>> int orig = 0x11223344; >>> int accum = 0; >>> long start = System.currentTimeMillis(); >>> for (int i = 0; i < 1_000_000_000; i++) { >>> // Try to keep java from figuring out stuff in advance >>> reversed = Integer.reverseBytes(orig); >>> re_reversed = Integer.reverseBytes(reversed); >>> if (re_reversed != orig) { >>> System.out.println("Orig: " + String.format("%08x", orig) + >>> " Re-reversed: " + String.format("%08x", re_reversed)); >>> } >>> accum += orig; >>> orig = Integer.rotateRight(orig, 3); >>> } >>> System.out.println("Elapsed time: " + >>> Long.toString(System.currentTimeMillis() - start)); >>> System.out.println("accum: " + Integer.toString(accum)); >>> } >>> } >>> > From rahul.v.raghavan at oracle.com Tue Apr 21 13:26:52 2020 From: rahul.v.raghavan at oracle.com (Rahul Raghavan) Date: Tue, 21 Apr 2020 18:56:52 +0530 Subject: [15]RFR: 8241986: java man page erroneously refers to XEND when it should refer XTEST Message-ID: <6b7d90c1-1180-d643-826f-2c40baae2957@oracle.com> Hi, Please review the following very trivial fix for a typo in man page. http://cr.openjdk.java.net/~rraghavan/8241986/webrev.00/ https://bugs.openjdk.java.net/browse/JDK-8241986 Thanks, Rahul From tobias.hartmann at oracle.com Tue Apr 21 13:40:17 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Tue, 21 Apr 2020 15:40:17 +0200 Subject: [15]RFR: 8241986: java man page erroneously refers to XEND when it should refer XTEST In-Reply-To: <6b7d90c1-1180-d643-826f-2c40baae2957@oracle.com> References: <6b7d90c1-1180-d643-826f-2c40baae2957@oracle.com> Message-ID: <09451a53-152f-4222-059b-90e3a3b6b7ce@oracle.com> Hi Rahul, looks good to me. Best regards, Tobias On 21.04.20 15:26, Rahul Raghavan wrote: > Hi, > > Please review the following very trivial fix for a typo in man page. > > http://cr.openjdk.java.net/~rraghavan/8241986/webrev.00/ > https://bugs.openjdk.java.net/browse/JDK-8241986 > > Thanks, > Rahul From HORIE at jp.ibm.com Tue Apr 21 14:57:30 2020 From: HORIE at jp.ibm.com (Michihiro Horie) Date: Tue, 21 Apr 2020 23:57:30 +0900 Subject: RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range In-Reply-To: References: Message-ID: Hi Martin, I started measuring SPECjbb2015 to see the performance impact on P9. Also, I'm preparing same measurement on P8. Best regards, Michihiro ----- Original message ----- From: "Doerr, Martin" To: "'hotspot-compiler-dev at openjdk.java.net'" Cc: Michihiro Horie , "cjashfor at linux.ibm.com" , "ppc-aix-port-dev at openjdk.java.net" , Gustavo Romero , "joserz at linux.ibm.com" Subject: [EXTERNAL] RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range Date: Tue, Apr 14, 2020 11:07 PM Hi, I?d like to resolve a very old PPC64 issue: https://bugs.openjdk.java.net/browse/JDK-8151030 There?s code for AllocatePrefetchStyle=4 which is not an accepted option. It was used for a special experimental prefetch mode using dcbz instructions to combine prefetching and zeroing in the TLABs. However, this code was never contributed and there are no plans to work on it. So I?d like to simply remove this small part of it. In addition to that, AllocatePrefetchLines is currently set to 3 by default which doesn?t make sense to me. PPC64 has an automatic prefetch engine and executing several prefetch instructions for succeeding cache lines doesn?t seem to be beneficial at all. So I?m setting it to 1 by default. I couldn?t observe regressions on Power7, Power8 and Power9. Webrev: http://cr.openjdk.java.net/~mdoerr/8151030_ppc_prefetch/webrev.00/ Please review. If somebody from IBM would like to check performance impact of changing the AllocatePrefetchLines + Distance, I?ll be glad to receive feedback. Best regards, Martin From vladimir.kozlov at oracle.com Tue Apr 21 18:41:32 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 21 Apr 2020 11:41:32 -0700 Subject: [15]RFR: 8241986: java man page erroneously refers to XEND when it should refer XTEST In-Reply-To: <09451a53-152f-4222-059b-90e3a3b6b7ce@oracle.com> References: <6b7d90c1-1180-d643-826f-2c40baae2957@oracle.com> <09451a53-152f-4222-059b-90e3a3b6b7ce@oracle.com> Message-ID: +1 Vladimir On 4/21/20 6:40 AM, Tobias Hartmann wrote: > Hi Rahul, > > looks good to me. > > Best regards, > Tobias > > On 21.04.20 15:26, Rahul Raghavan wrote: >> Hi, >> >> Please review the following very trivial fix for a typo in man page. >> >> http://cr.openjdk.java.net/~rraghavan/8241986/webrev.00/ >> https://bugs.openjdk.java.net/browse/JDK-8241986 >> >> Thanks, >> Rahul From Yang.Zhang at arm.com Wed Apr 22 04:23:51 2020 From: Yang.Zhang at arm.com (Yang Zhang) Date: Wed, 22 Apr 2020 04:23:51 +0000 Subject: [aarch64-port-dev ] RFR(M): 8242482: AArch64: Change parameter names of reduction operations to make code clear In-Reply-To: References: Message-ID: Hi Andrew Thanks for your review. I will ask Pengfei to help push it. Regards Yang -----Original Message----- From: Andrew Haley Sent: Tuesday, April 21, 2020 5:24 PM To: Yang Zhang ; aarch64-port-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net Cc: nd Subject: Re: [aarch64-port-dev ] RFR(M): 8242482: AArch64: Change parameter names of reduction operations to make code clear On 4/17/20 10:13 AM, Yang Zhang wrote: > Besides tier1, I also test these operations in Vector API test, which can cover all the reduction operations. > > In this directory, there are also some test cases about reduction operations, which is added in [1]. > https://hg.openjdk.java.net/jdk/jdk/file/55c4283a7606/test/hotspot/jtr > eg/compiler/loopopts/superword > > [1] https://bugs.openjdk.java.net/browse/JDK-8240248 Sounds good. Thanks! -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From forax at univ-mlv.fr Wed Apr 22 14:03:05 2020 From: forax at univ-mlv.fr (Remi Forax) Date: Wed, 22 Apr 2020 16:03:05 +0200 (CEST) Subject: Intrinsics for divideUnsigned/remainderUnsigned In-Reply-To: <0cd41102-c950-13dc-1959-0893ef1237dc@oracle.com> References: <84dcf12f-7a75-0a8b-b590-3ebb122c7ec3@oracle.com> <2104801b-5942-0982-ade7-1e6f418cfd06@oracle.com> <0cd41102-c950-13dc-1959-0893ef1237dc@oracle.com> Message-ID: <39292789.1313445.1587564185591.JavaMail.zimbra@u-pem.fr> And don't forget compareUnsigned ! I believe you can not have an efficient implementation of Mozilla SpiderMonkey respresentation (NaN boxing [1]) without it. regards, R?mi [1] https://brionv.com/log/2018/05/17/javascript-engine-internals-nan-boxing/ ----- Mail original ----- > De: "Tobias Hartmann" > ?: "joe darcy" , "David Lloyd" , "Maurizio Cimadamore" > > Cc: "hotspot compiler" > Envoy?: Mardi 21 Avril 2020 14:12:23 > Objet: Re: Intrinsics for divideUnsigned/remainderUnsigned > That's correct, these methods are currently not intrinsified by the JITs. > > Best regards, > Tobias > > On 20.04.20 19:40, Joe Darcy wrote: >> The divideUnsigned methods in question are not marked with the >> @HotSpotIntrinsicCandidate annotation >> so it doesn't look like there are currently intrinsics. >> >> Cheers, >> >> -Joe >> >> On 4/20/2020 8:07 AM, David Lloyd wrote: >>> Yes, I did, sorry about that. >>> >>> On Mon, Apr 20, 2020 at 10:02 AM Maurizio Cimadamore >>> wrote: >>>> Hi David, >>>> did you mean to write to hotspot compiler (CCed) ? >>>> >>>> Maurizio >>>> >>>> On 20/04/2020 15:38, David Lloyd wrote: >>>>> Am I correct in understanding that there are no compiler intrinsics >>>>> for Long.divideUnsigned/remainderUnsigned? >>>>> >>>>> The implementation seems pretty expensive for an operation that is, if >>>>> I understand correctly, a single instruction on many CPU >>>>> architectures.? But maybe these methods are not very frequently used? >>>>> (My clue was a comment in the source referencing an algorithm from >>>>> Hacker's Delight that could be used - if such an algorithm exists, but >>>>> wasn't implemented, presumably demand is low?) From lutz.schmidt at sap.com Wed Apr 22 18:01:44 2020 From: lutz.schmidt at sap.com (Schmidt, Lutz) Date: Wed, 22 Apr 2020 18:01:44 +0000 Subject: RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range In-Reply-To: References: Message-ID: <0737AF50-4DED-4680-8629-47140DD2A7A6@sap.com> Hi Martin, your change looks good to me. I noticed you didn't find a chance to put it in the patch queue for our internal testing. I did that now, but it's too late for tonight. We'll have to wait until Friday morning (GMT+2) to really see what I expect: no issues. Thanks for cleaning up this old stuff. Regards, Lutz ?On 21.04.20, 16:57, "hotspot-compiler-dev on behalf of Michihiro Horie" wrote: Hi Martin, I started measuring SPECjbb2015 to see the performance impact on P9. Also, I'm preparing same measurement on P8. Best regards, Michihiro ----- Original message ----- From: "Doerr, Martin" To: "'hotspot-compiler-dev at openjdk.java.net'" Cc: Michihiro Horie , "cjashfor at linux.ibm.com" , "ppc-aix-port-dev at openjdk.java.net" , Gustavo Romero , "joserz at linux.ibm.com" Subject: [EXTERNAL] RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range Date: Tue, Apr 14, 2020 11:07 PM Hi, I?d like to resolve a very old PPC64 issue: https://bugs.openjdk.java.net/browse/JDK-8151030 There?s code for AllocatePrefetchStyle=4 which is not an accepted option. It was used for a special experimental prefetch mode using dcbz instructions to combine prefetching and zeroing in the TLABs. However, this code was never contributed and there are no plans to work on it. So I?d like to simply remove this small part of it. In addition to that, AllocatePrefetchLines is currently set to 3 by default which doesn?t make sense to me. PPC64 has an automatic prefetch engine and executing several prefetch instructions for succeeding cache lines doesn?t seem to be beneficial at all. So I?m setting it to 1 by default. I couldn?t observe regressions on Power7, Power8 and Power9. Webrev: http://cr.openjdk.java.net/~mdoerr/8151030_ppc_prefetch/webrev.00/ Please review. If somebody from IBM would like to check performance impact of changing the AllocatePrefetchLines + Distance, I?ll be glad to receive feedback. Best regards, Martin From Yang.Zhang at arm.com Thu Apr 23 02:39:26 2020 From: Yang.Zhang at arm.com (Yang Zhang) Date: Thu, 23 Apr 2020 02:39:26 +0000 Subject: [aarch64-port-dev ] RFR(XS): 8242905: AArch64: Client build failed Message-ID: Hi, Could you please help to review this patch? JBS: https://bugs.openjdk.java.net/browse/JDK-8242905 Webrev: http://cr.openjdk.java.net/~yzhang/8242905/webrev.00/ This issue is introduced by [1]. In this commit, pop_CPU_state(restore _vectors) and leave() are included under COMPILER2_OR_JVMCI check in AArc64 restore_live_registers[2]. But restore_live_registers is used in generate_resolve_blob[3] which might be called from c1. In x86 restore_live_registers, pop_CPU_state() and pop(rbp) are always done [4]. To fix this issue, pop_CPU_state(restore_vectors) and leave() are also moved outside of COMPILER2_OR_JVMCI check in AArch64 restore_live_registers. Testing on AArch64 platform: tier1 test with server build server build with configuring --with-jvm-features=-compiler2 client build and ran HelloWorld [1] https://bugs.openjdk.java.net/browse/JDK-8241665 [2] https://hg.openjdk.java.net/jdk/jdk/rev/53568400fec3#l1.23 [3] http://hg.openjdk.java.net/jdk/jdk/file/55c4283a7606/src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp#l2850 [4] http://hg.openjdk.java.net/jdk/jdk/file/55c4283a7606/src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp#l378 From eric.c.liu at arm.com Thu Apr 23 03:57:46 2020 From: eric.c.liu at arm.com (Eric Liu) Date: Thu, 23 Apr 2020 03:57:46 +0000 Subject: RFR(S):8242429:Better implementation for signed extract In-Reply-To: References: Message-ID: Hi Vladimir, Today we retriggered the job and it's passed all test cases. The detail as below: Job: mach5-one-njian-JDK-8242429-2-20200423-0236-10421472 BuildId: 2020-04-23-0235070.ningsheng.jian.source No failed tests Tasks Summary NOTHING_TO_RUN: 0 UNABLE_TO_RUN: 0 KILLED: 0 NA: 0 HARNESS_ERROR: 0 FAILED: 0 EXECUTED_WITH_FAILURE: 0 PASSED: 84 I'm wondering whether it's necessary to check it again by some another reviewer. Thanks, Eric -----Original Message----- From: Vladimir Ivanov Sent: Thursday, April 16, 2020 8:29 PM To: Eric Liu ; hotspot-compiler-dev at openjdk.java.net Cc: nd Subject: Re: RFR(S):8242429:Better implementation for signed extract > Webrev:?http://cr.openjdk.java.net/~yzhang/ericliu/8242429/webrev.01/ Looks good. Have you tested it through submit repo? Best regards, Vladimir Ivanov > [Tests] > Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core and langtools::tier1. > No new failure found. > > JMH: A simple JMH case [1] on AArch64 and AMD64 machines. > > For AArch64, one platform has no obvious improvement, but on others > the performance gain is 7.3%~32.7%. > > For AMD64, one platform has no obvious improvement, but on others the > performance gain is 13.7%~32.4%. > > A simple test case [2] has checked the correctness for some corner > cases. > > [1] > https://bugs.openjdk.java.net/secure/attachment/87712/IdealNegate.java > [2] > https://bugs.openjdk.java.net/secure/attachment/87713/SignExtractTest. > java > > > Thanks, > Eric > From aph at redhat.com Thu Apr 23 08:44:15 2020 From: aph at redhat.com (Andrew Haley) Date: Thu, 23 Apr 2020 09:44:15 +0100 Subject: [aarch64-port-dev ] RFR(XS): 8242905: AArch64: Client build failed In-Reply-To: References: Message-ID: On 4/23/20 3:39 AM, Yang Zhang wrote: > Could you please help to review this patch? > > JBS: https://bugs.openjdk.java.net/browse/JDK-8242905 > Webrev: http://cr.openjdk.java.net/~yzhang/8242905/webrev.00/ Ok, thanks. Does anyone in the real world use AArch64 client builds? I'm wondering if we'd be better off without that option. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From jorn.vernee at oracle.com Thu Apr 23 12:52:38 2020 From: jorn.vernee at oracle.com (Jorn Vernee) Date: Thu, 23 Apr 2020 14:52:38 +0200 Subject: is it time fully optimize long loops? (JDK-8223051) In-Reply-To: <87ftdbbxj5.fsf@redhat.com> References: <87imi8bunn.fsf@redhat.com> <87ftdbbxj5.fsf@redhat.com> Message-ID: Hi Roland, Sorry, I'm just now seeing this. I was using the following test to diagnose C2 loop predication: public class Main { ??? static final int SIZE = 1_000_000; ??? final long bound_long; ??? final int bound_int; ??? public Main() { ??????? this.bound_long = SIZE; ??????? this.bound_int = SIZE; ??? } ??? public static void main(String[] args) { System.out.println(ProcessHandle.current().pid()); ??????? run(); ??? } ??? public static void run() { ??????? Main m = new Main(); System.out.println("========================================================================="); ??????? for (int i = 0; i < 20_000; i++) { ??????????? m.invoke(); ??????? } ??? } ??? public int invoke() { ??????? int sum = 0; ??????? var bound = this.bound_int; ??????? for (int i = 0; i < SIZE; i++) { ??????????? if (i >= bound) throw new IllegalStateException(); ??????????? sum +=? i; ??????? } ??????? return sum; ??? } } Together with explicitly disabling the inlining of the 'invoke' method. Switching between `var bound = this.bound_int` and `var bound = this.bound_long` you should see that the bound check in the `if` is being eliminated in the int case, but not in the long case. After some debugging the switch point between the 2 cases seems to be in 'IdealLoopTree::iteration_split_impl' when initializing `should_rce` [1], but ultimately this call seems to bottom out in 'PhaseIdealLoop::is_scaled_iv_plus_offset' in loopTransform.cpp, which is checking the nodes involved for integer opcodes explicitly [2]. In the Panama code we are currently working around this by assuming the operands of the calculation fit into `int` in some cases, and then explicitly casting them to ints, which then enables the optimization [3]. But, as John says, this is not ideal. HTH, Jorn [1] : https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L3308 [2] : https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L2402 [3] : https://github.com/openjdk/panama-foreign/blob/c8fc03351277f318f86d333f7fff1338fe17a247/src/java.base/share/classes/jdk/internal/access/foreign/MemoryAddressProxy.java#L50-L94 On 10/04/2020 09:38, Roland Westrelin wrote: > Once the long loop is transformed to an int counted loop what are the > optimizations that need to trigger reliably? In particular do we need > range check elimination? Can you or someone from the panama project shar > code samples that I can use to verify the long loop optimizes well? > > Roland. > From aleksei.voitylov at bell-sw.com Thu Apr 23 13:12:16 2020 From: aleksei.voitylov at bell-sw.com (Aleksei Voitylov) Date: Thu, 23 Apr 2020 16:12:16 +0300 Subject: [aarch64-port-dev ] RFR(XS): 8242905: AArch64: Client build failed In-Reply-To: References: Message-ID: <7b98219a-e45b-f0e8-9008-0c7a712c06f4@bell-sw.com> Yes, in the embedded space. On 23/04/2020 11:44, Andrew Haley wrote: > On 4/23/20 3:39 AM, Yang Zhang wrote: >> Could you please help to review this patch? >> >> JBS: https://bugs.openjdk.java.net/browse/JDK-8242905 >> Webrev: http://cr.openjdk.java.net/~yzhang/8242905/webrev.00/ > Ok, thanks. > > Does anyone in the real world use AArch64 client builds? I'm wondering if > we'd be better off without that option. > From dean.long at oracle.com Thu Apr 23 23:48:06 2020 From: dean.long at oracle.com (Dean Long) Date: Thu, 23 Apr 2020 16:48:06 -0700 Subject: RFR(S) 8219607: Add support in Graal and AOT for hidden class Message-ID: https://bugs.openjdk.java.net/browse/JDK-8219607 http://cr.openjdk.java.net/~dlong/8219607/webrev/ This change adds support for the Class.isHidden() intrinsic to Graal. thanks, dl From vladimir.kozlov at oracle.com Fri Apr 24 00:57:04 2020 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 23 Apr 2020 17:57:04 -0700 Subject: RFR(S) 8219607: Add support in Graal and AOT for hidden class In-Reply-To: References: Message-ID: <36d29a0e-e396-da7a-4945-ad9afb709b14@oracle.com> Hi Dean, Changes looks good. I see that compiler/graalunit/HotspotTest.java failed in tier1 (and tier3-graal). I assume it is 8243381. Thanks, Vladimir K On 4/23/20 4:48 PM, Dean Long wrote: > https://bugs.openjdk.java.net/browse/JDK-8219607 > http://cr.openjdk.java.net/~dlong/8219607/webrev/ > > This change adds support for the Class.isHidden() intrinsic to Graal. > > thanks, > > dl From dean.long at oracle.com Fri Apr 24 02:20:44 2020 From: dean.long at oracle.com (Dean Long) Date: Thu, 23 Apr 2020 19:20:44 -0700 Subject: RFR(S) 8219607: Add support in Graal and AOT for hidden class In-Reply-To: <36d29a0e-e396-da7a-4945-ad9afb709b14@oracle.com> References: <36d29a0e-e396-da7a-4945-ad9afb709b14@oracle.com> Message-ID: On 4/23/20 5:57 PM, Vladimir Kozlov wrote: > Hi Dean, > > Changes looks good. Thanks Vladimir. > > I see that compiler/graalunit/HotspotTest.java failed in tier1 (and > tier3-graal). I assume it is 8243381. Yes, I accidentally removed that sub-test from the problem list during testing, so it added some "noise" to the test results. dl > > Thanks, > Vladimir K > > On 4/23/20 4:48 PM, Dean Long wrote: >> https://bugs.openjdk.java.net/browse/JDK-8219607 >> http://cr.openjdk.java.net/~dlong/8219607/webrev/ >> >> This change adds support for the Class.isHidden() intrinsic to Graal. >> >> thanks, >> >> dl From HORIE at jp.ibm.com Fri Apr 24 05:40:00 2020 From: HORIE at jp.ibm.com (Michihiro Horie) Date: Fri, 24 Apr 2020 14:40:00 +0900 Subject: RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range In-Reply-To: <0737AF50-4DED-4680-8629-47140DD2A7A6@sap.com> References: <0737AF50-4DED-4680-8629-47140DD2A7A6@sap.com>, Message-ID: Hi Martin, Lutz, I have not seen big differences in SPECjbb2015 scores both on P8 and P9. Best regards, Michihiro ----- Original message ----- From: "Schmidt, Lutz" To: Michihiro Horie , "Doerr, Martin" Cc: "ppc-aix-port-dev at openjdk.java.net" , "hotspot-compiler-dev at openjdk.java.net" Subject: [EXTERNAL] Re: RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range Date: Thu, Apr 23, 2020 3:01 AM Hi Martin, your change looks good to me. I noticed you didn't find a chance to put it in the patch queue for our internal testing. I did that now, but it's too late for tonight. We'll have to wait until Friday morning (GMT+2) to really see what I expect: no issues. Thanks for cleaning up this old stuff. Regards, Lutz ?On 21.04.20, 16:57, "hotspot-compiler-dev on behalf of Michihiro Horie" wrote: Hi Martin, I started measuring SPECjbb2015 to see the performance impact on P9. Also, I'm preparing same measurement on P8. Best regards, Michihiro ----- Original message ----- From: "Doerr, Martin" To: "'hotspot-compiler-dev at openjdk.java.net'" Cc: Michihiro Horie , "cjashfor at linux.ibm.com" , "ppc-aix-port-dev at openjdk.java.net" , Gustavo Romero , "joserz at linux.ibm.com" Subject: [EXTERNAL] RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range Date: Tue, Apr 14, 2020 11:07 PM Hi, I?d like to resolve a very old PPC64 issue: https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.openjdk.java.net_browse_JDK-2D8151030&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=oecsIpYF-cifqq2i1JEH0Q&m=Q3El2qgCsQyK-bunbC8-3yZzMvfZGBwC8q58omWEUCM&s=ohXZhHZXhsm01dbRh1iQHwrtNAH1QfUmokv2qs49cPY&e= There?s code for AllocatePrefetchStyle=4 which is not an accepted option. It was used for a special experimental prefetch mode using dcbz instructions to combine prefetching and zeroing in the TLABs. However, this code was never contributed and there are no plans to work on it. So I?d like to simply remove this small part of it. In addition to that, AllocatePrefetchLines is currently set to 3 by default which doesn?t make sense to me. PPC64 has an automatic prefetch engine and executing several prefetch instructions for succeeding cache lines doesn?t seem to be beneficial at all. So I?m setting it to 1 by default. I couldn?t observe regressions on Power7, Power8 and Power9. Webrev: https://urldefense.proofpoint.com/v2/url?u=http-3A__cr.openjdk.java.net_-7Emdoerr_8151030-5Fppc-5Fprefetch_webrev.00_&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=oecsIpYF-cifqq2i1JEH0Q&m=Q3El2qgCsQyK-bunbC8-3yZzMvfZGBwC8q58omWEUCM&s=paesC67BcmFOkkYjGySj1AUJJyOKHO25BwzZi0vHG8g&e= Please review. If somebody from IBM would like to check performance impact of changing the AllocatePrefetchLines + Distance, I?ll be glad to receive feedback. Best regards, Martin From Yang.Zhang at arm.com Fri Apr 24 06:01:28 2020 From: Yang.Zhang at arm.com (Yang Zhang) Date: Fri, 24 Apr 2020 06:01:28 +0000 Subject: [aarch64-port-dev ] RFR(S): 8243240: AArch64: Add support for MulVB Message-ID: Hi, Could you please help to review this patch? JBS: https://bugs.openjdk.java.net/browse/JDK-8243240 Webrev: http://cr.openjdk.java.net/~yzhang/8243240/webrev.00/ In this patch, the missing MulVB support for AArch64 is added. Testing: tier1 Test case: public static void mulvb(byte[] a, byte[] b, byte[] c) { for (int i = 0; i < a.length; i++) { c[i] = (byte)(a[i] * b[i]); } } Assembly generated by C2: 0x0000ffffacafdbac: ldr q17, [x15, #16] 0x0000ffffacafdbb0: ldr q16, [x14, #16] 0x0000ffffacafdbb4: mul v16.16b, v16.16b, v17.16b 0x0000ffffacafdbbc: str q16, [x11, #16] Performance: JMH test case is attached in JBS. Before: Benchmark (size) Mode Cnt Score Error Units TestVect.testVectMulVB 1024 avgt 5 0.952 0.005 us/op After: Benchmark (size) Mode Cnt Score Error Units TestVect.testVectMulVB 1024 avgt 5 0.110 0.001 us/op Regards Yang From rwestrel at redhat.com Fri Apr 24 08:14:15 2020 From: rwestrel at redhat.com (Roland Westrelin) Date: Fri, 24 Apr 2020 10:14:15 +0200 Subject: RFR(S): 8239569: PublicMethodsTest.java failed due to NPE in java.base/java.nio.file.FileSystems.getFileSystem(FileSystems.java:230) Message-ID: <87zhb18fmw.fsf@redhat.com> https://bugs.openjdk.java.net/browse/JDK-8239569 http://cr.openjdk.java.net/~roland/8239569/webrev.00/ The bug occurs when reading from a constant array after a loop is fully unrolled. Reading an element in the loop has the shape: (LoadB (AddP base (AddP base base index) ..) ..) A load from the same element is also out of the loop: (LoadUB (AddP base (AddP base base index) ..) ..) The AddPs are shared between the LoadB in the loop and the LoadUB out of the loop. After full unrolling the load out of the loop becomes: (LoadUB (Phi (AddP base (AddP base base index1) ..) (AddP base (AddP base base index2) ..) ..) ..) The AddPs are then pushed through the Phi and that's where the bug is. - index1 is 0 and so the type of (AddP base base index1) is a constant array pointer with no offset. - that type is met with the type of the base of the second AddP instead of the type of the address of the second AddP. The result is a constant array pointer. The resulting Phi for the address input is created as a Phi of type constant array with no offset instead of constant array with offset. As a result, the Phi constant folds and the offset is lost. Roland. From richard.reingruber at sap.com Fri Apr 24 08:18:31 2020 From: richard.reingruber at sap.com (Reingruber, Richard) Date: Fri, 24 Apr 2020 08:18:31 +0000 Subject: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant In-Reply-To: References: <3c59b9f9-ec38-18c9-8f24-e1186a08a04a@oracle.com> <410eed04-e2ef-0f4f-1c56-19e6734a10f6@oracle.com> Message-ID: Hi Patricio, Vladimir, and Serguei, now that direct handshakes are available, I've updated the patch to make use of them. In addition I have done some clean-up changes I missed in the first webrev. Finally I have implemented the workaround suggested by Patricio to avoid nesting the handshake into the vm operation VM_SetFramePop [1] Kindly review again: Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1/ Webrev(delta): http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1.inc/ I updated the JBS item explaining why the vm operation VM_EnterInterpOnlyMode can be replaced with a direct handshake: JBS: https://bugs.openjdk.java.net/browse/JDK-8238585 Testing: * JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms. * Submit-repo: mach5-one-rrich-JDK-8238585-20200423-1436-10441737 Thanks, Richard. [1] An assertion in Handshake::execute_direct() fails, if called be VMThread, because it is no JavaThread. -----Original Message----- From: hotspot-dev On Behalf Of Reingruber, Richard Sent: Freitag, 14. Februar 2020 19:47 To: Patricio Chilano ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net Subject: RE: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant Hi Patricio, > > I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a > > handshake cannot be nested in a vm operation. Maybe it should be asserted in the > > Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation? > > > > > Alternatively I think you could do something similar to what we do in > > > Deoptimization::deoptimize_all_marked(): > > > > > > EnterInterpOnlyModeClosure hs; > > > if (SafepointSynchronize::is_at_safepoint()) { > > > hs.do_thread(state->get_thread()); > > > } else { > > > Handshake::execute(&hs, state->get_thread()); > > > } > > > (you could pass ?EnterInterpOnlyModeClosure? directly to the > > > HandshakeClosure() constructor) > > > > Maybe this could be used also in the Handshake::execute() methods as general solution? > Right, we could also do that. Avoiding to clear the polling page in > HandshakeState::clear_handshake() should be enough to fix this issue and > execute a handshake inside a safepoint, but adding that "if" statement > in Hanshake::execute() sounds good to avoid all the extra code that we > go through when executing a handshake. I filed 8239084 to make that change. Thanks for taking care of this and creating the RFE. > > > > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is > > > always called in a nested operation or just sometimes. > > > > At least one execution path without vm operation exists: > > > > JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void > > JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong > > JvmtiEventControllerPrivate::recompute_enabled() : void > > JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches) > > JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void > > JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError > > jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError > > > > I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a > > handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further > > encouraged to do it with a handshake :) > Ah! I think you can still do it with a handshake with the > Deoptimization::deoptimize_all_marked() like solution. I can change the > if-else statement with just the Handshake::execute() call in 8239084. > But up to you. : ) Well, I think that's enough encouragement :) I'll wait for 8239084 and try then again. (no urgency and all) Thanks, Richard. -----Original Message----- From: Patricio Chilano Sent: Freitag, 14. Februar 2020 15:54 To: Reingruber, Richard ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant Hi Richard, On 2/14/20 9:58 AM, Reingruber, Richard wrote: > Hi Patricio, > > thanks for having a look. > > > I?m only commenting on the handshake changes. > > I see that operation VM_EnterInterpOnlyMode can be called inside > > operation VM_SetFramePop which also allows nested operations. Here is a > > comment in VM_SetFramePop definition: > > > > // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is > > // called from the JvmtiEventControllerPrivate::recompute_thread_enabled. > > > > So if we change VM_EnterInterpOnlyMode to be a handshake, then now we > > could have a handshake inside a safepoint operation. The issue I see > > there is that at the end of the handshake the polling page of the target > > thread could be disarmed. So if the target thread happens to be in a > > blocked state just transiently and wakes up then it will not stop for > > the ongoing safepoint. Maybe I can file an RFE to assert that the > > polling page is armed at the beginning of disarm_safepoint(). > > I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a > handshake cannot be nested in a vm operation. Maybe it should be asserted in the > Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation? > > > Alternatively I think you could do something similar to what we do in > > Deoptimization::deoptimize_all_marked(): > > > > EnterInterpOnlyModeClosure hs; > > if (SafepointSynchronize::is_at_safepoint()) { > > hs.do_thread(state->get_thread()); > > } else { > > Handshake::execute(&hs, state->get_thread()); > > } > > (you could pass ?EnterInterpOnlyModeClosure? directly to the > > HandshakeClosure() constructor) > > Maybe this could be used also in the Handshake::execute() methods as general solution? Right, we could also do that. Avoiding to clear the polling page in HandshakeState::clear_handshake() should be enough to fix this issue and execute a handshake inside a safepoint, but adding that "if" statement in Hanshake::execute() sounds good to avoid all the extra code that we go through when executing a handshake. I filed 8239084 to make that change. > > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is > > always called in a nested operation or just sometimes. > > At least one execution path without vm operation exists: > > JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void > JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong > JvmtiEventControllerPrivate::recompute_enabled() : void > JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches) > JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void > JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError > jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError > > I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a > handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further > encouraged to do it with a handshake :) Ah! I think you can still do it with a handshake with the Deoptimization::deoptimize_all_marked() like solution. I can change the if-else statement with just the Handshake::execute() call in 8239084. But up to you.? : ) Thanks, Patricio > Thanks again, > Richard. > > -----Original Message----- > From: Patricio Chilano > Sent: Donnerstag, 13. Februar 2020 18:47 > To: Reingruber, Richard ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net > Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant > > Hi Richard, > > I?m only commenting on the handshake changes. > I see that operation VM_EnterInterpOnlyMode can be called inside > operation VM_SetFramePop which also allows nested operations. Here is a > comment in VM_SetFramePop definition: > > // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is > // called from the JvmtiEventControllerPrivate::recompute_thread_enabled. > > So if we change VM_EnterInterpOnlyMode to be a handshake, then now we > could have a handshake inside a safepoint operation. The issue I see > there is that at the end of the handshake the polling page of the target > thread could be disarmed. So if the target thread happens to be in a > blocked state just transiently and wakes up then it will not stop for > the ongoing safepoint. Maybe I can file an RFE to assert that the > polling page is armed at the beginning of disarm_safepoint(). > > I think one option could be to remove > SafepointMechanism::disarm_if_needed() in > HandshakeState::clear_handshake() and let each JavaThread disarm itself > for the handshake case. > > Alternatively I think you could do something similar to what we do in > Deoptimization::deoptimize_all_marked(): > > ? EnterInterpOnlyModeClosure hs; > ? if (SafepointSynchronize::is_at_safepoint()) { > ??? hs.do_thread(state->get_thread()); > ? } else { > ??? Handshake::execute(&hs, state->get_thread()); > ? } > (you could pass ?EnterInterpOnlyModeClosure? directly to the > HandshakeClosure() constructor) > > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is > always called in a nested operation or just sometimes. > > Thanks, > Patricio > > On 2/12/20 7:23 AM, Reingruber, Richard wrote: >> // Repost including hotspot runtime and gc lists. >> // Dean Long suggested to do so, because the enhancement replaces a vm operation >> // with a handshake. >> // Original thread: http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-February/030359.html >> >> Hi, >> >> could I please get reviews for this small enhancement in hotspot's jvmti implementation: >> >> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.0/ >> Bug: https://bugs.openjdk.java.net/browse/JDK-8238585 >> >> The change avoids making all compiled methods on stack not_entrant when switching a java thread to >> interpreter only execution for jvmti purposes. It is sufficient to deoptimize the compiled frames on stack. >> >> Additionally a handshake is used instead of a vm operation to walk the stack and do the deoptimizations. >> >> Testing: JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms. >> >> Thanks, Richard. >> >> See also my question if anyone knows a reason for making the compiled methods not_entrant: >> http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-January/030339.html From tobias.hartmann at oracle.com Fri Apr 24 08:24:08 2020 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Fri, 24 Apr 2020 10:24:08 +0200 Subject: RFR(S): 8239569: PublicMethodsTest.java failed due to NPE in java.base/java.nio.file.FileSystems.getFileSystem(FileSystems.java:230) In-Reply-To: <87zhb18fmw.fsf@redhat.com> References: <87zhb18fmw.fsf@redhat.com> Message-ID: Hi Roland, Ouh, good catch! Looks good. Best regards, Tobias On 24.04.20 10:14, Roland Westrelin wrote: > > https://bugs.openjdk.java.net/browse/JDK-8239569 > http://cr.openjdk.java.net/~roland/8239569/webrev.00/ > > The bug occurs when reading from a constant array after a loop is fully > unrolled. Reading an element in the loop has the shape: > (LoadB (AddP base (AddP base base index) ..) ..) > A load from the same element is also out of the loop: > (LoadUB (AddP base (AddP base base index) ..) ..) > The AddPs are shared between the LoadB in the loop and the LoadUB out of > the loop. > > After full unrolling the load out of the loop becomes: > (LoadUB (Phi (AddP base (AddP base base index1) ..) (AddP base (AddP base base index2) ..) ..) ..) > > The AddPs are then pushed through the Phi and that's where the bug > is. > > - index1 is 0 and so the type of (AddP base base index1) is a constant > array pointer with no offset. > > - that type is met with the type of the base of the second AddP instead > of the type of the address of the second AddP. The result is a > constant array pointer. > > The resulting Phi for the address input is created as a Phi of type > constant array with no offset instead of constant array with offset. As > a result, the Phi constant folds and the offset is lost. > > Roland. > From xxinliu at amazon.com Fri Apr 24 08:33:40 2020 From: xxinliu at amazon.com (Liu, Xin) Date: Fri, 24 Apr 2020 08:33:40 +0000 Subject: RFR[M]: 8151779: Some intrinsic flags could be replaced with one general flag In-Reply-To: <0EDAAC88-E5D9-424F-A19E-5E20C689C2F3@amazon.com> References: <19CD3956-4DC6-4908-8626-27D48A9AB4A4@amazon.com> <0EDAAC88-E5D9-424F-A19E-5E20C689C2F3@amazon.com> Message-ID: <801D878C-CAE5-4EBE-8AFE-4E35346CD5BD@amazon.com> Hi, May I get reviewed for this new revision? JBS: https://bugs.openjdk.java.net/browse/JDK-8151779 webrev: https://cr.openjdk.java.net/~xliu/8151779/01/webrev/ I introduce a new option -XX:ControlIntrinsic=+_id1,-id2... The id is vmIntrinsics::ID. As prior discussion, ControlIntrinsic is expected to replace DisableIntrinsic. I keep DisableIntrinsic in this revision. DisableIntrinsic prevails when an intrinsic appears on both lists. I use an array of tribool to mark each intrinsic is enabled or not. In this way, hotspot can avoid expensive string querying among intrinsics. A Tribool value has 3 states: Default, true, or false. If developers don't explicitly set an intrinsic, it will be available unless is disabled by the corresponding UseXXXIntrinsics. Traditional Boolean value can't express fine/coarse-grained control. Ie. We only go through those auxiliary options UseXXXIntrinsics if developers don't control a specific intrinsic. I also add the support of ControlIntrinsic to CompilerDirectives. Test: I reuse jtreg tests of DisableIntrinsic. Add add more @run annotations to verify ControlIntrinsics. I passed hotspot:Tier1 test and all tests on x86_64/linux. Thanks, --lx ?On 4/17/20, 7:22 PM, "hotspot-compiler-dev on behalf of Liu, Xin" wrote: Hi, Vladimir, Thanks for the clarification. Oh, yes, it's theoretically possible, but it's tedious. I am wrong at that point. I think I got your point. ControlIntrinsics will make developer's life easier. I will implement it. Thanks, --lx On 4/17/20, 6:46 PM, "Vladimir Kozlov" wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. I withdraw my suggestion about EnableIntrinsic from JDK-8151779 because ControlIntrinsics will provide such functionality and will replace existing DisableIntrinsic. Note, we can start deprecating Use*Intrinsic flags (and DisableIntrinsic) later in other changes. You don't need to do everything at once. What we need now a mechanism to replace them. On 4/16/20 11:58 PM, Liu, Xin wrote: > Hi, Corey and Vladimir, > > I recently go through vmSymbols.hpp/cpp. I think I understand your comments. > Each UseXXXIntrinsics does control a bunch of intrinsics (plural). Thanks for the hint. > > Even though I feel I know intrinsics mechanism of hotspot better, I still need a clarification of JDK- 8151779. > > There're 321 intrinsics (https://chriswhocodes.com/hotspot_intrinsics_jdk15.html). > If there's no any option, they are all available for compilers. That makes sense because intrinsics are always beneficial. > But there're reasons we need to disable a subset of them. A specific architecture may miss efficient instructions or fixed functions. Or simply because an intrinsic is buggy. > > Currently, JDK provides developers 2 ways to control intrinsics. > 1. Some diagnostic options. Eg. InlineMathNatives, UseBase64Intrinsics. > Developers can use one option to disable a group of intrinsics. That is to say, it's a coarse-grained approach. > > 2. DisableIntrinsic="a,b,c" > By passing a string list of vmIntrinsics::IDs, it's capable of disabling any specified intrinsic. > > But even putting above 2 approaches together, we still can't precisely control any intrinsic. Yes, you are right. We seems are trying to put these 2 different ways into one flag which may be mistake. -XX:ControlIntrinsic=-_updateBytesCRC32C,-_updateDirectByteBufferCRC32C is a similar to -XX:-UseCRC32CIntrinsics but it requires more detailed knowledge about intrinsics ids. May be we can have 2nd flag, as you suggested -XX:UseIntrinsics=-AESCTR,+CRC32C, for such cases. > If we want to enable an intrinsic which is under control of InlineMathNatives but keep others disable, it's impossible now. [please correct if I am wrong here]. You can disable all other from 321 intrinsics with DisableIntrinsic flag which is very tedious I agree. > I think that the motivation JDK-8151779 tried to solve. The idea is that instead of flags we use to control particular intrinsics depending on CPU we will use vmIntrinsics::IDs or other tables as you showed in your changes. It will require changes in vm_version_ codes. > > If we provide a new option EnableIntrinsic and put it least priority, then we can precisely control any intrinsic. > Quote Vladimir Kozlov "DisableIntrinsic list prevails if an intrinsic is specified on both EnableIntrinsic and DisableIntrinsic." > > "-XX:ControlIntrinsic=+_dabs,-_fabs,-_getClass" looks more elegant, but it will confuse developers with DisableIntrinsic. > If we decide to deprecate DisableIntrinsic, I think ControlIntrinsic may be a better option. Now I prefer to provide EnableIntrinsic for simplicity and symmetry. I prefer to have one ControlIntrinsic flag and deprecate DisableIntrinsic. I don't think it is confusing. Thanks, Vladimir > What do you think? > > Thanks, > --lx > > > On 4/13/20, 1:47 PM, "hotspot-compiler-dev on behalf of Corey Ashford" wrote: > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On 4/13/20 10:33 AM, Liu, Xin wrote: > > Hi, compiler developers, > > I attempt to refactor UseXXXIntrinsics for JDK-8151779. I think we still need to keep UseXXXIntrinsics options because many applications may be using them. > > > > My change provide 2 new features: > > 1) a shorthand to enable/disable intrinsics. > > A comma-separated string. Each one is an intrinsic. An optional tailing symbol + or '-' denotes enabling or disabling. > > If the tailing symbol is missing, it means enable. > > Eg. -XX:UseIntrinsics="AESCTR-,CRC32C+,CRC32-,MathExact" > > This jvm option will expand to multiple options -XX:-UseAESCTRIntrinsics, -XX:+UseCRC32CIntrinsics, -XX:-UseCRC32Intrinsics, -XX:UseMathExactIntrinsics > > > > 2) provide a set of macro to declare intrinsic options > > Developers declare once in intrinsics.hpp and macros will take care all other places. > > Here are example: https://cr.openjdk.java.net/~xliu/8151779/00/webrev/src/hotspot/share/compiler/intrinsics.hpp.html > > Ion Lam is overhauling jvm options. I am thinking how to be consistent with his proposal. > > > > Great idea, though to be consistent with the original syntax, I think > the +/- should be in front of the name: > > -XX:UseIntrinsics=-AESCTR,+CRC32C,... > > > > I handle UseIntrinsics before VM_Version::initialize. It means that platform-specific initialization still has a chance to correct those options. > > If we do that after VM_Version::initialize, some intrinsics may cause JVM crash. Eg. +UseBase64Intrinsics on x86_64 Linux. > > Even though this behavior is same as -XX:+UseXXXIntrinsics, from user's perspective, it's not straightforward when JVM overrides what users specify implicitly. It's dilemma here, stable jvm or fidelity of cmdline. What do you think? > > > > Another problem is naming convention. Almost all intrinsics options use UseXXXIntrinsics. One exception is UseVectorizedMismatchIntrinsic. > > Personally, I think it should be "UseXXXIntrinsic" because one option is for one intrinsic, right? Is it possible to change this name convention? > > Some (many?) intrinsic options turn on more than one .ad instruct > instrinsic, or library instrinsics at the same time. I think that's why > the plural is there. Also, consistently adding the plural allows you to > add more capabilities to a flag that initially only had one intrinsic > without changing the plurality (and thus backward compatibility). > > Regards, > > - Corey > > From aph at redhat.com Fri Apr 24 09:31:59 2020 From: aph at redhat.com (Andrew Haley) Date: Fri, 24 Apr 2020 10:31:59 +0100 Subject: [aarch64-port-dev ] RFR(S): 8243240: AArch64: Add support for MulVB In-Reply-To: References: Message-ID: <893f6983-7e3c-adc0-ecf4-48e57312c456@redhat.com> On 4/24/20 7:01 AM, Yang Zhang wrote: > JBS: https://bugs.openjdk.java.net/browse/JDK-8243240 > Webrev: http://cr.openjdk.java.net/~yzhang/8243240/webrev.00/ OK, thanks. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From suenaga at oss.nttdata.com Fri Apr 24 11:34:22 2020 From: suenaga at oss.nttdata.com (Yasumasa Suenaga) Date: Fri, 24 Apr 2020 20:34:22 +0900 Subject: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant In-Reply-To: References: <3c59b9f9-ec38-18c9-8f24-e1186a08a04a@oracle.com> <410eed04-e2ef-0f4f-1c56-19e6734a10f6@oracle.com> Message-ID: Hi Richard, I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427. Does it help you? I think it gives you to remove workaround. (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.) Thanks, Yasumasa On 2020/04/24 17:18, Reingruber, Richard wrote: > Hi Patricio, Vladimir, and Serguei, > > now that direct handshakes are available, I've updated the patch to make use of them. > > In addition I have done some clean-up changes I missed in the first webrev. > > Finally I have implemented the workaround suggested by Patricio to avoid nesting the handshake > into the vm operation VM_SetFramePop [1] > > Kindly review again: > > Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1/ > Webrev(delta): http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1.inc/ > > I updated the JBS item explaining why the vm operation VM_EnterInterpOnlyMode can be replaced with a > direct handshake: > > JBS: https://bugs.openjdk.java.net/browse/JDK-8238585 > > Testing: > > * JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms. > > * Submit-repo: mach5-one-rrich-JDK-8238585-20200423-1436-10441737 > > Thanks, > Richard. > > [1] An assertion in Handshake::execute_direct() fails, if called be VMThread, because it is no JavaThread. > > -----Original Message----- > From: hotspot-dev On Behalf Of Reingruber, Richard > Sent: Freitag, 14. Februar 2020 19:47 > To: Patricio Chilano ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net > Subject: RE: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant > > Hi Patricio, > > > > I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a > > > handshake cannot be nested in a vm operation. Maybe it should be asserted in the > > > Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation? > > > > > > > Alternatively I think you could do something similar to what we do in > > > > Deoptimization::deoptimize_all_marked(): > > > > > > > > EnterInterpOnlyModeClosure hs; > > > > if (SafepointSynchronize::is_at_safepoint()) { > > > > hs.do_thread(state->get_thread()); > > > > } else { > > > > Handshake::execute(&hs, state->get_thread()); > > > > } > > > > (you could pass ?EnterInterpOnlyModeClosure? directly to the > > > > HandshakeClosure() constructor) > > > > > > Maybe this could be used also in the Handshake::execute() methods as general solution? > > Right, we could also do that. Avoiding to clear the polling page in > > HandshakeState::clear_handshake() should be enough to fix this issue and > > execute a handshake inside a safepoint, but adding that "if" statement > > in Hanshake::execute() sounds good to avoid all the extra code that we > > go through when executing a handshake. I filed 8239084 to make that change. > > Thanks for taking care of this and creating the RFE. > > > > > > > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is > > > > always called in a nested operation or just sometimes. > > > > > > At least one execution path without vm operation exists: > > > > > > JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void > > > JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong > > > JvmtiEventControllerPrivate::recompute_enabled() : void > > > JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches) > > > JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void > > > JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError > > > jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError > > > > > > I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a > > > handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further > > > encouraged to do it with a handshake :) > > Ah! I think you can still do it with a handshake with the > > Deoptimization::deoptimize_all_marked() like solution. I can change the > > if-else statement with just the Handshake::execute() call in 8239084. > > But up to you. : ) > > Well, I think that's enough encouragement :) > I'll wait for 8239084 and try then again. > (no urgency and all) > > Thanks, > Richard. > > -----Original Message----- > From: Patricio Chilano > Sent: Freitag, 14. Februar 2020 15:54 > To: Reingruber, Richard ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net > Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant > > Hi Richard, > > On 2/14/20 9:58 AM, Reingruber, Richard wrote: >> Hi Patricio, >> >> thanks for having a look. >> >> > I?m only commenting on the handshake changes. >> > I see that operation VM_EnterInterpOnlyMode can be called inside >> > operation VM_SetFramePop which also allows nested operations. Here is a >> > comment in VM_SetFramePop definition: >> > >> > // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is >> > // called from the JvmtiEventControllerPrivate::recompute_thread_enabled. >> > >> > So if we change VM_EnterInterpOnlyMode to be a handshake, then now we >> > could have a handshake inside a safepoint operation. The issue I see >> > there is that at the end of the handshake the polling page of the target >> > thread could be disarmed. So if the target thread happens to be in a >> > blocked state just transiently and wakes up then it will not stop for >> > the ongoing safepoint. Maybe I can file an RFE to assert that the >> > polling page is armed at the beginning of disarm_safepoint(). >> >> I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a >> handshake cannot be nested in a vm operation. Maybe it should be asserted in the >> Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation? >> >> > Alternatively I think you could do something similar to what we do in >> > Deoptimization::deoptimize_all_marked(): >> > >> > EnterInterpOnlyModeClosure hs; >> > if (SafepointSynchronize::is_at_safepoint()) { >> > hs.do_thread(state->get_thread()); >> > } else { >> > Handshake::execute(&hs, state->get_thread()); >> > } >> > (you could pass ?EnterInterpOnlyModeClosure? directly to the >> > HandshakeClosure() constructor) >> >> Maybe this could be used also in the Handshake::execute() methods as general solution? > Right, we could also do that. Avoiding to clear the polling page in > HandshakeState::clear_handshake() should be enough to fix this issue and > execute a handshake inside a safepoint, but adding that "if" statement > in Hanshake::execute() sounds good to avoid all the extra code that we > go through when executing a handshake. I filed 8239084 to make that change. > >> > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is >> > always called in a nested operation or just sometimes. >> >> At least one execution path without vm operation exists: >> >> JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void >> JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong >> JvmtiEventControllerPrivate::recompute_enabled() : void >> JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches) >> JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void >> JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError >> jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError >> >> I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a >> handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further >> encouraged to do it with a handshake :) > Ah! I think you can still do it with a handshake with the > Deoptimization::deoptimize_all_marked() like solution. I can change the > if-else statement with just the Handshake::execute() call in 8239084. > But up to you.? : ) > > Thanks, > Patricio >> Thanks again, >> Richard. >> >> -----Original Message----- >> From: Patricio Chilano >> Sent: Donnerstag, 13. Februar 2020 18:47 >> To: Reingruber, Richard ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net >> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant >> >> Hi Richard, >> >> I?m only commenting on the handshake changes. >> I see that operation VM_EnterInterpOnlyMode can be called inside >> operation VM_SetFramePop which also allows nested operations. Here is a >> comment in VM_SetFramePop definition: >> >> // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is >> // called from the JvmtiEventControllerPrivate::recompute_thread_enabled. >> >> So if we change VM_EnterInterpOnlyMode to be a handshake, then now we >> could have a handshake inside a safepoint operation. The issue I see >> there is that at the end of the handshake the polling page of the target >> thread could be disarmed. So if the target thread happens to be in a >> blocked state just transiently and wakes up then it will not stop for >> the ongoing safepoint. Maybe I can file an RFE to assert that the >> polling page is armed at the beginning of disarm_safepoint(). >> >> I think one option could be to remove >> SafepointMechanism::disarm_if_needed() in >> HandshakeState::clear_handshake() and let each JavaThread disarm itself >> for the handshake case. >> >> Alternatively I think you could do something similar to what we do in >> Deoptimization::deoptimize_all_marked(): >> >> ? EnterInterpOnlyModeClosure hs; >> ? if (SafepointSynchronize::is_at_safepoint()) { >> ??? hs.do_thread(state->get_thread()); >> ? } else { >> ??? Handshake::execute(&hs, state->get_thread()); >> ? } >> (you could pass ?EnterInterpOnlyModeClosure? directly to the >> HandshakeClosure() constructor) >> >> I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is >> always called in a nested operation or just sometimes. >> >> Thanks, >> Patricio >> >> On 2/12/20 7:23 AM, Reingruber, Richard wrote: >>> // Repost including hotspot runtime and gc lists. >>> // Dean Long suggested to do so, because the enhancement replaces a vm operation >>> // with a handshake. >>> // Original thread: http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-February/030359.html >>> >>> Hi, >>> >>> could I please get reviews for this small enhancement in hotspot's jvmti implementation: >>> >>> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.0/ >>> Bug: https://bugs.openjdk.java.net/browse/JDK-8238585 >>> >>> The change avoids making all compiled methods on stack not_entrant when switching a java thread to >>> interpreter only execution for jvmti purposes. It is sufficient to deoptimize the compiled frames on stack. >>> >>> Additionally a handshake is used instead of a vm operation to walk the stack and do the deoptimizations. >>> >>> Testing: JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms. >>> >>> Thanks, Richard. >>> >>> See also my question if anyone knows a reason for making the compiled methods not_entrant: >>> http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-January/030339.html > From christian.hagedorn at oracle.com Fri Apr 24 14:37:39 2020 From: christian.hagedorn at oracle.com (Christian Hagedorn) Date: Fri, 24 Apr 2020 16:37:39 +0200 Subject: [15] RFR(S): 8230402: Allocation of compile task fails with assert: "Leaking compilation tasks?" Message-ID: <27dd5ff1-9f91-d8c1-ecee-a77e6ecdb558@oracle.com> Hi Please review the following patch: https://bugs.openjdk.java.net/browse/JDK-8230402 http://cr.openjdk.java.net/~chagedorn/8230402/webrev.00/ This assert was hit very intermittently in an internal test until jdk-14+19. The test was changed afterwards and the assert was not observed to fail anymore. However, the problem of having too many tasks in the queue is still present (i.e. the compile queue is growing too quickly and the compiler(s) too slow to catch up). This assert can easily be hit by creating many class loaders which load many methods which are immediately compiled by setting a low compilation threshold as used in runA() in the testcase. Therefore, I suggest to tackle this problem with a general solution to drop half of the compilation tasks in CompileQueue::add() when a queue size of 10000 is reached and none of the other conditions of this assert hold (no Whitebox or JVMCI compiler). For tiered compilation, the tasks with the lowest method weight() or which are unloaded are removed from the queue (without altering the order of the remaining tasks in the queue). Without tiered compilation (i.e. SimpleCompPolicy), the tasks from the tail of the queue are removed. An additional verification in debug builds should ensure that there are no duplicated tasks. I assume that part of the reason of the original assert was to detect such duplicates. Thank you! Best regards, Christian From richard.reingruber at sap.com Fri Apr 24 14:44:29 2020 From: richard.reingruber at sap.com (Reingruber, Richard) Date: Fri, 24 Apr 2020 14:44:29 +0000 Subject: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant In-Reply-To: References: <3c59b9f9-ec38-18c9-8f24-e1186a08a04a@oracle.com> <410eed04-e2ef-0f4f-1c56-19e6734a10f6@oracle.com> Message-ID: Hi Yasumasa, > I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427. > Does it help you? I think it gives you to remove workaround. I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes. Also my first impression was that it won't be that easy from a synchronization point of view to replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear to me, how this has to be handled. So it appears to me that it would be easier to push JDK-8242427 after this (JDK-8238585). > (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.) Would be interesting to see how you handled the issues above :) Thanks, Richard. [1] See question in comment https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14302030&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302030 -----Original Message----- From: Yasumasa Suenaga Sent: Freitag, 24. April 2020 13:34 To: Reingruber, Richard ; Patricio Chilano ; serguei.spitsyn at oracle.com; Vladimir Ivanov ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant Hi Richard, I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427. Does it help you? I think it gives you to remove workaround. (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.) Thanks, Yasumasa On 2020/04/24 17:18, Reingruber, Richard wrote: > Hi Patricio, Vladimir, and Serguei, > > now that direct handshakes are available, I've updated the patch to make use of them. > > In addition I have done some clean-up changes I missed in the first webrev. > > Finally I have implemented the workaround suggested by Patricio to avoid nesting the handshake > into the vm operation VM_SetFramePop [1] > > Kindly review again: > > Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1/ > Webrev(delta): http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1.inc/ > > I updated the JBS item explaining why the vm operation VM_EnterInterpOnlyMode can be replaced with a > direct handshake: > > JBS: https://bugs.openjdk.java.net/browse/JDK-8238585 > > Testing: > > * JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms. > > * Submit-repo: mach5-one-rrich-JDK-8238585-20200423-1436-10441737 > > Thanks, > Richard. > > [1] An assertion in Handshake::execute_direct() fails, if called be VMThread, because it is no JavaThread. > > -----Original Message----- > From: hotspot-dev On Behalf Of Reingruber, Richard > Sent: Freitag, 14. Februar 2020 19:47 > To: Patricio Chilano ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net > Subject: RE: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant > > Hi Patricio, > > > > I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a > > > handshake cannot be nested in a vm operation. Maybe it should be asserted in the > > > Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation? > > > > > > > Alternatively I think you could do something similar to what we do in > > > > Deoptimization::deoptimize_all_marked(): > > > > > > > > EnterInterpOnlyModeClosure hs; > > > > if (SafepointSynchronize::is_at_safepoint()) { > > > > hs.do_thread(state->get_thread()); > > > > } else { > > > > Handshake::execute(&hs, state->get_thread()); > > > > } > > > > (you could pass ?EnterInterpOnlyModeClosure? directly to the > > > > HandshakeClosure() constructor) > > > > > > Maybe this could be used also in the Handshake::execute() methods as general solution? > > Right, we could also do that. Avoiding to clear the polling page in > > HandshakeState::clear_handshake() should be enough to fix this issue and > > execute a handshake inside a safepoint, but adding that "if" statement > > in Hanshake::execute() sounds good to avoid all the extra code that we > > go through when executing a handshake. I filed 8239084 to make that change. > > Thanks for taking care of this and creating the RFE. > > > > > > > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is > > > > always called in a nested operation or just sometimes. > > > > > > At least one execution path without vm operation exists: > > > > > > JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void > > > JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong > > > JvmtiEventControllerPrivate::recompute_enabled() : void > > > JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches) > > > JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void > > > JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError > > > jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError > > > > > > I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a > > > handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further > > > encouraged to do it with a handshake :) > > Ah! I think you can still do it with a handshake with the > > Deoptimization::deoptimize_all_marked() like solution. I can change the > > if-else statement with just the Handshake::execute() call in 8239084. > > But up to you. : ) > > Well, I think that's enough encouragement :) > I'll wait for 8239084 and try then again. > (no urgency and all) > > Thanks, > Richard. > > -----Original Message----- > From: Patricio Chilano > Sent: Freitag, 14. Februar 2020 15:54 > To: Reingruber, Richard ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net > Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant > > Hi Richard, > > On 2/14/20 9:58 AM, Reingruber, Richard wrote: >> Hi Patricio, >> >> thanks for having a look. >> >> > I?m only commenting on the handshake changes. >> > I see that operation VM_EnterInterpOnlyMode can be called inside >> > operation VM_SetFramePop which also allows nested operations. Here is a >> > comment in VM_SetFramePop definition: >> > >> > // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is >> > // called from the JvmtiEventControllerPrivate::recompute_thread_enabled. >> > >> > So if we change VM_EnterInterpOnlyMode to be a handshake, then now we >> > could have a handshake inside a safepoint operation. The issue I see >> > there is that at the end of the handshake the polling page of the target >> > thread could be disarmed. So if the target thread happens to be in a >> > blocked state just transiently and wakes up then it will not stop for >> > the ongoing safepoint. Maybe I can file an RFE to assert that the >> > polling page is armed at the beginning of disarm_safepoint(). >> >> I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a >> handshake cannot be nested in a vm operation. Maybe it should be asserted in the >> Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation? >> >> > Alternatively I think you could do something similar to what we do in >> > Deoptimization::deoptimize_all_marked(): >> > >> > EnterInterpOnlyModeClosure hs; >> > if (SafepointSynchronize::is_at_safepoint()) { >> > hs.do_thread(state->get_thread()); >> > } else { >> > Handshake::execute(&hs, state->get_thread()); >> > } >> > (you could pass ?EnterInterpOnlyModeClosure? directly to the >> > HandshakeClosure() constructor) >> >> Maybe this could be used also in the Handshake::execute() methods as general solution? > Right, we could also do that. Avoiding to clear the polling page in > HandshakeState::clear_handshake() should be enough to fix this issue and > execute a handshake inside a safepoint, but adding that "if" statement > in Hanshake::execute() sounds good to avoid all the extra code that we > go through when executing a handshake. I filed 8239084 to make that change. > >> > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is >> > always called in a nested operation or just sometimes. >> >> At least one execution path without vm operation exists: >> >> JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void >> JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong >> JvmtiEventControllerPrivate::recompute_enabled() : void >> JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches) >> JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void >> JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError >> jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError >> >> I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a >> handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further >> encouraged to do it with a handshake :) > Ah! I think you can still do it with a handshake with the > Deoptimization::deoptimize_all_marked() like solution. I can change the > if-else statement with just the Handshake::execute() call in 8239084. > But up to you.? : ) > > Thanks, > Patricio >> Thanks again, >> Richard. >> >> -----Original Message----- >> From: Patricio Chilano >> Sent: Donnerstag, 13. Februar 2020 18:47 >> To: Reingruber, Richard ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net >> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant >> >> Hi Richard, >> >> I?m only commenting on the handshake changes. >> I see that operation VM_EnterInterpOnlyMode can be called inside >> operation VM_SetFramePop which also allows nested operations. Here is a >> comment in VM_SetFramePop definition: >> >> // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is >> // called from the JvmtiEventControllerPrivate::recompute_thread_enabled. >> >> So if we change VM_EnterInterpOnlyMode to be a handshake, then now we >> could have a handshake inside a safepoint operation. The issue I see >> there is that at the end of the handshake the polling page of the target >> thread could be disarmed. So if the target thread happens to be in a >> blocked state just transiently and wakes up then it will not stop for >> the ongoing safepoint. Maybe I can file an RFE to assert that the >> polling page is armed at the beginning of disarm_safepoint(). >> >> I think one option could be to remove >> SafepointMechanism::disarm_if_needed() in >> HandshakeState::clear_handshake() and let each JavaThread disarm itself >> for the handshake case. >> >> Alternatively I think you could do something similar to what we do in >> Deoptimization::deoptimize_all_marked(): >> >> ? EnterInterpOnlyModeClosure hs; >> ? if (SafepointSynchronize::is_at_safepoint()) { >> ??? hs.do_thread(state->get_thread()); >> ? } else { >> ??? Handshake::execute(&hs, state->get_thread()); >> ? } >> (you could pass ?EnterInterpOnlyModeClosure? directly to the >> HandshakeClosure() constructor) >> >> I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is >> always called in a nested operation or just sometimes. >> >> Thanks, >> Patricio >> >> On 2/12/20 7:23 AM, Reingruber, Richard wrote: >>> // Repost including hotspot runtime and gc lists. >>> // Dean Long suggested to do so, because the enhancement replaces a vm operation >>> // with a handshake. >>> // Original thread: http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-February/030359.html >>> >>> Hi, >>> >>> could I please get reviews for this small enhancement in hotspot's jvmti implementation: >>> >>> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.0/ >>> Bug: https://bugs.openjdk.java.net/browse/JDK-8238585 >>> >>> The change avoids making all compiled methods on stack not_entrant when switching a java thread to >>> interpreter only execution for jvmti purposes. It is sufficient to deoptimize the compiled frames on stack. >>> >>> Additionally a handshake is used instead of a vm operation to walk the stack and do the deoptimizations. >>> >>> Testing: JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms. >>> >>> Thanks, Richard. >>> >>> See also my question if anyone knows a reason for making the compiled methods not_entrant: >>> http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-January/030339.html > From lutz.schmidt at sap.com Fri Apr 24 14:51:01 2020 From: lutz.schmidt at sap.com (Schmidt, Lutz) Date: Fri, 24 Apr 2020 14:51:01 +0000 Subject: RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range In-Reply-To: References: <0737AF50-4DED-4680-8629-47140DD2A7A6@sap.com> Message-ID: Hi Martin, SAP-internal testing revealed no problems related to this patch. As Michihiro did not find performance issues, the patch is good to go from my perspective. Regards, Lutz From: Michihiro Horie on behalf of Michihiro Horie Date: Friday, 24. April 2020 at 07:40 To: Lutz Schmidt Cc: "hotspot-compiler-dev at openjdk.java.net" , "Doerr, Martin (martin.doerr at sap.com)" , "ppc-aix-port-dev at openjdk.java.net" Subject: Re: RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range Hi Martin, Lutz, I have not seen big differences in SPECjbb2015 scores both on P8 and P9. Best regards, Michihiro ----- Original message ----- From: "Schmidt, Lutz" To: Michihiro Horie , "Doerr, Martin" Cc: "ppc-aix-port-dev at openjdk.java.net" , "hotspot-compiler-dev at openjdk.java.net" Subject: [EXTERNAL] Re: RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range Date: Thu, Apr 23, 2020 3:01 AM Hi Martin, your change looks good to me. I noticed you didn't find a chance to put it in the patch queue for our internal testing. I did that now, but it's too late for tonight. We'll have to wait until Friday morning (GMT+2) to really see what I expect: no issues. Thanks for cleaning up this old stuff. Regards, Lutz On 21.04.20, 16:57, "hotspot-compiler-dev on behalf of Michihiro Horie" wrote: ? ? Hi Martin, ? ? I started measuring SPECjbb2015 to see the performance impact on P9. Also, ? ? I'm preparing same measurement on P8. ? ? Best regards, ? ? Michihiro ? ? ?----- Original message ----- ? ? ?From: "Doerr, Martin" ? ? ?To: "'hotspot-compiler-dev at openjdk.java.net'" ? ? ? ? ? ?Cc: Michihiro Horie , "cjashfor at linux.ibm.com" ? ? ?, "ppc-aix-port-dev at openjdk.java.net" ? ? ?, Gustavo Romero ? ? ?, "joserz at linux.ibm.com" ? ? ? ? ? ?Subject: [EXTERNAL] RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is ? ? ?out of range ? ? ?Date: Tue, Apr 14, 2020 11:07 PM ? ? ?Hi, ? ? ?I?d like to resolve a very old PPC64 issue: ? ? ?https://bugs.openjdk.java.net/browse/JDK-8151030? ? ? ?There?s code for AllocatePrefetchStyle=4 which is not an accepted option. ? ? ?It was used for a special experimental prefetch mode using dcbz ? ? ?instructions to combine prefetching and zeroing in the TLABs. ? ? ?However, this code was never contributed and there are no plans to work on ? ? ?it. So I?d like to simply remove this small part of it. ? ? ?In addition to that, AllocatePrefetchLines is currently set to 3 by ? ? ?default which doesn?t make sense to me. PPC64 has an automatic prefetch ? ? ?engine and executing several prefetch instructions for succeeding cache ? ? ?lines doesn?t seem to be beneficial at all. ? ? ?So I?m setting it to 1 by default. I couldn?t observe regressions on ? ? ?Power7, Power8 and Power9. ? ? ?Webrev: ? ? ?http://cr.openjdk.java.net/~mdoerr/8151030_ppc_prefetch/webrev.00/? ? ? ?Please review. ? ? ?If somebody from IBM would like to check performance impact of changing ? ? ?the AllocatePrefetchLines + Distance, I?ll be glad to receive feedback. ? ? ?Best regards, ? ? ?Martin From suenaga at oss.nttdata.com Fri Apr 24 15:23:06 2020 From: suenaga at oss.nttdata.com (Yasumasa Suenaga) Date: Sat, 25 Apr 2020 00:23:06 +0900 Subject: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant In-Reply-To: References: <3c59b9f9-ec38-18c9-8f24-e1186a08a04a@oracle.com> <410eed04-e2ef-0f4f-1c56-19e6734a10f6@oracle.com> Message-ID: <81d7caa8-4244-85f3-4d4e-78117fe5e25b@oss.nttdata.com> Hi Richard, On 2020/04/24 23:44, Reingruber, Richard wrote: > Hi Yasumasa, > >> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427. >> Does it help you? I think it gives you to remove workaround. > > I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake > you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to > change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes. Thanks for your information. I tested my patch with both vmTestbase/nsk/jvmti/PopFrame and vmTestbase/nsk/jvmti/NotifyFramePop. I will modify and will test it after yours. > Also my first impression was that it won't be that easy from a synchronization point of view to > replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls > JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where > JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear > to me, how this has to be handled. I think JvmtiEventController::set_frame_pop() should hold JvmtiThreadState_lock because it affects other JVMTI operation especially FramePop event. Thanks, Yasumasa > So it appears to me that it would be easier to push JDK-8242427 after this (JDK-8238585). > >> (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.) > > Would be interesting to see how you handled the issues above :) > > Thanks, Richard. > > [1] See question in comment https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14302030&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302030 > > -----Original Message----- > From: Yasumasa Suenaga > Sent: Freitag, 24. April 2020 13:34 > To: Reingruber, Richard ; Patricio Chilano ; serguei.spitsyn at oracle.com; Vladimir Ivanov ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net > Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant > > Hi Richard, > > I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427. > Does it help you? I think it gives you to remove workaround. > > (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.) > > > Thanks, > > Yasumasa > > > On 2020/04/24 17:18, Reingruber, Richard wrote: >> Hi Patricio, Vladimir, and Serguei, >> >> now that direct handshakes are available, I've updated the patch to make use of them. >> >> In addition I have done some clean-up changes I missed in the first webrev. >> >> Finally I have implemented the workaround suggested by Patricio to avoid nesting the handshake >> into the vm operation VM_SetFramePop [1] >> >> Kindly review again: >> >> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1/ >> Webrev(delta): http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1.inc/ >> >> I updated the JBS item explaining why the vm operation VM_EnterInterpOnlyMode can be replaced with a >> direct handshake: >> >> JBS: https://bugs.openjdk.java.net/browse/JDK-8238585 >> >> Testing: >> >> * JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms. >> >> * Submit-repo: mach5-one-rrich-JDK-8238585-20200423-1436-10441737 >> >> Thanks, >> Richard. >> >> [1] An assertion in Handshake::execute_direct() fails, if called be VMThread, because it is no JavaThread. >> >> -----Original Message----- >> From: hotspot-dev On Behalf Of Reingruber, Richard >> Sent: Freitag, 14. Februar 2020 19:47 >> To: Patricio Chilano ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net >> Subject: RE: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant >> >> Hi Patricio, >> >> > > I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a >> > > handshake cannot be nested in a vm operation. Maybe it should be asserted in the >> > > Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation? >> > > >> > > > Alternatively I think you could do something similar to what we do in >> > > > Deoptimization::deoptimize_all_marked(): >> > > > >> > > > EnterInterpOnlyModeClosure hs; >> > > > if (SafepointSynchronize::is_at_safepoint()) { >> > > > hs.do_thread(state->get_thread()); >> > > > } else { >> > > > Handshake::execute(&hs, state->get_thread()); >> > > > } >> > > > (you could pass ?EnterInterpOnlyModeClosure? directly to the >> > > > HandshakeClosure() constructor) >> > > >> > > Maybe this could be used also in the Handshake::execute() methods as general solution? >> > Right, we could also do that. Avoiding to clear the polling page in >> > HandshakeState::clear_handshake() should be enough to fix this issue and >> > execute a handshake inside a safepoint, but adding that "if" statement >> > in Hanshake::execute() sounds good to avoid all the extra code that we >> > go through when executing a handshake. I filed 8239084 to make that change. >> >> Thanks for taking care of this and creating the RFE. >> >> > >> > > > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is >> > > > always called in a nested operation or just sometimes. >> > > >> > > At least one execution path without vm operation exists: >> > > >> > > JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void >> > > JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong >> > > JvmtiEventControllerPrivate::recompute_enabled() : void >> > > JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches) >> > > JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void >> > > JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError >> > > jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError >> > > >> > > I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a >> > > handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further >> > > encouraged to do it with a handshake :) >> > Ah! I think you can still do it with a handshake with the >> > Deoptimization::deoptimize_all_marked() like solution. I can change the >> > if-else statement with just the Handshake::execute() call in 8239084. >> > But up to you. : ) >> >> Well, I think that's enough encouragement :) >> I'll wait for 8239084 and try then again. >> (no urgency and all) >> >> Thanks, >> Richard. >> >> -----Original Message----- >> From: Patricio Chilano >> Sent: Freitag, 14. Februar 2020 15:54 >> To: Reingruber, Richard ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net >> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant >> >> Hi Richard, >> >> On 2/14/20 9:58 AM, Reingruber, Richard wrote: >>> Hi Patricio, >>> >>> thanks for having a look. >>> >>> > I?m only commenting on the handshake changes. >>> > I see that operation VM_EnterInterpOnlyMode can be called inside >>> > operation VM_SetFramePop which also allows nested operations. Here is a >>> > comment in VM_SetFramePop definition: >>> > >>> > // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is >>> > // called from the JvmtiEventControllerPrivate::recompute_thread_enabled. >>> > >>> > So if we change VM_EnterInterpOnlyMode to be a handshake, then now we >>> > could have a handshake inside a safepoint operation. The issue I see >>> > there is that at the end of the handshake the polling page of the target >>> > thread could be disarmed. So if the target thread happens to be in a >>> > blocked state just transiently and wakes up then it will not stop for >>> > the ongoing safepoint. Maybe I can file an RFE to assert that the >>> > polling page is armed at the beginning of disarm_safepoint(). >>> >>> I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a >>> handshake cannot be nested in a vm operation. Maybe it should be asserted in the >>> Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation? >>> >>> > Alternatively I think you could do something similar to what we do in >>> > Deoptimization::deoptimize_all_marked(): >>> > >>> > EnterInterpOnlyModeClosure hs; >>> > if (SafepointSynchronize::is_at_safepoint()) { >>> > hs.do_thread(state->get_thread()); >>> > } else { >>> > Handshake::execute(&hs, state->get_thread()); >>> > } >>> > (you could pass ?EnterInterpOnlyModeClosure? directly to the >>> > HandshakeClosure() constructor) >>> >>> Maybe this could be used also in the Handshake::execute() methods as general solution? >> Right, we could also do that. Avoiding to clear the polling page in >> HandshakeState::clear_handshake() should be enough to fix this issue and >> execute a handshake inside a safepoint, but adding that "if" statement >> in Hanshake::execute() sounds good to avoid all the extra code that we >> go through when executing a handshake. I filed 8239084 to make that change. >> >>> > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is >>> > always called in a nested operation or just sometimes. >>> >>> At least one execution path without vm operation exists: >>> >>> JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void >>> JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong >>> JvmtiEventControllerPrivate::recompute_enabled() : void >>> JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches) >>> JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void >>> JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError >>> jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError >>> >>> I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a >>> handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further >>> encouraged to do it with a handshake :) >> Ah! I think you can still do it with a handshake with the >> Deoptimization::deoptimize_all_marked() like solution. I can change the >> if-else statement with just the Handshake::execute() call in 8239084. >> But up to you.? : ) >> >> Thanks, >> Patricio >>> Thanks again, >>> Richard. >>> >>> -----Original Message----- >>> From: Patricio Chilano >>> Sent: Donnerstag, 13. Februar 2020 18:47 >>> To: Reingruber, Richard ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net >>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant >>> >>> Hi Richard, >>> >>> I?m only commenting on the handshake changes. >>> I see that operation VM_EnterInterpOnlyMode can be called inside >>> operation VM_SetFramePop which also allows nested operations. Here is a >>> comment in VM_SetFramePop definition: >>> >>> // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is >>> // called from the JvmtiEventControllerPrivate::recompute_thread_enabled. >>> >>> So if we change VM_EnterInterpOnlyMode to be a handshake, then now we >>> could have a handshake inside a safepoint operation. The issue I see >>> there is that at the end of the handshake the polling page of the target >>> thread could be disarmed. So if the target thread happens to be in a >>> blocked state just transiently and wakes up then it will not stop for >>> the ongoing safepoint. Maybe I can file an RFE to assert that the >>> polling page is armed at the beginning of disarm_safepoint(). >>> >>> I think one option could be to remove >>> SafepointMechanism::disarm_if_needed() in >>> HandshakeState::clear_handshake() and let each JavaThread disarm itself >>> for the handshake case. >>> >>> Alternatively I think you could do something similar to what we do in >>> Deoptimization::deoptimize_all_marked(): >>> >>> ? EnterInterpOnlyModeClosure hs; >>> ? if (SafepointSynchronize::is_at_safepoint()) { >>> ??? hs.do_thread(state->get_thread()); >>> ? } else { >>> ??? Handshake::execute(&hs, state->get_thread()); >>> ? } >>> (you could pass ?EnterInterpOnlyModeClosure? directly to the >>> HandshakeClosure() constructor) >>> >>> I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is >>> always called in a nested operation or just sometimes. >>> >>> Thanks, >>> Patricio >>> >>> On 2/12/20 7:23 AM, Reingruber, Richard wrote: >>>> // Repost including hotspot runtime and gc lists. >>>> // Dean Long suggested to do so, because the enhancement replaces a vm operation >>>> // with a handshake. >>>> // Original thread: http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-February/030359.html >>>> >>>> Hi, >>>> >>>> could I please get reviews for this small enhancement in hotspot's jvmti implementation: >>>> >>>> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.0/ >>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8238585 >>>> >>>> The change avoids making all compiled methods on stack not_entrant when switching a java thread to >>>> interpreter only execution for jvmti purposes. It is sufficient to deoptimize the compiled frames on stack. >>>> >>>> Additionally a handshake is used instead of a vm operation to walk the stack and do the deoptimizations. >>>> >>>> Testing: JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms. >>>> >>>> Thanks, Richard. >>>> >>>> See also my question if anyone knows a reason for making the compiled methods not_entrant: >>>> http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-January/030339.html >> From richard.reingruber at sap.com Fri Apr 24 16:08:57 2020 From: richard.reingruber at sap.com (Reingruber, Richard) Date: Fri, 24 Apr 2020 16:08:57 +0000 Subject: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant In-Reply-To: <81d7caa8-4244-85f3-4d4e-78117fe5e25b@oss.nttdata.com> References: <3c59b9f9-ec38-18c9-8f24-e1186a08a04a@oracle.com> <410eed04-e2ef-0f4f-1c56-19e6734a10f6@oracle.com> <81d7caa8-4244-85f3-4d4e-78117fe5e25b@oss.nttdata.com> Message-ID: Hi Yasumasa, Patricio, > >> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427. > >> Does it help you? I think it gives you to remove workaround. > > > > I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake > > you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to > > change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes. > Thanks for your information. > I tested my patch with both vmTestbase/nsk/jvmti/PopFrame and vmTestbase/nsk/jvmti/NotifyFramePop. > I will modify and will test it after yours. Thanks :) > > Also my first impression was that it won't be that easy from a synchronization point of view to > > replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls > > JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where > > JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear > > to me, how this has to be handled. > I think JvmtiEventController::set_frame_pop() should hold JvmtiThreadState_lock because it affects other JVMTI operation especially FramePop event. Yes. To me it is unclear what synchronization is necessary, if it is called during a handshake. And also I'm unsure if a thread should do safepoint checks while executing a handshake. @Patricio, coming back to my question [1]: In the example you gave in your answer [2]: the java thread would execute a vm operation during a direct handshake operation, while the VMThread is actually in the middle of a VM_HandshakeAllThreads operation, waiting to handshake the same handshakee: why can't the VMThread just proceed? The handshakee would be safepoint safe, wouldn't it? Thanks, Richard. [1] https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14301677&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14301677 [2] https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14301763&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14301763 -----Original Message----- From: Yasumasa Suenaga Sent: Freitag, 24. April 2020 17:23 To: Reingruber, Richard ; Patricio Chilano ; serguei.spitsyn at oracle.com; Vladimir Ivanov ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant Hi Richard, On 2020/04/24 23:44, Reingruber, Richard wrote: > Hi Yasumasa, > >> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427. >> Does it help you? I think it gives you to remove workaround. > > I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake > you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to > change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes. Thanks for your information. I tested my patch with both vmTestbase/nsk/jvmti/PopFrame and vmTestbase/nsk/jvmti/NotifyFramePop. I will modify and will test it after yours. > Also my first impression was that it won't be that easy from a synchronization point of view to > replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls > JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where > JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear > to me, how this has to be handled. I think JvmtiEventController::set_frame_pop() should hold JvmtiThreadState_lock because it affects other JVMTI operation especially FramePop event. Thanks, Yasumasa > So it appears to me that it would be easier to push JDK-8242427 after this (JDK-8238585). > >> (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.) > > Would be interesting to see how you handled the issues above :) > > Thanks, Richard. > > [1] See question in comment https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14302030&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302030 > > -----Original Message----- > From: Yasumasa Suenaga > Sent: Freitag, 24. April 2020 13:34 > To: Reingruber, Richard ; Patricio Chilano ; serguei.spitsyn at oracle.com; Vladimir Ivanov ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net > Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant > > Hi Richard, > > I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427. > Does it help you? I think it gives you to remove workaround. > > (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.) > > > Thanks, > > Yasumasa > > > On 2020/04/24 17:18, Reingruber, Richard wrote: >> Hi Patricio, Vladimir, and Serguei, >> >> now that direct handshakes are available, I've updated the patch to make use of them. >> >> In addition I have done some clean-up changes I missed in the first webrev. >> >> Finally I have implemented the workaround suggested by Patricio to avoid nesting the handshake >> into the vm operation VM_SetFramePop [1] >> >> Kindly review again: >> >> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1/ >> Webrev(delta): http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1.inc/ >> >> I updated the JBS item explaining why the vm operation VM_EnterInterpOnlyMode can be replaced with a >> direct handshake: >> >> JBS: https://bugs.openjdk.java.net/browse/JDK-8238585 >> >> Testing: >> >> * JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms. >> >> * Submit-repo: mach5-one-rrich-JDK-8238585-20200423-1436-10441737 >> >> Thanks, >> Richard. >> >> [1] An assertion in Handshake::execute_direct() fails, if called be VMThread, because it is no JavaThread. >> >> -----Original Message----- >> From: hotspot-dev On Behalf Of Reingruber, Richard >> Sent: Freitag, 14. Februar 2020 19:47 >> To: Patricio Chilano ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net >> Subject: RE: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant >> >> Hi Patricio, >> >> > > I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a >> > > handshake cannot be nested in a vm operation. Maybe it should be asserted in the >> > > Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation? >> > > >> > > > Alternatively I think you could do something similar to what we do in >> > > > Deoptimization::deoptimize_all_marked(): >> > > > >> > > > EnterInterpOnlyModeClosure hs; >> > > > if (SafepointSynchronize::is_at_safepoint()) { >> > > > hs.do_thread(state->get_thread()); >> > > > } else { >> > > > Handshake::execute(&hs, state->get_thread()); >> > > > } >> > > > (you could pass ?EnterInterpOnlyModeClosure? directly to the >> > > > HandshakeClosure() constructor) >> > > >> > > Maybe this could be used also in the Handshake::execute() methods as general solution? >> > Right, we could also do that. Avoiding to clear the polling page in >> > HandshakeState::clear_handshake() should be enough to fix this issue and >> > execute a handshake inside a safepoint, but adding that "if" statement >> > in Hanshake::execute() sounds good to avoid all the extra code that we >> > go through when executing a handshake. I filed 8239084 to make that change. >> >> Thanks for taking care of this and creating the RFE. >> >> > >> > > > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is >> > > > always called in a nested operation or just sometimes. >> > > >> > > At least one execution path without vm operation exists: >> > > >> > > JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void >> > > JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong >> > > JvmtiEventControllerPrivate::recompute_enabled() : void >> > > JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches) >> > > JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void >> > > JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError >> > > jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError >> > > >> > > I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a >> > > handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further >> > > encouraged to do it with a handshake :) >> > Ah! I think you can still do it with a handshake with the >> > Deoptimization::deoptimize_all_marked() like solution. I can change the >> > if-else statement with just the Handshake::execute() call in 8239084. >> > But up to you. : ) >> >> Well, I think that's enough encouragement :) >> I'll wait for 8239084 and try then again. >> (no urgency and all) >> >> Thanks, >> Richard. >> >> -----Original Message----- >> From: Patricio Chilano >> Sent: Freitag, 14. Februar 2020 15:54 >> To: Reingruber, Richard ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net >> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant >> >> Hi Richard, >> >> On 2/14/20 9:58 AM, Reingruber, Richard wrote: >>> Hi Patricio, >>> >>> thanks for having a look. >>> >>> > I?m only commenting on the handshake changes. >>> > I see that operation VM_EnterInterpOnlyMode can be called inside >>> > operation VM_SetFramePop which also allows nested operations. Here is a >>> > comment in VM_SetFramePop definition: >>> > >>> > // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is >>> > // called from the JvmtiEventControllerPrivate::recompute_thread_enabled. >>> > >>> > So if we change VM_EnterInterpOnlyMode to be a handshake, then now we >>> > could have a handshake inside a safepoint operation. The issue I see >>> > there is that at the end of the handshake the polling page of the target >>> > thread could be disarmed. So if the target thread happens to be in a >>> > blocked state just transiently and wakes up then it will not stop for >>> > the ongoing safepoint. Maybe I can file an RFE to assert that the >>> > polling page is armed at the beginning of disarm_safepoint(). >>> >>> I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a >>> handshake cannot be nested in a vm operation. Maybe it should be asserted in the >>> Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation? >>> >>> > Alternatively I think you could do something similar to what we do in >>> > Deoptimization::deoptimize_all_marked(): >>> > >>> > EnterInterpOnlyModeClosure hs; >>> > if (SafepointSynchronize::is_at_safepoint()) { >>> > hs.do_thread(state->get_thread()); >>> > } else { >>> > Handshake::execute(&hs, state->get_thread()); >>> > } >>> > (you could pass ?EnterInterpOnlyModeClosure? directly to the >>> > HandshakeClosure() constructor) >>> >>> Maybe this could be used also in the Handshake::execute() methods as general solution? >> Right, we could also do that. Avoiding to clear the polling page in >> HandshakeState::clear_handshake() should be enough to fix this issue and >> execute a handshake inside a safepoint, but adding that "if" statement >> in Hanshake::execute() sounds good to avoid all the extra code that we >> go through when executing a handshake. I filed 8239084 to make that change. >> >>> > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is >>> > always called in a nested operation or just sometimes. >>> >>> At least one execution path without vm operation exists: >>> >>> JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void >>> JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong >>> JvmtiEventControllerPrivate::recompute_enabled() : void >>> JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches) >>> JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void >>> JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError >>> jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError >>> >>> I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a >>> handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further >>> encouraged to do it with a handshake :) >> Ah! I think you can still do it with a handshake with the >> Deoptimization::deoptimize_all_marked() like solution. I can change the >> if-else statement with just the Handshake::execute() call in 8239084. >> But up to you.? : ) >> >> Thanks, >> Patricio >>> Thanks again, >>> Richard. >>> >>> -----Original Message----- >>> From: Patricio Chilano >>> Sent: Donnerstag, 13. Februar 2020 18:47 >>> To: Reingruber, Richard ; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net >>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant >>> >>> Hi Richard, >>> >>> I?m only commenting on the handshake changes. >>> I see that operation VM_EnterInterpOnlyMode can be called inside >>> operation VM_SetFramePop which also allows nested operations. Here is a >>> comment in VM_SetFramePop definition: >>> >>> // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is >>> // called from the JvmtiEventControllerPrivate::recompute_thread_enabled. >>> >>> So if we change VM_EnterInterpOnlyMode to be a handshake, then now we >>> could have a handshake inside a safepoint operation. The issue I see >>> there is that at the end of the handshake the polling page of the target >>> thread could be disarmed. So if the target thread happens to be in a >>> blocked state just transiently and wakes up then it will not stop for >>> the ongoing safepoint. Maybe I can file an RFE to assert that the >>> polling page is armed at the beginning of disarm_safepoint(). >>> >>> I think one option could be to remove >>> SafepointMechanism::disarm_if_needed() in >>> HandshakeState::clear_handshake() and let each JavaThread disarm itself >>> for the handshake case. >>> >>> Alternatively I think you could do something similar to what we do in >>> Deoptimization::deoptimize_all_marked(): >>> >>> ? EnterInterpOnlyModeClosure hs; >>> ? if (SafepointSynchronize::is_at_safepoint()) { >>> ??? hs.do_thread(state->get_thread()); >>> ? } else { >>> ??? Handshake::execute(&hs, state->get_thread()); >>> ? } >>> (you could pass ?EnterInterpOnlyModeClosure? directly to the >>> HandshakeClosure() constructor) >>> >>> I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is >>> always called in a nested operation or just sometimes. >>> >>> Thanks, >>> Patricio >>> >>> On 2/12/20 7:23 AM, Reingruber, Richard wrote: >>>> // Repost including hotspot runtime and gc lists. >>>> // Dean Long suggested to do so, because the enhancement replaces a vm operation >>>> // with a handshake. >>>> // Original thread: http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-February/030359.html >>>> >>>> Hi, >>>> >>>> could I please get reviews for this small enhancement in hotspot's jvmti implementation: >>>> >>>> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.0/ >>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8238585 >>>> >>>> The change avoids making all compiled methods on stack not_entrant when switching a java thread to >>>> interpreter only execution for jvmti purposes. It is sufficient to deoptimize the compiled frames on stack. >>>> >>>> Additionally a handshake is used instead of a vm operation to walk the stack and do the deoptimizations. >>>> >>>> Testing: JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms. >>>> >>>> Thanks, Richard. >>>> >>>> See also my question if anyone knows a reason for making the compiled methods not_entrant: >>>> http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-January/030339.html >> From patricio.chilano.mateo at oracle.com Fri Apr 24 17:13:43 2020 From: patricio.chilano.mateo at oracle.com (Patricio Chilano) Date: Fri, 24 Apr 2020 14:13:43 -0300 Subject: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant In-Reply-To: References: <3c59b9f9-ec38-18c9-8f24-e1186a08a04a@oracle.com> <410eed04-e2ef-0f4f-1c56-19e6734a10f6@oracle.com> <81d7caa8-4244-85f3-4d4e-78117fe5e25b@oss.nttdata.com> Message-ID: <11c78b30-de04-544d-3a10-811ebf663bf2@oracle.com> Hi Richard, Just jumping into your last question for now.? : ) On 4/24/20 1:08 PM, Reingruber, Richard wrote: > Hi Yasumasa, Patricio, > >>>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427. >>>> Does it help you? I think it gives you to remove workaround. >>> I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake >>> you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to >>> change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes. >> Thanks for your information. >> I tested my patch with both vmTestbase/nsk/jvmti/PopFrame and vmTestbase/nsk/jvmti/NotifyFramePop. >> I will modify and will test it after yours. > Thanks :) > >>> Also my first impression was that it won't be that easy from a synchronization point of view to >>> replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls >>> JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where >>> JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear >>> to me, how this has to be handled. >> I thin