From Pengfei.Li at arm.com  Wed Apr  1 02:05:04 2020
From: Pengfei.Li at arm.com (Pengfei Li)
Date: Wed, 1 Apr 2020 02:05:04 +0000
Subject: [aarch64-port-dev ] RFR(S): 8241475: AArch64: Add missing support
 for PopCountVI node
In-Reply-To: <2ce24736-9b5c-5c23-bfde-14067d6d6b0d@redhat.com>
References: <DB8PR08MB49699764FA887647D8495C5396C80@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <2ce24736-9b5c-5c23-bfde-14067d6d6b0d@redhat.com>
Message-ID: <DB8PR08MB496930232C57100B12D55E9896C90@DB8PR08MB4969.eurprd08.prod.outlook.com>

Hi Andrew,

Thanks for review.

>    INSN(absr,   0, 0b100000101110, 1); // accepted arrangements: T8B, T16B,
> T4H, T8H,      T4S
> -  INSN(negr,   1, 0b100000101110, 2); // accepted arrangements: T8B, T16B,
> T4H, T8H, T2S, T4S, T2D
> 
> is actually related to some other work you are doing?

This change is related to
-    if (accepted < 2) guarantee(T != T2S && T != T2D, "incorrect arrangement");         \
-    if (accepted == 0) guarantee(T == T8B || T == T16B, "incorrect arrangement");       \
+    if (accepted < 3) guarantee(T != T2D, "incorrect arrangement");                     \
+    if (accepted < 2) guarantee(T != T2S, "incorrect arrangement");                     \
+    if (accepted < 1) guarantee(T == T8B || T == T16B, "incorrect arrangement");        \

Before my patch, the candidate values of "accepted" are 0, 1 and 2 meaning different accepted arrangements as below:
0 - Only T8B and T16B are accepted
1 - All arrangements but T2S and T2D are accepted
2 - All arrangements are accepted

In my patch, the newly added instruction UADDLP supports T2S but doesn't support T2D. So I changed the value range to 0 - 3, where 3 means all arrangements are accepted now. That's why the value for parameter "accepted" of NEGR is promoted from 2 to 3 now.

--
Thanks,
Pengfei


From richard.reingruber at sap.com  Wed Apr  1 06:15:12 2020
From: richard.reingruber at sap.com (Reingruber, Richard)
Date: Wed, 1 Apr 2020 06:15:12 +0000
Subject: RFR(L) 8227745: Enable Escape Analysis for Better Performance in
 the Presence of JVMTI Agents
In-Reply-To: <AM0PR0202MB329746F57D1C78F14000CB799AC80@AM0PR0202MB3297.eurprd02.prod.outlook.com>
References: <DB7PR02MB3612C77802B72D3B3A131C729B5B0@DB7PR02MB3612.eurprd02.prod.outlook.com>
 <ca46e04d-6c46-7365-0f09-9d649e196442@oracle.com>
 <DB7PR02MB3612E34960EAD89951E788839B5A0@DB7PR02MB3612.eurprd02.prod.outlook.com>
 <1f8a3c7a-fa0f-b5b2-4a8a-7d3d8dbbe1b5@oracle.com>
 <a4213452-e7bd-5bed-7456-3eebf4a4c3a7@oracle.com>
 <DB7PR02MB3612C72A7DC0C14CFC8B92969B540@DB7PR02MB3612.eurprd02.prod.outlook.com>
 <f97264ed-c43e-2d7e-19ae-fcff174f74df@oracle.com>
 <4b56a45c-a14c-6f74-2bfd-25deaabe8201@oracle.com>
 <DB7PR02MB36127925DB5D6609DDBF96909B500@DB7PR02MB3612.eurprd02.prod.outlook.com>
 <5271429a-481d-ddb9-99dc-b3f6670fcc0b@oracle.com>
 <AM0PR0202MB33316510E86767AED0D29F679B030@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <AM7PR02MB6049A3D2F6DE10CAD6AA7A51ECEC0@AM7PR02MB6049.eurprd02.prod.outlook.com>
 <b159e349-95bc-01c3-5250-f3b454d7ef53@oracle.com>
 <AM0PR0202MB33315707EAB1F5C9801DB4C19BE40@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <AM0PR0202MB32972071A26C80FB22FC49DE9AFD0@AM0PR0202MB3297.eurprd02.prod.outlook.com>
 <AM0PR0202MB3331EEF36942FCEBA7E131389BCB0@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <AM0PR0202MB329746F57D1C78F14000CB799AC80@AM0PR0202MB3297.eurprd02.prod.outlook.com>
Message-ID: <AM0PR0202MB3331D64C693490FD0746D1989BC90@AM0PR0202MB3331.eurprd02.prod.outlook.com>

Hi Martin,

> thanks for addressing all my points. I've looked over webrev.5 and I'm satisfied with your changes.

Thanks!

> I had also promised to review the tests.

Thanks++
I appreciate it very much, the tests are many lines of code.

> test/jdk/com/sun/jdi/EATests.java
> This is a substantial amount of tests which is appropriate for a such a large change. Skipping some subtests with UseJVMCICompiler makes sense because it doesn't provide the necessary JVMTI functionality, yet.
> Nice work!
> I also like that you test with and without BiasedLocking. Your tests will still be fine after BiasedLocking deprecation.

Hope so :)

> Very minor nits:
> - 2 typos in comment above EARelockingNestedInflatedTarget: "lockes are ommitted" (sounds funny)
> - You sometimes write "graal" and sometimes "Graal". I guess the capital G is better. (Also in EATestsJVMCI.java.)

> test/jdk/com/sun/jdi/EATestsJVMCI.java
> EATests with Graal enabled. Nice that you support Graal to some extent. Maybe Graal folks want to enhance them in the future. I think this is a good starting point.

Will change this in the next webrev.

> Conclusion: Looks good and not trivial :-)
> Now, you have one full review. I'd be ok with covering 2nd review by partial reviews.
> Compiler and JVMTI parts are not too complicated IMHO.
> Runtime part should get at least one additional careful review.

Thanks a lot,
Richard.

-----Original Message-----
From: Doerr, Martin <martin.doerr at sap.com> 
Sent: Dienstag, 31. M?rz 2020 16:01
To: Reingruber, Richard <richard.reingruber at sap.com>; 'Robbin Ehn' <robbin.ehn at oracle.com>; Lindenmaier, Goetz <goetz.lindenmaier at sap.com>; David Holmes <david.holmes at oracle.com>; Vladimir Kozlov (vladimir.kozlov at oracle.com) <vladimir.kozlov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net
Subject: RE: RFR(L) 8227745: Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents

Hi Richard,

thanks for addressing all my points. I've looked over webrev.5 and I'm satisfied with your changes.


I had also promised to review the tests.

test/hotspot/jtreg/serviceability/jvmti/Heap/IterateHeapWithEscapeAnalysisEnabled.java
Thanks for updating the @summary comment. Looks good in webrev.5.

test/hotspot/jtreg/serviceability/jvmti/Heap/libIterateHeapWithEscapeAnalysisEnabled.c
JVMTI agent for object tagging and heap iteration. Good.

test/jdk/com/sun/jdi/EATests.java
This is a substantial amount of tests which is appropriate for a such a large change. Skipping some subtests with UseJVMCICompiler makes sense because it doesn't provide the necessary JVMTI functionality, yet.
Nice work!
I also like that you test with and without BiasedLocking. Your tests will still be fine after BiasedLocking deprecation.

Very minor nits:
- 2 typos in comment above EARelockingNestedInflatedTarget: "lockes are ommitted" (sounds funny)
- You sometimes write "graal" and sometimes "Graal". I guess the capital G is better. (Also in EATestsJVMCI.java.)

test/jdk/com/sun/jdi/EATestsJVMCI.java
EATests with Graal enabled. Nice that you support Graal to some extent. Maybe Graal folks want to enhance them in the future. I think this is a good starting point.


Conclusion: Looks good and not trivial :-)
Now, you have one full review. I'd be ok with covering 2nd review by partial reviews.
Compiler and JVMTI parts are not too complicated IMHO.
Runtime part should get at least one additional careful review.

Best regards,
Martin


> -----Original Message-----
> From: Reingruber, Richard <richard.reingruber at sap.com>
> Sent: Montag, 30. M?rz 2020 10:32
> To: Doerr, Martin <martin.doerr at sap.com>; 'Robbin Ehn'
> <robbin.ehn at oracle.com>; Lindenmaier, Goetz
> <goetz.lindenmaier at sap.com>; David Holmes <david.holmes at oracle.com>;
> Vladimir Kozlov (vladimir.kozlov at oracle.com)
> <vladimir.kozlov at oracle.com>; serviceability-dev at openjdk.java.net;
> hotspot-compiler-dev at openjdk.java.net; hotspot-runtime-
> dev at openjdk.java.net
> Subject: RE: RFR(L) 8227745: Enable Escape Analysis for Better Performance
> in the Presence of JVMTI Agents
> 
> Hi,
> 
> this is webrev.5 based on Robbin's feedback and Martin's review - thanks! :)
> 
> The change affects jvmti, hotspot and c2. Partial reviews are very welcome
> too.
> 
> Full:  http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.5/
> Delta:
> http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.5.inc/
> 
> Robbin, Martin, please let me know, if anything shouldn't be quite as you
> wanted it. Also find my
> comments on your feedback below.
> 
> Robbin, can I count you as Reviewer for the runtime part?
> 
> Thanks, Richard.
> 
> --
> 
> > DeoptimizeObjectsALotThread is only used in compileBroker.cpp.
> > You can move both declaration and definition to that file, no need to
> clobber
> > thread.[c|h]pp. (and the static function deopt_objs_alot_thread_entry)
> 
> Done.
> 
> > Does JvmtiDeferredUpdates really need to be in thread.hpp, can't be in it's
> own
> > hpp file? It doesn't seem right to add JVM TI classes into thread.hpp.
> 
> I moved JvmtiDeferredUpdates to vframe_hp.hpp where preexisting
> jvmtiDeferredLocalVariableSet is
> declared.
> 
> > src/hotspot/share/code/compiledMethod.cpp
> > Nice cleanup!
> 
> Thanks :)
> 
> > src/hotspot/share/code/debugInfoRec.cpp
> > src/hotspot/share/code/debugInfoRec.hpp
> > Additional parmeters. (Remark: I think "non_global_escape_in_scope"
> would read better than "not_global_escape_in_scope", but your version is
> consistent with existing code, so no change request from my side.) Ok.
> 
> I've been thinking about this too and finally stayed with
> not_global_escape_in_scope. It's supposed
> to mean an object whose escape state is not GlobalEscape is in scope.
> 
> > src/hotspot/share/compiler/compileBroker.cpp
> > src/hotspot/share/compiler/compileBroker.hpp
> > Extra thread for DeoptimizeObjectsALot. (Remark: I would have put it into
> a follow up change together with the test in order to make this webrev
> smaller, but since it is included, I'm reviewing everything at once. Not a big
> deal.) Ok.
> 
> Yes the change would be a little smaller. And if it helps I'll split it off. In
> general I prefer
> patches that bring along a suitable amount of tests.
> 
> > src/hotspot/share/opto/c2compiler.cpp
> > Make do_escape_analysis independent of JVMCI capabilities. Nice!
> 
> It is the main goal of the enhancement. It is done for C2, but could be done
> for JVMCI compilers
> with just a small effort as well.
> 
> > src/hotspot/share/opto/escape.cpp
> > Annotation for MachSafePointNodes. Your added functionality looks
> correct.
> > But I'd prefer to move the bulky code out of the large function.
> > I suggest to factor out something like has_not_global_escape and
> has_arg_escape. So the code could look like this:
> >       SafePointNode* sfn = sfn_worklist.at(next);
> >       sfn->set_not_global_escape_in_scope(has_not_global_escape(sfn));
> >       if (sfn->is_CallJava()) {
> >         CallJavaNode* call = sfn->as_CallJava();
> >         call->set_arg_escape(has_arg_escape(call));
> >       }
> > This would also allow us to get rid of the found_..._escape_in_args
> variables making the loops better readable.
> 
> Done.
> 
> > It's kind of ugly to use strcmp to recognize uncommon trap, but that seems
> to be the way to do it (there are more such places). So it's ok.
> 
> Yeah. I copied the snippet.
> 
> > src/hotspot/share/prims/jvmtiImpl.cpp
> > src/hotspot/share/prims/jvmtiImpl.hpp
> > The sequence is pretty complex:
> > VM_GetOrSetLocal element initialization executes EscapeBarrier code
> which suspends the target thread (extra VM Operation).
> 
> Note that the target threads have to be suspended already for
> VM_GetOrSetLocal*. So it's mainly the
> synchronization effect of EscapeBarrier::sync_and_suspend_one() that is
> required here. Also no extra
> _handshake_ is executed, since sync_and_suspend_one() will find the
> target threads already
> suspended.
> 
> > VM_GetOrSetLocal::doit_prologue performs object deoptimization (by VM
> Thread to prepare VM Operation with frame deoptimization).
> > VM_GetOrSetLocal destructor implicitly calls EscapeBarrier destructor
> which resumes the target thread.
> > But I don't have any improvement proposal. Performance is probably not a
> concern, here. So it's ok.
> 
> > VM_GetOrSetLocal::deoptimize_objects deoptimizes the top frame if it
> has non-globally escaping objects and other frames if they have arg escaping
> ones. Good.
> 
> It's not specifically the top frame, but the frame that is accessed.
> 
> > src/hotspot/share/runtime/deoptimization.cpp
> > Object deoptimization. I have more comments and proposals, here.
> > First of all, handling recursive and waiting locks in relock_objects is tricky,
> but looks correct.
> > Comments are sufficient to understand why things are done as they are
> implemented.
> 
> > BiasedLocking related parts are complex, but we may get rid of them in the
> future (with BiasedLocking removal).
> > Anyway, looks correct, too.
> 
> > Typo in comment: "regularily" => "regularly"
> 
> > Deoptimization::fetch_unroll_info_helper is the only place where
> _jvmti_deferred_updates get deallocated (except JavaThread destructor).
> But I think we always go through it, so I can't see a memory leak or such kind
> of issues.
> 
> That's correct. The compiled frame for which deferred updates are allocated
> is always deoptimized
> before (see EscapeBarrier::deoptimize_objects()). This is also asserted in
> compiledVFrame::update_deferred_value(). I've added the same assertion
> to
> Deoptimization::relock_objects(). So we can be sure that
> _jvmti_deferred_updates are deallocated
> again in fetch_unroll_info_helper().
> 
> > EscapeBarrier::deoptimize_objects: ResourceMark should use
> calling_thread().
> 
> Sure, well spotted!
> 
> > You can use MutexLocker and MonitorLocker with Thread* to save the
> Thread::current() call.
> 
> Right, good hint. This was recently introduced with 8235678. I even had to
> resolve conflicts. Should
> have done this then.
> 
> > I'd make set_objs_are_deoptimized static and remove it from the
> EscapeBarrier interface because I think it shouldn't be used outside of
> EscapeBarrier::deoptimize_objects.
> 
> Done.
> 
> > Typo in comment: "we must only deoptimize" => "we only have to
> deoptimize"
> 
> Replaced with "[...] we deoptimize iff local objects are passed as args"
> 
> > "bool EscapeBarrier::deoptimize_objects(intptr_t* fr_id)" is trivial and
> barrier_active() is redundant. Implementation can get moved to hpp file.
> 
> Ok. Done.
> 
> > I'll get back to suspend flags, later.
> 
> > There are weird cases regarding _self_deoptimization_in_progress.
> > Assume we have 3 threads A, B and C. A deopts C, B deopts C, C deopts C.
> C can set _self_deoptimization_in_progress while A performs the handshake
> for suspending C. I think this doesn't lead to errors, but it's probably not
> desired.
> > I think it would be better to use only one "wait" call in
> sync_and_suspend_one and sync_and_suspend_all.
> 
> You're right. We've discussed that face-to-face, but couldn't find a real issue.
> But now, thinking again, a reckon I found one:
> 
> 2808   // Sync with other threads that might be doing deoptimizations
> 2809   {
> 2810     // Need to switch to _thread_blocked for the wait() call
> 2811     ThreadBlockInVM tbivm(_calling_thread);
> 2812     MonitorLocker ml(EscapeBarrier_lock,
> Mutex::_no_safepoint_check_flag);
> 2813     while (_self_deoptimization_in_progress) {
> 2814       ml.wait();
> 2815     }
> 2816
> 2817     if (self_deopt()) {
> 2818       _self_deoptimization_in_progress = true;
> 2819     }
> 2820
> 2821     while (_deoptee_thread->is_ea_obj_deopt_suspend()) {
> 2822       ml.wait();
> 2823     }
> 2824
> 2825     if (self_deopt()) {
> 2826       return;
> 2827     }
> 2828
> 2829     // set suspend flag for target thread
> 2830     _deoptee_thread->set_ea_obj_deopt_flag();
> 2831   }
> 
> - A waits in 2822
> - C is suspended
> - B notifies all in resume_one()
> - A and C wake up
> - C wins over A and sets _self_deoptimization_in_progress = true in 2818
> - C does the self deoptimization
> - A executes 2830 _deoptee_thread->set_ea_obj_deopt_flag()
> 
> C will self suspend at some undefined point. The resulting state is illegal.
> 
> > I first thought it'd be better to move ThreadBlockInVM before wait() to
> reduce thread state transitions, but that seems to be problematic because
> ThreadBlockInVM destructor contains a safepoint check which we shouldn't
> do while holding EscapeBarrier_lock. So no change request.
> 
> Yes, would be nice to have the state change only if needed, but for the
> reason you mentioned it is
> not quite as easy as it seems to be. I experimented as well with a second
> lock, but did not succeed.
> 
> > Change in thred_added:
> > I think the sequence would be more comprehensive if we waited for
> deopt_all_threads in Thread::start and all other places where a new thread
> can run into Java code (e.g. JVMTI attach).
> > Your version makes new threads come up with suspend flag set. That looks
> correct, too. Advantage is that you only have to change one place
> (thread_added). It'll be interesting to see how it will look like when we use
> async handshakes instead of suspend flags.
> > For now, I'm ok with your version.
> 
> I had a version that did what you are suggesting. The current version also has
> the advantage, that
> there are fewer places where a thread has to wait for ongoing object
> deoptimization. This means
> viewer places where you have to worry about correct thread state
> transitions, possible deadlocks,
> and if all oops are properly Handle'ed.
> 
> > I'd only move MutexLocker ml(EscapeBarrier_lock...) after if (!jt-
> >is_hidden_from_external_view()).
> 
> Done.
> 
> > Having 4 different deoptimize_objects functions makes it a little hard to
> keep an overview of which one is used for what.
> > Maybe adding suffixes would help a little bit, but I can also live with what
> you have.
> > Implementation looks correct to me.
> 
> 2 are internal. I added the suffix _internal to them. This leaves 2 to choose
> from.
> 
> > src/hotspot/share/runtime/deoptimization.hpp
> > Escape barriers and object deoptimization functions.
> > Typo in comment: "helt" => "held"
> 
> Done in place already.
> 
> > src/hotspot/share/runtime/interfaceSupport.cpp
> > InterfaceSupport::deoptimizeAllObjects() is only used for
> DeoptimizeObjectsALot = 1.
> > I think DeoptimizeObjectsALot = 2 is more important, but I think it's not bad
> to have DeoptimizeObjectsALot = 1 in addition. Ok.
> 
> I never used DeoptimizeObjectsALot = 1 that much. It could be more
> deterministic in single threaded
> scenarios. I wouldn't object to get rid of it though.
> 
> > src/hotspot/share/runtime/stackValue.hpp
> > Better reinitilization in StackValue. Good.
> 
> StackValue::obj_is_scalar_replaced() should not return true after calling
> set_obj().
> 
> > src/hotspot/share/runtime/thread.cpp
> > src/hotspot/share/runtime/thread.hpp
> > src/hotspot/share/runtime/thread.inline.hpp
> > wait_for_object_deoptimization, suspend flag, deferred updates and test
> feature to deoptimize objects.
> 
> > In the long term, we want to get rid of suspend flags, so it's not so nice to
> introduce a new one. But I agree with G?tz that it should be acceptable as
> temporary solution until async handshakes are available (which takes more
> time). So I'm ok with your change.
> 
> I'm keen to build the feature on async handshakes when the arive.
> 
> > You can use MutexLocker with Thread*.
> 
> Done.
> 
> > JVMTIDeferredUpdates: I agree with Robin. It'd be nice to move the class
> out of thread.hpp.
> 
> Done.
> 
> > src/hotspot/share/runtime/vframe.cpp
> > Added support for entry frame to new_vframe. Ok.
> 
> 
> > src/hotspot/share/runtime/vframe_hp.cpp
> > src/hotspot/share/runtime/vframe_hp.hpp
> 
> > I think code()->as_nmethod() in not_global_escape_in_scope() and
> arg_escape() should better be under #ifdef ASSERT or inside the assert
> statement (no need for code cache walking in product build).
> 
> Done.
> 
> > jvmtiDeferredLocalVariableSet::update_monitors:
> > Please add a comment explaining that owner referenced by original info
> may be scalar replaced, but it is deoptimized in the vframe.
> 
> Done.
> 
> -----Original Message-----
> From: Doerr, Martin <martin.doerr at sap.com>
> Sent: Donnerstag, 12. M?rz 2020 17:28
> To: Reingruber, Richard <richard.reingruber at sap.com>; 'Robbin Ehn'
> <robbin.ehn at oracle.com>; Lindenmaier, Goetz
> <goetz.lindenmaier at sap.com>; David Holmes <david.holmes at oracle.com>;
> Vladimir Kozlov (vladimir.kozlov at oracle.com)
> <vladimir.kozlov at oracle.com>; serviceability-dev at openjdk.java.net;
> hotspot-compiler-dev at openjdk.java.net; hotspot-runtime-
> dev at openjdk.java.net
> Subject: RE: RFR(L) 8227745: Enable Escape Analysis for Better Performance
> in the Presence of JVMTI Agents
> 
> Hi Richard,
> 
> 
> I managed to find time for a (almost) complete review of webrev.4. (I'll
> review the tests separately.)
> 
> First of all, the change seems to be in pretty good quality for its significant
> complexity. I couldn't find any real bugs. But I'd like to propose minor
> improvements.
> I'm convinced that it's mature because we did substantial testing.
> 
> I like the new functionality for object deoptimization. It can possibly be
> reused for future escape analysis based optimizations. So I appreciate having
> it available in the code base.
> In addition to that, your change makes the JVMTI implementation better
> integrated into the VM.
> 
> 
> Now to the details:
> 
> 
> src/hotspot/share/c1/c1_IR.hpp
> describe_scope parameters. Ok.
> 
> 
> src/hotspot/share/ci/ciEnv.cpp
> src/hotspot/share/ci/ciEnv.hpp
> Fix for JvmtiExport::can_walk_any_space() capability. Ok.
> 
> 
> src/hotspot/share/code/compiledMethod.cpp
> Nice cleanup!
> 
> 
> src/hotspot/share/code/debugInfoRec.cpp
> src/hotspot/share/code/debugInfoRec.hpp
> Additional parmeters. (Remark: I think "non_global_escape_in_scope"
> would read better than "not_global_escape_in_scope", but your version is
> consistent with existing code, so no change request from my side.) Ok.
> 
> 
> src/hotspot/share/code/nmethod.cpp
> Nice cleanup!
> 
> 
> src/hotspot/share/code/pcDesc.hpp
> Additional parameters. Ok.
> 
> 
> src/hotspot/share/code/scopeDesc.cpp
> src/hotspot/share/code/scopeDesc.hpp
> Improved implementation + additional parameters. Ok.
> 
> 
> src/hotspot/share/compiler/compileBroker.cpp
> src/hotspot/share/compiler/compileBroker.hpp
> Extra thread for DeoptimizeObjectsALot. (Remark: I would have put it into a
> follow up change together with the test in order to make this webrev
> smaller, but since it is included, I'm reviewing everything at once. Not a big
> deal.) Ok.
> 
> 
> src/hotspot/share/jvmci/jvmciCodeInstaller.cpp
> Additional parameters. Ok.
> 
> 
> src/hotspot/share/opto/c2compiler.cpp
> Make do_escape_analysis independent of JVMCI capabilities. Nice!
> 
> 
> src/hotspot/share/opto/callnode.hpp
> Additional fields for MachSafePointNodes. Ok.
> 
> 
> src/hotspot/share/opto/escape.cpp
> Annotation for MachSafePointNodes. Your added functionality looks correct.
> But I'd prefer to move the bulky code out of the large function.
> I suggest to factor out something like has_not_global_escape and
> has_arg_escape. So the code could look like this:
>       SafePointNode* sfn = sfn_worklist.at(next);
>       sfn->set_not_global_escape_in_scope(has_not_global_escape(sfn));
>       if (sfn->is_CallJava()) {
>         CallJavaNode* call = sfn->as_CallJava();
>         call->set_arg_escape(has_arg_escape(call));
>       }
> This would also allow us to get rid of the found_..._escape_in_args variables
> making the loops better readable.
> 
> It's kind of ugly to use strcmp to recognize uncommon trap, but that seems
> to be the way to do it (there are more such places). So it's ok.
> 
> 
> src/hotspot/share/opto/machnode.hpp
> Additional fields for MachSafePointNodes. Ok.
> 
> 
> src/hotspot/share/opto/macro.cpp
> Allow elimination of non-escaping allocations. Ok.
> 
> 
> src/hotspot/share/opto/matcher.cpp
> src/hotspot/share/opto/output.cpp
> Copy attribute / pass parameters. Ok.
> 
> 
> src/hotspot/share/prims/jvmtiCodeBlobEvents.cpp
> Nice cleanup!
> 
> 
> src/hotspot/share/prims/jvmtiEnv.cpp
> src/hotspot/share/prims/jvmtiEnvBase.cpp
> Escape barriers + deoptimize objects for target thread. Good.
> 
> 
> src/hotspot/share/prims/jvmtiImpl.cpp
> src/hotspot/share/prims/jvmtiImpl.hpp
> The sequence is pretty complex:
> VM_GetOrSetLocal element initialization executes EscapeBarrier code which
> suspends the target thread (extra VM Operation).
> VM_GetOrSetLocal::doit_prologue performs object deoptimization (by VM
> Thread to prepare VM Operation with frame deoptimization).
> VM_GetOrSetLocal destructor implicitly calls EscapeBarrier destructor which
> resumes the target thread.
> But I don't have any improvement proposal. Performance is probably not a
> concern, here. So it's ok.
> 
> VM_GetOrSetLocal::deoptimize_objects deoptimizes the top frame if it has
> non-globally escaping objects and other frames if they have arg escaping
> ones. Good.
> 
> 
> src/hotspot/share/prims/jvmtiTagMap.cpp
> Escape barriers + deoptimize objects for all threads. Ok.
> 
> 
> src/hotspot/share/prims/whitebox.cpp
> Added WB_IsFrameDeoptimized to API. Ok.
> 
> 
> src/hotspot/share/runtime/deoptimization.cpp
> Object deoptimization. I have more comments and proposals, here.
> First of all, handling recursive and waiting locks in relock_objects is tricky, but
> looks correct.
> Comments are sufficient to understand why things are done as they are
> implemented.
> 
> BiasedLocking related parts are complex, but we may get rid of them in the
> future (with BiasedLocking removal).
> Anyway, looks correct, too.
> 
> Typo in comment: "regularily" => "regularly"
> 
> Deoptimization::fetch_unroll_info_helper is the only place where
> _jvmti_deferred_updates get deallocated (except JavaThread destructor).
> But I think we always go through it, so I can't see a memory leak or such kind
> of issues.
> 
> EscapeBarrier::deoptimize_objects: ResourceMark should use
> calling_thread().
> 
> You can use MutexLocker and MonitorLocker with Thread* to save the
> Thread::current() call.
> 
> I'd make set_objs_are_deoptimized static and remove it from the
> EscapeBarrier interface because I think it shouldn't be used outside of
> EscapeBarrier::deoptimize_objects.
> 
> Typo in comment: "we must only deoptimize" => "we only have to
> deoptimize"
> 
> "bool EscapeBarrier::deoptimize_objects(intptr_t* fr_id)" is trivial and
> barrier_active() is redundant. Implementation can get moved to hpp file.
> 
> I'll get back to suspend flags, later.
> 
> There are weird cases regarding _self_deoptimization_in_progress.
> Assume we have 3 threads A, B and C. A deopts C, B deopts C, C deopts C. C
> can set _self_deoptimization_in_progress while A performs the handshake
> for suspending C. I think this doesn't lead to errors, but it's probably not
> desired.
> I think it would be better to use only one "wait" call in
> sync_and_suspend_one and sync_and_suspend_all.
> 
> I first thought it'd be better to move ThreadBlockInVM before wait() to
> reduce thread state transitions, but that seems to be problematic because
> ThreadBlockInVM destructor contains a safepoint check which we shouldn't
> do while holding EscapeBarrier_lock. So no change request.
> 
> Change in thred_added:
> I think the sequence would be more comprehensive if we waited for
> deopt_all_threads in Thread::start and all other places where a new thread
> can run into Java code (e.g. JVMTI attach).
> Your version makes new threads come up with suspend flag set. That looks
> correct, too. Advantage is that you only have to change one place
> (thread_added). It'll be interesting to see how it will look like when we use
> async handshakes instead of suspend flags.
> For now, I'm ok with your version.
> 
> I'd only move MutexLocker ml(EscapeBarrier_lock...) after if (!jt-
> >is_hidden_from_external_view()).
> 
> Having 4 different deoptimize_objects functions makes it a little hard to keep
> an overview of which one is used for what.
> Maybe adding suffixes would help a little bit, but I can also live with what you
> have.
> Implementation looks correct to me.
> 
> 
> src/hotspot/share/runtime/deoptimization.hpp
> Escape barriers and object deoptimization functions.
> Typo in comment: "helt" => "held"
> 
> 
> src/hotspot/share/runtime/globals.hpp
> Addition of develop flag DeoptimizeObjectsALotInterval. Ok.
> 
> 
> src/hotspot/share/runtime/interfaceSupport.cpp
> InterfaceSupport::deoptimizeAllObjects() is only used for
> DeoptimizeObjectsALot = 1.
> I think DeoptimizeObjectsALot = 2 is more important, but I think it's not bad
> to have DeoptimizeObjectsALot = 1 in addition. Ok.
> 
> 
> src/hotspot/share/runtime/interfaceSupport.inline.hpp
> Addition of deoptimizeAllObjects. Ok.
> 
> 
> src/hotspot/share/runtime/mutexLocker.cpp
> src/hotspot/share/runtime/mutexLocker.hpp
> Addition of EscapeBarrier_lock. Ok.
> 
> 
> src/hotspot/share/runtime/objectMonitor.cpp
> Make recursion count relock aware. Ok.
> 
> 
> src/hotspot/share/runtime/stackValue.hpp
> Better reinitilization in StackValue. Good.
> 
> 
> src/hotspot/share/runtime/thread.cpp
> src/hotspot/share/runtime/thread.hpp
> src/hotspot/share/runtime/thread.inline.hpp
> wait_for_object_deoptimization, suspend flag, deferred updates and test
> feature to deoptimize objects.
> 
> In the long term, we want to get rid of suspend flags, so it's not so nice to
> introduce a new one. But I agree with G?tz that it should be acceptable as
> temporary solution until async handshakes are available (which takes more
> time). So I'm ok with your change.
> 
> You can use MutexLocker with Thread*.
> 
> JVMTIDeferredUpdates: I agree with Robin. It'd be nice to move the class out
> of thread.hpp.
> 
> 
> src/hotspot/share/runtime/vframe.cpp
> Added support for entry frame to new_vframe. Ok.
> 
> 
> src/hotspot/share/runtime/vframe_hp.cpp
> src/hotspot/share/runtime/vframe_hp.hpp
> 
> I think code()->as_nmethod() in not_global_escape_in_scope() and
> arg_escape() should better be under #ifdef ASSERT or inside the assert
> statement (no need for code cache walking in product build).
> 
> jvmtiDeferredLocalVariableSet::update_monitors:
> Please add a comment explaining that owner referenced by original info may
> be scalar replaced, but it is deoptimized in the vframe.
> 
> 
> src/hotspot/share/utilities/macros.hpp
> Addition of NOT_COMPILER2_OR_JVMCI_RETURN macros. Ok.
> 
> 
> test/hotspot/jtreg/serviceability/jvmti/Heap/IterateHeapWithEscapeAnalysi
> sEnabled.java
> test/hotspot/jtreg/serviceability/jvmti/Heap/libIterateHeapWithEscapeAnal
> ysisEnabled.c
> New test. Will review separately.
> 
> 
> test/jdk/TEST.ROOT
> Addition of vm.jvmci as required property. Ok.
> 
> 
> test/jdk/com/sun/jdi/EATests.java
> test/jdk/com/sun/jdi/EATestsJVMCI.java
> New test. Will review separately.
> 
> 
> test/lib/sun/hotspot/WhiteBox.java
> Added isFrameDeoptimized to API. Ok.
> 
> 
> That was it. Best regards,
> Martin
> 
> 
> > -----Original Message-----
> > From: hotspot-compiler-dev <hotspot-compiler-dev-
> > bounces at openjdk.java.net> On Behalf Of Reingruber, Richard
> > Sent: Dienstag, 3. M?rz 2020 21:23
> > To: 'Robbin Ehn' <robbin.ehn at oracle.com>; Lindenmaier, Goetz
> > <goetz.lindenmaier at sap.com>; David Holmes
> <david.holmes at oracle.com>;
> > Vladimir Kozlov (vladimir.kozlov at oracle.com)
> > <vladimir.kozlov at oracle.com>; serviceability-dev at openjdk.java.net;
> > hotspot-compiler-dev at openjdk.java.net; hotspot-runtime-
> > dev at openjdk.java.net
> > Subject: RE: RFR(L) 8227745: Enable Escape Analysis for Better
> > Performance in the Presence of JVMTI Agents
> >
> > Hi Robbin,
> >
> > > > I understand that Robbin proposed to replace the usage of
> > > > _suspend_flag with handshakes. Apparently, async handshakes
> > > > are needed to do so. We have been waiting a while for removal
> > > > of the _suspend_flag / introduction of async handshakes [2].
> > > > What is the status here?
> >
> > > I have an old prototype which I would like to continue to work on.
> > > So do not assume asynch handshakes will make 15.
> > > Even if it would, I think there are a lot more investigate work to remove
> > > _suspend_flag.
> >
> > Let us know, if we can be of any help to you and be it only testing.
> >
> > > >> Full:
> > http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.4/
> >
> > > DeoptimizeObjectsALotThread is only used in compileBroker.cpp.
> > > You can move both declaration and definition to that file, no need to
> > clobber
> > > thread.[c|h]pp. (and the static function deopt_objs_alot_thread_entry)
> >
> > Will do.
> >
> > > Does JvmtiDeferredUpdates really need to be in thread.hpp, can't be in
> it's
> > own
> > > hpp file? It doesn't seem right to add JVM TI classes into thread.hpp.
> >
> > You are right. It shouldn't be declared in thread.hpp. I will look into that.
> >
> > > Note that we also think we may have a bug in deopt:
> > > https://bugs.openjdk.java.net/browse/JDK-8238237
> >
> > > I think it would be best, if possible, to push after that is resolved.
> >
> > Sure.
> >
> > > Not even nearly a full review :)
> >
> > I know :)
> >
> > Anyways, thanks a lot,
> > Richard.
> >
> >
> > -----Original Message-----
> > From: Robbin Ehn <robbin.ehn at oracle.com>
> > Sent: Monday, March 2, 2020 11:17 AM
> > To: Lindenmaier, Goetz <goetz.lindenmaier at sap.com>; Reingruber,
> Richard
> > <richard.reingruber at sap.com>; David Holmes
> <david.holmes at oracle.com>;
> > Vladimir Kozlov (vladimir.kozlov at oracle.com)
> > <vladimir.kozlov at oracle.com>; serviceability-dev at openjdk.java.net;
> > hotspot-compiler-dev at openjdk.java.net; hotspot-runtime-
> > dev at openjdk.java.net
> > Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better Performance
> > in the Presence of JVMTI Agents
> >
> > Hi,
> >
> > On 2/24/20 5:39 PM, Lindenmaier, Goetz wrote:
> > > Hi,
> > >
> > > I had a look at the progress of this change. Nothing
> > > happened since Richard posted his update using more
> > > handshakes [1].
> > > But we (SAP) would appreciate a lot if this change could
> > > be successfully reviewed and pushed.
> > >
> > > I think there is basic understanding that this
> > > change is helpful. It fixes a number of issues with JVMTI,
> > > and will deliver the same performance benefits as EA
> > > does in current production mode for debugging scenarios.
> > >
> > > This is important for us as we run our VMs prepared
> > > for debugging in production mode.
> > >
> > > I understand that Robbin proposed to replace the usage of
> > > _suspend_flag with handshakes. Apparently, async handshakes
> > > are needed to do so. We have been waiting a while for removal
> > > of the _suspend_flag / introduction of async handshakes [2].
> > > What is the status here?
> >
> > I have an old prototype which I would like to continue to work on.
> > So do not assume asynch handshakes will make 15.
> > Even if it would, I think there are a lot more investigate work to remove
> > _suspend_flag.
> >
> > >
> > > I think we should no longer wait, but proceed with
> > > this change. We will look into removing the usage of
> > > suspend_flag introduced here once it is possible to implement
> > > it with handshakes.
> >
> > Yes, sure.
> >
> > >> Full:
> http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.4/
> >
> > DeoptimizeObjectsALotThread is only used in compileBroker.cpp.
> > You can move both declaration and definition to that file, no need to
> clobber
> > thread.[c|h]pp. (and the static function deopt_objs_alot_thread_entry)
> >
> > Does JvmtiDeferredUpdates really need to be in thread.hpp, can't be in it's
> > own
> > hpp file? It doesn't seem right to add JVM TI classes into thread.hpp.
> >
> > Note that we also think we may have a bug in deopt:
> > https://bugs.openjdk.java.net/browse/JDK-8238237
> >
> > I think it would be best, if possible, to push after that is resolved.
> >
> > Not even nearly a full review :)
> >
> > Thanks, Robbin
> >
> >
> > >> Incremental:
> > >>
> http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.4.inc/
> > >>
> > >> I was not able to eliminate the additional suspend flag now. I'll take care
> > of this
> > >> as soon as the
> > >> existing suspend-resume-mechanism is reworked.
> > >>
> > >> Testing:
> > >>
> > >> Nightly tests @SAP:
> > >>
> > >>    JCK and JTREG, also in Xcomp mode, SPECjvm2008, SPECjbb2015,
> > Renaissance
> > >> Suite, SAP specific tests
> > >>    with fastdebug and release builds on all platforms
> > >>
> > >>    Stress testing with DeoptimizeObjectsALot running SPECjvm2008 40x
> > parallel
> > >> for 24h
> > >>
> > >> Thanks, Richard.
> > >>
> > >>
> > >> More details on the changes:
> > >>
> > >> * Hide DeoptimizeObjectsALotThread from external view.
> > >>
> > >> * Changed EscapeBarrier_lock to be a _safepoint_check_never lock.
> > >>    It used to be _safepoint_check_sometimes, which will be eliminated
> > sooner or
> > >> later.
> > >>    I added explicit thread state changes with ThreadBlockInVM to code
> > paths
> > >> where we can wait()
> > >>    on EscapeBarrier_lock to become safepoint safe.
> > >>
> > >> * Use handshake EscapeBarrierSuspendHandshake to suspend target
> > threads
> > >> instead of vm operation
> > >>    VM_ThreadSuspendAllForObjDeopt.
> > >>
> > >> * Removed uses of Threads_lock. When adding a new thread we
> suspend
> > it iff
> > >> EA optimizations are
> > >>    being reverted. In the previous version we were waiting on
> > Threads_lock
> > >> while EA optimizations
> > >>    were reverted. See EscapeBarrier::thread_added().
> > >>
> > >> * Made tests require Xmixed compilation mode.
> > >>
> > >> * Made tests agnostic regarding tiered compilation.
> > >>    I.e. tc isn't disabled anymore, and the tests can be run with tc enabled
> or
> > >> disabled.
> > >>
> > >> * Exercising EATests.java as well with stress test options
> > >> DeoptimizeObjectsALot*
> > >>    Due to the non-deterministic deoptimizations some tests need to be
> > skipped.
> > >>    We do this to prevent bit-rot of the stress test code.
> > >>
> > >> * Executing EATests.java as well with graal if available. Driver for this is
> > >>    EATestsJVMCI.java. Graal cannot pass all tests, because it does not
> > provide all
> > >> the new debug info
> > >>    (namely not_global_escape_in_scope and arg_escape in
> > scopeDesc.hpp).
> > >>    And graal does not yet support the JVMTI operations force early
> return
> > and
> > >> pop frame.
> > >>
> > >> * Removed tracing from new jdi tests in EATests.java. Too much trace
> > output
> > >> before the debugging
> > >>    connection is established can cause deadlock because output buffers
> fill
> > up.
> > >>    (See https://bugs.openjdk.java.net/browse/JDK-8173304)
> > >>
> > >> * Many copyright year changes and smaller clean-up changes of testing
> > code
> > >> (trailing white-space and
> > >>    the like).
> > >>
> > >>
> > >> -----Original Message-----
> > >> From: David Holmes <david.holmes at oracle.com>
> > >> Sent: Donnerstag, 19. Dezember 2019 03:12
> > >> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-
> > >> dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net;
> > hotspot-
> > >> runtime-dev at openjdk.java.net; Vladimir Kozlov
> > (vladimir.kozlov at oracle.com)
> > >> <vladimir.kozlov at oracle.com>
> > >> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better
> > Performance in
> > >> the Presence of JVMTI Agents
> > >>
> > >> Hi Richard,
> > >>
> > >> I think my issue is with the way EliminateNestedLocks works so I'm going
> > >> to look into that more deeply.
> > >>
> > >> Thanks for the explanations.
> > >>
> > >> David
> > >>
> > >> On 18/12/2019 12:47 am, Reingruber, Richard wrote:
> > >>> Hi David,
> > >>>
> > >>>     > >    > Some further queries/concerns:
> > >>>     > >    >
> > >>>     > >    > src/hotspot/share/runtime/objectMonitor.cpp
> > >>>     > >    >
> > >>>     > >    > Can you please explain the changes to ObjectMonitor::wait:
> > >>>     > >    >
> > >>>     > >    > !   _recursions = save      // restore the old recursion count
> > >>>     > >    > !                 + jt->get_and_reset_relock_count_after_wait(); //
> > >>>     > >    > increased by the deferred relock count
> > >>>     > >    >
> > >>>     > >    > what is the "deferred relock count"? I gather it relates to
> > >>>     > >    >
> > >>>     > >    > "The code was extended to be able to deoptimize objects of a
> > >>>     > > frame that
> > >>>     > >    > is not the top frame and to let another thread than the
> owning
> > >>>     > > thread do
> > >>>     > >    > it."
> > >>>     > >
> > >>>     > > Yes, these relate. Currently EA based optimizations are reverted,
> > when a
> > >> compiled frame is
> > >>>     > > replaced with corresponding interpreter frames. Part of this is
> > relocking
> > >> objects with eliminated
> > >>>     > > locking. New with the enhancement is that we do this also just
> > before
> > >> object references are
> > >>>     > > acquired through JVMTI. In this case we deoptimize also the
> > owning
> > >> compiled frame C and we
> > >>>     > > register deoptimized objects as deferred updates. When control
> > returns
> > >> to C it gets deoptimized,
> > >>>     > > we notice that objects are already deoptimized (reallocated and
> > >> relocked), so we don't do it again
> > >>>     > > (relocking twice would be incorrect of course). Deferred updates
> > are
> > >> copied into the new
> > >>>     > > interpreter frames.
> > >>>     > >
> > >>>     > > Problem: relocking is not possible if the target thread T is waiting
> > on the
> > >> monitor that needs to
> > >>>     > > be relocked. This happens only with non-local objects with
> > >> EliminateNestedLocks. Instead relocking
> > >>>     > > is deferred until T owns the monitor again. This is what the piece
> of
> > >> code above does.
> > >>>     >
> > >>>     >  Sorry I need some more detail here. How can you wait() on an
> > object
> > >>>     >  monitor if the object allocation and/or locking was optimised
> away?
> > And
> > >>>     >  what is a "non-local object" in this context? Isn't EA restricted to
> > >>>     >  thread-confined objects?
> > >>>
> > >>> "Non-local object" is an object that escapes its thread. The issue I'm
> > >> addressing with the changes
> > >>> in ObjectMonitor::wait are almost unrelated to EA. They are caused by
> > >> EliminateNestedLocks, where C2
> > >>> eliminates recursive locking of an already owned lock. The lock owning
> > object
> > >> exists on the heap, it
> > >>> is locked and you can call wait() on it.
> > >>>
> > >>> EliminateLocks is the C2 option that controls lock elimination based on
> > EA.
> > >> Both optimizations have
> > >>> in common that objects with eliminated locking need to be relocked
> > when
> > >> deoptimizing a frame,
> > >>> i.e. when replacing a compiled frame with equivalent interpreter
> > >>> frames. Deoptimization::relock_objects does that job for /all/
> eliminated
> > >> locks in scope. /All/ can
> > >>> be a mix of eliminated nested locks and locks of not-escaping objects.
> > >>>
> > >>> New with the enhancement: I call relock_objects earlier, just before
> > objects
> > >> pontentially
> > >>> escape. But then later when the owning compiled frame gets
> > deoptimized, I
> > >> must not do it again:
> > >>>
> > >>> See call to EscapeBarrier::objs_are_deoptimized in
> deoptimization.cpp:
> > >>>
> > >>>    373   if ((jvmci_enabled || ((DoEscapeAnalysis ||
> > EliminateNestedLocks) &&
> > >> EliminateLocks))
> > >>>    374       && !EscapeBarrier::objs_are_deoptimized(thread,
> > deoptee.id())) {
> > >>>    375     bool unused;
> > >>>    376     eliminate_locks(thread, chunk, realloc_failures, deoptee,
> > exec_mode,
> > >> unused);
> > >>>    377   }
> > >>>
> > >>> Now when calling relock_objects early it is quiet possible that I have to
> > relock
> > >> an object the
> > >>> target thread currently waits for. Obviously I cannot relock in this case,
> > >> instead I chose to
> > >>> introduce relock_count_after_wait to JavaThread.
> > >>>
> > >>>     >  Is it just that some of the locking gets optimized away e.g.
> > >>>     >
> > >>>     >  synchronised(obj) {
> > >>>     >     synchronised(obj) {
> > >>>     >       synchronised(obj) {
> > >>>     >         obj.wait();
> > >>>     >       }
> > >>>     >     }
> > >>>     >  }
> > >>>     >
> > >>>     >  If this is reduced to a form as-if it were a single lock of the monitor
> > >>>     >  (due to EA) and the wait() triggers a JVM TI event which leads to
> the
> > >>>     >  escape of "obj" then we need to reconstruct the true lock state,
> and
> > so
> > >>>     >  when the wait() internally unblocks and reacquires the monitor it
> > has to
> > >>>     >  set the true recursion count to 3, not the 1 that it appeared to be
> > when
> > >>>     >  wait() was initially called. Is that the scenario?
> > >>>
> > >>> Kind of... except that the locking is not eliminated due to EA and there
> is
> > no
> > >> JVM TI event
> > >>> triggered by wait.
> > >>>
> > >>> Add
> > >>>
> > >>> LocalObject l1 = new LocalObject();
> > >>>
> > >>> in front of the synchrnized blocks and assume a JVM TI agent acquires
> l1.
> > This
> > >> triggers the code in
> > >>> question.
> > >>>
> > >>> See that relocking/reallocating is transactional. If it is done then for
> /all/
> > >> objects in scope and it is
> > >>> done at most once. It wouldn't be quite so easy to split this in relocking
> > of
> > >> nested/EA-based
> > >>> eliminated locks.
> > >>>
> > >>>     >  If so I find this truly awful. Anyone using wait() in a realistic form
> > >>>     >  requires a notification and so the object cannot be thread
> confined.
> > In
> > >>>
> > >>> It is not thread confined.
> > >>>
> > >>>     >  which case I would strongly argue that upon hitting the wait() the
> > deopt
> > >>>     >  should occur unconditionally and so the lock state is correct before
> > we
> > >>>     >  wait and so we don't need to mess with the recursion count
> > internally
> > >>>     >  when we reacquire the monitor.
> > >>>     >
> > >>>     > >
> > >>>     > >    > which I don't like the sound of at all when it comes to
> > ObjectMonitor
> > >>>     > >    > state. So I'd like to understand in detail exactly what is going
> on
> > here
> > >>>     > >    > and why.  This is a very intrusive change that seems to badly
> > break
> > >>>     > >    > encapsulation and impacts future changes to ObjectMonitor
> > that are
> > >> under
> > >>>     > >    > investigation.
> > >>>     > >
> > >>>     > > I would not regard this as breaking encapsulation. Certainly not
> > badly.
> > >>>     > >
> > >>>     > > I've added a property relock_count_after_wait to JavaThread.
> The
> > >> property is well
> > >>>     > > encapsulated. Future ObjectMonitor implementations have to
> deal
> > with
> > >> recursion too. They are free
> > >>>     > > in choosing a way to do that as long as that property is taken into
> > >> account. This is hardly a
> > >>>     > > limitation.
> > >>>     >
> > >>>     >  I do think this badly breaks encapsulation as you have to add a
> > callout
> > >>>     >  from the guts of the ObjectMonitor code to reach into the thread
> to
> > get
> > >>>     >  this lock count adjustment. I understand why you have had to do
> > this but
> > >>>     >  I would much rather see a change to the EA optimisation strategy
> so
> > that
> > >>>     >  this is not needed.
> > >>>     >
> > >>>     > > Note also that the property is a straight forward extension of the
> > >> existing concept of deferred
> > >>>     > > local updates. It is embedded into the structure holding them. So
> > not
> > >> even the footprint of a
> > >>>     > > JavaThread is enlarged if no deferred updates are generated.
> > >>>     >
> > >>>     > [...]
> > >>>     >
> > >>>     > >
> > >>>     > > I'm actually duplicating the existing external suspend mechanism,
> > >> because a thread can be
> > >>>     > > suspended at most once. And hey, and don't like that either! But
> it
> > >> seems not unlikely that the
> > >>>     > > duplicate can be removed together with the original and the new
> > type
> > >> of handshakes that will be
> > >>>     > > used for thread suspend can be used for object deoptimization
> > too. See
> > >> today's discussion in
> > >>>     > > JDK-8227745 [2].
> > >>>     >
> > >>>     >  I hope that discussion bears some fruit, at the moment it seems
> not
> > to
> > >>>     >  be possible to use handshakes here. :(
> > >>>     >
> > >>>     >  The external suspend mechanism is a royal pain in the proverbial
> > that we
> > >>>     >  have to carefully live with. The idea that we're duplicating that for
> > >>>     >  use in another fringe area of functionality does not thrill me at all.
> > >>>     >
> > >>>     >  To be clear, I understand the problem that exists and that you
> wish
> > to
> > >>>     >  solve, but for the runtime parts I balk at the complexity cost of
> > >>>     >  solving it.
> > >>>
> > >>> I know it's complex, but by far no rocket science.
> > >>>
> > >>> Also I find it hard to imagine another fix for JDK-8233915 besides
> > changing
> > >> the JVM TI specification.
> > >>>
> > >>> Thanks, Richard.
> > >>>
> > >>> -----Original Message-----
> > >>> From: David Holmes <david.holmes at oracle.com>
> > >>> Sent: Dienstag, 17. Dezember 2019 08:03
> > >>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-
> > >> dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net;
> > hotspot-
> > >> runtime-dev at openjdk.java.net; Vladimir Kozlov
> > (vladimir.kozlov at oracle.com)
> > >> <vladimir.kozlov at oracle.com>
> > >>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better
> > Performance
> > >> in the Presence of JVMTI Agents
> > >>>
> > >>> <resend as my mailer crashed during last send>
> > >>>
> > >>> David
> > >>>
> > >>> On 17/12/2019 4:57 pm, David Holmes wrote:
> > >>>> Hi Richard,
> > >>>>
> > >>>> On 14/12/2019 5:01 am, Reingruber, Richard wrote:
> > >>>>> Hi David,
> > >>>>>
> > >>>>>   ?? > Some further queries/concerns:
> > >>>>>   ?? >
> > >>>>>   ?? > src/hotspot/share/runtime/objectMonitor.cpp
> > >>>>>   ?? >
> > >>>>>   ?? > Can you please explain the changes to ObjectMonitor::wait:
> > >>>>>   ?? >
> > >>>>>   ?? > !?? _recursions = save????? // restore the old recursion count
> > >>>>>   ?? > !???????????????? + jt->get_and_reset_relock_count_after_wait(); //
> > >>>>>   ?? > increased by the deferred relock count
> > >>>>>   ?? >
> > >>>>>   ?? > what is the "deferred relock count"? I gather it relates to
> > >>>>>   ?? >
> > >>>>>   ?? > "The code was extended to be able to deoptimize objects of a
> > >>>>> frame that
> > >>>>>   ?? > is not the top frame and to let another thread than the owning
> > >>>>> thread do
> > >>>>>   ?? > it."
> > >>>>>
> > >>>>> Yes, these relate. Currently EA based optimizations are reverted,
> > when
> > >>>>> a compiled frame is replaced
> > >>>>> with corresponding interpreter frames. Part of this is relocking
> > >>>>> objects with eliminated
> > >>>>> locking. New with the enhancement is that we do this also just
> before
> > >>>>> object references are acquired
> > >>>>> through JVMTI. In this case we deoptimize also the owning compiled
> > >>>>> frame C and we register
> > >>>>> deoptimized objects as deferred updates. When control returns to
> C
> > it
> > >>>>> gets deoptimized, we notice
> > >>>>> that objects are already deoptimized (reallocated and relocked), so
> > we
> > >>>>> don't do it again (relocking
> > >>>>> twice would be incorrect of course). Deferred updates are copied
> into
> > >>>>> the new interpreter frames.
> > >>>>>
> > >>>>> Problem: relocking is not possible if the target thread T is waiting
> > >>>>> on the monitor that needs to be
> > >>>>> relocked. This happens only with non-local objects with
> > >>>>> EliminateNestedLocks. Instead relocking is
> > >>>>> deferred until T owns the monitor again. This is what the piece of
> > >>>>> code above does.
> > >>>>
> > >>>> Sorry I need some more detail here. How can you wait() on an object
> > >>>> monitor if the object allocation and/or locking was optimised away?
> > And
> > >>>> what is a "non-local object" in this context? Isn't EA restricted to
> > >>>> thread-confined objects?
> > >>>>
> > >>>> Is it just that some of the locking gets optimized away e.g.
> > >>>>
> > >>>> synchronised(obj) {
> > >>>>    ? synchronised(obj) {
> > >>>>    ??? synchronised(obj) {
> > >>>>    ????? obj.wait();
> > >>>>    ??? }
> > >>>>    ? }
> > >>>> }
> > >>>>
> > >>>> If this is reduced to a form as-if it were a single lock of the monitor
> > >>>> (due to EA) and the wait() triggers a JVM TI event which leads to the
> > >>>> escape of "obj" then we need to reconstruct the true lock state, and
> so
> > >>>> when the wait() internally unblocks and reacquires the monitor it has
> to
> > >>>> set the true recursion count to 3, not the 1 that it appeared to be
> when
> > >>>> wait() was initially called. Is that the scenario?
> > >>>>
> > >>>> If so I find this truly awful. Anyone using wait() in a realistic form
> > >>>> requires a notification and so the object cannot be thread confined.
> In
> > >>>> which case I would strongly argue that upon hitting the wait() the
> > deopt
> > >>>> should occur unconditionally and so the lock state is correct before
> we
> > >>>> wait and so we don't need to mess with the recursion count internally
> > >>>> when we reacquire the monitor.
> > >>>>
> > >>>>>
> > >>>>>   ?? > which I don't like the sound of at all when it comes to
> > >>>>> ObjectMonitor
> > >>>>>   ?? > state. So I'd like to understand in detail exactly what is going
> > >>>>> on here
> > >>>>>   ?? > and why.? This is a very intrusive change that seems to badly
> > break
> > >>>>>   ?? > encapsulation and impacts future changes to ObjectMonitor
> that
> > >>>>> are under
> > >>>>>   ?? > investigation.
> > >>>>>
> > >>>>> I would not regard this as breaking encapsulation. Certainly not
> badly.
> > >>>>>
> > >>>>> I've added a property relock_count_after_wait to JavaThread. The
> > >>>>> property is well
> > >>>>> encapsulated. Future ObjectMonitor implementations have to deal
> > with
> > >>>>> recursion too. They are free in
> > >>>>> choosing a way to do that as long as that property is taken into
> > >>>>> account. This is hardly a
> > >>>>> limitation.
> > >>>>
> > >>>> I do think this badly breaks encapsulation as you have to add a callout
> > >>>> from the guts of the ObjectMonitor code to reach into the thread to
> > get
> > >>>> this lock count adjustment. I understand why you have had to do this
> > but
> > >>>> I would much rather see a change to the EA optimisation strategy so
> > that
> > >>>> this is not needed.
> > >>>>
> > >>>>> Note also that the property is a straight forward extension of the
> > >>>>> existing concept of deferred
> > >>>>> local updates. It is embedded into the structure holding them. So
> not
> > >>>>> even the footprint of a
> > >>>>> JavaThread is enlarged if no deferred updates are generated.
> > >>>>>
> > >>>>>   ?? > ---
> > >>>>>   ?? >
> > >>>>>   ?? > src/hotspot/share/runtime/thread.cpp
> > >>>>>   ?? >
> > >>>>>   ?? > Can you please explain why
> > >>>>> JavaThread::wait_for_object_deoptimization
> > >>>>>   ?? > has to be handcrafted in this way rather than using proper
> > >>>>> transitions.
> > >>>>>   ?? >
> > >>>>>
> > >>>>> I wrote wait_for_object_deoptimization taking
> > >>>>> JavaThread::java_suspend_self_with_safepoint_check
> > >>>>> as template. So in short: for the same reasons :)
> > >>>>>
> > >>>>> Threads reach both methods as part of thread state transitions,
> > >>>>> therefore special handling is
> > >>>>> required to change thread state on top of ongoing transitions.
> > >>>>>
> > >>>>>   ?? > We got rid of "deopt suspend" some time ago and it is
> disturbing
> > >>>>> to see
> > >>>>>   ?? > it being added back (effectively). This seems like it may be
> > >>>>> something
> > >>>>>   ?? > that handshakes could be used for.
> > >>>>>
> > >>>>> Deopt suspend used to be something rather different with a similar
> > >>>>> name[1]. It is not being added back.
> > >>>>
> > >>>> I stand corrected. Despite comments in the code to the contrary
> > >>>> deopt_suspend didn't actually cause a self-suspend. I was doing a lot
> of
> > >>>> cleanup in this area 13 years ago :)
> > >>>>
> > >>>>>
> > >>>>> I'm actually duplicating the existing external suspend mechanism,
> > >>>>> because a thread can be suspended
> > >>>>> at most once. And hey, and don't like that either! But it seems not
> > >>>>> unlikely that the duplicate can
> > >>>>> be removed together with the original and the new type of
> > handshakes
> > >>>>> that will be used for
> > >>>>> thread suspend can be used for object deoptimization too. See
> > today's
> > >>>>> discussion in JDK-8227745 [2].
> > >>>>
> > >>>> I hope that discussion bears some fruit, at the moment it seems not
> to
> > >>>> be possible to use handshakes here. :(
> > >>>>
> > >>>> The external suspend mechanism is a royal pain in the proverbial that
> > we
> > >>>> have to carefully live with. The idea that we're duplicating that for
> > >>>> use in another fringe area of functionality does not thrill me at all.
> > >>>>
> > >>>> To be clear, I understand the problem that exists and that you wish to
> > >>>> solve, but for the runtime parts I balk at the complexity cost of
> > >>>> solving it.
> > >>>>
> > >>>> Thanks,
> > >>>> David
> > >>>> -----
> > >>>>
> > >>>>> Thanks, Richard.
> > >>>>>
> > >>>>> [1] Deopt suspend was something like an async. handshake for
> > >>>>> architectures with register windows,
> > >>>>>   ???? where patching the return pc for deoptimization of a compiled
> > >>>>> frame was racy if the owner thread
> > >>>>>   ???? was in native code. Instead a "deopt" suspend flag was set on
> > >>>>> which the thread patched its own
> > >>>>>   ???? frame upon return from native. So no thread was suspended. It
> > got
> > >>>>> its name only from the name of
> > >>>>>   ???? the flags.
> > >>>>>
> > >>>>> [2] Discussion about using handshakes to sync. with the target
> thread:
> > >>>>>
> > >>>>> https://bugs.openjdk.java.net/browse/JDK-
> > >>
> >
> 8227745?focusedCommentId=14306727&page=com.atlassian.jira.plugin.syst
> > e
> > >> m.issuetabpanels:comment-tabpanel#comment-14306727
> > >>>>>
> > >>>>>
> > >>>>> -----Original Message-----
> > >>>>> From: David Holmes <david.holmes at oracle.com>
> > >>>>> Sent: Freitag, 13. Dezember 2019 00:56
> > >>>>> To: Reingruber, Richard <richard.reingruber at sap.com>;
> > >>>>> serviceability-dev at openjdk.java.net;
> > >>>>> hotspot-compiler-dev at openjdk.java.net;
> > >>>>> hotspot-runtime-dev at openjdk.java.net
> > >>>>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better
> > >>>>> Performance in the Presence of JVMTI Agents
> > >>>>>
> > >>>>> Hi Richard,
> > >>>>>
> > >>>>> Some further queries/concerns:
> > >>>>>
> > >>>>> src/hotspot/share/runtime/objectMonitor.cpp
> > >>>>>
> > >>>>> Can you please explain the changes to ObjectMonitor::wait:
> > >>>>>
> > >>>>> !?? _recursions = save????? // restore the old recursion count
> > >>>>> !???????????????? + jt->get_and_reset_relock_count_after_wait(); //
> > >>>>> increased by the deferred relock count
> > >>>>>
> > >>>>> what is the "deferred relock count"? I gather it relates to
> > >>>>>
> > >>>>> "The code was extended to be able to deoptimize objects of a
> frame
> > that
> > >>>>> is not the top frame and to let another thread than the owning
> thread
> > do
> > >>>>> it."
> > >>>>>
> > >>>>> which I don't like the sound of at all when it comes to ObjectMonitor
> > >>>>> state. So I'd like to understand in detail exactly what is going on here
> > >>>>> and why.? This is a very intrusive change that seems to badly break
> > >>>>> encapsulation and impacts future changes to ObjectMonitor that
> are
> > under
> > >>>>> investigation.
> > >>>>>
> > >>>>> ---
> > >>>>>
> > >>>>> src/hotspot/share/runtime/thread.cpp
> > >>>>>
> > >>>>> Can you please explain why
> > JavaThread::wait_for_object_deoptimization
> > >>>>> has to be handcrafted in this way rather than using proper
> transitions.
> > >>>>>
> > >>>>> We got rid of "deopt suspend" some time ago and it is disturbing to
> > see
> > >>>>> it being added back (effectively). This seems like it may be
> something
> > >>>>> that handshakes could be used for.
> > >>>>>
> > >>>>> Thanks,
> > >>>>> David
> > >>>>> -----
> > >>>>>
> > >>>>> On 12/12/2019 7:02 am, David Holmes wrote:
> > >>>>>> On 12/12/2019 1:07 am, Reingruber, Richard wrote:
> > >>>>>>> Hi David,
> > >>>>>>>
> > >>>>>>>   ??? > Most of the details here are in areas I can comment on in
> > detail,
> > >>>>>>> but I
> > >>>>>>>   ??? > did take an initial general look at things.
> > >>>>>>>
> > >>>>>>> Thanks for taking the time!
> > >>>>>>
> > >>>>>> Apologies the above should read:
> > >>>>>>
> > >>>>>> "Most of the details here are in areas I *can't* comment on in
> detail
> > >>>>>> ..."
> > >>>>>>
> > >>>>>> David
> > >>>>>>
> > >>>>>>>   ??? > The only thing that jumped out at me is that I think the
> > >>>>>>>   ??? > DeoptimizeObjectsALotThread should be a hidden thread.
> > >>>>>>>   ??? >
> > >>>>>>>   ??? > +? bool is_hidden_from_external_view() const { return true;
> }
> > >>>>>>>
> > >>>>>>> Yes, it should. Will add the method like above.
> > >>>>>>>
> > >>>>>>>   ??? > Also I don't see any testing of the
> > DeoptimizeObjectsALotThread.
> > >>>>>>> Without
> > >>>>>>>   ??? > active testing this will just bit-rot.
> > >>>>>>>
> > >>>>>>> DeoptimizeObjectsALot is meant for stress testing with a larger
> > >>>>>>> workload. I will add a minimal test
> > >>>>>>> to keep it fresh.
> > >>>>>>>
> > >>>>>>>   ??? > Also on the tests I don't understand your @requires clause:
> > >>>>>>>   ??? >
> > >>>>>>>   ??? >?? @requires ((vm.compMode != "Xcomp") &
> > vm.compiler2.enabled
> > >> &
> > >>>>>>>   ??? > (vm.opt.TieredCompilation != true))
> > >>>>>>>   ??? >
> > >>>>>>>   ??? > This seems to require that TieredCompilation is disabled, but
> > >>>>>>> tiered is
> > >>>>>>>   ??? > our normal mode of operation. ??
> > >>>>>>>   ??? >
> > >>>>>>>
> > >>>>>>> I removed the clause. I guess I wanted to target the tests towards
> > the
> > >>>>>>> code they are supposed to
> > >>>>>>> test, and it's easier to analyze failures w/o tiered compilation and
> > >>>>>>> with just one compiler thread.
> > >>>>>>>
> > >>>>>>> Additionally I will make use of
> > >>>>>>> compiler.whitebox.CompilerWhiteBoxTest.THRESHOLD in the
> tests.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Richard.
> > >>>>>>>
> > >>>>>>> -----Original Message-----
> > >>>>>>> From: David Holmes <david.holmes at oracle.com>
> > >>>>>>> Sent: Mittwoch, 11. Dezember 2019 08:03
> > >>>>>>> To: Reingruber, Richard <richard.reingruber at sap.com>;
> > >>>>>>> serviceability-dev at openjdk.java.net;
> > >>>>>>> hotspot-compiler-dev at openjdk.java.net;
> > >>>>>>> hotspot-runtime-dev at openjdk.java.net
> > >>>>>>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better
> > >>>>>>> Performance in the Presence of JVMTI Agents
> > >>>>>>>
> > >>>>>>> Hi Richard,
> > >>>>>>>
> > >>>>>>> On 11/12/2019 7:45 am, Reingruber, Richard wrote:
> > >>>>>>>> Hi,
> > >>>>>>>>
> > >>>>>>>> I would like to get reviews please for
> > >>>>>>>>
> > >>>>>>>>
> > http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.3/
> > >>>>>>>>
> > >>>>>>>> Corresponding RFE:
> > >>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8227745
> > >>>>>>>>
> > >>>>>>>> Fixes also https://bugs.openjdk.java.net/browse/JDK-8233915
> > >>>>>>>> And potentially https://bugs.openjdk.java.net/browse/JDK-
> > 8214584 [1]
> > >>>>>>>>
> > >>>>>>>> Vladimir Kozlov kindly put webrev.3 through tier1-8 testing
> > without
> > >>>>>>>> issues (thanks!). In addition the
> > >>>>>>>> change is being tested at SAP since I posted the first RFR some
> > >>>>>>>> months ago.
> > >>>>>>>>
> > >>>>>>>> The intention of this enhancement is to benefit performance
> wise
> > from
> > >>>>>>>> escape analysis even if JVMTI
> > >>>>>>>> agents request capabilities that allow them to access local
> variable
> > >>>>>>>> values. E.g. if you start-up
> > >>>>>>>> with -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,
> > then
> > >>>>>>>> escape analysis is disabled right
> > >>>>>>>> from the beginning, well before a debugger attaches -- if ever
> one
> > >>>>>>>> should do so. With the
> > >>>>>>>> enhancement, escape analysis will remain enabled until and
> after
> > a
> > >>>>>>>> debugger attaches. EA based
> > >>>>>>>> optimizations are reverted just before an agent acquires the
> > >>>>>>>> reference to an object. In the JBS item
> > >>>>>>>> you'll find more details.
> > >>>>>>>
> > >>>>>>> Most of the details here are in areas I can comment on in detail,
> but
> > I
> > >>>>>>> did take an initial general look at things.
> > >>>>>>>
> > >>>>>>> The only thing that jumped out at me is that I think the
> > >>>>>>> DeoptimizeObjectsALotThread should be a hidden thread.
> > >>>>>>>
> > >>>>>>> +? bool is_hidden_from_external_view() const { return true; }
> > >>>>>>>
> > >>>>>>> Also I don't see any testing of the DeoptimizeObjectsALotThread.
> > >>>>>>> Without
> > >>>>>>> active testing this will just bit-rot.
> > >>>>>>>
> > >>>>>>> Also on the tests I don't understand your @requires clause:
> > >>>>>>>
> > >>>>>>>   ??? @requires ((vm.compMode != "Xcomp") &
> > vm.compiler2.enabled &
> > >>>>>>> (vm.opt.TieredCompilation != true))
> > >>>>>>>
> > >>>>>>> This seems to require that TieredCompilation is disabled, but
> tiered
> > is
> > >>>>>>> our normal mode of operation. ??
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> David
> > >>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Richard.
> > >>>>>>>>
> > >>>>>>>> [1] Experimental fix for JDK-8214584 based on JDK-8227745
> > >>>>>>>>
> > >>
> >
> http://cr.openjdk.java.net/~rrich/webrevs/2019/8214584/experiment_v1.pa
> > tc
> > >> h
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>

From richard.reingruber at sap.com  Wed Apr  1 06:19:10 2020
From: richard.reingruber at sap.com (Reingruber, Richard)
Date: Wed, 1 Apr 2020 06:19:10 +0000
Subject: RFR(L) 8227745: Enable Escape Analysis for Better Performance in
 the Presence of JVMTI Agents
In-Reply-To: <0a07f87e-ede1-edbd-c754-e7df884e0545@oracle.com>
References: <DB7PR02MB3612C77802B72D3B3A131C729B5B0@DB7PR02MB3612.eurprd02.prod.outlook.com>
 <1f8a3c7a-fa0f-b5b2-4a8a-7d3d8dbbe1b5@oracle.com>
 <a4213452-e7bd-5bed-7456-3eebf4a4c3a7@oracle.com>
 <DB7PR02MB3612C72A7DC0C14CFC8B92969B540@DB7PR02MB3612.eurprd02.prod.outlook.com>
 <f97264ed-c43e-2d7e-19ae-fcff174f74df@oracle.com>
 <4b56a45c-a14c-6f74-2bfd-25deaabe8201@oracle.com>
 <DB7PR02MB36127925DB5D6609DDBF96909B500@DB7PR02MB3612.eurprd02.prod.outlook.com>
 <5271429a-481d-ddb9-99dc-b3f6670fcc0b@oracle.com>
 <AM0PR0202MB33316510E86767AED0D29F679B030@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <AM7PR02MB6049A3D2F6DE10CAD6AA7A51ECEC0@AM7PR02MB6049.eurprd02.prod.outlook.com>
 <b159e349-95bc-01c3-5250-f3b454d7ef53@oracle.com>
 <AM0PR0202MB33315707EAB1F5C9801DB4C19BE40@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <AM0PR0202MB32972071A26C80FB22FC49DE9AFD0@AM0PR0202MB3297.eurprd02.prod.outlook.com>
 <AM0PR0202MB3331EEF36942FCEBA7E131389BCB0@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <0a07f87e-ede1-edbd-c754-e7df884e0545@oracle.com>
Message-ID: <AM0PR0202MB333114C38C7CB9A784C2F0D59BC90@AM0PR0202MB3331.eurprd02.prod.outlook.com>

  > Thanks for cleaning up thread.hpp!

Thanks for providing the feedback!

I justed noticed that the forward declaration of class jvmtiDeferredLocalVariableSet is not required anymore. Will remove it in the next webrev. Hope to get some more (partial) reviews.

Thanks, Richard.

-----Original Message-----
From: Robbin Ehn <robbin.ehn at oracle.com> 
Sent: Dienstag, 31. M?rz 2020 16:21
To: Reingruber, Richard <richard.reingruber at sap.com>; Doerr, Martin <martin.doerr at sap.com>; Lindenmaier, Goetz <goetz.lindenmaier at sap.com>; David Holmes <david.holmes at oracle.com>; Vladimir Kozlov (vladimir.kozlov at oracle.com) <vladimir.kozlov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net
Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents

Thanks for cleaning up thread.hpp!

/Robbin

On 2020-03-30 10:31, Reingruber, Richard wrote:
> Hi,
> 
> this is webrev.5 based on Robbin's feedback and Martin's review - thanks! :)
> 
> The change affects jvmti, hotspot and c2. Partial reviews are very welcome too.
> 
> Full:  http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.5/
> Delta: http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.5.inc/
> 
> Robbin, Martin, please let me know, if anything shouldn't be quite as you wanted it. Also find my
> comments on your feedback below.
> 
> Robbin, can I count you as Reviewer for the runtime part?
> 
> Thanks, Richard.
> 
> --
> 
>> DeoptimizeObjectsALotThread is only used in compileBroker.cpp.
>> You can move both declaration and definition to that file, no need to clobber
>> thread.[c|h]pp. (and the static function deopt_objs_alot_thread_entry)
> 
> Done.
> 
>> Does JvmtiDeferredUpdates really need to be in thread.hpp, can't be in it's own
>> hpp file? It doesn't seem right to add JVM TI classes into thread.hpp.
> 
> I moved JvmtiDeferredUpdates to vframe_hp.hpp where preexisting jvmtiDeferredLocalVariableSet is
> declared.
> 
>> src/hotspot/share/code/compiledMethod.cpp
>> Nice cleanup!
> 
> Thanks :)
> 
>> src/hotspot/share/code/debugInfoRec.cpp
>> src/hotspot/share/code/debugInfoRec.hpp
>> Additional parmeters. (Remark: I think "non_global_escape_in_scope" would read better than "not_global_escape_in_scope", but your version is consistent with existing code, so no change request from my side.) Ok.
> 
> I've been thinking about this too and finally stayed with not_global_escape_in_scope. It's supposed
> to mean an object whose escape state is not GlobalEscape is in scope.
> 
>> src/hotspot/share/compiler/compileBroker.cpp
>> src/hotspot/share/compiler/compileBroker.hpp
>> Extra thread for DeoptimizeObjectsALot. (Remark: I would have put it into a follow up change together with the test in order to make this webrev smaller, but since it is included, I'm reviewing everything at once. Not a big deal.) Ok.
> 
> Yes the change would be a little smaller. And if it helps I'll split it off. In general I prefer
> patches that bring along a suitable amount of tests.
> 
>> src/hotspot/share/opto/c2compiler.cpp
>> Make do_escape_analysis independent of JVMCI capabilities. Nice!
> 
> It is the main goal of the enhancement. It is done for C2, but could be done for JVMCI compilers
> with just a small effort as well.
> 
>> src/hotspot/share/opto/escape.cpp
>> Annotation for MachSafePointNodes. Your added functionality looks correct.
>> But I'd prefer to move the bulky code out of the large function.
>> I suggest to factor out something like has_not_global_escape and has_arg_escape. So the code could look like this:
>>        SafePointNode* sfn = sfn_worklist.at(next);
>>        sfn->set_not_global_escape_in_scope(has_not_global_escape(sfn));
>>        if (sfn->is_CallJava()) {
>>          CallJavaNode* call = sfn->as_CallJava();
>>          call->set_arg_escape(has_arg_escape(call));
>>        }
>> This would also allow us to get rid of the found_..._escape_in_args variables making the loops better readable.
> 
> Done.
> 
>> It's kind of ugly to use strcmp to recognize uncommon trap, but that seems to be the way to do it (there are more such places). So it's ok.
> 
> Yeah. I copied the snippet.
> 
>> src/hotspot/share/prims/jvmtiImpl.cpp
>> src/hotspot/share/prims/jvmtiImpl.hpp
>> The sequence is pretty complex:
>> VM_GetOrSetLocal element initialization executes EscapeBarrier code which suspends the target thread (extra VM Operation).
> 
> Note that the target threads have to be suspended already for VM_GetOrSetLocal*. So it's mainly the
> synchronization effect of EscapeBarrier::sync_and_suspend_one() that is required here. Also no extra
> _handshake_ is executed, since sync_and_suspend_one() will find the target threads already
> suspended.
> 
>> VM_GetOrSetLocal::doit_prologue performs object deoptimization (by VM Thread to prepare VM Operation with frame deoptimization).
>> VM_GetOrSetLocal destructor implicitly calls EscapeBarrier destructor which resumes the target thread.
>> But I don't have any improvement proposal. Performance is probably not a concern, here. So it's ok.
> 
>> VM_GetOrSetLocal::deoptimize_objects deoptimizes the top frame if it has non-globally escaping objects and other frames if they have arg escaping ones. Good.
> 
> It's not specifically the top frame, but the frame that is accessed.
> 
>> src/hotspot/share/runtime/deoptimization.cpp
>> Object deoptimization. I have more comments and proposals, here.
>> First of all, handling recursive and waiting locks in relock_objects is tricky, but looks correct.
>> Comments are sufficient to understand why things are done as they are implemented.
> 
>> BiasedLocking related parts are complex, but we may get rid of them in the future (with BiasedLocking removal).
>> Anyway, looks correct, too.
> 
>> Typo in comment: "regularily" => "regularly"
> 
>> Deoptimization::fetch_unroll_info_helper is the only place where _jvmti_deferred_updates get deallocated (except JavaThread destructor). But I think we always go through it, so I can't see a memory leak or such kind of issues.
> 
> That's correct. The compiled frame for which deferred updates are allocated is always deoptimized
> before (see EscapeBarrier::deoptimize_objects()). This is also asserted in
> compiledVFrame::update_deferred_value(). I've added the same assertion to
> Deoptimization::relock_objects(). So we can be sure that _jvmti_deferred_updates are deallocated
> again in fetch_unroll_info_helper().
> 
>> EscapeBarrier::deoptimize_objects: ResourceMark should use calling_thread().
> 
> Sure, well spotted!
> 
>> You can use MutexLocker and MonitorLocker with Thread* to save the Thread::current() call.
> 
> Right, good hint. This was recently introduced with 8235678. I even had to resolve conflicts. Should
> have done this then.
> 
>> I'd make set_objs_are_deoptimized static and remove it from the EscapeBarrier interface because I think it shouldn't be used outside of EscapeBarrier::deoptimize_objects.
> 
> Done.
> 
>> Typo in comment: "we must only deoptimize" => "we only have to deoptimize"
> 
> Replaced with "[...] we deoptimize iff local objects are passed as args"
> 
>> "bool EscapeBarrier::deoptimize_objects(intptr_t* fr_id)" is trivial and barrier_active() is redundant. Implementation can get moved to hpp file.
> 
> Ok. Done.
> 
>> I'll get back to suspend flags, later.
> 
>> There are weird cases regarding _self_deoptimization_in_progress.
>> Assume we have 3 threads A, B and C. A deopts C, B deopts C, C deopts C. C can set _self_deoptimization_in_progress while A performs the handshake for suspending C. I think this doesn't lead to errors, but it's probably not desired.
>> I think it would be better to use only one "wait" call in sync_and_suspend_one and sync_and_suspend_all.
> 
> You're right. We've discussed that face-to-face, but couldn't find a real issue. But now, thinking again, a reckon I found one:
> 
> 2808   // Sync with other threads that might be doing deoptimizations
> 2809   {
> 2810     // Need to switch to _thread_blocked for the wait() call
> 2811     ThreadBlockInVM tbivm(_calling_thread);
> 2812     MonitorLocker ml(EscapeBarrier_lock, Mutex::_no_safepoint_check_flag);
> 2813     while (_self_deoptimization_in_progress) {
> 2814       ml.wait();
> 2815     }
> 2816
> 2817     if (self_deopt()) {
> 2818       _self_deoptimization_in_progress = true;
> 2819     }
> 2820
> 2821     while (_deoptee_thread->is_ea_obj_deopt_suspend()) {
> 2822       ml.wait();
> 2823     }
> 2824
> 2825     if (self_deopt()) {
> 2826       return;
> 2827     }
> 2828
> 2829     // set suspend flag for target thread
> 2830     _deoptee_thread->set_ea_obj_deopt_flag();
> 2831   }
> 
> - A waits in 2822
> - C is suspended
> - B notifies all in resume_one()
> - A and C wake up
> - C wins over A and sets _self_deoptimization_in_progress = true in 2818
> - C does the self deoptimization
> - A executes 2830 _deoptee_thread->set_ea_obj_deopt_flag()
> 
> C will self suspend at some undefined point. The resulting state is illegal.
> 
>> I first thought it'd be better to move ThreadBlockInVM before wait() to reduce thread state transitions, but that seems to be problematic because ThreadBlockInVM destructor contains a safepoint check which we shouldn't do while holding EscapeBarrier_lock. So no change request.
> 
> Yes, would be nice to have the state change only if needed, but for the reason you mentioned it is
> not quite as easy as it seems to be. I experimented as well with a second lock, but did not succeed.
> 
>> Change in thred_added:
>> I think the sequence would be more comprehensive if we waited for deopt_all_threads in Thread::start and all other places where a new thread can run into Java code (e.g. JVMTI attach).
>> Your version makes new threads come up with suspend flag set. That looks correct, too. Advantage is that you only have to change one place (thread_added). It'll be interesting to see how it will look like when we use async handshakes instead of suspend flags.
>> For now, I'm ok with your version.
> 
> I had a version that did what you are suggesting. The current version also has the advantage, that
> there are fewer places where a thread has to wait for ongoing object deoptimization. This means
> viewer places where you have to worry about correct thread state transitions, possible deadlocks,
> and if all oops are properly Handle'ed.
> 
>> I'd only move MutexLocker ml(EscapeBarrier_lock...) after if (!jt->is_hidden_from_external_view()).
> 
> Done.
> 
>> Having 4 different deoptimize_objects functions makes it a little hard to keep an overview of which one is used for what.
>> Maybe adding suffixes would help a little bit, but I can also live with what you have.
>> Implementation looks correct to me.
> 
> 2 are internal. I added the suffix _internal to them. This leaves 2 to choose from.
> 
>> src/hotspot/share/runtime/deoptimization.hpp
>> Escape barriers and object deoptimization functions.
>> Typo in comment: "helt" => "held"
> 
> Done in place already.
> 
>> src/hotspot/share/runtime/interfaceSupport.cpp
>> InterfaceSupport::deoptimizeAllObjects() is only used for DeoptimizeObjectsALot = 1.
>> I think DeoptimizeObjectsALot = 2 is more important, but I think it's not bad to have DeoptimizeObjectsALot = 1 in addition. Ok.
> 
> I never used DeoptimizeObjectsALot = 1 that much. It could be more deterministic in single threaded
> scenarios. I wouldn't object to get rid of it though.
> 
>> src/hotspot/share/runtime/stackValue.hpp
>> Better reinitilization in StackValue. Good.
> 
> StackValue::obj_is_scalar_replaced() should not return true after calling set_obj().
> 
>> src/hotspot/share/runtime/thread.cpp
>> src/hotspot/share/runtime/thread.hpp
>> src/hotspot/share/runtime/thread.inline.hpp
>> wait_for_object_deoptimization, suspend flag, deferred updates and test feature to deoptimize objects.
> 
>> In the long term, we want to get rid of suspend flags, so it's not so nice to introduce a new one. But I agree with G?tz that it should be acceptable as temporary solution until async handshakes are available (which takes more time). So I'm ok with your change.
> 
> I'm keen to build the feature on async handshakes when the arive.
> 
>> You can use MutexLocker with Thread*.
> 
> Done.
> 
>> JVMTIDeferredUpdates: I agree with Robin. It'd be nice to move the class out of thread.hpp.
> 
> Done.
> 
>> src/hotspot/share/runtime/vframe.cpp
>> Added support for entry frame to new_vframe. Ok.
> 
> 
>> src/hotspot/share/runtime/vframe_hp.cpp
>> src/hotspot/share/runtime/vframe_hp.hpp
> 
>> I think code()->as_nmethod() in not_global_escape_in_scope() and arg_escape() should better be under #ifdef ASSERT or inside the assert statement (no need for code cache walking in product build).
> 
> Done.
> 
>> jvmtiDeferredLocalVariableSet::update_monitors:
>> Please add a comment explaining that owner referenced by original info may be scalar replaced, but it is deoptimized in the vframe.
> 
> Done.
> 
> -----Original Message-----
> From: Doerr, Martin <martin.doerr at sap.com>
> Sent: Donnerstag, 12. M?rz 2020 17:28
> To: Reingruber, Richard <richard.reingruber at sap.com>; 'Robbin Ehn' <robbin.ehn at oracle.com>; Lindenmaier, Goetz <goetz.lindenmaier at sap.com>; David Holmes <david.holmes at oracle.com>; Vladimir Kozlov (vladimir.kozlov at oracle.com) <vladimir.kozlov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net
> Subject: RE: RFR(L) 8227745: Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents
> 
> Hi Richard,
> 
> 
> I managed to find time for a (almost) complete review of webrev.4. (I'll review the tests separately.)
> 
> First of all, the change seems to be in pretty good quality for its significant complexity. I couldn't find any real bugs. But I'd like to propose minor improvements.
> I'm convinced that it's mature because we did substantial testing.
> 
> I like the new functionality for object deoptimization. It can possibly be reused for future escape analysis based optimizations. So I appreciate having it available in the code base.
> In addition to that, your change makes the JVMTI implementation better integrated into the VM.
> 
> 
> Now to the details:
> 
> 
> src/hotspot/share/c1/c1_IR.hpp
> describe_scope parameters. Ok.
> 
> 
> src/hotspot/share/ci/ciEnv.cpp
> src/hotspot/share/ci/ciEnv.hpp
> Fix for JvmtiExport::can_walk_any_space() capability. Ok.
> 
> 
> src/hotspot/share/code/compiledMethod.cpp
> Nice cleanup!
> 
> 
> src/hotspot/share/code/debugInfoRec.cpp
> src/hotspot/share/code/debugInfoRec.hpp
> Additional parmeters. (Remark: I think "non_global_escape_in_scope" would read better than "not_global_escape_in_scope", but your version is consistent with existing code, so no change request from my side.) Ok.
> 
> 
> src/hotspot/share/code/nmethod.cpp
> Nice cleanup!
> 
> 
> src/hotspot/share/code/pcDesc.hpp
> Additional parameters. Ok.
> 
> 
> src/hotspot/share/code/scopeDesc.cpp
> src/hotspot/share/code/scopeDesc.hpp
> Improved implementation + additional parameters. Ok.
> 
> 
> src/hotspot/share/compiler/compileBroker.cpp
> src/hotspot/share/compiler/compileBroker.hpp
> Extra thread for DeoptimizeObjectsALot. (Remark: I would have put it into a follow up change together with the test in order to make this webrev smaller, but since it is included, I'm reviewing everything at once. Not a big deal.) Ok.
> 
> 
> src/hotspot/share/jvmci/jvmciCodeInstaller.cpp
> Additional parameters. Ok.
> 
> 
> src/hotspot/share/opto/c2compiler.cpp
> Make do_escape_analysis independent of JVMCI capabilities. Nice!
> 
> 
> src/hotspot/share/opto/callnode.hpp
> Additional fields for MachSafePointNodes. Ok.
> 
> 
> src/hotspot/share/opto/escape.cpp
> Annotation for MachSafePointNodes. Your added functionality looks correct.
> But I'd prefer to move the bulky code out of the large function.
> I suggest to factor out something like has_not_global_escape and has_arg_escape. So the code could look like this:
>        SafePointNode* sfn = sfn_worklist.at(next);
>        sfn->set_not_global_escape_in_scope(has_not_global_escape(sfn));
>        if (sfn->is_CallJava()) {
>          CallJavaNode* call = sfn->as_CallJava();
>          call->set_arg_escape(has_arg_escape(call));
>        }
> This would also allow us to get rid of the found_..._escape_in_args variables making the loops better readable.
> 
> It's kind of ugly to use strcmp to recognize uncommon trap, but that seems to be the way to do it (there are more such places). So it's ok.
> 
> 
> src/hotspot/share/opto/machnode.hpp
> Additional fields for MachSafePointNodes. Ok.
> 
> 
> src/hotspot/share/opto/macro.cpp
> Allow elimination of non-escaping allocations. Ok.
> 
> 
> src/hotspot/share/opto/matcher.cpp
> src/hotspot/share/opto/output.cpp
> Copy attribute / pass parameters. Ok.
> 
> 
> src/hotspot/share/prims/jvmtiCodeBlobEvents.cpp
> Nice cleanup!
> 
> 
> src/hotspot/share/prims/jvmtiEnv.cpp
> src/hotspot/share/prims/jvmtiEnvBase.cpp
> Escape barriers + deoptimize objects for target thread. Good.
> 
> 
> src/hotspot/share/prims/jvmtiImpl.cpp
> src/hotspot/share/prims/jvmtiImpl.hpp
> The sequence is pretty complex:
> VM_GetOrSetLocal element initialization executes EscapeBarrier code which suspends the target thread (extra VM Operation).
> VM_GetOrSetLocal::doit_prologue performs object deoptimization (by VM Thread to prepare VM Operation with frame deoptimization).
> VM_GetOrSetLocal destructor implicitly calls EscapeBarrier destructor which resumes the target thread.
> But I don't have any improvement proposal. Performance is probably not a concern, here. So it's ok.
> 
> VM_GetOrSetLocal::deoptimize_objects deoptimizes the top frame if it has non-globally escaping objects and other frames if they have arg escaping ones. Good.
> 
> 
> src/hotspot/share/prims/jvmtiTagMap.cpp
> Escape barriers + deoptimize objects for all threads. Ok.
> 
> 
> src/hotspot/share/prims/whitebox.cpp
> Added WB_IsFrameDeoptimized to API. Ok.
> 
> 
> src/hotspot/share/runtime/deoptimization.cpp
> Object deoptimization. I have more comments and proposals, here.
> First of all, handling recursive and waiting locks in relock_objects is tricky, but looks correct.
> Comments are sufficient to understand why things are done as they are implemented.
> 
> BiasedLocking related parts are complex, but we may get rid of them in the future (with BiasedLocking removal).
> Anyway, looks correct, too.
> 
> Typo in comment: "regularily" => "regularly"
> 
> Deoptimization::fetch_unroll_info_helper is the only place where _jvmti_deferred_updates get deallocated (except JavaThread destructor). But I think we always go through it, so I can't see a memory leak or such kind of issues.
> 
> EscapeBarrier::deoptimize_objects: ResourceMark should use calling_thread().
> 
> You can use MutexLocker and MonitorLocker with Thread* to save the Thread::current() call.
> 
> I'd make set_objs_are_deoptimized static and remove it from the EscapeBarrier interface because I think it shouldn't be used outside of EscapeBarrier::deoptimize_objects.
> 
> Typo in comment: "we must only deoptimize" => "we only have to deoptimize"
> 
> "bool EscapeBarrier::deoptimize_objects(intptr_t* fr_id)" is trivial and barrier_active() is redundant. Implementation can get moved to hpp file.
> 
> I'll get back to suspend flags, later.
> 
> There are weird cases regarding _self_deoptimization_in_progress.
> Assume we have 3 threads A, B and C. A deopts C, B deopts C, C deopts C. C can set _self_deoptimization_in_progress while A performs the handshake for suspending C. I think this doesn't lead to errors, but it's probably not desired.
> I think it would be better to use only one "wait" call in sync_and_suspend_one and sync_and_suspend_all.
> 
> I first thought it'd be better to move ThreadBlockInVM before wait() to reduce thread state transitions, but that seems to be problematic because ThreadBlockInVM destructor contains a safepoint check which we shouldn't do while holding EscapeBarrier_lock. So no change request.
> 
> Change in thred_added:
> I think the sequence would be more comprehensive if we waited for deopt_all_threads in Thread::start and all other places where a new thread can run into Java code (e.g. JVMTI attach).
> Your version makes new threads come up with suspend flag set. That looks correct, too. Advantage is that you only have to change one place (thread_added). It'll be interesting to see how it will look like when we use async handshakes instead of suspend flags.
> For now, I'm ok with your version.
> 
> I'd only move MutexLocker ml(EscapeBarrier_lock...) after if (!jt->is_hidden_from_external_view()).
> 
> Having 4 different deoptimize_objects functions makes it a little hard to keep an overview of which one is used for what.
> Maybe adding suffixes would help a little bit, but I can also live with what you have.
> Implementation looks correct to me.
> 
> 
> src/hotspot/share/runtime/deoptimization.hpp
> Escape barriers and object deoptimization functions.
> Typo in comment: "helt" => "held"
> 
> 
> src/hotspot/share/runtime/globals.hpp
> Addition of develop flag DeoptimizeObjectsALotInterval. Ok.
> 
> 
> src/hotspot/share/runtime/interfaceSupport.cpp
> InterfaceSupport::deoptimizeAllObjects() is only used for DeoptimizeObjectsALot = 1.
> I think DeoptimizeObjectsALot = 2 is more important, but I think it's not bad to have DeoptimizeObjectsALot = 1 in addition. Ok.
> 
> 
> src/hotspot/share/runtime/interfaceSupport.inline.hpp
> Addition of deoptimizeAllObjects. Ok.
> 
> 
> src/hotspot/share/runtime/mutexLocker.cpp
> src/hotspot/share/runtime/mutexLocker.hpp
> Addition of EscapeBarrier_lock. Ok.
> 
> 
> src/hotspot/share/runtime/objectMonitor.cpp
> Make recursion count relock aware. Ok.
> 
> 
> src/hotspot/share/runtime/stackValue.hpp
> Better reinitilization in StackValue. Good.
> 
> 
> src/hotspot/share/runtime/thread.cpp
> src/hotspot/share/runtime/thread.hpp
> src/hotspot/share/runtime/thread.inline.hpp
> wait_for_object_deoptimization, suspend flag, deferred updates and test feature to deoptimize objects.
> 
> In the long term, we want to get rid of suspend flags, so it's not so nice to introduce a new one. But I agree with G?tz that it should be acceptable as temporary solution until async handshakes are available (which takes more time). So I'm ok with your change.
> 
> You can use MutexLocker with Thread*.
> 
> JVMTIDeferredUpdates: I agree with Robin. It'd be nice to move the class out of thread.hpp.
> 
> 
> src/hotspot/share/runtime/vframe.cpp
> Added support for entry frame to new_vframe. Ok.
> 
> 
> src/hotspot/share/runtime/vframe_hp.cpp
> src/hotspot/share/runtime/vframe_hp.hpp
> 
> I think code()->as_nmethod() in not_global_escape_in_scope() and arg_escape() should better be under #ifdef ASSERT or inside the assert statement (no need for code cache walking in product build).
> 
> jvmtiDeferredLocalVariableSet::update_monitors:
> Please add a comment explaining that owner referenced by original info may be scalar replaced, but it is deoptimized in the vframe.
> 
> 
> src/hotspot/share/utilities/macros.hpp
> Addition of NOT_COMPILER2_OR_JVMCI_RETURN macros. Ok.
> 
> 
> test/hotspot/jtreg/serviceability/jvmti/Heap/IterateHeapWithEscapeAnalysisEnabled.java
> test/hotspot/jtreg/serviceability/jvmti/Heap/libIterateHeapWithEscapeAnalysisEnabled.c
> New test. Will review separately.
> 
> 
> test/jdk/TEST.ROOT
> Addition of vm.jvmci as required property. Ok.
> 
> 
> test/jdk/com/sun/jdi/EATests.java
> test/jdk/com/sun/jdi/EATestsJVMCI.java
> New test. Will review separately.
> 
> 
> test/lib/sun/hotspot/WhiteBox.java
> Added isFrameDeoptimized to API. Ok.
> 
> 
> That was it. Best regards,
> Martin
> 
> 
>> -----Original Message-----
>> From: hotspot-compiler-dev <hotspot-compiler-dev-
>> bounces at openjdk.java.net> On Behalf Of Reingruber, Richard
>> Sent: Dienstag, 3. M?rz 2020 21:23
>> To: 'Robbin Ehn' <robbin.ehn at oracle.com>; Lindenmaier, Goetz
>> <goetz.lindenmaier at sap.com>; David Holmes <david.holmes at oracle.com>;
>> Vladimir Kozlov (vladimir.kozlov at oracle.com)
>> <vladimir.kozlov at oracle.com>; serviceability-dev at openjdk.java.net;
>> hotspot-compiler-dev at openjdk.java.net; hotspot-runtime-
>> dev at openjdk.java.net
>> Subject: RE: RFR(L) 8227745: Enable Escape Analysis for Better
>> Performance in the Presence of JVMTI Agents
>>
>> Hi Robbin,
>>
>>>> I understand that Robbin proposed to replace the usage of
>>>> _suspend_flag with handshakes. Apparently, async handshakes
>>>> are needed to do so. We have been waiting a while for removal
>>>> of the _suspend_flag / introduction of async handshakes [2].
>>>> What is the status here?
>>
>>> I have an old prototype which I would like to continue to work on.
>>> So do not assume asynch handshakes will make 15.
>>> Even if it would, I think there are a lot more investigate work to remove
>>> _suspend_flag.
>>
>> Let us know, if we can be of any help to you and be it only testing.
>>
>>>>> Full:
>> http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.4/
>>
>>> DeoptimizeObjectsALotThread is only used in compileBroker.cpp.
>>> You can move both declaration and definition to that file, no need to
>> clobber
>>> thread.[c|h]pp. (and the static function deopt_objs_alot_thread_entry)
>>
>> Will do.
>>
>>> Does JvmtiDeferredUpdates really need to be in thread.hpp, can't be in it's
>> own
>>> hpp file? It doesn't seem right to add JVM TI classes into thread.hpp.
>>
>> You are right. It shouldn't be declared in thread.hpp. I will look into that.
>>
>>> Note that we also think we may have a bug in deopt:
>>> https://bugs.openjdk.java.net/browse/JDK-8238237
>>
>>> I think it would be best, if possible, to push after that is resolved.
>>
>> Sure.
>>
>>> Not even nearly a full review :)
>>
>> I know :)
>>
>> Anyways, thanks a lot,
>> Richard.
>>
>>
>> -----Original Message-----
>> From: Robbin Ehn <robbin.ehn at oracle.com>
>> Sent: Monday, March 2, 2020 11:17 AM
>> To: Lindenmaier, Goetz <goetz.lindenmaier at sap.com>; Reingruber, Richard
>> <richard.reingruber at sap.com>; David Holmes <david.holmes at oracle.com>;
>> Vladimir Kozlov (vladimir.kozlov at oracle.com)
>> <vladimir.kozlov at oracle.com>; serviceability-dev at openjdk.java.net;
>> hotspot-compiler-dev at openjdk.java.net; hotspot-runtime-
>> dev at openjdk.java.net
>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better Performance
>> in the Presence of JVMTI Agents
>>
>> Hi,
>>
>> On 2/24/20 5:39 PM, Lindenmaier, Goetz wrote:
>>> Hi,
>>>
>>> I had a look at the progress of this change. Nothing
>>> happened since Richard posted his update using more
>>> handshakes [1].
>>> But we (SAP) would appreciate a lot if this change could
>>> be successfully reviewed and pushed.
>>>
>>> I think there is basic understanding that this
>>> change is helpful. It fixes a number of issues with JVMTI,
>>> and will deliver the same performance benefits as EA
>>> does in current production mode for debugging scenarios.
>>>
>>> This is important for us as we run our VMs prepared
>>> for debugging in production mode.
>>>
>>> I understand that Robbin proposed to replace the usage of
>>> _suspend_flag with handshakes. Apparently, async handshakes
>>> are needed to do so. We have been waiting a while for removal
>>> of the _suspend_flag / introduction of async handshakes [2].
>>> What is the status here?
>>
>> I have an old prototype which I would like to continue to work on.
>> So do not assume asynch handshakes will make 15.
>> Even if it would, I think there are a lot more investigate work to remove
>> _suspend_flag.
>>
>>>
>>> I think we should no longer wait, but proceed with
>>> this change. We will look into removing the usage of
>>> suspend_flag introduced here once it is possible to implement
>>> it with handshakes.
>>
>> Yes, sure.
>>
>>>> Full: http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.4/
>>
>> DeoptimizeObjectsALotThread is only used in compileBroker.cpp.
>> You can move both declaration and definition to that file, no need to clobber
>> thread.[c|h]pp. (and the static function deopt_objs_alot_thread_entry)
>>
>> Does JvmtiDeferredUpdates really need to be in thread.hpp, can't be in it's
>> own
>> hpp file? It doesn't seem right to add JVM TI classes into thread.hpp.
>>
>> Note that we also think we may have a bug in deopt:
>> https://bugs.openjdk.java.net/browse/JDK-8238237
>>
>> I think it would be best, if possible, to push after that is resolved.
>>
>> Not even nearly a full review :)
>>
>> Thanks, Robbin
>>
>>
>>>> Incremental:
>>>> http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.4.inc/
>>>>
>>>> I was not able to eliminate the additional suspend flag now. I'll take care
>> of this
>>>> as soon as the
>>>> existing suspend-resume-mechanism is reworked.
>>>>
>>>> Testing:
>>>>
>>>> Nightly tests @SAP:
>>>>
>>>>     JCK and JTREG, also in Xcomp mode, SPECjvm2008, SPECjbb2015,
>> Renaissance
>>>> Suite, SAP specific tests
>>>>     with fastdebug and release builds on all platforms
>>>>
>>>>     Stress testing with DeoptimizeObjectsALot running SPECjvm2008 40x
>> parallel
>>>> for 24h
>>>>
>>>> Thanks, Richard.
>>>>
>>>>
>>>> More details on the changes:
>>>>
>>>> * Hide DeoptimizeObjectsALotThread from external view.
>>>>
>>>> * Changed EscapeBarrier_lock to be a _safepoint_check_never lock.
>>>>     It used to be _safepoint_check_sometimes, which will be eliminated
>> sooner or
>>>> later.
>>>>     I added explicit thread state changes with ThreadBlockInVM to code
>> paths
>>>> where we can wait()
>>>>     on EscapeBarrier_lock to become safepoint safe.
>>>>
>>>> * Use handshake EscapeBarrierSuspendHandshake to suspend target
>> threads
>>>> instead of vm operation
>>>>     VM_ThreadSuspendAllForObjDeopt.
>>>>
>>>> * Removed uses of Threads_lock. When adding a new thread we suspend
>> it iff
>>>> EA optimizations are
>>>>     being reverted. In the previous version we were waiting on
>> Threads_lock
>>>> while EA optimizations
>>>>     were reverted. See EscapeBarrier::thread_added().
>>>>
>>>> * Made tests require Xmixed compilation mode.
>>>>
>>>> * Made tests agnostic regarding tiered compilation.
>>>>     I.e. tc isn't disabled anymore, and the tests can be run with tc enabled or
>>>> disabled.
>>>>
>>>> * Exercising EATests.java as well with stress test options
>>>> DeoptimizeObjectsALot*
>>>>     Due to the non-deterministic deoptimizations some tests need to be
>> skipped.
>>>>     We do this to prevent bit-rot of the stress test code.
>>>>
>>>> * Executing EATests.java as well with graal if available. Driver for this is
>>>>     EATestsJVMCI.java. Graal cannot pass all tests, because it does not
>> provide all
>>>> the new debug info
>>>>     (namely not_global_escape_in_scope and arg_escape in
>> scopeDesc.hpp).
>>>>     And graal does not yet support the JVMTI operations force early return
>> and
>>>> pop frame.
>>>>
>>>> * Removed tracing from new jdi tests in EATests.java. Too much trace
>> output
>>>> before the debugging
>>>>     connection is established can cause deadlock because output buffers fill
>> up.
>>>>     (See https://bugs.openjdk.java.net/browse/JDK-8173304)
>>>>
>>>> * Many copyright year changes and smaller clean-up changes of testing
>> code
>>>> (trailing white-space and
>>>>     the like).
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: David Holmes <david.holmes at oracle.com>
>>>> Sent: Donnerstag, 19. Dezember 2019 03:12
>>>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-
>>>> dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net;
>> hotspot-
>>>> runtime-dev at openjdk.java.net; Vladimir Kozlov
>> (vladimir.kozlov at oracle.com)
>>>> <vladimir.kozlov at oracle.com>
>>>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better
>> Performance in
>>>> the Presence of JVMTI Agents
>>>>
>>>> Hi Richard,
>>>>
>>>> I think my issue is with the way EliminateNestedLocks works so I'm going
>>>> to look into that more deeply.
>>>>
>>>> Thanks for the explanations.
>>>>
>>>> David
>>>>
>>>> On 18/12/2019 12:47 am, Reingruber, Richard wrote:
>>>>> Hi David,
>>>>>
>>>>>      > >    > Some further queries/concerns:
>>>>>      > >    >
>>>>>      > >    > src/hotspot/share/runtime/objectMonitor.cpp
>>>>>      > >    >
>>>>>      > >    > Can you please explain the changes to ObjectMonitor::wait:
>>>>>      > >    >
>>>>>      > >    > !   _recursions = save      // restore the old recursion count
>>>>>      > >    > !                 + jt->get_and_reset_relock_count_after_wait(); //
>>>>>      > >    > increased by the deferred relock count
>>>>>      > >    >
>>>>>      > >    > what is the "deferred relock count"? I gather it relates to
>>>>>      > >    >
>>>>>      > >    > "The code was extended to be able to deoptimize objects of a
>>>>>      > > frame that
>>>>>      > >    > is not the top frame and to let another thread than the owning
>>>>>      > > thread do
>>>>>      > >    > it."
>>>>>      > >
>>>>>      > > Yes, these relate. Currently EA based optimizations are reverted,
>> when a
>>>> compiled frame is
>>>>>      > > replaced with corresponding interpreter frames. Part of this is
>> relocking
>>>> objects with eliminated
>>>>>      > > locking. New with the enhancement is that we do this also just
>> before
>>>> object references are
>>>>>      > > acquired through JVMTI. In this case we deoptimize also the
>> owning
>>>> compiled frame C and we
>>>>>      > > register deoptimized objects as deferred updates. When control
>> returns
>>>> to C it gets deoptimized,
>>>>>      > > we notice that objects are already deoptimized (reallocated and
>>>> relocked), so we don't do it again
>>>>>      > > (relocking twice would be incorrect of course). Deferred updates
>> are
>>>> copied into the new
>>>>>      > > interpreter frames.
>>>>>      > >
>>>>>      > > Problem: relocking is not possible if the target thread T is waiting
>> on the
>>>> monitor that needs to
>>>>>      > > be relocked. This happens only with non-local objects with
>>>> EliminateNestedLocks. Instead relocking
>>>>>      > > is deferred until T owns the monitor again. This is what the piece of
>>>> code above does.
>>>>>      >
>>>>>      >  Sorry I need some more detail here. How can you wait() on an
>> object
>>>>>      >  monitor if the object allocation and/or locking was optimised away?
>> And
>>>>>      >  what is a "non-local object" in this context? Isn't EA restricted to
>>>>>      >  thread-confined objects?
>>>>>
>>>>> "Non-local object" is an object that escapes its thread. The issue I'm
>>>> addressing with the changes
>>>>> in ObjectMonitor::wait are almost unrelated to EA. They are caused by
>>>> EliminateNestedLocks, where C2
>>>>> eliminates recursive locking of an already owned lock. The lock owning
>> object
>>>> exists on the heap, it
>>>>> is locked and you can call wait() on it.
>>>>>
>>>>> EliminateLocks is the C2 option that controls lock elimination based on
>> EA.
>>>> Both optimizations have
>>>>> in common that objects with eliminated locking need to be relocked
>> when
>>>> deoptimizing a frame,
>>>>> i.e. when replacing a compiled frame with equivalent interpreter
>>>>> frames. Deoptimization::relock_objects does that job for /all/ eliminated
>>>> locks in scope. /All/ can
>>>>> be a mix of eliminated nested locks and locks of not-escaping objects.
>>>>>
>>>>> New with the enhancement: I call relock_objects earlier, just before
>> objects
>>>> pontentially
>>>>> escape. But then later when the owning compiled frame gets
>> deoptimized, I
>>>> must not do it again:
>>>>>
>>>>> See call to EscapeBarrier::objs_are_deoptimized in deoptimization.cpp:
>>>>>
>>>>>     373   if ((jvmci_enabled || ((DoEscapeAnalysis ||
>> EliminateNestedLocks) &&
>>>> EliminateLocks))
>>>>>     374       && !EscapeBarrier::objs_are_deoptimized(thread,
>> deoptee.id())) {
>>>>>     375     bool unused;
>>>>>     376     eliminate_locks(thread, chunk, realloc_failures, deoptee,
>> exec_mode,
>>>> unused);
>>>>>     377   }
>>>>>
>>>>> Now when calling relock_objects early it is quiet possible that I have to
>> relock
>>>> an object the
>>>>> target thread currently waits for. Obviously I cannot relock in this case,
>>>> instead I chose to
>>>>> introduce relock_count_after_wait to JavaThread.
>>>>>
>>>>>      >  Is it just that some of the locking gets optimized away e.g.
>>>>>      >
>>>>>      >  synchronised(obj) {
>>>>>      >     synchronised(obj) {
>>>>>      >       synchronised(obj) {
>>>>>      >         obj.wait();
>>>>>      >       }
>>>>>      >     }
>>>>>      >  }
>>>>>      >
>>>>>      >  If this is reduced to a form as-if it were a single lock of the monitor
>>>>>      >  (due to EA) and the wait() triggers a JVM TI event which leads to the
>>>>>      >  escape of "obj" then we need to reconstruct the true lock state, and
>> so
>>>>>      >  when the wait() internally unblocks and reacquires the monitor it
>> has to
>>>>>      >  set the true recursion count to 3, not the 1 that it appeared to be
>> when
>>>>>      >  wait() was initially called. Is that the scenario?
>>>>>
>>>>> Kind of... except that the locking is not eliminated due to EA and there is
>> no
>>>> JVM TI event
>>>>> triggered by wait.
>>>>>
>>>>> Add
>>>>>
>>>>> LocalObject l1 = new LocalObject();
>>>>>
>>>>> in front of the synchrnized blocks and assume a JVM TI agent acquires l1.
>> This
>>>> triggers the code in
>>>>> question.
>>>>>
>>>>> See that relocking/reallocating is transactional. If it is done then for /all/
>>>> objects in scope and it is
>>>>> done at most once. It wouldn't be quite so easy to split this in relocking
>> of
>>>> nested/EA-based
>>>>> eliminated locks.
>>>>>
>>>>>      >  If so I find this truly awful. Anyone using wait() in a realistic form
>>>>>      >  requires a notification and so the object cannot be thread confined.
>> In
>>>>>
>>>>> It is not thread confined.
>>>>>
>>>>>      >  which case I would strongly argue that upon hitting the wait() the
>> deopt
>>>>>      >  should occur unconditionally and so the lock state is correct before
>> we
>>>>>      >  wait and so we don't need to mess with the recursion count
>> internally
>>>>>      >  when we reacquire the monitor.
>>>>>      >
>>>>>      > >
>>>>>      > >    > which I don't like the sound of at all when it comes to
>> ObjectMonitor
>>>>>      > >    > state. So I'd like to understand in detail exactly what is going on
>> here
>>>>>      > >    > and why.  This is a very intrusive change that seems to badly
>> break
>>>>>      > >    > encapsulation and impacts future changes to ObjectMonitor
>> that are
>>>> under
>>>>>      > >    > investigation.
>>>>>      > >
>>>>>      > > I would not regard this as breaking encapsulation. Certainly not
>> badly.
>>>>>      > >
>>>>>      > > I've added a property relock_count_after_wait to JavaThread. The
>>>> property is well
>>>>>      > > encapsulated. Future ObjectMonitor implementations have to deal
>> with
>>>> recursion too. They are free
>>>>>      > > in choosing a way to do that as long as that property is taken into
>>>> account. This is hardly a
>>>>>      > > limitation.
>>>>>      >
>>>>>      >  I do think this badly breaks encapsulation as you have to add a
>> callout
>>>>>      >  from the guts of the ObjectMonitor code to reach into the thread to
>> get
>>>>>      >  this lock count adjustment. I understand why you have had to do
>> this but
>>>>>      >  I would much rather see a change to the EA optimisation strategy so
>> that
>>>>>      >  this is not needed.
>>>>>      >
>>>>>      > > Note also that the property is a straight forward extension of the
>>>> existing concept of deferred
>>>>>      > > local updates. It is embedded into the structure holding them. So
>> not
>>>> even the footprint of a
>>>>>      > > JavaThread is enlarged if no deferred updates are generated.
>>>>>      >
>>>>>      > [...]
>>>>>      >
>>>>>      > >
>>>>>      > > I'm actually duplicating the existing external suspend mechanism,
>>>> because a thread can be
>>>>>      > > suspended at most once. And hey, and don't like that either! But it
>>>> seems not unlikely that the
>>>>>      > > duplicate can be removed together with the original and the new
>> type
>>>> of handshakes that will be
>>>>>      > > used for thread suspend can be used for object deoptimization
>> too. See
>>>> today's discussion in
>>>>>      > > JDK-8227745 [2].
>>>>>      >
>>>>>      >  I hope that discussion bears some fruit, at the moment it seems not
>> to
>>>>>      >  be possible to use handshakes here. :(
>>>>>      >
>>>>>      >  The external suspend mechanism is a royal pain in the proverbial
>> that we
>>>>>      >  have to carefully live with. The idea that we're duplicating that for
>>>>>      >  use in another fringe area of functionality does not thrill me at all.
>>>>>      >
>>>>>      >  To be clear, I understand the problem that exists and that you wish
>> to
>>>>>      >  solve, but for the runtime parts I balk at the complexity cost of
>>>>>      >  solving it.
>>>>>
>>>>> I know it's complex, but by far no rocket science.
>>>>>
>>>>> Also I find it hard to imagine another fix for JDK-8233915 besides
>> changing
>>>> the JVM TI specification.
>>>>>
>>>>> Thanks, Richard.
>>>>>
>>>>> -----Original Message-----
>>>>> From: David Holmes <david.holmes at oracle.com>
>>>>> Sent: Dienstag, 17. Dezember 2019 08:03
>>>>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-
>>>> dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net;
>> hotspot-
>>>> runtime-dev at openjdk.java.net; Vladimir Kozlov
>> (vladimir.kozlov at oracle.com)
>>>> <vladimir.kozlov at oracle.com>
>>>>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better
>> Performance
>>>> in the Presence of JVMTI Agents
>>>>>
>>>>> <resend as my mailer crashed during last send>
>>>>>
>>>>> David
>>>>>
>>>>> On 17/12/2019 4:57 pm, David Holmes wrote:
>>>>>> Hi Richard,
>>>>>>
>>>>>> On 14/12/2019 5:01 am, Reingruber, Richard wrote:
>>>>>>> Hi David,
>>>>>>>
>>>>>>>    ?? > Some further queries/concerns:
>>>>>>>    ?? >
>>>>>>>    ?? > src/hotspot/share/runtime/objectMonitor.cpp
>>>>>>>    ?? >
>>>>>>>    ?? > Can you please explain the changes to ObjectMonitor::wait:
>>>>>>>    ?? >
>>>>>>>    ?? > !?? _recursions = save????? // restore the old recursion count
>>>>>>>    ?? > !???????????????? + jt->get_and_reset_relock_count_after_wait(); //
>>>>>>>    ?? > increased by the deferred relock count
>>>>>>>    ?? >
>>>>>>>    ?? > what is the "deferred relock count"? I gather it relates to
>>>>>>>    ?? >
>>>>>>>    ?? > "The code was extended to be able to deoptimize objects of a
>>>>>>> frame that
>>>>>>>    ?? > is not the top frame and to let another thread than the owning
>>>>>>> thread do
>>>>>>>    ?? > it."
>>>>>>>
>>>>>>> Yes, these relate. Currently EA based optimizations are reverted,
>> when
>>>>>>> a compiled frame is replaced
>>>>>>> with corresponding interpreter frames. Part of this is relocking
>>>>>>> objects with eliminated
>>>>>>> locking. New with the enhancement is that we do this also just before
>>>>>>> object references are acquired
>>>>>>> through JVMTI. In this case we deoptimize also the owning compiled
>>>>>>> frame C and we register
>>>>>>> deoptimized objects as deferred updates. When control returns to C
>> it
>>>>>>> gets deoptimized, we notice
>>>>>>> that objects are already deoptimized (reallocated and relocked), so
>> we
>>>>>>> don't do it again (relocking
>>>>>>> twice would be incorrect of course). Deferred updates are copied into
>>>>>>> the new interpreter frames.
>>>>>>>
>>>>>>> Problem: relocking is not possible if the target thread T is waiting
>>>>>>> on the monitor that needs to be
>>>>>>> relocked. This happens only with non-local objects with
>>>>>>> EliminateNestedLocks. Instead relocking is
>>>>>>> deferred until T owns the monitor again. This is what the piece of
>>>>>>> code above does.
>>>>>>
>>>>>> Sorry I need some more detail here. How can you wait() on an object
>>>>>> monitor if the object allocation and/or locking was optimised away?
>> And
>>>>>> what is a "non-local object" in this context? Isn't EA restricted to
>>>>>> thread-confined objects?
>>>>>>
>>>>>> Is it just that some of the locking gets optimized away e.g.
>>>>>>
>>>>>> synchronised(obj) {
>>>>>>     ? synchronised(obj) {
>>>>>>     ??? synchronised(obj) {
>>>>>>     ????? obj.wait();
>>>>>>     ??? }
>>>>>>     ? }
>>>>>> }
>>>>>>
>>>>>> If this is reduced to a form as-if it were a single lock of the monitor
>>>>>> (due to EA) and the wait() triggers a JVM TI event which leads to the
>>>>>> escape of "obj" then we need to reconstruct the true lock state, and so
>>>>>> when the wait() internally unblocks and reacquires the monitor it has to
>>>>>> set the true recursion count to 3, not the 1 that it appeared to be when
>>>>>> wait() was initially called. Is that the scenario?
>>>>>>
>>>>>> If so I find this truly awful. Anyone using wait() in a realistic form
>>>>>> requires a notification and so the object cannot be thread confined. In
>>>>>> which case I would strongly argue that upon hitting the wait() the
>> deopt
>>>>>> should occur unconditionally and so the lock state is correct before we
>>>>>> wait and so we don't need to mess with the recursion count internally
>>>>>> when we reacquire the monitor.
>>>>>>
>>>>>>>
>>>>>>>    ?? > which I don't like the sound of at all when it comes to
>>>>>>> ObjectMonitor
>>>>>>>    ?? > state. So I'd like to understand in detail exactly what is going
>>>>>>> on here
>>>>>>>    ?? > and why.? This is a very intrusive change that seems to badly
>> break
>>>>>>>    ?? > encapsulation and impacts future changes to ObjectMonitor that
>>>>>>> are under
>>>>>>>    ?? > investigation.
>>>>>>>
>>>>>>> I would not regard this as breaking encapsulation. Certainly not badly.
>>>>>>>
>>>>>>> I've added a property relock_count_after_wait to JavaThread. The
>>>>>>> property is well
>>>>>>> encapsulated. Future ObjectMonitor implementations have to deal
>> with
>>>>>>> recursion too. They are free in
>>>>>>> choosing a way to do that as long as that property is taken into
>>>>>>> account. This is hardly a
>>>>>>> limitation.
>>>>>>
>>>>>> I do think this badly breaks encapsulation as you have to add a callout
>>>>>> from the guts of the ObjectMonitor code to reach into the thread to
>> get
>>>>>> this lock count adjustment. I understand why you have had to do this
>> but
>>>>>> I would much rather see a change to the EA optimisation strategy so
>> that
>>>>>> this is not needed.
>>>>>>
>>>>>>> Note also that the property is a straight forward extension of the
>>>>>>> existing concept of deferred
>>>>>>> local updates. It is embedded into the structure holding them. So not
>>>>>>> even the footprint of a
>>>>>>> JavaThread is enlarged if no deferred updates are generated.
>>>>>>>
>>>>>>>    ?? > ---
>>>>>>>    ?? >
>>>>>>>    ?? > src/hotspot/share/runtime/thread.cpp
>>>>>>>    ?? >
>>>>>>>    ?? > Can you please explain why
>>>>>>> JavaThread::wait_for_object_deoptimization
>>>>>>>    ?? > has to be handcrafted in this way rather than using proper
>>>>>>> transitions.
>>>>>>>    ?? >
>>>>>>>
>>>>>>> I wrote wait_for_object_deoptimization taking
>>>>>>> JavaThread::java_suspend_self_with_safepoint_check
>>>>>>> as template. So in short: for the same reasons :)
>>>>>>>
>>>>>>> Threads reach both methods as part of thread state transitions,
>>>>>>> therefore special handling is
>>>>>>> required to change thread state on top of ongoing transitions.
>>>>>>>
>>>>>>>    ?? > We got rid of "deopt suspend" some time ago and it is disturbing
>>>>>>> to see
>>>>>>>    ?? > it being added back (effectively). This seems like it may be
>>>>>>> something
>>>>>>>    ?? > that handshakes could be used for.
>>>>>>>
>>>>>>> Deopt suspend used to be something rather different with a similar
>>>>>>> name[1]. It is not being added back.
>>>>>>
>>>>>> I stand corrected. Despite comments in the code to the contrary
>>>>>> deopt_suspend didn't actually cause a self-suspend. I was doing a lot of
>>>>>> cleanup in this area 13 years ago :)
>>>>>>
>>>>>>>
>>>>>>> I'm actually duplicating the existing external suspend mechanism,
>>>>>>> because a thread can be suspended
>>>>>>> at most once. And hey, and don't like that either! But it seems not
>>>>>>> unlikely that the duplicate can
>>>>>>> be removed together with the original and the new type of
>> handshakes
>>>>>>> that will be used for
>>>>>>> thread suspend can be used for object deoptimization too. See
>> today's
>>>>>>> discussion in JDK-8227745 [2].
>>>>>>
>>>>>> I hope that discussion bears some fruit, at the moment it seems not to
>>>>>> be possible to use handshakes here. :(
>>>>>>
>>>>>> The external suspend mechanism is a royal pain in the proverbial that
>> we
>>>>>> have to carefully live with. The idea that we're duplicating that for
>>>>>> use in another fringe area of functionality does not thrill me at all.
>>>>>>
>>>>>> To be clear, I understand the problem that exists and that you wish to
>>>>>> solve, but for the runtime parts I balk at the complexity cost of
>>>>>> solving it.
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>> -----
>>>>>>
>>>>>>> Thanks, Richard.
>>>>>>>
>>>>>>> [1] Deopt suspend was something like an async. handshake for
>>>>>>> architectures with register windows,
>>>>>>>    ???? where patching the return pc for deoptimization of a compiled
>>>>>>> frame was racy if the owner thread
>>>>>>>    ???? was in native code. Instead a "deopt" suspend flag was set on
>>>>>>> which the thread patched its own
>>>>>>>    ???? frame upon return from native. So no thread was suspended. It
>> got
>>>>>>> its name only from the name of
>>>>>>>    ???? the flags.
>>>>>>>
>>>>>>> [2] Discussion about using handshakes to sync. with the target thread:
>>>>>>>
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-
>>>>
>> 8227745?focusedCommentId=14306727&page=com.atlassian.jira.plugin.syst
>> e
>>>> m.issuetabpanels:comment-tabpanel#comment-14306727
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: David Holmes <david.holmes at oracle.com>
>>>>>>> Sent: Freitag, 13. Dezember 2019 00:56
>>>>>>> To: Reingruber, Richard <richard.reingruber at sap.com>;
>>>>>>> serviceability-dev at openjdk.java.net;
>>>>>>> hotspot-compiler-dev at openjdk.java.net;
>>>>>>> hotspot-runtime-dev at openjdk.java.net
>>>>>>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better
>>>>>>> Performance in the Presence of JVMTI Agents
>>>>>>>
>>>>>>> Hi Richard,
>>>>>>>
>>>>>>> Some further queries/concerns:
>>>>>>>
>>>>>>> src/hotspot/share/runtime/objectMonitor.cpp
>>>>>>>
>>>>>>> Can you please explain the changes to ObjectMonitor::wait:
>>>>>>>
>>>>>>> !?? _recursions = save????? // restore the old recursion count
>>>>>>> !???????????????? + jt->get_and_reset_relock_count_after_wait(); //
>>>>>>> increased by the deferred relock count
>>>>>>>
>>>>>>> what is the "deferred relock count"? I gather it relates to
>>>>>>>
>>>>>>> "The code was extended to be able to deoptimize objects of a frame
>> that
>>>>>>> is not the top frame and to let another thread than the owning thread
>> do
>>>>>>> it."
>>>>>>>
>>>>>>> which I don't like the sound of at all when it comes to ObjectMonitor
>>>>>>> state. So I'd like to understand in detail exactly what is going on here
>>>>>>> and why.? This is a very intrusive change that seems to badly break
>>>>>>> encapsulation and impacts future changes to ObjectMonitor that are
>> under
>>>>>>> investigation.
>>>>>>>
>>>>>>> ---
>>>>>>>
>>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>>
>>>>>>> Can you please explain why
>> JavaThread::wait_for_object_deoptimization
>>>>>>> has to be handcrafted in this way rather than using proper transitions.
>>>>>>>
>>>>>>> We got rid of "deopt suspend" some time ago and it is disturbing to
>> see
>>>>>>> it being added back (effectively). This seems like it may be something
>>>>>>> that handshakes could be used for.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>> -----
>>>>>>>
>>>>>>> On 12/12/2019 7:02 am, David Holmes wrote:
>>>>>>>> On 12/12/2019 1:07 am, Reingruber, Richard wrote:
>>>>>>>>> Hi David,
>>>>>>>>>
>>>>>>>>>    ??? > Most of the details here are in areas I can comment on in
>> detail,
>>>>>>>>> but I
>>>>>>>>>    ??? > did take an initial general look at things.
>>>>>>>>>
>>>>>>>>> Thanks for taking the time!
>>>>>>>>
>>>>>>>> Apologies the above should read:
>>>>>>>>
>>>>>>>> "Most of the details here are in areas I *can't* comment on in detail
>>>>>>>> ..."
>>>>>>>>
>>>>>>>> David
>>>>>>>>
>>>>>>>>>    ??? > The only thing that jumped out at me is that I think the
>>>>>>>>>    ??? > DeoptimizeObjectsALotThread should be a hidden thread.
>>>>>>>>>    ??? >
>>>>>>>>>    ??? > +? bool is_hidden_from_external_view() const { return true; }
>>>>>>>>>
>>>>>>>>> Yes, it should. Will add the method like above.
>>>>>>>>>
>>>>>>>>>    ??? > Also I don't see any testing of the
>> DeoptimizeObjectsALotThread.
>>>>>>>>> Without
>>>>>>>>>    ??? > active testing this will just bit-rot.
>>>>>>>>>
>>>>>>>>> DeoptimizeObjectsALot is meant for stress testing with a larger
>>>>>>>>> workload. I will add a minimal test
>>>>>>>>> to keep it fresh.
>>>>>>>>>
>>>>>>>>>    ??? > Also on the tests I don't understand your @requires clause:
>>>>>>>>>    ??? >
>>>>>>>>>    ??? >?? @requires ((vm.compMode != "Xcomp") &
>> vm.compiler2.enabled
>>>> &
>>>>>>>>>    ??? > (vm.opt.TieredCompilation != true))
>>>>>>>>>    ??? >
>>>>>>>>>    ??? > This seems to require that TieredCompilation is disabled, but
>>>>>>>>> tiered is
>>>>>>>>>    ??? > our normal mode of operation. ??
>>>>>>>>>    ??? >
>>>>>>>>>
>>>>>>>>> I removed the clause. I guess I wanted to target the tests towards
>> the
>>>>>>>>> code they are supposed to
>>>>>>>>> test, and it's easier to analyze failures w/o tiered compilation and
>>>>>>>>> with just one compiler thread.
>>>>>>>>>
>>>>>>>>> Additionally I will make use of
>>>>>>>>> compiler.whitebox.CompilerWhiteBoxTest.THRESHOLD in the tests.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Richard.
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: David Holmes <david.holmes at oracle.com>
>>>>>>>>> Sent: Mittwoch, 11. Dezember 2019 08:03
>>>>>>>>> To: Reingruber, Richard <richard.reingruber at sap.com>;
>>>>>>>>> serviceability-dev at openjdk.java.net;
>>>>>>>>> hotspot-compiler-dev at openjdk.java.net;
>>>>>>>>> hotspot-runtime-dev at openjdk.java.net
>>>>>>>>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better
>>>>>>>>> Performance in the Presence of JVMTI Agents
>>>>>>>>>
>>>>>>>>> Hi Richard,
>>>>>>>>>
>>>>>>>>> On 11/12/2019 7:45 am, Reingruber, Richard wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I would like to get reviews please for
>>>>>>>>>>
>>>>>>>>>>
>> http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.3/
>>>>>>>>>>
>>>>>>>>>> Corresponding RFE:
>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8227745
>>>>>>>>>>
>>>>>>>>>> Fixes also https://bugs.openjdk.java.net/browse/JDK-8233915
>>>>>>>>>> And potentially https://bugs.openjdk.java.net/browse/JDK-
>> 8214584 [1]
>>>>>>>>>>
>>>>>>>>>> Vladimir Kozlov kindly put webrev.3 through tier1-8 testing
>> without
>>>>>>>>>> issues (thanks!). In addition the
>>>>>>>>>> change is being tested at SAP since I posted the first RFR some
>>>>>>>>>> months ago.
>>>>>>>>>>
>>>>>>>>>> The intention of this enhancement is to benefit performance wise
>> from
>>>>>>>>>> escape analysis even if JVMTI
>>>>>>>>>> agents request capabilities that allow them to access local variable
>>>>>>>>>> values. E.g. if you start-up
>>>>>>>>>> with -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,
>> then
>>>>>>>>>> escape analysis is disabled right
>>>>>>>>>> from the beginning, well before a debugger attaches -- if ever one
>>>>>>>>>> should do so. With the
>>>>>>>>>> enhancement, escape analysis will remain enabled until and after
>> a
>>>>>>>>>> debugger attaches. EA based
>>>>>>>>>> optimizations are reverted just before an agent acquires the
>>>>>>>>>> reference to an object. In the JBS item
>>>>>>>>>> you'll find more details.
>>>>>>>>>
>>>>>>>>> Most of the details here are in areas I can comment on in detail, but
>> I
>>>>>>>>> did take an initial general look at things.
>>>>>>>>>
>>>>>>>>> The only thing that jumped out at me is that I think the
>>>>>>>>> DeoptimizeObjectsALotThread should be a hidden thread.
>>>>>>>>>
>>>>>>>>> +? bool is_hidden_from_external_view() const { return true; }
>>>>>>>>>
>>>>>>>>> Also I don't see any testing of the DeoptimizeObjectsALotThread.
>>>>>>>>> Without
>>>>>>>>> active testing this will just bit-rot.
>>>>>>>>>
>>>>>>>>> Also on the tests I don't understand your @requires clause:
>>>>>>>>>
>>>>>>>>>    ??? @requires ((vm.compMode != "Xcomp") &
>> vm.compiler2.enabled &
>>>>>>>>> (vm.opt.TieredCompilation != true))
>>>>>>>>>
>>>>>>>>> This seems to require that TieredCompilation is disabled, but tiered
>> is
>>>>>>>>> our normal mode of operation. ??
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Richard.
>>>>>>>>>>
>>>>>>>>>> [1] Experimental fix for JDK-8214584 based on JDK-8227745
>>>>>>>>>>
>>>>
>> http://cr.openjdk.java.net/~rrich/webrevs/2019/8214584/experiment_v1.pa
>> tc
>>>> h
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>

From tobias.hartmann at oracle.com  Wed Apr  1 06:26:47 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Wed, 1 Apr 2020 08:26:47 +0200
Subject: [15] RFR(S): 8241909: Remove useless code cache lookup in
 frame::patch_pc
In-Reply-To: <76b44f19-0c1d-3efb-e922-4a108b136b52@oracle.com>
References: <d4e8c0fa-5ab7-04db-00fb-b17510474e3e@oracle.com>
 <39c001b5-e39e-8e3a-c74a-cd2d35dabf5c@oracle.com>
 <76b44f19-0c1d-3efb-e922-4a108b136b52@oracle.com>
Message-ID: <6c0db3fe-e14d-3693-46dc-a187f264dc47@oracle.com>

Vladimir, Dean, thanks for the review!

Best regards,
Tobias

On 31.03.20 21:36, Dean Long wrote:
> +1
> 
> dl
> 
> On 3/31/20 10:42 AM, Vladimir Kozlov wrote:
>> Good.
>>
>> thanks,
>> Vladimir
>>
>> On 3/31/20 1:45 AM, Tobias Hartmann wrote:
>>> Hi,
>>>
>>> please review the following patch:
>>> https://bugs.openjdk.java.net/browse/JDK-8241909
>>> http://cr.openjdk.java.net/~thartmann/8241909/webrev.00/
>>>
>>> The code cache lookup in frame::patch_pc [1] is useless because the method is only called from
>>> frame::deoptimize and vframeArrayElement::unpack_on_stack where pc is always part of _cb.
>>>
>>> If the method is called from frame::deoptimize [2], pc is either _cb->deopt_mh_handler_begin() or
>>> _cb->deopt_handler_begin(). Both are part of _cb.
>>>
>>> If the method is called from vframeArrayElement::unpack_on_stack [3], _frame is an interpreter frame
>>> and therefore _frame._cb is the interpreter buffer blob. pc is only set in this method and always
>>> points to an interpreter entry which is part of the interpreter buffer blob.
>>>
>>> Thanks,
>>> Tobias
>>>
>>> [1] http://hg.openjdk.java.net/jdk/jdk/file/ee44884f3ab8/src/hotspot/cpu/x86/frame_x86.cpp#l265
>>> [2] http://hg.openjdk.java.net/jdk/jdk/file/ee44884f3ab8/src/hotspot/share/runtime/frame.cpp#l287
>>> [3]
>>> http://hg.openjdk.java.net/jdk/jdk/file/ee44884f3ab8/src/hotspot/share/runtime/vframeArray.cpp#l303
>>>
> 

From aph at redhat.com  Wed Apr  1 08:54:52 2020
From: aph at redhat.com (Andrew Haley)
Date: Wed, 1 Apr 2020 09:54:52 +0100
Subject: [aarch64-port-dev ] RFR(S): 8241475: AArch64: Add missing support
 for PopCountVI node
In-Reply-To: <DB8PR08MB496930232C57100B12D55E9896C90@DB8PR08MB4969.eurprd08.prod.outlook.com>
References: <DB8PR08MB49699764FA887647D8495C5396C80@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <2ce24736-9b5c-5c23-bfde-14067d6d6b0d@redhat.com>
 <DB8PR08MB496930232C57100B12D55E9896C90@DB8PR08MB4969.eurprd08.prod.outlook.com>
Message-ID: <b0272c8c-2f0a-9a7c-bca8-1c33a4aa691d@redhat.com>

On 4/1/20 3:05 AM, Pengfei Li wrote:
> In my patch, the newly added instruction UADDLP supports T2S but doesn't support T2D. So I changed the value range to 0 - 3, where 3 means all arrangements are accepted now. That's why the value for parameter "accepted" of NEGR is promoted from 2 to 3 now.

I see. OK, thanks.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Wed Apr  1 10:22:23 2020
From: aph at redhat.com (Andrew Haley)
Date: Wed, 1 Apr 2020 11:22:23 +0100
Subject: [8u] RFR: 8237951: CTW: C2 compilation fails with "malformed
 control flow"
In-Reply-To: <871rp8ek1x.fsf@redhat.com>
References: <871rp8ek1x.fsf@redhat.com>
Message-ID: <51b56814-c654-beaf-f4d3-0e952ff337fa@redhat.com>

On 3/31/20 2:22 PM, Roland Westrelin wrote:
> The patch from the fix applies cleanly but it relies on
> Node::find_out_with() that's missing from 8. The backport below cherry
> picks that method from 8066312 (Add new Node* Node::find_out(int opc)
> method).
> 
> http://cr.openjdk.java.net/~roland/8237951.8u/webrev.00/

OK, thanks.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From erik.osterlund at oracle.com  Wed Apr  1 10:24:20 2020
From: erik.osterlund at oracle.com (=?UTF-8?Q?Erik_=c3=96sterlund?=)
Date: Wed, 1 Apr 2020 12:24:20 +0200
Subject: RFR: 8241438: Move IntelJccErratum mitigation code to
 platform-specific code
In-Reply-To: <7f4033fa-d380-582e-43cc-343aa6d0fe1c@oracle.com>
References: <14fe1c02-520b-f5d7-5c66-dc35d63f0a0d@oracle.com>
 <19c75204-d036-4768-686e-834995c5e21f@oracle.com>
 <e9c21be1-6ccd-c32e-d32c-a509517d2e71@oracle.com>
 <73e13987-92c9-c189-657e-0abe7a69ecaa@oracle.com>
 <7833a4ea-613a-4b4c-da23-4b3e064b2f1c@oracle.com>
 <7f4033fa-d380-582e-43cc-343aa6d0fe1c@oracle.com>
Message-ID: <2c038184-d7c0-94f1-47d1-56444c8b243d@oracle.com>

Hi Vladimir,

On 2020-03-30 21:14, Vladimir Kozlov wrote:
> But you at least can do static check at the beginning of method:
>
> int MachNode::pd_alignment_required() const {
> ? if (VM_Version::has_intel_jcc_erratum()) {
> ??? PhaseOutput* output = Compile::current()->output();
> ??? Block* block = output->block();
> ??? int index = output->index();
> ??? assert(output->mach() == this, "incorrect iterator state in 
> PhaseOutput");
> ??? if (IntelJccErratum::is_jcc_erratum_branch(block, this, index)) {
> ????? // Conservatively add worst case padding. We assume that 
> relocInfo::addr_unit() is 1 on x86.
> ????? return IntelJccErratum::largest_jcc_size() + 1;
> ??? }
> ? }
> ? return 1;
> }

That is equivalent to the compiler. I verified that by disassembling the 
release bits before
and after your suggestion, and it is instruction by instruction the 
same. In both cases it
first checks ifVM_Version::has_intel_jcc_erratum(), and if not, returns 
before even building
a frame. I'd rather keep the not nested variant because it is 
equivalent, yet easier to read.

>>
>>> In compute_padding() reads done under check so I have less concerns 
>>> about it. But I also don't get why you use saved _mach instead of 
>>> using MachNode 'this'.
>>
>> Good point. I changed to this + an assert checking that they are 
>> indeed the same.
>
> Why do you need Output._mach at all if you use it only in this assert? 
> Even logically it looks strange. In what case it could be different?

It should never be different; that was the point. The index and mach 
node exposed by the
iterator are related and refer to the same entity. So if you use the 
exposed index in code
in a mach node, you must know that this mach node is the same mach node 
that the index refers
to, and it is. The assert was meant to enforce it so that if you were to 
call either the
alignment or padding function in a new context, for whatever reason, and 
don't happen to know
that you can't do that without having a consistent iteration state, you 
would immediately catch
that in the assertions, instead of getting strange silent logic errors.

Having said that, I am okay with removing _mach if you prefer having one 
seat belt less, it is up to you:
http://cr.openjdk.java.net/~eosterlund/8241438/webrev.03/

Incremental:
http://cr.openjdk.java.net/~eosterlund/8241438/webrev.02_03/

Thanks,
/Erik

> Thanks,
> Vladimir
>
>>
>> Here is an updated webrev with your concerns and Vladimir Ivanov's 
>> concerns addressed:
>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.02/
>>
>> Incremental:
>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.01_02/
>>
>> Thanks,
>> /Erik
>>
>>> Thanks,
>>> Vladimir
>>>
>>>>
>>>>> In pd_alignment_required() you implicitly use knowledge that 
>>>>> relocInfo::addr_unit() on x86 is 1.
>>>>> At least add comment about that.
>>>>
>>>> I can add a comment about that.
>>>>
>>>> New webrev:
>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.01/
>>>>
>>>> Incremental:
>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.00_01/
>>>>
>>>> Thanks,
>>>> /Erik
>>>>
>>>>> Thanks,
>>>>> Vladimir
>>>>>
>>>>> On 3/23/20 6:09 AM, Erik ?sterlund wrote:
>>>>>> Hi,
>>>>>>
>>>>>> There is some platform-specific code in PhaseOutput that deals 
>>>>>> with the IntelJccErratum mitigation,
>>>>>> which is ifdef:ed in shared code. It should move to 
>>>>>> platform-specific code.
>>>>>>
>>>>>> This patch exposes the iteration state of PhaseOutput, which 
>>>>>> allows hiding the Intel-specific code
>>>>>> completely in x86-specific files.
>>>>>>
>>>>>> Webrev:
>>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.00/
>>>>>>
>>>>>> Bug:
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8241438
>>>>>>
>>>>>> Thanks,
>>>>>> /Erik
>>>>
>>


From jatin.bhateja at intel.com  Wed Apr  1 18:23:29 2020
From: jatin.bhateja at intel.com (Bhateja, Jatin)
Date: Wed, 1 Apr 2020 18:23:29 +0000
Subject: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
In-Reply-To: <A66BBE673E08E1428E3A918AE4D5B32C1AEE2F58@BGSMSX105.gar.corp.intel.com>
References: <A66BBE673E08E1428E3A918AE4D5B32C1AEDCD66@BGSMSX105.gar.corp.intel.com>
 <d91f4582-8c55-ad16-4107-9e1e433cd465@oracle.com>
 <A66BBE673E08E1428E3A918AE4D5B32C1AEE167F@BGSMSX105.gar.corp.intel.com>
 <373cb1c9-80f3-ec08-7d43-cbd3202bc134@oracle.com>,
 <A66BBE673E08E1428E3A918AE4D5B32C1AEE2F58@BGSMSX105.gar.corp.intel.com>
Message-ID: <A66BBE673E08E1428E3A918AE4D5B32C1AEE7689@BGSMSX105.gar.corp.intel.com>

Hi Vladimir,

Please find an updated unified patch at the following link.

http://cr.openjdk.java.net/~jbhateja/8241040/webrev.05/

This removes Optimized NotV handling for AVX3, as suggested it will be
brought via vectorIntrinsics branch.

Thanks for your help in shaping up this patch, please let me know if there 
are other comments.

Best Regards,
Jatin
________________________________________
From: Bhateja, Jatin
Sent: Wednesday, March 25, 2020 12:14 PM
To: Vladimir Ivanov
Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
Subject: RE: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction

Hi Vladimir,

I have placed updated patch at following links:-

 1)  Optimized NotV handling:
http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/

 2)  Changes for MacroLogic opt:
 http://cr.openjdk.java.net/~jbhateja/8241040/webrev.03_over_notV/

Kindly review and let me know your feedback.

Thanks,
Jatin

> -----Original Message-----
> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> Sent: Wednesday, March 25, 2020 12:33 AM
> To: Bhateja, Jatin <jatin.bhateja at intel.com>
> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
> <sandhya.viswanathan at intel.com>
> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
>
> Hi Jatin,
>
> I tried to submit the patches for testing, but windows-x64 build failed with the
> following errors:
>
> src/hotspot/share/opto/compile.cpp(2345): error C2131: expression did not
> evaluate to a constant
> src/hotspot/share/opto/compile.cpp(2345): note: failure was caused by a read
> of a variable outside its lifetime
> src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition'
> src/hotspot/share/opto/compile.cpp(2404): error C3863: array type 'int
> ['function']' is not assignable
>
> Best regards,
> Vladimir Ivanov
>
> On 24.03.2020 10:34, Bhateja, Jatin wrote:
> > Hi Vladimir,
> >
> > Thanks for your comments , I have split the original patch into two sub-
> patches.
> >
> > 1)  Optimized NotV handling:
> > http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
> >
> > 2)  Changes for MacroLogic opt:
> > http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/
> >
> > Added a new flag "UseVectorMacroLogic" which guards MacroLogic
> optimization.
> >
> > Kindly review and let me know your feedback.
> >
> > Best Regards,
> > Jatin
> >
> >> -----Original Message-----
> >> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> >> Sent: Tuesday, March 17, 2020 4:31 PM
> >> To: Bhateja, Jatin <jatin.bhateja at intel.com>; hotspot-compiler-
> >> dev at openjdk.java.net
> >> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic
> >> Instruction
> >>
> >>
> >>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/
> >>
> >> Very nice contribution, Jatin!
> >>
> >> Some comments after a brief review pass:
> >>
> >>     * Please, contribute NotV part separately.
> >>
> >>     * Why don't you perform (XorV v 0xFF..FF) => (NotV v)
> >> transformation during GVN instead?
> >>
> >>     * As of now, vector nodes are only produced by SuperWord
> >> analysis. It makes sense to limit new optimization pass to SuperWord
> >> pass only (probably, introduce a new dedicated Phase ). Once Vector
> >> API is available, it can be extended to cases when vector nodes are
> >> present
> >> (C->max_vector_size() > 0).
> >>
> >>     * There are more efficient ways to produce a vector of all-1s [1] [2].
> >>
> >> Best regards,
> >> Vladimir Ivanov
> >>
> >> [1]
> >> https://urldefense.com/v3/__https://stackoverflow.com/questions/45105
> >> 164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3DgJgc
> >> qfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$
> >> 1-efficiently
> >>
> >> [2]
> >> https://urldefense.com/v3/__https://stackoverflow.com/questions/37469
> >> 930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI
> >> _RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$
> >> value-to-all-one-bits
> >>
> >>>
> >>> A new optimization pass has been added post Auto-Vectorization which
> >> folds expression tree involving vector boolean logic operations
> >> (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
> >>> Optimization pass has following stages:
> >>>
> >>>     1.  Collection stage :
> >>>        *   This performs a DFS traversal over Ideal Graph and collects the root
> >> nodes of all vector logic expression trees.
> >>>     2.  Processing stage:
> >>>        *   Performs a bottom up traversal over expression tree and
> >> simultaneously folds specific DAG patterns involving Boolean logic
> >> parent and child nodes.
> >>>        *   Transforms (XORV INP , -1) -> (NOTV INP) to promote logic folding.
> >>>        *   Folding is performed under a constraint on the total number of
> inputs
> >> which a MacroLogic node can have, in this case it's 3.
> >>>        *   A partition is created around a DAG pattern involving logic parent
> and
> >> one or two logic child node, it encapsulate the nodes in post-order fashion.
> >>>        *   This partition is then evaluated by traversing over the nodes,
> assigning
> >> boolean values to its inputs and performing operations over them
> >> based on its Opcode. Node along with its computed result is stored in
> >> a map which is accessed during the evaluation of its user/parent node.
> >>>        *   Post-evaluation a MacroLogic node is created which is equivalent to
> a
> >> three input truth-table. Expression tree leaf level inputs along with
> >> result of its evaluation are the inputs fed to this new node.
> >>>        *   Entire expression tree is eventually subsumed/replaced by newly
> >> create MacroLogic node.
> >>>
> >>>
> >>> Following are the JMH benchmarks results with and without changes.
> >>>
> >>> Without Changes:
> >>>
> >>> Benchmark                            (VECLEN)   Mode  Cnt     Score   Error  Units
> >>> MacroLogicOpt.workload1_caller             64  thrpt       2904.480          ops/s
> >>> MacroLogicOpt.workload1_caller            128  thrpt       2219.252          ops/s
> >>> MacroLogicOpt.workload1_caller            256  thrpt       1507.267          ops/s
> >>> MacroLogicOpt.workload1_caller            512  thrpt        860.926          ops/s
> >>> MacroLogicOpt.workload1_caller           1024  thrpt        470.163          ops/s
> >>> MacroLogicOpt.workload1_caller           2048  thrpt        246.608          ops/s
> >>> MacroLogicOpt.workload1_caller           4096  thrpt        108.031          ops/s
> >>> MacroLogicOpt.workload2_caller             64  thrpt        344.633          ops/s
> >>> MacroLogicOpt.workload2_caller            128  thrpt        209.818          ops/s
> >>> MacroLogicOpt.workload2_caller            256  thrpt        111.678          ops/s
> >>> MacroLogicOpt.workload2_caller            512  thrpt         53.360          ops/s
> >>> MacroLogicOpt.workload2_caller           1024  thrpt         27.888          ops/s
> >>> MacroLogicOpt.workload2_caller           2048  thrpt         12.103          ops/s
> >>> MacroLogicOpt.workload2_caller           4096  thrpt          6.018          ops/s
> >>> MacroLogicOpt.workload3_caller             64  thrpt       3110.669          ops/s
> >>> MacroLogicOpt.workload3_caller            128  thrpt       1996.861          ops/s
> >>> MacroLogicOpt.workload3_caller            256  thrpt        870.166          ops/s
> >>> MacroLogicOpt.workload3_caller            512  thrpt        389.629          ops/s
> >>> MacroLogicOpt.workload3_caller           1024  thrpt        151.203          ops/s
> >>> MacroLogicOpt.workload3_caller           2048  thrpt         75.086          ops/s
> >>> MacroLogicOpt.workload3_caller           4096  thrpt         37.576          ops/s
> >>>
> >>> With Changes:
> >>>
> >>> Benchmark                            (VECLEN)   Mode  Cnt     Score   Error  Units
> >>> MacroLogicOpt.workload1_caller             64  thrpt       3306.670          ops/s
> >>> MacroLogicOpt.workload1_caller            128  thrpt       2936.851          ops/s
> >>> MacroLogicOpt.workload1_caller            256  thrpt       2413.827          ops/s
> >>> MacroLogicOpt.workload1_caller            512  thrpt       1440.291          ops/s
> >>> MacroLogicOpt.workload1_caller           1024  thrpt        707.576          ops/s
> >>> MacroLogicOpt.workload1_caller           2048  thrpt        384.863          ops/s
> >>> MacroLogicOpt.workload1_caller           4096  thrpt        132.753          ops/s
> >>> MacroLogicOpt.workload2_caller             64  thrpt        450.856          ops/s
> >>> MacroLogicOpt.workload2_caller            128  thrpt        323.925          ops/s
> >>> MacroLogicOpt.workload2_caller            256  thrpt        135.191          ops/s
> >>> MacroLogicOpt.workload2_caller            512  thrpt         69.424          ops/s
> >>> MacroLogicOpt.workload2_caller           1024  thrpt         35.744          ops/s
> >>> MacroLogicOpt.workload2_caller           2048  thrpt         14.168          ops/s
> >>> MacroLogicOpt.workload2_caller           4096  thrpt          7.245          ops/s
> >>> MacroLogicOpt.workload3_caller             64  thrpt       3333.550          ops/s
> >>> MacroLogicOpt.workload3_caller            128  thrpt       2269.428          ops/s
> >>> MacroLogicOpt.workload3_caller            256  thrpt        995.691          ops/s
> >>> MacroLogicOpt.workload3_caller            512  thrpt        412.452          ops/s
> >>> MacroLogicOpt.workload3_caller           1024  thrpt        151.157          ops/s
> >>> MacroLogicOpt.workload3_caller           2048  thrpt         75.079          ops/s
> >>> MacroLogicOpt.workload3_caller           4096  thrpt         37.158          ops/s
> >>>
> >>> Please review the patch.
> >>>
> >>> Best Regards,
> >>> Jatin
> >>>
> >>> [1] Section 17.7 :
> >>> https://urldefense.com/v3/__https://software.intel.com/sites/default
> >>> /files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllG
> >>> QTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$
> >>> architectures-optimization-manual.pdf
> >>>

From daniel.daugherty at oracle.com  Wed Apr  1 18:45:48 2020
From: daniel.daugherty at oracle.com (Daniel D. Daugherty)
Date: Wed, 1 Apr 2020 14:45:48 -0400
Subject: RFR: 8241234: Unify monitor enter/exit runtime entries
In-Reply-To: <b37f85c1-7c3e-269b-761a-32bc05f9cfc1@oracle.com>
References: <222D2846-F6AE-4D5B-B41F-F976D90E329C@oracle.com>
 <91eeada8-e05f-bc73-b029-94e169216a56@oracle.com>
 <a16b2657-daa4-b6df-7e99-0c97f7e921e7@oracle.com>
 <B330012A-DE37-4DD1-9FBB-EBA4C29FD121@oracle.com>
 <e3c12133-e678-f826-1934-05b5a4ce86c5@oracle.com>
 <DC0ECF11-2011-4335-AC71-0C4A75DA46DC@oracle.com>
 <534b8cf7-cd8c-565b-5163-09a216d4f94e@oracle.com>
 <F7EFFE2B-2BA5-420B-BF2D-682DC9DBCD15@oracle.com>
 <904faf68-4fff-f1b8-2fb8-48d65f282fa2@oracle.com>
 <ED6493D8-65F4-437F-ACFD-2EBAAE845BC5@oracle.com>
 <b37f85c1-7c3e-269b-761a-32bc05f9cfc1@oracle.com>
Message-ID: <09be678a-2742-4ab4-2e91-8cb7cef2c811@oracle.com>

Hi Yudi,

I grabbed a copy of this patch:

http://cr.openjdk.java.net/~yzheng/8241234/webrev.04/open.patch

pushed it into my jdk-15+16 baseline and ran it thru a single cycle of
my regular stress kit (~24 hours). There were no failures which matches
my jdk-15+16 baseline stress testing (~72 hours, no failures).

I also ran it through my ObjectMonitor inflation stress kit for ~24
hours and there were no failures there either.

Dan


On 3/30/20 10:20 AM, Daniel D. Daugherty wrote:
> On 3/30/20 10:15 AM, Yudi Zheng wrote:
>> Hi Daniel,
>>
>> Thanks for the review! I have uploaded a new version with your 
>> comments addressed:
>> http://cr.openjdk.java.net/~yzheng/8241234/webrev.04/
>>
>>> src/hotspot/share/runtime/sharedRuntime.hpp
>>> ???? Please don't forget to update the copyright year before you push.
>> Fixed.
>>
>>> src/hotspot/share/runtime/sharedRuntime.cpp
>>> ???? L2104:?? ObjectSynchronizer::exit(obj, lock, THREAD);
>>> ???????? The use of 'THREAD' here and 'TRAPS' in the function itself
>>> ???????? standout more now, but that's something for me to cleanup.
>> Also, I noticed that C2 was using CHECK
>>> ??? ObjectSynchronizer::enter(h_obj, lock, CHECK);
>> While C1 and JVMCI were using THREAD:
>>> ??? ObjectSynchronizer::enter(h_obj, lock->lock(), THREAD);
>> I have no idea when to use what, and hope unifying to the C2 entries 
>> would help.
>> Let me know if there is something I should address in this patch. 
>> Otherwise, I would
>> rather leave it to the expert, i.e., you ;)
>
> Yes, please leave it for me to clean up.
>
>
>>> src/hotspot/share/c1/c1_Runtime1.cpp
>>> ???? old L718:?? assert(thread == JavaThread::current(), "threads 
>>> must correspond");
>>> ???????? Removed in favor of the assert in 
>>> SharedRuntime::monitor_enter_helper().
>>> ???????? Okay that makes sense.
>>>
>>> ???? old L721:?? EXCEPTION_MARK;
>>> ???????? Removed in favor of the same in 
>>> SharedRuntime::monitor_enter_helper().
>>> ???????? Okay that makes sense.
>>>
>>> src/hotspot/share/jvmci/jvmciRuntime.cpp
>>> ???? old L403:?? assert(thread == JavaThread::current(), "threads 
>>> must correspond");
>>> ???? old L406:?? EXCEPTION_MARK;
>>> ???????? Same as for c1_Runtime1.cpp
>> I assume I don?t need to do anything regarding the comments above.
>
> Correct. Just observations on the old code.
>
>
>>> ???? L390:???? TRACE_jvmci_3("%s: entered locking slow case with 
>>> obj="...
>>> ???? L394:?? TRACE_jvmci_3("%s: exiting locking slow with obj="
>>> ???? L417:???? TRACE_jvmci_3("%s: exited locking slow case with obj="
>>> ???????? But this is no longer the "slow" case so I'm a bit confused.
>>>
>>> ???????? Update: I see there's a comment about the tracing being 
>>> removed.
>>> ???????? I have no opinion on that since it is JVM/CI code, but the 
>>> word
>>> ???????? "slow" needs to be adjusted if you keep it.
>> I removed all the tracing code.
>
> Thanks for cleaning that up.
>
> Dan
>
>>
>> Many thanks,
>> Yudi
>


From igor.ignatyev at oracle.com  Wed Apr  1 19:13:09 2020
From: igor.ignatyev at oracle.com (Igor Ignatyev)
Date: Wed, 1 Apr 2020 12:13:09 -0700
Subject: RFR(XS): 8174768: Make ProcessTools print executed process output
 into a separate file
In-Reply-To: <ec26b257-4518-419e-83a3-b64f920f766b@default>
References: <ec26b257-4518-419e-83a3-b64f920f766b@default>
Message-ID: <C980DAFB-8F75-4D8A-A0BB-B3227D01F54D@oracle.com>

Hi Evgeny,

(widening the audience, given this affects not just hotspot compiler, but hotspot tests as well as core libs tests in general)

overall that looks good to me. one suggestion, for the ease of failure analysis it might be worth to print out the names of created files, although this might potentially clutter the output, I don't think it'll be a problem given we already print out things like 'Gathering output for process ...' , 'Waiting for completion...' in LazyOutputBuffer.

> The change has been tested via a mach5 test runs (jdk-tier1 through 4) on the 4 common platforms (linux-x64, windows-x64, macosx-x64, sparcv9). 
this doesn't include any of hotspot tiers, could you please also run hs-tier1--4? 
// you can use tierN jobs which include both jdk and hs parts.

Thanks,
-- Igor

> On Mar 30, 2020, at 3:55 AM, Evgeny Nikitin <evgeny.nikitin at oracle.com> wrote:
> 
> 
> Hi, 
> 
> 
> Bug: https://bugs.openjdk.java.net/browse/JDK-8174768 
> 
> Webrev: http://cr.openjdk.java.net/~iignatyev/enikitin/8174768/webrev.00/ 
> 
> 
> The bug had been created as a request to simplify investigation for compiler control tests failures. 
> I found the functionality pretty generic and useful and made ProcessTools dump output as well as some diagnostic information for every executed process into a separate file. 
> The diagnostic information contains cmdline, exit code, stdout and stderr. The output files are named like 'pid-<PID>-output.log'. 
> 
> The change has been tested via a mach5 test runs (jdk-tier1 through 4) on the 4 common platforms (linux-x64, windows-x64, macosx-x64, sparcv9). 
> 
> Please review, 
> /Evgeny Nikitin. 


From tom.rodriguez at oracle.com  Wed Apr  1 19:56:54 2020
From: tom.rodriguez at oracle.com (Tom Rodriguez)
Date: Wed, 1 Apr 2020 12:56:54 -0700
Subject: RFR(XS) 8191930: [Graal] emits unparseable XML into compile log
Message-ID: <a5e1d290-018f-0145-3cfe-23a12e74d0b5@oracle.com>

http://cr.openjdk.java.net/~never/8191930/webrev
https://bugs.openjdk.java.net/browse/JDK-8191930

This was something that was fixed in 8 but never made it into 9+ I think 
because the code moved after 8.  Tested by forcing a bailout with the 
problematic string and inspecting the resulting xml.

<failure reason='Code installation failed: dependencies failed
Failed dependency of type call_site_target_value
   object = a 
&apos;jdk/nashorn/internal/runtime/linker/LinkerCallSite&apos;{0x00000005df235660}
   object = a 
&apos;java/lang/invoke/BoundMethodHandle$Species_LLLL&apos;{0x00000005df235680}
   witness = jdk.nashorn.internal.runtime.linker.LinkerCallSite '/>

From nils.eliasson at oracle.com  Wed Apr  1 20:06:28 2020
From: nils.eliasson at oracle.com (Nils Eliasson)
Date: Wed, 1 Apr 2020 22:06:28 +0200
Subject: RFR(S): 8241556: Memory leak if -XX:CompileCommand is set
In-Reply-To: <CA+w6HxZw3ZxNL4aooNzaA9BtbGgp82QgihNRmxe4joYdJ5H9XQ@mail.gmail.com>
References: <CA+w6HxZw3ZxNL4aooNzaA9BtbGgp82QgihNRmxe4joYdJ5H9XQ@mail.gmail.com>
Message-ID: <d3300378-934c-f937-064d-dc223ade1125@oracle.com>

Hi Man,

Your fix looks good. Thanks for fixing!

Reviewed.


Best regards,

Nils Eliasson

On 2020-03-25 00:21, Man Cao wrote:
> Hi all,
>
> Could I have reviews for this fix for a memory leak? This memory leak is
> pretty significant in production, and it took us weeks to identify the root
> cause.
> Webrev: https://cr.openjdk.java.net/~manc/8241556/webrev.00/
> Bug: https://bugs.openjdk.java.net/browse/JDK-8241556
>
> A more elegant fix would be to use automatic allocation/deallocation on the
> char*. Unfortunately std::string and std:unique_ptr are both unavailable in
> HotSpot.
>
> -Man

From yudi.zheng at oracle.com  Wed Apr  1 20:23:02 2020
From: yudi.zheng at oracle.com (Yudi Zheng)
Date: Wed, 1 Apr 2020 22:23:02 +0200
Subject: RFR: 8241234: Unify monitor enter/exit runtime entries
In-Reply-To: <09be678a-2742-4ab4-2e91-8cb7cef2c811@oracle.com>
References: <222D2846-F6AE-4D5B-B41F-F976D90E329C@oracle.com>
 <91eeada8-e05f-bc73-b029-94e169216a56@oracle.com>
 <a16b2657-daa4-b6df-7e99-0c97f7e921e7@oracle.com>
 <B330012A-DE37-4DD1-9FBB-EBA4C29FD121@oracle.com>
 <e3c12133-e678-f826-1934-05b5a4ce86c5@oracle.com>
 <DC0ECF11-2011-4335-AC71-0C4A75DA46DC@oracle.com>
 <534b8cf7-cd8c-565b-5163-09a216d4f94e@oracle.com>
 <F7EFFE2B-2BA5-420B-BF2D-682DC9DBCD15@oracle.com>
 <904faf68-4fff-f1b8-2fb8-48d65f282fa2@oracle.com>
 <ED6493D8-65F4-437F-ACFD-2EBAAE845BC5@oracle.com>
 <b37f85c1-7c3e-269b-761a-32bc05f9cfc1@oracle.com>
 <09be678a-2742-4ab4-2e91-8cb7cef2c811@oracle.com>
Message-ID: <D8E7FAA1-164D-4E1B-9323-D4315C1692FB@oracle.com>

Hi Dan,

Thanks a lot for stress testing this patch! 
I will push this as soon as I get green lights from the mach5 tests.

Best regards,
-Yudi

> On 1 Apr 2020, at 20:45, Daniel D. Daugherty <daniel.daugherty at oracle.com> wrote:
> 
> Hi Yudi,
> 
> I grabbed a copy of this patch:
> 
> http://cr.openjdk.java.net/~yzheng/8241234/webrev.04/open.patch
> 
> pushed it into my jdk-15+16 baseline and ran it thru a single cycle of
> my regular stress kit (~24 hours). There were no failures which matches
> my jdk-15+16 baseline stress testing (~72 hours, no failures).
> 
> I also ran it through my ObjectMonitor inflation stress kit for ~24
> hours and there were no failures there either.
> 
> Dan
> 
> 
> On 3/30/20 10:20 AM, Daniel D. Daugherty wrote:
>> On 3/30/20 10:15 AM, Yudi Zheng wrote:
>>> Hi Daniel,
>>> 
>>> Thanks for the review! I have uploaded a new version with your comments addressed:
>>> http://cr.openjdk.java.net/~yzheng/8241234/webrev.04/
>>> 
>>>> src/hotspot/share/runtime/sharedRuntime.hpp
>>>>      Please don't forget to update the copyright year before you push.
>>> Fixed.
>>> 
>>>> src/hotspot/share/runtime/sharedRuntime.cpp
>>>>      L2104:   ObjectSynchronizer::exit(obj, lock, THREAD);
>>>>          The use of 'THREAD' here and 'TRAPS' in the function itself
>>>>          standout more now, but that's something for me to cleanup.
>>> Also, I noticed that C2 was using CHECK
>>>>     ObjectSynchronizer::enter(h_obj, lock, CHECK);
>>> While C1 and JVMCI were using THREAD:
>>>>     ObjectSynchronizer::enter(h_obj, lock->lock(), THREAD);
>>> I have no idea when to use what, and hope unifying to the C2 entries would help.
>>> Let me know if there is something I should address in this patch. Otherwise, I would
>>> rather leave it to the expert, i.e., you ;)
>> 
>> Yes, please leave it for me to clean up.
>> 
>> 
>>>> src/hotspot/share/c1/c1_Runtime1.cpp
>>>>      old L718:   assert(thread == JavaThread::current(), "threads must correspond");
>>>>          Removed in favor of the assert in SharedRuntime::monitor_enter_helper().
>>>>          Okay that makes sense.
>>>> 
>>>>      old L721:   EXCEPTION_MARK;
>>>>          Removed in favor of the same in SharedRuntime::monitor_enter_helper().
>>>>          Okay that makes sense.
>>>> 
>>>> src/hotspot/share/jvmci/jvmciRuntime.cpp
>>>>      old L403:   assert(thread == JavaThread::current(), "threads must correspond");
>>>>      old L406:   EXCEPTION_MARK;
>>>>          Same as for c1_Runtime1.cpp
>>> I assume I don?t need to do anything regarding the comments above.
>> 
>> Correct. Just observations on the old code.
>> 
>> 
>>>>      L390:     TRACE_jvmci_3("%s: entered locking slow case with obj="...
>>>>      L394:   TRACE_jvmci_3("%s: exiting locking slow with obj="
>>>>      L417:     TRACE_jvmci_3("%s: exited locking slow case with obj="
>>>>          But this is no longer the "slow" case so I'm a bit confused.
>>>> 
>>>>          Update: I see there's a comment about the tracing being removed.
>>>>          I have no opinion on that since it is JVM/CI code, but the word
>>>>          "slow" needs to be adjusted if you keep it.
>>> I removed all the tracing code.
>> 
>> Thanks for cleaning that up.
>> 
>> Dan
>> 
>>> 
>>> Many thanks,
>>> Yudi
>> 
> 


From vladimir.x.ivanov at oracle.com  Wed Apr  1 20:25:48 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Wed, 1 Apr 2020 23:25:48 +0300
Subject: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
In-Reply-To: <A66BBE673E08E1428E3A918AE4D5B32C1AEE7689@BGSMSX105.gar.corp.intel.com>
References: <A66BBE673E08E1428E3A918AE4D5B32C1AEDCD66@BGSMSX105.gar.corp.intel.com>
 <d91f4582-8c55-ad16-4107-9e1e433cd465@oracle.com>
 <A66BBE673E08E1428E3A918AE4D5B32C1AEE167F@BGSMSX105.gar.corp.intel.com>
 <373cb1c9-80f3-ec08-7d43-cbd3202bc134@oracle.com>
 <A66BBE673E08E1428E3A918AE4D5B32C1AEE2F58@BGSMSX105.gar.corp.intel.com>
 <A66BBE673E08E1428E3A918AE4D5B32C1AEE7689@BGSMSX105.gar.corp.intel.com>
Message-ID: <b429b73a-e720-cad6-6a8d-d0eb8f478fdb@oracle.com>

Hi Jatin,

> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.05/

Looks good. I'll submit it for testing.

FTR, in the longer term I'd like to see the dedicated pass to go away 
and the optimization to be migrated to GVN. I don't see any special 
requirements which justify additional complexity from a separate pass.

Best regards,
Vladimir Ivanov

> This removes Optimized NotV handling for AVX3, as suggested it will be
> brought via vectorIntrinsics branch.
> 
> Thanks for your help in shaping up this patch, please let me know if there
> are other comments.
> 
> Best Regards,
> Jatin
> ________________________________________
> From: Bhateja, Jatin
> Sent: Wednesday, March 25, 2020 12:14 PM
> To: Vladimir Ivanov
> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
> Subject: RE: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
> 
> Hi Vladimir,
> 
> I have placed updated patch at following links:-
> 
>   1)  Optimized NotV handling:
> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
> 
>   2)  Changes for MacroLogic opt:
>   http://cr.openjdk.java.net/~jbhateja/8241040/webrev.03_over_notV/
> 
> Kindly review and let me know your feedback.
> 
> Thanks,
> Jatin
> 
>> -----Original Message-----
>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>> Sent: Wednesday, March 25, 2020 12:33 AM
>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
>> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
>> <sandhya.viswanathan at intel.com>
>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
>>
>> Hi Jatin,
>>
>> I tried to submit the patches for testing, but windows-x64 build failed with the
>> following errors:
>>
>> src/hotspot/share/opto/compile.cpp(2345): error C2131: expression did not
>> evaluate to a constant
>> src/hotspot/share/opto/compile.cpp(2345): note: failure was caused by a read
>> of a variable outside its lifetime
>> src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition'
>> src/hotspot/share/opto/compile.cpp(2404): error C3863: array type 'int
>> ['function']' is not assignable
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> On 24.03.2020 10:34, Bhateja, Jatin wrote:
>>> Hi Vladimir,
>>>
>>> Thanks for your comments , I have split the original patch into two sub-
>> patches.
>>>
>>> 1)  Optimized NotV handling:
>>> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
>>>
>>> 2)  Changes for MacroLogic opt:
>>> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/
>>>
>>> Added a new flag "UseVectorMacroLogic" which guards MacroLogic
>> optimization.
>>>
>>> Kindly review and let me know your feedback.
>>>
>>> Best Regards,
>>> Jatin
>>>
>>>> -----Original Message-----
>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>> Sent: Tuesday, March 17, 2020 4:31 PM
>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>; hotspot-compiler-
>>>> dev at openjdk.java.net
>>>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic
>>>> Instruction
>>>>
>>>>
>>>>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/
>>>>
>>>> Very nice contribution, Jatin!
>>>>
>>>> Some comments after a brief review pass:
>>>>
>>>>      * Please, contribute NotV part separately.
>>>>
>>>>      * Why don't you perform (XorV v 0xFF..FF) => (NotV v)
>>>> transformation during GVN instead?
>>>>
>>>>      * As of now, vector nodes are only produced by SuperWord
>>>> analysis. It makes sense to limit new optimization pass to SuperWord
>>>> pass only (probably, introduce a new dedicated Phase ). Once Vector
>>>> API is available, it can be extended to cases when vector nodes are
>>>> present
>>>> (C->max_vector_size() > 0).
>>>>
>>>>      * There are more efficient ways to produce a vector of all-1s [1] [2].
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>> [1]
>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/45105
>>>> 164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3DgJgc
>>>> qfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$
>>>> 1-efficiently
>>>>
>>>> [2]
>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/37469
>>>> 930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI
>>>> _RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$
>>>> value-to-all-one-bits
>>>>
>>>>>
>>>>> A new optimization pass has been added post Auto-Vectorization which
>>>> folds expression tree involving vector boolean logic operations
>>>> (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
>>>>> Optimization pass has following stages:
>>>>>
>>>>>      1.  Collection stage :
>>>>>         *   This performs a DFS traversal over Ideal Graph and collects the root
>>>> nodes of all vector logic expression trees.
>>>>>      2.  Processing stage:
>>>>>         *   Performs a bottom up traversal over expression tree and
>>>> simultaneously folds specific DAG patterns involving Boolean logic
>>>> parent and child nodes.
>>>>>         *   Transforms (XORV INP , -1) -> (NOTV INP) to promote logic folding.
>>>>>         *   Folding is performed under a constraint on the total number of
>> inputs
>>>> which a MacroLogic node can have, in this case it's 3.
>>>>>         *   A partition is created around a DAG pattern involving logic parent
>> and
>>>> one or two logic child node, it encapsulate the nodes in post-order fashion.
>>>>>         *   This partition is then evaluated by traversing over the nodes,
>> assigning
>>>> boolean values to its inputs and performing operations over them
>>>> based on its Opcode. Node along with its computed result is stored in
>>>> a map which is accessed during the evaluation of its user/parent node.
>>>>>         *   Post-evaluation a MacroLogic node is created which is equivalent to
>> a
>>>> three input truth-table. Expression tree leaf level inputs along with
>>>> result of its evaluation are the inputs fed to this new node.
>>>>>         *   Entire expression tree is eventually subsumed/replaced by newly
>>>> create MacroLogic node.
>>>>>
>>>>>
>>>>> Following are the JMH benchmarks results with and without changes.
>>>>>
>>>>> Without Changes:
>>>>>
>>>>> Benchmark                            (VECLEN)   Mode  Cnt     Score   Error  Units
>>>>> MacroLogicOpt.workload1_caller             64  thrpt       2904.480          ops/s
>>>>> MacroLogicOpt.workload1_caller            128  thrpt       2219.252          ops/s
>>>>> MacroLogicOpt.workload1_caller            256  thrpt       1507.267          ops/s
>>>>> MacroLogicOpt.workload1_caller            512  thrpt        860.926          ops/s
>>>>> MacroLogicOpt.workload1_caller           1024  thrpt        470.163          ops/s
>>>>> MacroLogicOpt.workload1_caller           2048  thrpt        246.608          ops/s
>>>>> MacroLogicOpt.workload1_caller           4096  thrpt        108.031          ops/s
>>>>> MacroLogicOpt.workload2_caller             64  thrpt        344.633          ops/s
>>>>> MacroLogicOpt.workload2_caller            128  thrpt        209.818          ops/s
>>>>> MacroLogicOpt.workload2_caller            256  thrpt        111.678          ops/s
>>>>> MacroLogicOpt.workload2_caller            512  thrpt         53.360          ops/s
>>>>> MacroLogicOpt.workload2_caller           1024  thrpt         27.888          ops/s
>>>>> MacroLogicOpt.workload2_caller           2048  thrpt         12.103          ops/s
>>>>> MacroLogicOpt.workload2_caller           4096  thrpt          6.018          ops/s
>>>>> MacroLogicOpt.workload3_caller             64  thrpt       3110.669          ops/s
>>>>> MacroLogicOpt.workload3_caller            128  thrpt       1996.861          ops/s
>>>>> MacroLogicOpt.workload3_caller            256  thrpt        870.166          ops/s
>>>>> MacroLogicOpt.workload3_caller            512  thrpt        389.629          ops/s
>>>>> MacroLogicOpt.workload3_caller           1024  thrpt        151.203          ops/s
>>>>> MacroLogicOpt.workload3_caller           2048  thrpt         75.086          ops/s
>>>>> MacroLogicOpt.workload3_caller           4096  thrpt         37.576          ops/s
>>>>>
>>>>> With Changes:
>>>>>
>>>>> Benchmark                            (VECLEN)   Mode  Cnt     Score   Error  Units
>>>>> MacroLogicOpt.workload1_caller             64  thrpt       3306.670          ops/s
>>>>> MacroLogicOpt.workload1_caller            128  thrpt       2936.851          ops/s
>>>>> MacroLogicOpt.workload1_caller            256  thrpt       2413.827          ops/s
>>>>> MacroLogicOpt.workload1_caller            512  thrpt       1440.291          ops/s
>>>>> MacroLogicOpt.workload1_caller           1024  thrpt        707.576          ops/s
>>>>> MacroLogicOpt.workload1_caller           2048  thrpt        384.863          ops/s
>>>>> MacroLogicOpt.workload1_caller           4096  thrpt        132.753          ops/s
>>>>> MacroLogicOpt.workload2_caller             64  thrpt        450.856          ops/s
>>>>> MacroLogicOpt.workload2_caller            128  thrpt        323.925          ops/s
>>>>> MacroLogicOpt.workload2_caller            256  thrpt        135.191          ops/s
>>>>> MacroLogicOpt.workload2_caller            512  thrpt         69.424          ops/s
>>>>> MacroLogicOpt.workload2_caller           1024  thrpt         35.744          ops/s
>>>>> MacroLogicOpt.workload2_caller           2048  thrpt         14.168          ops/s
>>>>> MacroLogicOpt.workload2_caller           4096  thrpt          7.245          ops/s
>>>>> MacroLogicOpt.workload3_caller             64  thrpt       3333.550          ops/s
>>>>> MacroLogicOpt.workload3_caller            128  thrpt       2269.428          ops/s
>>>>> MacroLogicOpt.workload3_caller            256  thrpt        995.691          ops/s
>>>>> MacroLogicOpt.workload3_caller            512  thrpt        412.452          ops/s
>>>>> MacroLogicOpt.workload3_caller           1024  thrpt        151.157          ops/s
>>>>> MacroLogicOpt.workload3_caller           2048  thrpt         75.079          ops/s
>>>>> MacroLogicOpt.workload3_caller           4096  thrpt         37.158          ops/s
>>>>>
>>>>> Please review the patch.
>>>>>
>>>>> Best Regards,
>>>>> Jatin
>>>>>
>>>>> [1] Section 17.7 :
>>>>> https://urldefense.com/v3/__https://software.intel.com/sites/default
>>>>> /files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllG
>>>>> QTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$
>>>>> architectures-optimization-manual.pdf
>>>>>

From ioi.lam at oracle.com  Thu Apr  2 00:00:21 2020
From: ioi.lam at oracle.com (Ioi Lam)
Date: Wed, 1 Apr 2020 17:00:21 -0700
Subject: RFR(XS): 8174768: Make ProcessTools print executed process output
 into a separate file
In-Reply-To: <C980DAFB-8F75-4D8A-A0BB-B3227D01F54D@oracle.com>
References: <ec26b257-4518-419e-83a3-b64f920f766b@default>
 <C980DAFB-8F75-4D8A-A0BB-B3227D01F54D@oracle.com>
Message-ID: <3bbe30fd-aae0-f55f-15f4-6a92ef918617@oracle.com>


On 4/1/20 12:13 PM, Igor Ignatyev wrote:
> Hi Evgeny,
>
> (widening the audience, given this affects not just hotspot compiler, but hotspot tests as well as core libs tests in general)
>
> overall that looks good to me. one suggestion, for the ease of failure analysis it might be worth to print out the names of created files, although this might potentially clutter the output, I don't think it'll be a problem given we already print out things like 'Gathering output for process ...' , 'Waiting for completion...' in LazyOutputBuffer.
>

FYI,

We've been doing a similar thing with all the CDS tests -- all the logs 
from ProcessTools are saved, and we print out the name of stdout/stderr 
files in the .jtr files. It's been very valuable in diagnosing failures.

Command line: [/home/iklam/jdk/bld/fre-fastdebug/images/jdk/bin/java -cp 
/jdk2/tmp/jtreg/work/classes/13/runtime/cds/appcds/HelloTest.d:/jdk2/fre/open/test/hotspot/jtreg/runtime/cds/appcds:/jdk2/tmp/jtreg/work/classes/13/test/lib:/jdk/tools/jtreg/5.0-b01/lib/javatest.jar:/jdk/tools/jtreg/5.0-b01/lib/jtreg.jar 
-XX:MaxRAM=8g -cp 
/jdk2/tmp/jtreg/work/classes/13/runtime/cds/appcds/HelloTest.d/hello.jar 
-Xshare:dump -Xlog:cds 
-XX:SharedArchiveFile=/jdk2/tmp/jtreg/work/scratch/2/appcds-23h24m40s432.jsa 
-XX:ExtraSharedClassListFile=/jdk2/tmp/jtreg/work/classes/13/runtime/cds/appcds/HelloTest.d/HelloTest-test.classlist 
]
[2020-04-01T06:24:40.530164Z] Gathering output for process 22666
[ELAPSED: 3068 ms]
[logging stdout to HelloTest-0000-dump.stdout]
[logging stderr to HelloTest-0000-dump.stderr]

Thanks
- Ioi


>> The change has been tested via a mach5 test runs (jdk-tier1 through 4) on the 4 common platforms (linux-x64, windows-x64, macosx-x64, sparcv9).
> this doesn't include any of hotspot tiers, could you please also run hs-tier1--4?
> // you can use tierN jobs which include both jdk and hs parts.
>
> Thanks,
> -- Igor
>
>> On Mar 30, 2020, at 3:55 AM, Evgeny Nikitin <evgeny.nikitin at oracle.com> wrote:
>>
>>
>> Hi,
>>
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8174768
>>
>> Webrev: http://cr.openjdk.java.net/~iignatyev/enikitin/8174768/webrev.00/
>>
>>
>> The bug had been created as a request to simplify investigation for compiler control tests failures.
>> I found the functionality pretty generic and useful and made ProcessTools dump output as well as some diagnostic information for every executed process into a separate file.
>> The diagnostic information contains cmdline, exit code, stdout and stderr. The output files are named like 'pid-<PID>-output.log'.
>>
>> The change has been tested via a mach5 test runs (jdk-tier1 through 4) on the 4 common platforms (linux-x64, windows-x64, macosx-x64, sparcv9).
>>
>> Please review,
>> /Evgeny Nikitin.


From david.holmes at oracle.com  Thu Apr  2 00:07:31 2020
From: david.holmes at oracle.com (David Holmes)
Date: Wed, 1 Apr 2020 17:07:31 -0700 (PDT)
Subject: RFR(XS): 8174768: Make ProcessTools print executed process output
 into a separate file
In-Reply-To: <C980DAFB-8F75-4D8A-A0BB-B3227D01F54D@oracle.com>
References: <ec26b257-4518-419e-83a3-b64f920f766b@default>
 <C980DAFB-8F75-4D8A-A0BB-B3227D01F54D@oracle.com>
Message-ID: <70147008-45b8-0b7f-6691-50f8429c5369@oracle.com>

Thanks for sharing this Igor!

I'm not at all sure this is generally what we want for every single test 
that uses ProcessTools! But I'm willing it to see it trialed.

Evgeny: Please run full tier testing at least to tier 6 and ideally 
beyond before pushing this. There are potential implications for 
temporary (and more permanent) disk usage as well as additional time 
needed to write files out to disk. (Hopefully these are generally small 
enough that this doesn't make a noticeable difference.)

Thanks,
David

On 2/04/2020 5:13 am, Igor Ignatyev wrote:
> Hi Evgeny,
> 
> (widening the audience, given this affects not just hotspot compiler, but hotspot tests as well as core libs tests in general)
> 
> overall that looks good to me. one suggestion, for the ease of failure analysis it might be worth to print out the names of created files, although this might potentially clutter the output, I don't think it'll be a problem given we already print out things like 'Gathering output for process ...' , 'Waiting for completion...' in LazyOutputBuffer.
> 
>> The change has been tested via a mach5 test runs (jdk-tier1 through 4) on the 4 common platforms (linux-x64, windows-x64, macosx-x64, sparcv9).
> this doesn't include any of hotspot tiers, could you please also run hs-tier1--4?
> // you can use tierN jobs which include both jdk and hs parts.
> 
> Thanks,
> -- Igor
> 
>> On Mar 30, 2020, at 3:55 AM, Evgeny Nikitin <evgeny.nikitin at oracle.com> wrote:
>>
>>
>> Hi,
>>
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8174768
>>
>> Webrev: http://cr.openjdk.java.net/~iignatyev/enikitin/8174768/webrev.00/
>>
>>
>> The bug had been created as a request to simplify investigation for compiler control tests failures.
>> I found the functionality pretty generic and useful and made ProcessTools dump output as well as some diagnostic information for every executed process into a separate file.
>> The diagnostic information contains cmdline, exit code, stdout and stderr. The output files are named like 'pid-<PID>-output.log'.
>>
>> The change has been tested via a mach5 test runs (jdk-tier1 through 4) on the 4 common platforms (linux-x64, windows-x64, macosx-x64, sparcv9).
>>
>> Please review,
>> /Evgeny Nikitin.
> 

From vladimir.kozlov at oracle.com  Thu Apr  2 02:57:24 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 1 Apr 2020 19:57:24 -0700
Subject: RFR: 8241438: Move IntelJccErratum mitigation code to
 platform-specific code
In-Reply-To: <2c038184-d7c0-94f1-47d1-56444c8b243d@oracle.com>
References: <14fe1c02-520b-f5d7-5c66-dc35d63f0a0d@oracle.com>
 <19c75204-d036-4768-686e-834995c5e21f@oracle.com>
 <e9c21be1-6ccd-c32e-d32c-a509517d2e71@oracle.com>
 <73e13987-92c9-c189-657e-0abe7a69ecaa@oracle.com>
 <7833a4ea-613a-4b4c-da23-4b3e064b2f1c@oracle.com>
 <7f4033fa-d380-582e-43cc-343aa6d0fe1c@oracle.com>
 <2c038184-d7c0-94f1-47d1-56444c8b243d@oracle.com>
Message-ID: <f4fcddcb-7840-86f2-7f69-d17b051e7cb7@oracle.com>

On 4/1/20 3:24 AM, Erik ?sterlund wrote:
> Hi Vladimir,
> 
> On 2020-03-30 21:14, Vladimir Kozlov wrote:
>> But you at least can do static check at the beginning of method:
>>
>> int MachNode::pd_alignment_required() const {
>> ? if (VM_Version::has_intel_jcc_erratum()) {
>> ??? PhaseOutput* output = Compile::current()->output();
>> ??? Block* block = output->block();
>> ??? int index = output->index();
>> ??? assert(output->mach() == this, "incorrect iterator state in PhaseOutput");
>> ??? if (IntelJccErratum::is_jcc_erratum_branch(block, this, index)) {
>> ????? // Conservatively add worst case padding. We assume that relocInfo::addr_unit() is 1 on x86.
>> ????? return IntelJccErratum::largest_jcc_size() + 1;
>> ??? }
>> ? }
>> ? return 1;
>> }
> 
> That is equivalent to the compiler. I verified that by disassembling the release bits before
> and after your suggestion, and it is instruction by instruction the same. In both cases it
> first checks ifVM_Version::has_intel_jcc_erratum(), and if not, returns before even building
> a frame. I'd rather keep the not nested variant because it is equivalent, yet easier to read.

I have reservation about this statement which may not true for all C++ compilers we use but I will not insist on 
refactoring it.

> 
>>>
>>>> In compute_padding() reads done under check so I have less concerns about it. But I also don't get why you use saved 
>>>> _mach instead of using MachNode 'this'.
>>>
>>> Good point. I changed to this + an assert checking that they are indeed the same.
>>
>> Why do you need Output._mach at all if you use it only in this assert? Even logically it looks strange. In what case 
>> it could be different?
> 
> It should never be different; that was the point. The index and mach node exposed by the
> iterator are related and refer to the same entity. So if you use the exposed index in code
> in a mach node, you must know that this mach node is the same mach node that the index refers
> to, and it is. The assert was meant to enforce it so that if you were to call either the
> alignment or padding function in a new context, for whatever reason, and don't happen to know
> that you can't do that without having a consistent iteration state, you would immediately catch
> that in the assertions, instead of getting strange silent logic errors.
> 
> Having said that, I am okay with removing _mach if you prefer having one seat belt less, it is up to you:
> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.03/

Okay. Good.

Thanks,
Vladimir

> 
> Incremental:
> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.02_03/
> 
> Thanks,
> /Erik
> 
>> Thanks,
>> Vladimir
>>
>>>
>>> Here is an updated webrev with your concerns and Vladimir Ivanov's concerns addressed:
>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.02/
>>>
>>> Incremental:
>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.01_02/
>>>
>>> Thanks,
>>> /Erik
>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>>>
>>>>>> In pd_alignment_required() you implicitly use knowledge that relocInfo::addr_unit() on x86 is 1.
>>>>>> At least add comment about that.
>>>>>
>>>>> I can add a comment about that.
>>>>>
>>>>> New webrev:
>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.01/
>>>>>
>>>>> Incremental:
>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.00_01/
>>>>>
>>>>> Thanks,
>>>>> /Erik
>>>>>
>>>>>> Thanks,
>>>>>> Vladimir
>>>>>>
>>>>>> On 3/23/20 6:09 AM, Erik ?sterlund wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> There is some platform-specific code in PhaseOutput that deals with the IntelJccErratum mitigation,
>>>>>>> which is ifdef:ed in shared code. It should move to platform-specific code.
>>>>>>>
>>>>>>> This patch exposes the iteration state of PhaseOutput, which allows hiding the Intel-specific code
>>>>>>> completely in x86-specific files.
>>>>>>>
>>>>>>> Webrev:
>>>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.00/
>>>>>>>
>>>>>>> Bug:
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8241438
>>>>>>>
>>>>>>> Thanks,
>>>>>>> /Erik
>>>>>
>>>
> 

From vladimir.kozlov at oracle.com  Thu Apr  2 03:37:51 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 1 Apr 2020 20:37:51 -0700
Subject: RFR(XS) 8191930: [Graal] emits unparseable XML into compile log
In-Reply-To: <a5e1d290-018f-0145-3cfe-23a12e74d0b5@oracle.com>
References: <a5e1d290-018f-0145-3cfe-23a12e74d0b5@oracle.com>
Message-ID: <6cb3928e-56e7-6fae-18e7-802792d4a6a7@oracle.com>

Looks good.

Thanks,
Vladimir

On 4/1/20 12:56 PM, Tom Rodriguez wrote:
> http://cr.openjdk.java.net/~never/8191930/webrev
> https://bugs.openjdk.java.net/browse/JDK-8191930
> 
> This was something that was fixed in 8 but never made it into 9+ I think because the code moved after 8.? Tested by 
> forcing a bailout with the problematic string and inspecting the resulting xml.
> 
> <failure reason='Code installation failed: dependencies failed
> Failed dependency of type call_site_target_value
>  ? object = a &apos;jdk/nashorn/internal/runtime/linker/LinkerCallSite&apos;{0x00000005df235660}
>  ? object = a &apos;java/lang/invoke/BoundMethodHandle$Species_LLLL&apos;{0x00000005df235680}
>  ? witness = jdk.nashorn.internal.runtime.linker.LinkerCallSite '/>

From nils.eliasson at oracle.com  Thu Apr  2 09:28:45 2020
From: nils.eliasson at oracle.com (Nils Eliasson)
Date: Thu, 2 Apr 2020 11:28:45 +0200
Subject: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
In-Reply-To: <A66BBE673E08E1428E3A918AE4D5B32C1AEE7689@BGSMSX105.gar.corp.intel.com>
References: <A66BBE673E08E1428E3A918AE4D5B32C1AEDCD66@BGSMSX105.gar.corp.intel.com>
 <d91f4582-8c55-ad16-4107-9e1e433cd465@oracle.com>
 <A66BBE673E08E1428E3A918AE4D5B32C1AEE167F@BGSMSX105.gar.corp.intel.com>
 <373cb1c9-80f3-ec08-7d43-cbd3202bc134@oracle.com>
 <A66BBE673E08E1428E3A918AE4D5B32C1AEE2F58@BGSMSX105.gar.corp.intel.com>
 <A66BBE673E08E1428E3A918AE4D5B32C1AEE7689@BGSMSX105.gar.corp.intel.com>
Message-ID: <869882d4-eb5a-d765-92d9-49cd389e3366@oracle.com>

Hi Jatin,

The patch is nice and clean.

Reviewed.

Best regards
Nils Eliasson

On 2020-04-01 20:23, Bhateja, Jatin wrote:
> Hi Vladimir,
>
> Please find an updated unified patch at the following link.
>
> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.05/
>
> This removes Optimized NotV handling for AVX3, as suggested it will be
> brought via vectorIntrinsics branch.
>
> Thanks for your help in shaping up this patch, please let me know if there
> are other comments.
>
> Best Regards,
> Jatin
> ________________________________________
> From: Bhateja, Jatin
> Sent: Wednesday, March 25, 2020 12:14 PM
> To: Vladimir Ivanov
> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
> Subject: RE: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
>
> Hi Vladimir,
>
> I have placed updated patch at following links:-
>
>   1)  Optimized NotV handling:
> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
>
>   2)  Changes for MacroLogic opt:
>   http://cr.openjdk.java.net/~jbhateja/8241040/webrev.03_over_notV/
>
> Kindly review and let me know your feedback.
>
> Thanks,
> Jatin
>
>> -----Original Message-----
>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>> Sent: Wednesday, March 25, 2020 12:33 AM
>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
>> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
>> <sandhya.viswanathan at intel.com>
>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
>>
>> Hi Jatin,
>>
>> I tried to submit the patches for testing, but windows-x64 build failed with the
>> following errors:
>>
>> src/hotspot/share/opto/compile.cpp(2345): error C2131: expression did not
>> evaluate to a constant
>> src/hotspot/share/opto/compile.cpp(2345): note: failure was caused by a read
>> of a variable outside its lifetime
>> src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition'
>> src/hotspot/share/opto/compile.cpp(2404): error C3863: array type 'int
>> ['function']' is not assignable
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> On 24.03.2020 10:34, Bhateja, Jatin wrote:
>>> Hi Vladimir,
>>>
>>> Thanks for your comments , I have split the original patch into two sub-
>> patches.
>>> 1)  Optimized NotV handling:
>>> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
>>>
>>> 2)  Changes for MacroLogic opt:
>>> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/
>>>
>>> Added a new flag "UseVectorMacroLogic" which guards MacroLogic
>> optimization.
>>> Kindly review and let me know your feedback.
>>>
>>> Best Regards,
>>> Jatin
>>>
>>>> -----Original Message-----
>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>> Sent: Tuesday, March 17, 2020 4:31 PM
>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>; hotspot-compiler-
>>>> dev at openjdk.java.net
>>>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic
>>>> Instruction
>>>>
>>>>
>>>>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/
>>>> Very nice contribution, Jatin!
>>>>
>>>> Some comments after a brief review pass:
>>>>
>>>>      * Please, contribute NotV part separately.
>>>>
>>>>      * Why don't you perform (XorV v 0xFF..FF) => (NotV v)
>>>> transformation during GVN instead?
>>>>
>>>>      * As of now, vector nodes are only produced by SuperWord
>>>> analysis. It makes sense to limit new optimization pass to SuperWord
>>>> pass only (probably, introduce a new dedicated Phase ). Once Vector
>>>> API is available, it can be extended to cases when vector nodes are
>>>> present
>>>> (C->max_vector_size() > 0).
>>>>
>>>>      * There are more efficient ways to produce a vector of all-1s [1] [2].
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>> [1]
>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/45105
>>>> 164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3DgJgc
>>>> qfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$
>>>> 1-efficiently
>>>>
>>>> [2]
>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/37469
>>>> 930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI
>>>> _RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$
>>>> value-to-all-one-bits
>>>>
>>>>> A new optimization pass has been added post Auto-Vectorization which
>>>> folds expression tree involving vector boolean logic operations
>>>> (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
>>>>> Optimization pass has following stages:
>>>>>
>>>>>      1.  Collection stage :
>>>>>         *   This performs a DFS traversal over Ideal Graph and collects the root
>>>> nodes of all vector logic expression trees.
>>>>>      2.  Processing stage:
>>>>>         *   Performs a bottom up traversal over expression tree and
>>>> simultaneously folds specific DAG patterns involving Boolean logic
>>>> parent and child nodes.
>>>>>         *   Transforms (XORV INP , -1) -> (NOTV INP) to promote logic folding.
>>>>>         *   Folding is performed under a constraint on the total number of
>> inputs
>>>> which a MacroLogic node can have, in this case it's 3.
>>>>>         *   A partition is created around a DAG pattern involving logic parent
>> and
>>>> one or two logic child node, it encapsulate the nodes in post-order fashion.
>>>>>         *   This partition is then evaluated by traversing over the nodes,
>> assigning
>>>> boolean values to its inputs and performing operations over them
>>>> based on its Opcode. Node along with its computed result is stored in
>>>> a map which is accessed during the evaluation of its user/parent node.
>>>>>         *   Post-evaluation a MacroLogic node is created which is equivalent to
>> a
>>>> three input truth-table. Expression tree leaf level inputs along with
>>>> result of its evaluation are the inputs fed to this new node.
>>>>>         *   Entire expression tree is eventually subsumed/replaced by newly
>>>> create MacroLogic node.
>>>>>
>>>>> Following are the JMH benchmarks results with and without changes.
>>>>>
>>>>> Without Changes:
>>>>>
>>>>> Benchmark                            (VECLEN)   Mode  Cnt     Score   Error  Units
>>>>> MacroLogicOpt.workload1_caller             64  thrpt       2904.480          ops/s
>>>>> MacroLogicOpt.workload1_caller            128  thrpt       2219.252          ops/s
>>>>> MacroLogicOpt.workload1_caller            256  thrpt       1507.267          ops/s
>>>>> MacroLogicOpt.workload1_caller            512  thrpt        860.926          ops/s
>>>>> MacroLogicOpt.workload1_caller           1024  thrpt        470.163          ops/s
>>>>> MacroLogicOpt.workload1_caller           2048  thrpt        246.608          ops/s
>>>>> MacroLogicOpt.workload1_caller           4096  thrpt        108.031          ops/s
>>>>> MacroLogicOpt.workload2_caller             64  thrpt        344.633          ops/s
>>>>> MacroLogicOpt.workload2_caller            128  thrpt        209.818          ops/s
>>>>> MacroLogicOpt.workload2_caller            256  thrpt        111.678          ops/s
>>>>> MacroLogicOpt.workload2_caller            512  thrpt         53.360          ops/s
>>>>> MacroLogicOpt.workload2_caller           1024  thrpt         27.888          ops/s
>>>>> MacroLogicOpt.workload2_caller           2048  thrpt         12.103          ops/s
>>>>> MacroLogicOpt.workload2_caller           4096  thrpt          6.018          ops/s
>>>>> MacroLogicOpt.workload3_caller             64  thrpt       3110.669          ops/s
>>>>> MacroLogicOpt.workload3_caller            128  thrpt       1996.861          ops/s
>>>>> MacroLogicOpt.workload3_caller            256  thrpt        870.166          ops/s
>>>>> MacroLogicOpt.workload3_caller            512  thrpt        389.629          ops/s
>>>>> MacroLogicOpt.workload3_caller           1024  thrpt        151.203          ops/s
>>>>> MacroLogicOpt.workload3_caller           2048  thrpt         75.086          ops/s
>>>>> MacroLogicOpt.workload3_caller           4096  thrpt         37.576          ops/s
>>>>>
>>>>> With Changes:
>>>>>
>>>>> Benchmark                            (VECLEN)   Mode  Cnt     Score   Error  Units
>>>>> MacroLogicOpt.workload1_caller             64  thrpt       3306.670          ops/s
>>>>> MacroLogicOpt.workload1_caller            128  thrpt       2936.851          ops/s
>>>>> MacroLogicOpt.workload1_caller            256  thrpt       2413.827          ops/s
>>>>> MacroLogicOpt.workload1_caller            512  thrpt       1440.291          ops/s
>>>>> MacroLogicOpt.workload1_caller           1024  thrpt        707.576          ops/s
>>>>> MacroLogicOpt.workload1_caller           2048  thrpt        384.863          ops/s
>>>>> MacroLogicOpt.workload1_caller           4096  thrpt        132.753          ops/s
>>>>> MacroLogicOpt.workload2_caller             64  thrpt        450.856          ops/s
>>>>> MacroLogicOpt.workload2_caller            128  thrpt        323.925          ops/s
>>>>> MacroLogicOpt.workload2_caller            256  thrpt        135.191          ops/s
>>>>> MacroLogicOpt.workload2_caller            512  thrpt         69.424          ops/s
>>>>> MacroLogicOpt.workload2_caller           1024  thrpt         35.744          ops/s
>>>>> MacroLogicOpt.workload2_caller           2048  thrpt         14.168          ops/s
>>>>> MacroLogicOpt.workload2_caller           4096  thrpt          7.245          ops/s
>>>>> MacroLogicOpt.workload3_caller             64  thrpt       3333.550          ops/s
>>>>> MacroLogicOpt.workload3_caller            128  thrpt       2269.428          ops/s
>>>>> MacroLogicOpt.workload3_caller            256  thrpt        995.691          ops/s
>>>>> MacroLogicOpt.workload3_caller            512  thrpt        412.452          ops/s
>>>>> MacroLogicOpt.workload3_caller           1024  thrpt        151.157          ops/s
>>>>> MacroLogicOpt.workload3_caller           2048  thrpt         75.079          ops/s
>>>>> MacroLogicOpt.workload3_caller           4096  thrpt         37.158          ops/s
>>>>>
>>>>> Please review the patch.
>>>>>
>>>>> Best Regards,
>>>>> Jatin
>>>>>
>>>>> [1] Section 17.7 :
>>>>> https://urldefense.com/v3/__https://software.intel.com/sites/default
>>>>> /files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllG
>>>>> QTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$
>>>>> architectures-optimization-manual.pdf
>>>>>


From vladimir.x.ivanov at oracle.com  Thu Apr  2 09:31:10 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Thu, 2 Apr 2020 12:31:10 +0300
Subject: RFR: 8241438: Move IntelJccErratum mitigation code to
 platform-specific code
In-Reply-To: <2c038184-d7c0-94f1-47d1-56444c8b243d@oracle.com>
References: <14fe1c02-520b-f5d7-5c66-dc35d63f0a0d@oracle.com>
 <19c75204-d036-4768-686e-834995c5e21f@oracle.com>
 <e9c21be1-6ccd-c32e-d32c-a509517d2e71@oracle.com>
 <73e13987-92c9-c189-657e-0abe7a69ecaa@oracle.com>
 <7833a4ea-613a-4b4c-da23-4b3e064b2f1c@oracle.com>
 <7f4033fa-d380-582e-43cc-343aa6d0fe1c@oracle.com>
 <2c038184-d7c0-94f1-47d1-56444c8b243d@oracle.com>
Message-ID: <92c54ba2-4c40-b28b-b687-19f7c2ae38c7@oracle.com>


> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.03/

Looks good.

Best regards,
Vladimir Ivanov

>>> Here is an updated webrev with your concerns and Vladimir Ivanov's 
>>> concerns addressed:
>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.02/
>>>
>>> Incremental:
>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.01_02/
>>>
>>> Thanks,
>>> /Erik
>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>>>
>>>>>> In pd_alignment_required() you implicitly use knowledge that 
>>>>>> relocInfo::addr_unit() on x86 is 1.
>>>>>> At least add comment about that.
>>>>>
>>>>> I can add a comment about that.
>>>>>
>>>>> New webrev:
>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.01/
>>>>>
>>>>> Incremental:
>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.00_01/
>>>>>
>>>>> Thanks,
>>>>> /Erik
>>>>>
>>>>>> Thanks,
>>>>>> Vladimir
>>>>>>
>>>>>> On 3/23/20 6:09 AM, Erik ?sterlund wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> There is some platform-specific code in PhaseOutput that deals 
>>>>>>> with the IntelJccErratum mitigation,
>>>>>>> which is ifdef:ed in shared code. It should move to 
>>>>>>> platform-specific code.
>>>>>>>
>>>>>>> This patch exposes the iteration state of PhaseOutput, which 
>>>>>>> allows hiding the Intel-specific code
>>>>>>> completely in x86-specific files.
>>>>>>>
>>>>>>> Webrev:
>>>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.00/
>>>>>>>
>>>>>>> Bug:
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8241438
>>>>>>>
>>>>>>> Thanks,
>>>>>>> /Erik
>>>>>
>>>
> 

From erik.osterlund at oracle.com  Thu Apr  2 09:36:57 2020
From: erik.osterlund at oracle.com (=?UTF-8?Q?Erik_=c3=96sterlund?=)
Date: Thu, 2 Apr 2020 11:36:57 +0200
Subject: RFR: 8241438: Move IntelJccErratum mitigation code to
 platform-specific code
In-Reply-To: <f4fcddcb-7840-86f2-7f69-d17b051e7cb7@oracle.com>
References: <14fe1c02-520b-f5d7-5c66-dc35d63f0a0d@oracle.com>
 <19c75204-d036-4768-686e-834995c5e21f@oracle.com>
 <e9c21be1-6ccd-c32e-d32c-a509517d2e71@oracle.com>
 <73e13987-92c9-c189-657e-0abe7a69ecaa@oracle.com>
 <7833a4ea-613a-4b4c-da23-4b3e064b2f1c@oracle.com>
 <7f4033fa-d380-582e-43cc-343aa6d0fe1c@oracle.com>
 <2c038184-d7c0-94f1-47d1-56444c8b243d@oracle.com>
 <f4fcddcb-7840-86f2-7f69-d17b051e7cb7@oracle.com>
Message-ID: <9d6a7ba1-3d0c-fdb2-1e79-01ae0e8058cf@oracle.com>

Hi Vladimir,

Thanks for the review.

/Erik

On 2020-04-02 04:57, Vladimir Kozlov wrote:
> On 4/1/20 3:24 AM, Erik ?sterlund wrote:
>> Hi Vladimir,
>>
>> On 2020-03-30 21:14, Vladimir Kozlov wrote:
>>> But you at least can do static check at the beginning of method:
>>>
>>> int MachNode::pd_alignment_required() const {
>>> ? if (VM_Version::has_intel_jcc_erratum()) {
>>> ??? PhaseOutput* output = Compile::current()->output();
>>> ??? Block* block = output->block();
>>> ??? int index = output->index();
>>> ??? assert(output->mach() == this, "incorrect iterator state in 
>>> PhaseOutput");
>>> ??? if (IntelJccErratum::is_jcc_erratum_branch(block, this, index)) {
>>> ????? // Conservatively add worst case padding. We assume that 
>>> relocInfo::addr_unit() is 1 on x86.
>>> ????? return IntelJccErratum::largest_jcc_size() + 1;
>>> ??? }
>>> ? }
>>> ? return 1;
>>> }
>>
>> That is equivalent to the compiler. I verified that by disassembling 
>> the release bits before
>> and after your suggestion, and it is instruction by instruction the 
>> same. In both cases it
>> first checks ifVM_Version::has_intel_jcc_erratum(), and if not, 
>> returns before even building
>> a frame. I'd rather keep the not nested variant because it is 
>> equivalent, yet easier to read.
>
> I have reservation about this statement which may not true for all C++ 
> compilers we use but I will not insist on refactoring it.
>
>>
>>>>
>>>>> In compute_padding() reads done under check so I have less 
>>>>> concerns about it. But I also don't get why you use saved _mach 
>>>>> instead of using MachNode 'this'.
>>>>
>>>> Good point. I changed to this + an assert checking that they are 
>>>> indeed the same.
>>>
>>> Why do you need Output._mach at all if you use it only in this 
>>> assert? Even logically it looks strange. In what case it could be 
>>> different?
>>
>> It should never be different; that was the point. The index and mach 
>> node exposed by the
>> iterator are related and refer to the same entity. So if you use the 
>> exposed index in code
>> in a mach node, you must know that this mach node is the same mach 
>> node that the index refers
>> to, and it is. The assert was meant to enforce it so that if you were 
>> to call either the
>> alignment or padding function in a new context, for whatever reason, 
>> and don't happen to know
>> that you can't do that without having a consistent iteration state, 
>> you would immediately catch
>> that in the assertions, instead of getting strange silent logic errors.
>>
>> Having said that, I am okay with removing _mach if you prefer having 
>> one seat belt less, it is up to you:
>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.03/
>
> Okay. Good.
>
> Thanks,
> Vladimir
>
>>
>> Incremental:
>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.02_03/
>>
>> Thanks,
>> /Erik
>>
>>> Thanks,
>>> Vladimir
>>>
>>>>
>>>> Here is an updated webrev with your concerns and Vladimir Ivanov's 
>>>> concerns addressed:
>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.02/
>>>>
>>>> Incremental:
>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.01_02/
>>>>
>>>> Thanks,
>>>> /Erik
>>>>
>>>>> Thanks,
>>>>> Vladimir
>>>>>
>>>>>>
>>>>>>> In pd_alignment_required() you implicitly use knowledge that 
>>>>>>> relocInfo::addr_unit() on x86 is 1.
>>>>>>> At least add comment about that.
>>>>>>
>>>>>> I can add a comment about that.
>>>>>>
>>>>>> New webrev:
>>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.01/
>>>>>>
>>>>>> Incremental:
>>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.00_01/
>>>>>>
>>>>>> Thanks,
>>>>>> /Erik
>>>>>>
>>>>>>> Thanks,
>>>>>>> Vladimir
>>>>>>>
>>>>>>> On 3/23/20 6:09 AM, Erik ?sterlund wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> There is some platform-specific code in PhaseOutput that deals 
>>>>>>>> with the IntelJccErratum mitigation,
>>>>>>>> which is ifdef:ed in shared code. It should move to 
>>>>>>>> platform-specific code.
>>>>>>>>
>>>>>>>> This patch exposes the iteration state of PhaseOutput, which 
>>>>>>>> allows hiding the Intel-specific code
>>>>>>>> completely in x86-specific files.
>>>>>>>>
>>>>>>>> Webrev:
>>>>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.00/
>>>>>>>>
>>>>>>>> Bug:
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8241438
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> /Erik
>>>>>>
>>>>
>>


From erik.osterlund at oracle.com  Thu Apr  2 09:37:09 2020
From: erik.osterlund at oracle.com (=?UTF-8?Q?Erik_=c3=96sterlund?=)
Date: Thu, 2 Apr 2020 09:37:09 +0000 (UTC)
Subject: RFR: 8241438: Move IntelJccErratum mitigation code to
 platform-specific code
In-Reply-To: <92c54ba2-4c40-b28b-b687-19f7c2ae38c7@oracle.com>
References: <14fe1c02-520b-f5d7-5c66-dc35d63f0a0d@oracle.com>
 <19c75204-d036-4768-686e-834995c5e21f@oracle.com>
 <e9c21be1-6ccd-c32e-d32c-a509517d2e71@oracle.com>
 <73e13987-92c9-c189-657e-0abe7a69ecaa@oracle.com>
 <7833a4ea-613a-4b4c-da23-4b3e064b2f1c@oracle.com>
 <7f4033fa-d380-582e-43cc-343aa6d0fe1c@oracle.com>
 <2c038184-d7c0-94f1-47d1-56444c8b243d@oracle.com>
 <92c54ba2-4c40-b28b-b687-19f7c2ae38c7@oracle.com>
Message-ID: <66550012-d164-855f-7d45-087a1151cc0a@oracle.com>

Hi Vladimir,

Thanks for the review.

/Erik

On 2020-04-02 11:31, Vladimir Ivanov wrote:
>
>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.03/
>
> Looks good.
>
> Best regards,
> Vladimir Ivanov
>
>>>> Here is an updated webrev with your concerns and Vladimir Ivanov's 
>>>> concerns addressed:
>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.02/
>>>>
>>>> Incremental:
>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.01_02/
>>>>
>>>> Thanks,
>>>> /Erik
>>>>
>>>>> Thanks,
>>>>> Vladimir
>>>>>
>>>>>>
>>>>>>> In pd_alignment_required() you implicitly use knowledge that 
>>>>>>> relocInfo::addr_unit() on x86 is 1.
>>>>>>> At least add comment about that.
>>>>>>
>>>>>> I can add a comment about that.
>>>>>>
>>>>>> New webrev:
>>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.01/
>>>>>>
>>>>>> Incremental:
>>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.00_01/
>>>>>>
>>>>>> Thanks,
>>>>>> /Erik
>>>>>>
>>>>>>> Thanks,
>>>>>>> Vladimir
>>>>>>>
>>>>>>> On 3/23/20 6:09 AM, Erik ?sterlund wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> There is some platform-specific code in PhaseOutput that deals 
>>>>>>>> with the IntelJccErratum mitigation,
>>>>>>>> which is ifdef:ed in shared code. It should move to 
>>>>>>>> platform-specific code.
>>>>>>>>
>>>>>>>> This patch exposes the iteration state of PhaseOutput, which 
>>>>>>>> allows hiding the Intel-specific code
>>>>>>>> completely in x86-specific files.
>>>>>>>>
>>>>>>>> Webrev:
>>>>>>>> http://cr.openjdk.java.net/~eosterlund/8241438/webrev.00/
>>>>>>>>
>>>>>>>> Bug:
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8241438
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> /Erik
>>>>>>
>>>>
>>


From vladimir.x.ivanov at oracle.com  Thu Apr  2 10:14:53 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Thu, 2 Apr 2020 13:14:53 +0300
Subject: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
In-Reply-To: <b429b73a-e720-cad6-6a8d-d0eb8f478fdb@oracle.com>
References: <A66BBE673E08E1428E3A918AE4D5B32C1AEDCD66@BGSMSX105.gar.corp.intel.com>
 <d91f4582-8c55-ad16-4107-9e1e433cd465@oracle.com>
 <A66BBE673E08E1428E3A918AE4D5B32C1AEE167F@BGSMSX105.gar.corp.intel.com>
 <373cb1c9-80f3-ec08-7d43-cbd3202bc134@oracle.com>
 <A66BBE673E08E1428E3A918AE4D5B32C1AEE2F58@BGSMSX105.gar.corp.intel.com>
 <A66BBE673E08E1428E3A918AE4D5B32C1AEE7689@BGSMSX105.gar.corp.intel.com>
 <b429b73a-e720-cad6-6a8d-d0eb8f478fdb@oracle.com>
Message-ID: <72e4bd89-3f56-2894-dced-dd5f3f06e66e@oracle.com>


>> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.05/
> 
> Looks good. I'll submit it for testing.

Test results are clean.

Best regards,
Vladimir Ivanov

>> This removes Optimized NotV handling for AVX3, as suggested it will be
>> brought via vectorIntrinsics branch.
>>
>> Thanks for your help in shaping up this patch, please let me know if 
>> there
>> are other comments.
>>
>> Best Regards,
>> Jatin
>> ________________________________________
>> From: Bhateja, Jatin
>> Sent: Wednesday, March 25, 2020 12:14 PM
>> To: Vladimir Ivanov
>> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
>> Subject: RE: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic 
>> Instruction
>>
>> Hi Vladimir,
>>
>> I have placed updated patch at following links:-
>>
>> ? 1)? Optimized NotV handling:
>> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
>>
>> ? 2)? Changes for MacroLogic opt:
>> ? http://cr.openjdk.java.net/~jbhateja/8241040/webrev.03_over_notV/
>>
>> Kindly review and let me know your feedback.
>>
>> Thanks,
>> Jatin
>>
>>> -----Original Message-----
>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>> Sent: Wednesday, March 25, 2020 12:33 AM
>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
>>> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
>>> <sandhya.viswanathan at intel.com>
>>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic 
>>> Instruction
>>>
>>> Hi Jatin,
>>>
>>> I tried to submit the patches for testing, but windows-x64 build 
>>> failed with the
>>> following errors:
>>>
>>> src/hotspot/share/opto/compile.cpp(2345): error C2131: expression did 
>>> not
>>> evaluate to a constant
>>> src/hotspot/share/opto/compile.cpp(2345): note: failure was caused by 
>>> a read
>>> of a variable outside its lifetime
>>> src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition'
>>> src/hotspot/share/opto/compile.cpp(2404): error C3863: array type 'int
>>> ['function']' is not assignable
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> On 24.03.2020 10:34, Bhateja, Jatin wrote:
>>>> Hi Vladimir,
>>>>
>>>> Thanks for your comments , I have split the original patch into two 
>>>> sub-
>>> patches.
>>>>
>>>> 1)? Optimized NotV handling:
>>>> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
>>>>
>>>> 2)? Changes for MacroLogic opt:
>>>> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/
>>>>
>>>> Added a new flag "UseVectorMacroLogic" which guards MacroLogic
>>> optimization.
>>>>
>>>> Kindly review and let me know your feedback.
>>>>
>>>> Best Regards,
>>>> Jatin
>>>>
>>>>> -----Original Message-----
>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>> Sent: Tuesday, March 17, 2020 4:31 PM
>>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>; hotspot-compiler-
>>>>> dev at openjdk.java.net
>>>>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic
>>>>> Instruction
>>>>>
>>>>>
>>>>>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/
>>>>>
>>>>> Very nice contribution, Jatin!
>>>>>
>>>>> Some comments after a brief review pass:
>>>>>
>>>>> ???? * Please, contribute NotV part separately.
>>>>>
>>>>> ???? * Why don't you perform (XorV v 0xFF..FF) => (NotV v)
>>>>> transformation during GVN instead?
>>>>>
>>>>> ???? * As of now, vector nodes are only produced by SuperWord
>>>>> analysis. It makes sense to limit new optimization pass to SuperWord
>>>>> pass only (probably, introduce a new dedicated Phase ). Once Vector
>>>>> API is available, it can be extended to cases when vector nodes are
>>>>> present
>>>>> (C->max_vector_size() > 0).
>>>>>
>>>>> ???? * There are more efficient ways to produce a vector of all-1s 
>>>>> [1] [2].
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>> [1]
>>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/45105
>>>>> 164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3DgJgc
>>>>> qfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$
>>>>> 1-efficiently
>>>>>
>>>>> [2]
>>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/37469
>>>>> 930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllGQTI
>>>>> _RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$
>>>>> value-to-all-one-bits
>>>>>
>>>>>>
>>>>>> A new optimization pass has been added post Auto-Vectorization which
>>>>> folds expression tree involving vector boolean logic operations
>>>>> (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
>>>>>> Optimization pass has following stages:
>>>>>>
>>>>>> ???? 1.? Collection stage :
>>>>>> ??????? *?? This performs a DFS traversal over Ideal Graph and 
>>>>>> collects the root
>>>>> nodes of all vector logic expression trees.
>>>>>> ???? 2.? Processing stage:
>>>>>> ??????? *?? Performs a bottom up traversal over expression tree and
>>>>> simultaneously folds specific DAG patterns involving Boolean logic
>>>>> parent and child nodes.
>>>>>> ??????? *?? Transforms (XORV INP , -1) -> (NOTV INP) to promote 
>>>>>> logic folding.
>>>>>> ??????? *?? Folding is performed under a constraint on the total 
>>>>>> number of
>>> inputs
>>>>> which a MacroLogic node can have, in this case it's 3.
>>>>>> ??????? *?? A partition is created around a DAG pattern involving 
>>>>>> logic parent
>>> and
>>>>> one or two logic child node, it encapsulate the nodes in post-order 
>>>>> fashion.
>>>>>> ??????? *?? This partition is then evaluated by traversing over 
>>>>>> the nodes,
>>> assigning
>>>>> boolean values to its inputs and performing operations over them
>>>>> based on its Opcode. Node along with its computed result is stored in
>>>>> a map which is accessed during the evaluation of its user/parent node.
>>>>>> ??????? *?? Post-evaluation a MacroLogic node is created which is 
>>>>>> equivalent to
>>> a
>>>>> three input truth-table. Expression tree leaf level inputs along with
>>>>> result of its evaluation are the inputs fed to this new node.
>>>>>> ??????? *?? Entire expression tree is eventually subsumed/replaced 
>>>>>> by newly
>>>>> create MacroLogic node.
>>>>>>
>>>>>>
>>>>>> Following are the JMH benchmarks results with and without changes.
>>>>>>
>>>>>> Without Changes:
>>>>>>
>>>>>> Benchmark??????????????????????????? (VECLEN)?? Mode? Cnt     
>>>>>> Score?? Error? Units
>>>>>> MacroLogicOpt.workload1_caller???????????? 64? thrpt       
>>>>>> 2904.480????????? ops/s
>>>>>> MacroLogicOpt.workload1_caller??????????? 128? thrpt       
>>>>>> 2219.252????????? ops/s
>>>>>> MacroLogicOpt.workload1_caller??????????? 256? thrpt       
>>>>>> 1507.267????????? ops/s
>>>>>> MacroLogicOpt.workload1_caller??????????? 512? thrpt        
>>>>>> 860.926????????? ops/s
>>>>>> MacroLogicOpt.workload1_caller?????????? 1024? thrpt        
>>>>>> 470.163????????? ops/s
>>>>>> MacroLogicOpt.workload1_caller?????????? 2048? thrpt        
>>>>>> 246.608????????? ops/s
>>>>>> MacroLogicOpt.workload1_caller?????????? 4096? thrpt        
>>>>>> 108.031????????? ops/s
>>>>>> MacroLogicOpt.workload2_caller???????????? 64? thrpt        
>>>>>> 344.633????????? ops/s
>>>>>> MacroLogicOpt.workload2_caller??????????? 128? thrpt        
>>>>>> 209.818????????? ops/s
>>>>>> MacroLogicOpt.workload2_caller??????????? 256? thrpt        
>>>>>> 111.678????????? ops/s
>>>>>> MacroLogicOpt.workload2_caller??????????? 512? thrpt         
>>>>>> 53.360????????? ops/s
>>>>>> MacroLogicOpt.workload2_caller?????????? 1024? thrpt         
>>>>>> 27.888????????? ops/s
>>>>>> MacroLogicOpt.workload2_caller?????????? 2048? thrpt         
>>>>>> 12.103????????? ops/s
>>>>>> MacroLogicOpt.workload2_caller?????????? 4096? thrpt          
>>>>>> 6.018????????? ops/s
>>>>>> MacroLogicOpt.workload3_caller???????????? 64? thrpt       
>>>>>> 3110.669????????? ops/s
>>>>>> MacroLogicOpt.workload3_caller??????????? 128? thrpt       
>>>>>> 1996.861????????? ops/s
>>>>>> MacroLogicOpt.workload3_caller??????????? 256? thrpt        
>>>>>> 870.166????????? ops/s
>>>>>> MacroLogicOpt.workload3_caller??????????? 512? thrpt        
>>>>>> 389.629????????? ops/s
>>>>>> MacroLogicOpt.workload3_caller?????????? 1024? thrpt        
>>>>>> 151.203????????? ops/s
>>>>>> MacroLogicOpt.workload3_caller?????????? 2048? thrpt         
>>>>>> 75.086????????? ops/s
>>>>>> MacroLogicOpt.workload3_caller?????????? 4096? thrpt         
>>>>>> 37.576????????? ops/s
>>>>>>
>>>>>> With Changes:
>>>>>>
>>>>>> Benchmark??????????????????????????? (VECLEN)?? Mode? Cnt     
>>>>>> Score?? Error? Units
>>>>>> MacroLogicOpt.workload1_caller???????????? 64? thrpt       
>>>>>> 3306.670????????? ops/s
>>>>>> MacroLogicOpt.workload1_caller??????????? 128? thrpt       
>>>>>> 2936.851????????? ops/s
>>>>>> MacroLogicOpt.workload1_caller??????????? 256? thrpt       
>>>>>> 2413.827????????? ops/s
>>>>>> MacroLogicOpt.workload1_caller??????????? 512? thrpt       
>>>>>> 1440.291????????? ops/s
>>>>>> MacroLogicOpt.workload1_caller?????????? 1024? thrpt        
>>>>>> 707.576????????? ops/s
>>>>>> MacroLogicOpt.workload1_caller?????????? 2048? thrpt        
>>>>>> 384.863????????? ops/s
>>>>>> MacroLogicOpt.workload1_caller?????????? 4096? thrpt        
>>>>>> 132.753????????? ops/s
>>>>>> MacroLogicOpt.workload2_caller???????????? 64? thrpt        
>>>>>> 450.856????????? ops/s
>>>>>> MacroLogicOpt.workload2_caller??????????? 128? thrpt        
>>>>>> 323.925????????? ops/s
>>>>>> MacroLogicOpt.workload2_caller??????????? 256? thrpt        
>>>>>> 135.191????????? ops/s
>>>>>> MacroLogicOpt.workload2_caller??????????? 512? thrpt         
>>>>>> 69.424????????? ops/s
>>>>>> MacroLogicOpt.workload2_caller?????????? 1024? thrpt         
>>>>>> 35.744????????? ops/s
>>>>>> MacroLogicOpt.workload2_caller?????????? 2048? thrpt         
>>>>>> 14.168????????? ops/s
>>>>>> MacroLogicOpt.workload2_caller?????????? 4096? thrpt          
>>>>>> 7.245????????? ops/s
>>>>>> MacroLogicOpt.workload3_caller???????????? 64? thrpt       
>>>>>> 3333.550????????? ops/s
>>>>>> MacroLogicOpt.workload3_caller??????????? 128? thrpt       
>>>>>> 2269.428????????? ops/s
>>>>>> MacroLogicOpt.workload3_caller??????????? 256? thrpt        
>>>>>> 995.691????????? ops/s
>>>>>> MacroLogicOpt.workload3_caller??????????? 512? thrpt        
>>>>>> 412.452????????? ops/s
>>>>>> MacroLogicOpt.workload3_caller?????????? 1024? thrpt        
>>>>>> 151.157????????? ops/s
>>>>>> MacroLogicOpt.workload3_caller?????????? 2048? thrpt         
>>>>>> 75.079????????? ops/s
>>>>>> MacroLogicOpt.workload3_caller?????????? 4096? thrpt         
>>>>>> 37.158????????? ops/s
>>>>>>
>>>>>> Please review the patch.
>>>>>>
>>>>>> Best Regards,
>>>>>> Jatin
>>>>>>
>>>>>> [1] Section 17.7 :
>>>>>> https://urldefense.com/v3/__https://software.intel.com/sites/default
>>>>>> /files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllG
>>>>>> QTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$
>>>>>> architectures-optimization-manual.pdf
>>>>>>

From rwestrel at redhat.com  Thu Apr  2 14:14:04 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Thu, 02 Apr 2020 16:14:04 +0200
Subject: RFR(S): 8239072: subtype check macro node causes node budget to
 be exhausted
In-Reply-To: <736f1832-b44c-162d-35fb-fbad07a84c39@oracle.com>
References: <87d09llldp.fsf@redhat.com>
 <62ef48e0-fae8-38cc-7a48-2deb0f054cdd@oracle.com> <87v9nbjilg.fsf@redhat.com>
 <63a6f167-1d5b-7624-b4e6-0f2b89707b00@oracle.com>
 <3d3cc6d7-6b6c-ebbc-d28e-7350c50c5f58@oracle.com> <875zekewj7.fsf@redhat.com>
 <cc3d326b-4ef4-26f8-cf05-6cfcc466fdad@oracle.com>
 <736f1832-b44c-162d-35fb-fbad07a84c39@oracle.com>
Message-ID: <87mu7uc6vn.fsf@redhat.com>


Thanks for the reviews, Vladimir & Vladimir.

Roland.


From HORIE at jp.ibm.com  Thu Apr  2 14:27:10 2020
From: HORIE at jp.ibm.com (Michihiro Horie)
Date: Thu, 2 Apr 2020 23:27:10 +0900
Subject: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes()
 and Integer.reverseBytes() on Power9
In-Reply-To: <0079874e-7bc2-5ff4-f004-337c718ec6df@linux.ibm.com>
References: <0079874e-7bc2-5ff4-f004-337c718ec6df@linux.ibm.com>
Message-ID: <OF54F8418D.F0B8E4CB-ON0025853E.004E3DC9-4925853E.004F648A@notes.na.collabserv.com>


Hi Corey,

I?m not a reviewer, but I can run your benchmark in my local P9 node if you
share it.

Best regards,
Michihiro


 ----- Original message -----
 From: Corey Ashford <cjashfor at linux.ibm.com>
 Sent by: "hotspot-compiler-dev"
 <hotspot-compiler-dev-bounces at openjdk.java.net>
 To: hotspot-compiler-dev at openjdk.java.net
 Cc:
 Subject: [EXTERNAL] RFR[S]:8241874 [PPC64] Improve performance of
 Long.reverseBytes() and Integer.reverseBytes() on Power9
 Date: Tue, Mar 31, 2020 7:52 AM

 Hello,

 This is my first OpenJDK patch for review.  It increases the performance
 of byte reversal for Integer.reverseBytes() and Long.reverseBytes() on
 Power9 via its VSX xxbrw and xxbrd vector instructions.

 https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.openjdk.java.net_browse_JDK-2D8241874&d=DwICaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=oecsIpYF-cifqq2i1JEH0Q&m=Q0ug0imG7nRw-N8m1U0RobPS3M9D2mmT8nY3GnID3io&s=TXqhnYzhTVyILKGJBOpWSmqe-iP6ixmCAqwxYT19K8E&e=

 https://urldefense.proofpoint.com/v2/url?u=http-3A__cr.openjdk.java.net_-7Egromero_8241874_v1_&d=DwICaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=oecsIpYF-cifqq2i1JEH0Q&m=Q0ug0imG7nRw-N8m1U0RobPS3M9D2mmT8nY3GnID3io&s=1elFXKQoR_CB9mG6g4TM0z5-Da27XveB77RBXKwQi3I&e=


 I have tested on Power9 and see a 38%+ performance improvement on
 Long.reverseBytes() and 15%+ on Integer.reverseBytes().  (I add the +
 because the benchmark code has a fair amount of fixed overhead).
 Testing on Power8 reveals no regressions.

 I believe the patch itself is pretty self-explanatory.  It adds
 definitions for four instructions that are needed to get the data in and
 out of the vector registers, and to perform the reversal operation, and
 it adds the instructs to use them.  Also VM_Version::initialize()
 autodetects that the instructions are available, and warns for trying to
 set the UseVectorByteReverseInstructionsPPC64 flag on earlier Power
 processors that don't possess these PowerISA 3.0 instructions.

 Thanks to Michihiro Horie, Jose Ricardo Ziviani, and Gustav Romero for
 their help!

 Please review this patch.

 Thanks for your consideration,

 Corey Ashford


From rwestrel at redhat.com  Thu Apr  2 14:35:42 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Thu, 02 Apr 2020 16:35:42 +0200
Subject: RFR(S): 8241041: C2: "assert((Value(phase) == t) || (t !=
 TypeInt::CC_GT && t != TypeInt::CC_EQ)) failed: missing Value() optimization"
 still happens after fix for 8239335
In-Reply-To: <87tv2ef536.fsf@redhat.com>
References: <87tv2ef536.fsf@redhat.com>
Message-ID: <87k12yc5vl.fsf@redhat.com>


> http://cr.openjdk.java.net/~roland/8241041/webrev.00/

Anyone else for this?

Roland.


From rwestrel at redhat.com  Thu Apr  2 14:36:29 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Thu, 02 Apr 2020 16:36:29 +0200
Subject: [11u] 8217230: assert(t == t_no_spec) failure in
 NodeHash::check_no_speculative_types()
In-Reply-To: <874kubfked.fsf@redhat.com>
References: <874kubfked.fsf@redhat.com>
Message-ID: <87h7y2c5ua.fsf@redhat.com>


> This is required to backport 8237086 (assert(is_MachReturn()) running
> CTW with fix for JDK-8231291).
>
> Original bug:
>   https://bugs.openjdk.java.net/browse/JDK-8217230
>   http://hg.openjdk.java.net/jdk/jdk12/rev/1b292ae4eb50
>
> Original patch does not apply cleanly to 11u because context changed in
> compile.hpp. Patch is otherwise identical.
>
> 11u webrev:
>   http://cr.openjdk.java.net/~roland/8217230.11u/webrev.00/
>
> Testing: x86_64 build, tier1 + tier2

Anyone for this review?

Roland.


From jatin.bhateja at intel.com  Thu Apr  2 17:09:16 2020
From: jatin.bhateja at intel.com (Bhateja, Jatin)
Date: Thu, 2 Apr 2020 17:09:16 +0000
Subject: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
In-Reply-To: <72e4bd89-3f56-2894-dced-dd5f3f06e66e@oracle.com>
References: <A66BBE673E08E1428E3A918AE4D5B32C1AEDCD66@BGSMSX105.gar.corp.intel.com>
 <d91f4582-8c55-ad16-4107-9e1e433cd465@oracle.com>
 <A66BBE673E08E1428E3A918AE4D5B32C1AEE167F@BGSMSX105.gar.corp.intel.com>
 <373cb1c9-80f3-ec08-7d43-cbd3202bc134@oracle.com>
 <A66BBE673E08E1428E3A918AE4D5B32C1AEE2F58@BGSMSX105.gar.corp.intel.com>
 <A66BBE673E08E1428E3A918AE4D5B32C1AEE7689@BGSMSX105.gar.corp.intel.com>
 <b429b73a-e720-cad6-6a8d-d0eb8f478fdb@oracle.com>
 <72e4bd89-3f56-2894-dced-dd5f3f06e66e@oracle.com>
Message-ID: <A66BBE673E08E1428E3A918AE4D5B32C1AEE8E9A@BGSMSX105.gar.corp.intel.com>

Thanks Nils , Vladimir.

Changes have been pushed.
http://hg.openjdk.java.net/jdk/jdk/rev/29d878d3af35

Best Regards,
Jatin

> -----Original Message-----
> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> Sent: Thursday, April 2, 2020 3:45 PM
> To: Bhateja, Jatin <jatin.bhateja at intel.com>
> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
> <sandhya.viswanathan at intel.com>
> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic Instruction
> 
> 
> >> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.05/
> >
> > Looks good. I'll submit it for testing.
> 
> Test results are clean.
> 
> Best regards,
> Vladimir Ivanov
> 
> >> This removes Optimized NotV handling for AVX3, as suggested it will
> >> be brought via vectorIntrinsics branch.
> >>
> >> Thanks for your help in shaping up this patch, please let me know if
> >> there are other comments.
> >>
> >> Best Regards,
> >> Jatin
> >> ________________________________________
> >> From: Bhateja, Jatin
> >> Sent: Wednesday, March 25, 2020 12:14 PM
> >> To: Vladimir Ivanov
> >> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
> >> Subject: RE: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic
> >> Instruction
> >>
> >> Hi Vladimir,
> >>
> >> I have placed updated patch at following links:-
> >>
> >> ? 1)? Optimized NotV handling:
> >> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
> >>
> >> ? 2)? Changes for MacroLogic opt:
> >> ? http://cr.openjdk.java.net/~jbhateja/8241040/webrev.03_over_notV/
> >>
> >> Kindly review and let me know your feedback.
> >>
> >> Thanks,
> >> Jatin
> >>
> >>> -----Original Message-----
> >>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> >>> Sent: Wednesday, March 25, 2020 12:33 AM
> >>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
> >>> Cc: hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya
> >>> <sandhya.viswanathan at intel.com>
> >>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic
> >>> Instruction
> >>>
> >>> Hi Jatin,
> >>>
> >>> I tried to submit the patches for testing, but windows-x64 build
> >>> failed with the following errors:
> >>>
> >>> src/hotspot/share/opto/compile.cpp(2345): error C2131: expression
> >>> did not evaluate to a constant
> >>> src/hotspot/share/opto/compile.cpp(2345): note: failure was caused
> >>> by a read of a variable outside its lifetime
> >>> src/hotspot/share/opto/compile.cpp(2345): note: see usage of 'partition'
> >>> src/hotspot/share/opto/compile.cpp(2404): error C3863: array type
> >>> 'int ['function']' is not assignable
> >>>
> >>> Best regards,
> >>> Vladimir Ivanov
> >>>
> >>> On 24.03.2020 10:34, Bhateja, Jatin wrote:
> >>>> Hi Vladimir,
> >>>>
> >>>> Thanks for your comments , I have split the original patch into two
> >>>> sub-
> >>> patches.
> >>>>
> >>>> 1)? Optimized NotV handling:
> >>>> http://cr.openjdk.java.net/~jbhateja/8241484/webrev.01_notV/
> >>>>
> >>>> 2)? Changes for MacroLogic opt:
> >>>> http://cr.openjdk.java.net/~jbhateja/8241040/webrev.02_over_notV/
> >>>>
> >>>> Added a new flag "UseVectorMacroLogic" which guards MacroLogic
> >>> optimization.
> >>>>
> >>>> Kindly review and let me know your feedback.
> >>>>
> >>>> Best Regards,
> >>>> Jatin
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> >>>>> Sent: Tuesday, March 17, 2020 4:31 PM
> >>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>; hotspot-compiler-
> >>>>> dev at openjdk.java.net
> >>>>> Subject: Re: RFR[M] : 8241040 : Support for AVX-512 Ternary Logic
> >>>>> Instruction
> >>>>>
> >>>>>
> >>>>>> Path : http://cr.openjdk.java.net/~jbhateja/8241040/webrev.01/
> >>>>>
> >>>>> Very nice contribution, Jatin!
> >>>>>
> >>>>> Some comments after a brief review pass:
> >>>>>
> >>>>> ???? * Please, contribute NotV part separately.
> >>>>>
> >>>>> ???? * Why don't you perform (XorV v 0xFF..FF) => (NotV v)
> >>>>> transformation during GVN instead?
> >>>>>
> >>>>> ???? * As of now, vector nodes are only produced by SuperWord
> >>>>> analysis. It makes sense to limit new optimization pass to
> >>>>> SuperWord pass only (probably, introduce a new dedicated Phase ).
> >>>>> Once Vector API is available, it can be extended to cases when
> >>>>> vector nodes are present
> >>>>> (C->max_vector_size() > 0).
> >>>>>
> >>>>> ???? * There are more efficient ways to produce a vector of all-1s
> >>>>> [1] [2].
> >>>>>
> >>>>> Best regards,
> >>>>> Vladimir Ivanov
> >>>>>
> >>>>> [1]
> >>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/45
> >>>>> 105
> >>>>> 164/set-all-bits-in-cpu-register-to-__;!!GqivPVa7Brio!MlFds91TF3Dg
> >>>>> Jgc qfllGQTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZDsDgcGo$
> >>>>> 1-efficiently
> >>>>>
> >>>>> [2]
> >>>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/37
> >>>>> 469
> >>>>> 930/fastest-way-to-set-m256-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqfllG
> >>>>> QTI _RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZyDyHLYM$
> >>>>> value-to-all-one-bits
> >>>>>
> >>>>>>
> >>>>>> A new optimization pass has been added post Auto-Vectorization
> >>>>>> which
> >>>>> folds expression tree involving vector boolean logic operations
> >>>>> (ANDV/ORV/NOTV/XORV) into a MacroLogic node.
> >>>>>> Optimization pass has following stages:
> >>>>>>
> >>>>>> ???? 1.? Collection stage :
> >>>>>> ??????? *?? This performs a DFS traversal over Ideal Graph and
> >>>>>> collects the root
> >>>>> nodes of all vector logic expression trees.
> >>>>>> ???? 2.? Processing stage:
> >>>>>> ??????? *?? Performs a bottom up traversal over expression tree
> >>>>>> and
> >>>>> simultaneously folds specific DAG patterns involving Boolean logic
> >>>>> parent and child nodes.
> >>>>>> ??????? *?? Transforms (XORV INP , -1) -> (NOTV INP) to promote
> >>>>>> logic folding.
> >>>>>> ??????? *?? Folding is performed under a constraint on the total
> >>>>>> number of
> >>> inputs
> >>>>> which a MacroLogic node can have, in this case it's 3.
> >>>>>> ??????? *?? A partition is created around a DAG pattern involving
> >>>>>> logic parent
> >>> and
> >>>>> one or two logic child node, it encapsulate the nodes in
> >>>>> post-order fashion.
> >>>>>> ??????? *?? This partition is then evaluated by traversing over
> >>>>>> the nodes,
> >>> assigning
> >>>>> boolean values to its inputs and performing operations over them
> >>>>> based on its Opcode. Node along with its computed result is stored
> >>>>> in a map which is accessed during the evaluation of its user/parent
> node.
> >>>>>> ??????? *?? Post-evaluation a MacroLogic node is created which is
> >>>>>> equivalent to
> >>> a
> >>>>> three input truth-table. Expression tree leaf level inputs along
> >>>>> with result of its evaluation are the inputs fed to this new node.
> >>>>>> ??????? *?? Entire expression tree is eventually
> >>>>>> subsumed/replaced by newly
> >>>>> create MacroLogic node.
> >>>>>>
> >>>>>>
> >>>>>> Following are the JMH benchmarks results with and without changes.
> >>>>>>
> >>>>>> Without Changes:
> >>>>>>
> >>>>>> Benchmark??????????????????????????? (VECLEN)?? Mode? Cnt
> >>>>>> Score?? Error? Units
> >>>>>> MacroLogicOpt.workload1_caller???????????? 64? thrpt
> >>>>>> 2904.480????????? ops/s
> >>>>>> MacroLogicOpt.workload1_caller??????????? 128? thrpt
> >>>>>> 2219.252????????? ops/s
> >>>>>> MacroLogicOpt.workload1_caller??????????? 256? thrpt
> >>>>>> 1507.267????????? ops/s
> >>>>>> MacroLogicOpt.workload1_caller??????????? 512? thrpt
> >>>>>> 860.926????????? ops/s
> >>>>>> MacroLogicOpt.workload1_caller?????????? 1024? thrpt
> >>>>>> 470.163????????? ops/s
> >>>>>> MacroLogicOpt.workload1_caller?????????? 2048? thrpt
> >>>>>> 246.608????????? ops/s
> >>>>>> MacroLogicOpt.workload1_caller?????????? 4096? thrpt
> >>>>>> 108.031????????? ops/s
> >>>>>> MacroLogicOpt.workload2_caller???????????? 64? thrpt
> >>>>>> 344.633????????? ops/s
> >>>>>> MacroLogicOpt.workload2_caller??????????? 128? thrpt
> >>>>>> 209.818????????? ops/s
> >>>>>> MacroLogicOpt.workload2_caller??????????? 256? thrpt
> >>>>>> 111.678????????? ops/s
> >>>>>> MacroLogicOpt.workload2_caller??????????? 512? thrpt
> >>>>>> 53.360????????? ops/s
> >>>>>> MacroLogicOpt.workload2_caller?????????? 1024? thrpt
> >>>>>> 27.888????????? ops/s
> >>>>>> MacroLogicOpt.workload2_caller?????????? 2048? thrpt
> >>>>>> 12.103????????? ops/s
> >>>>>> MacroLogicOpt.workload2_caller?????????? 4096? thrpt
> >>>>>> 6.018????????? ops/s
> >>>>>> MacroLogicOpt.workload3_caller???????????? 64? thrpt
> >>>>>> 3110.669????????? ops/s
> >>>>>> MacroLogicOpt.workload3_caller??????????? 128? thrpt
> >>>>>> 1996.861????????? ops/s
> >>>>>> MacroLogicOpt.workload3_caller??????????? 256? thrpt
> >>>>>> 870.166????????? ops/s
> >>>>>> MacroLogicOpt.workload3_caller??????????? 512? thrpt
> >>>>>> 389.629????????? ops/s
> >>>>>> MacroLogicOpt.workload3_caller?????????? 1024? thrpt
> >>>>>> 151.203????????? ops/s
> >>>>>> MacroLogicOpt.workload3_caller?????????? 2048? thrpt
> >>>>>> 75.086????????? ops/s
> >>>>>> MacroLogicOpt.workload3_caller?????????? 4096? thrpt
> >>>>>> 37.576????????? ops/s
> >>>>>>
> >>>>>> With Changes:
> >>>>>>
> >>>>>> Benchmark??????????????????????????? (VECLEN)?? Mode? Cnt
> >>>>>> Score?? Error? Units
> >>>>>> MacroLogicOpt.workload1_caller???????????? 64? thrpt
> >>>>>> 3306.670????????? ops/s
> >>>>>> MacroLogicOpt.workload1_caller??????????? 128? thrpt
> >>>>>> 2936.851????????? ops/s
> >>>>>> MacroLogicOpt.workload1_caller??????????? 256? thrpt
> >>>>>> 2413.827????????? ops/s
> >>>>>> MacroLogicOpt.workload1_caller??????????? 512? thrpt
> >>>>>> 1440.291????????? ops/s
> >>>>>> MacroLogicOpt.workload1_caller?????????? 1024? thrpt
> >>>>>> 707.576????????? ops/s
> >>>>>> MacroLogicOpt.workload1_caller?????????? 2048? thrpt
> >>>>>> 384.863????????? ops/s
> >>>>>> MacroLogicOpt.workload1_caller?????????? 4096? thrpt
> >>>>>> 132.753????????? ops/s
> >>>>>> MacroLogicOpt.workload2_caller???????????? 64? thrpt
> >>>>>> 450.856????????? ops/s
> >>>>>> MacroLogicOpt.workload2_caller??????????? 128? thrpt
> >>>>>> 323.925????????? ops/s
> >>>>>> MacroLogicOpt.workload2_caller??????????? 256? thrpt
> >>>>>> 135.191????????? ops/s
> >>>>>> MacroLogicOpt.workload2_caller??????????? 512? thrpt
> >>>>>> 69.424????????? ops/s
> >>>>>> MacroLogicOpt.workload2_caller?????????? 1024? thrpt
> >>>>>> 35.744????????? ops/s
> >>>>>> MacroLogicOpt.workload2_caller?????????? 2048? thrpt
> >>>>>> 14.168????????? ops/s
> >>>>>> MacroLogicOpt.workload2_caller?????????? 4096? thrpt
> >>>>>> 7.245????????? ops/s
> >>>>>> MacroLogicOpt.workload3_caller???????????? 64? thrpt
> >>>>>> 3333.550????????? ops/s
> >>>>>> MacroLogicOpt.workload3_caller??????????? 128? thrpt
> >>>>>> 2269.428????????? ops/s
> >>>>>> MacroLogicOpt.workload3_caller??????????? 256? thrpt
> >>>>>> 995.691????????? ops/s
> >>>>>> MacroLogicOpt.workload3_caller??????????? 512? thrpt
> >>>>>> 412.452????????? ops/s
> >>>>>> MacroLogicOpt.workload3_caller?????????? 1024? thrpt
> >>>>>> 151.157????????? ops/s
> >>>>>> MacroLogicOpt.workload3_caller?????????? 2048? thrpt
> >>>>>> 75.079????????? ops/s
> >>>>>> MacroLogicOpt.workload3_caller?????????? 4096? thrpt
> >>>>>> 37.158????????? ops/s
> >>>>>>
> >>>>>> Please review the patch.
> >>>>>>
> >>>>>> Best Regards,
> >>>>>> Jatin
> >>>>>>
> >>>>>> [1] Section 17.7 :
> >>>>>> https://urldefense.com/v3/__https://software.intel.com/sites/defa
> >>>>>> ult
> >>>>>> /files/managed/9e/bc/64-ia-32-__;!!GqivPVa7Brio!MlFds91TF3DgJgcqf
> >>>>>> llG QTI_RakrAkQOtkS55W-_GnxBn24dcdHvIdHOIYQLslpZHxy2FdU$
> >>>>>> architectures-optimization-manual.pdf
> >>>>>>

From tom.rodriguez at oracle.com  Thu Apr  2 17:58:09 2020
From: tom.rodriguez at oracle.com (Tom Rodriguez)
Date: Thu, 2 Apr 2020 10:58:09 -0700
Subject: RFR(XS) 8191930: [Graal] emits unparseable XML into compile log
In-Reply-To: <6cb3928e-56e7-6fae-18e7-802792d4a6a7@oracle.com>
References: <a5e1d290-018f-0145-3cfe-23a12e74d0b5@oracle.com>
 <6cb3928e-56e7-6fae-18e7-802792d4a6a7@oracle.com>
Message-ID: <51121a1b-8c2a-dbf5-286f-a7815fac064b@oracle.com>

Thanks!

tom

Vladimir Kozlov wrote on 4/1/20 8:37 PM:
> Looks good.
> 
> Thanks,
> Vladimir
> 
> On 4/1/20 12:56 PM, Tom Rodriguez wrote:
>> http://cr.openjdk.java.net/~never/8191930/webrev
>> https://bugs.openjdk.java.net/browse/JDK-8191930
>>
>> This was something that was fixed in 8 but never made it into 9+ I 
>> think because the code moved after 8.? Tested by forcing a bailout 
>> with the problematic string and inspecting the resulting xml.
>>
>> <failure reason='Code installation failed: dependencies failed
>> Failed dependency of type call_site_target_value
>> ?? object = a 
>> &apos;jdk/nashorn/internal/runtime/linker/LinkerCallSite&apos;{0x00000005df235660} 
>>
>> ?? object = a 
>> &apos;java/lang/invoke/BoundMethodHandle$Species_LLLL&apos;{0x00000005df235680} 
>>
>> ?? witness = jdk.nashorn.internal.runtime.linker.LinkerCallSite '/>

From tom.rodriguez at oracle.com  Thu Apr  2 19:12:39 2020
From: tom.rodriguez at oracle.com (Tom Rodriguez)
Date: Thu, 2 Apr 2020 12:12:39 -0700
Subject: RFR(S) 8231756: [JVMCI] need support for deoptimizing virtual byte
 arrays encoding non-byte primitives
Message-ID: <5eb640c1-fb33-2fea-ef2f-416e19220d97@oracle.com>

http://cr.openjdk.java.net/~never/8231756/webrev
https://bugs.openjdk.java.net/browse/JDK-8231756

This adds support for deoptimizing with non-byte primitive values stored 
on top of a byte array, similarly to the way that a double or long can 
be stored on top of 2 int fields.  More detail is provided in the bug 
report and new unit tests exercise the deoptimization.  mach5 testing is 
in progress.

tom

From nils.eliasson at oracle.com  Thu Apr  2 19:37:56 2020
From: nils.eliasson at oracle.com (Nils Eliasson)
Date: Thu, 2 Apr 2020 21:37:56 +0200
Subject: RFR(S): 8241041: C2: "assert((Value(phase) == t) || (t !=
 TypeInt::CC_GT && t != TypeInt::CC_EQ)) failed: missing Value() optimization"
 still happens after fix for 8239335
In-Reply-To: <87k12yc5vl.fsf@redhat.com>
References: <87tv2ef536.fsf@redhat.com> <87k12yc5vl.fsf@redhat.com>
Message-ID: <d08daab7-c7f5-09dc-cd8a-0702c4f15f20@oracle.com>

Looks good!

Review.

Best regards,
Nils Eliasson

On 2020-04-02 16:35, Roland Westrelin wrote:
>> http://cr.openjdk.java.net/~roland/8241041/webrev.00/
> Anyone else for this?
>
> Roland.
>


From cjashfor at linux.ibm.com  Thu Apr  2 23:07:31 2020
From: cjashfor at linux.ibm.com (Corey Ashford)
Date: Thu, 2 Apr 2020 16:07:31 -0700
Subject: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes()
 and Integer.reverseBytes() on Power9
In-Reply-To: <OF54F8418D.F0B8E4CB-ON0025853E.004E3DC9-4925853E.004F648A@notes.na.collabserv.com>
References: <0079874e-7bc2-5ff4-f004-337c718ec6df@linux.ibm.com>
 <OF54F8418D.F0B8E4CB-ON0025853E.004E3DC9-4925853E.004F648A@notes.na.collabserv.com>
Message-ID: <67fa8056-a8ed-cdfc-1e5a-d36b49c4af18@linux.ibm.com>

On 4/2/20 7:27 AM, Michihiro Horie wrote:
> Hi Corey,
> 
> I?m not a reviewer, but I can run your benchmark in my local P9 node if 
> you share it.
> 
> Best regards,
> Michihiro

The tests are somewhat hokey; I added the shifts to keep the compiler 
from hoisting the code that it could predetermine the result.

Here's the one for Long.reverseBytes():

import java.lang.*;

class ReverseLong
{
     public static void main(String args[])
     {
         long reversed, re_reversed;
	long accum = 0;
	long orig = 0x1122334455667788L;
	long start = System.currentTimeMillis();
	for (int i = 0; i < 1_000_000_000; i++) {
		// Try to keep java from figuring out stuff in advance
		reversed = Long.reverseBytes(orig);
		re_reversed = Long.reverseBytes(reversed);
		if (re_reversed != orig) {
         		System.out.println("Orig: " + String.format("%16x", orig) + 
"  Re-reversed: " + String.format("%16x", re_reversed));
		}
		accum += orig;
		orig = Long.rotateRight(orig, 3);
	}
	System.out.println("Elapsed time: " + 
Long.toString(System.currentTimeMillis() - start));
	System.out.println("accum: " + Long.toString(accum));
     }
}


And the one for Integer.reverseBytes():

import java.lang.*;

class ReverseInt
{
     public static void main(String args[])
     {
         int reversed, re_reversed;
	int orig = 0x11223344;
	int accum = 0;
	long start = System.currentTimeMillis();
	for (int i = 0; i < 1_000_000_000; i++) {
		// Try to keep java from figuring out stuff in advance
		reversed = Integer.reverseBytes(orig);
		re_reversed = Integer.reverseBytes(reversed);
		if (re_reversed != orig) {
         		System.out.println("Orig: " + String.format("%08x", orig) + 
"  Re-reversed: " + String.format("%08x", re_reversed));
		}
		accum += orig;
		orig = Integer.rotateRight(orig, 3);
	}
	System.out.println("Elapsed time: " + 
Long.toString(System.currentTimeMillis() - start));
	System.out.println("accum: " + Integer.toString(accum));
     }
}

From ningsheng.jian at arm.com  Fri Apr  3 02:41:04 2020
From: ningsheng.jian at arm.com (Ningsheng Jian)
Date: Fri, 3 Apr 2020 10:41:04 +0800
Subject: [aarch64-port-dev ] RFR(S): 8241475: AArch64: Add missing support
 for PopCountVI node
In-Reply-To: <DB8PR08MB49699764FA887647D8495C5396C80@DB8PR08MB4969.eurprd08.prod.outlook.com>
References: <DB8PR08MB49699764FA887647D8495C5396C80@DB8PR08MB4969.eurprd08.prod.outlook.com>
Message-ID: <110347ce-0629-c5ff-d072-080094570f09@arm.com>

Hi Pengfei,

On 3/31/20 5:32 PM, Pengfei Li wrote:
> Hi,
> 
> Please help review this another missing node support for AArch64.
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8241475
> Webrev: http://cr.openjdk.java.net/~pli/rfr/8241475/webrev.01/
> 

Just took a close look before pushing your code, and I think this line 
can be removed?

+  effect(TEMP_DEF dst);

Thanks,
Ningsheng

From Pengfei.Li at arm.com  Fri Apr  3 05:48:05 2020
From: Pengfei.Li at arm.com (Pengfei Li)
Date: Fri, 3 Apr 2020 05:48:05 +0000
Subject: [aarch64-port-dev ] RFR(S): 8241475: AArch64: Add missing support
 for PopCountVI node
In-Reply-To: <110347ce-0629-c5ff-d072-080094570f09@arm.com>
References: <DB8PR08MB49699764FA887647D8495C5396C80@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <110347ce-0629-c5ff-d072-080094570f09@arm.com>
Message-ID: <DB8PR08MB4969A351FA2AE7ACD9DD698E96C70@DB8PR08MB4969.eurprd08.prod.outlook.com>

Hi,

> Just took a close look before pushing your code, and I think this line can be
> removed?
> 
> +  effect(TEMP_DEF dst);

Yes, thanks for pointing out. It is redundant since I don't use temps this time.

I've updated and rebased the patch. See http://cr.openjdk.java.net/~pli/rfr/8241475/webrev.02/

--
Thanks,
Pengfei


From shade at redhat.com  Fri Apr  3 07:30:24 2020
From: shade at redhat.com (Aleksey Shipilev)
Date: Fri, 3 Apr 2020 09:30:24 +0200
Subject: RFR (XS) 8242073: x86_32 build failure after JDK-8241040
Message-ID: <d4c8effc-3fa1-1849-63f0-ef64728f9d92@redhat.com>

Build bug:
  https://bugs.openjdk.java.net/browse/JDK-8242073

immU8 is undefined in x86_32.ad, so new matchers in x86.ad fail. Copying immU8 definition from
x86_64.ad helps. Matched the operand block order and internal format of x86_32.ad with this patch:

diff -r f50a7df94744 src/hotspot/cpu/x86/x86_32.ad
--- a/src/hotspot/cpu/x86/x86_32.ad     Fri Apr 03 07:27:53 2020 +0100
+++ b/src/hotspot/cpu/x86/x86_32.ad     Fri Apr 03 09:29:33 2020 +0200
@@ -3367,10 +3367,19 @@
   op_cost(5);
   format %{ %}
   interface(CONST_INTER);
 %}

+operand immU8() %{
+  predicate((0 <= n->get_int()) && (n->get_int() <= 255));
+  match(ConI);
+
+  op_cost(5);
+  format %{ %}
+  interface(CONST_INTER);
+%}
+
 operand immI16() %{
   predicate((-32768 <= n->get_int()) && (n->get_int() <= 32767));
   match(ConI);

   op_cost(10);

Testing: x86_32 build

-- 
Thanks,
-Aleksey


From rwestrel at redhat.com  Fri Apr  3 07:51:37 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Fri, 03 Apr 2020 09:51:37 +0200
Subject: RFR(S): 8241041: C2: "assert((Value(phase) == t) || (t !=
 TypeInt::CC_GT && t != TypeInt::CC_EQ)) failed: missing Value() optimization"
 still happens after fix for 8239335
In-Reply-To: <d08daab7-c7f5-09dc-cd8a-0702c4f15f20@oracle.com>
References: <87tv2ef536.fsf@redhat.com> <87k12yc5vl.fsf@redhat.com>
 <d08daab7-c7f5-09dc-cd8a-0702c4f15f20@oracle.com>
Message-ID: <87eet5c8hi.fsf@redhat.com>


Thanks for the review, Nils!

Roland.


From manc at google.com  Fri Apr  3 08:42:53 2020
From: manc at google.com (Man Cao)
Date: Fri, 3 Apr 2020 01:42:53 -0700
Subject: RFR(S): 8241556: Memory leak if -XX:CompileCommand is set
In-Reply-To: <d3300378-934c-f937-064d-dc223ade1125@oracle.com>
References: <CA+w6HxZw3ZxNL4aooNzaA9BtbGgp82QgihNRmxe4joYdJ5H9XQ@mail.gmail.com>
 <d3300378-934c-f937-064d-dc223ade1125@oracle.com>
Message-ID: <CA+w6HxaNsurY4fKkZ-eCDgUwFKaGoY2A_irxhohESrGEq2Cpnw@mail.gmail.com>

Thanks for the reviews!

-Man

From rwestrel at redhat.com  Fri Apr  3 08:55:10 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Fri, 03 Apr 2020 10:55:10 +0200
Subject: RFR(S): 8241900: Loop unswitching may cause dependence on null check
 to be lost
Message-ID: <878sjdc5jl.fsf@redhat.com>


http://cr.openjdk.java.net/~roland/8241900/webrev.00/

When a loop is unswitched, the now redundant test in the loop bodies is
changed so it always fails or succeeds. Data nodes that are control
dependent on the test become control dependent on the dominating
control.

In the test case:

1) the loop is unswitched once. The test that's hoisted is:
if (o3 != null) {

2) the loop is unswitched a second time. This time, the hoisted test is:
if (o != null) {

3) that test has a control dependent CastPP. That CastPP becomes
dependent on the dominating test:
if (o2 == null) {
that test never fails so it's compiled as a test + uncommon trap

4) partial peeling is applied

The chain of tests is now:

if (array[1] != null) { // hoisted o3 != null by unswitching
  if (objectField != null) { // hoisted o != null by unswitching
    if (array[1] != null) { // peeled o2 == null
      // CastPP on objectField is here

5) because the 3rd test is identical to the first one this becomes:

if (array[1] != null) { // hoisted o3 != null by unswitching
  // CastPP on objectField is here
  if (objectField != null) { // hoisted o != null by unswitching

So the CastPP bypasses the null check on its input and so a dependent
load can flow above the null check.

The fix I propose is to keep the dependence on the hoisted test on loop
unswitching by using dominated_by() instead of short_circuit_if(). This
way on step 2) 3) above, the CastPP is made dependent on the hoisted
test so reordering of the CastPP with its null check can't happen.

Roland.


From aph at redhat.com  Fri Apr  3 08:56:34 2020
From: aph at redhat.com (Andrew Haley)
Date: Fri, 3 Apr 2020 09:56:34 +0100
Subject: [aarch64-port-dev ] RFR(S): 8241475: AArch64: Add missing support
 for PopCountVI node
In-Reply-To: <DB8PR08MB4969A351FA2AE7ACD9DD698E96C70@DB8PR08MB4969.eurprd08.prod.outlook.com>
References: <DB8PR08MB49699764FA887647D8495C5396C80@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <110347ce-0629-c5ff-d072-080094570f09@arm.com>
 <DB8PR08MB4969A351FA2AE7ACD9DD698E96C70@DB8PR08MB4969.eurprd08.prod.outlook.com>
Message-ID: <d4fca5fe-b5c1-c448-5e4d-40541f5d5f46@redhat.com>

On 4/3/20 6:48 AM, Pengfei Li wrote:
> Yes, thanks for pointing out. It is redundant since I don't use temps this time.
> 
> I've updated and rebased the patch. See http://cr.openjdk.java.net/~pli/rfr/8241475/webrev.02/

Please push.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From ningsheng.jian at arm.com  Fri Apr  3 09:11:15 2020
From: ningsheng.jian at arm.com (Ningsheng Jian)
Date: Fri, 3 Apr 2020 17:11:15 +0800
Subject: [aarch64-port-dev ] RFR(S): 8241475: AArch64: Add missing support
 for PopCountVI node
In-Reply-To: <d4fca5fe-b5c1-c448-5e4d-40541f5d5f46@redhat.com>
References: <DB8PR08MB49699764FA887647D8495C5396C80@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <110347ce-0629-c5ff-d072-080094570f09@arm.com>
 <DB8PR08MB4969A351FA2AE7ACD9DD698E96C70@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <d4fca5fe-b5c1-c448-5e4d-40541f5d5f46@redhat.com>
Message-ID: <6c0bcfbd-118c-3fa7-96f7-7e832314a05c@arm.com>

On 4/3/20 4:56 PM, Andrew Haley wrote:
> On 4/3/20 6:48 AM, Pengfei Li wrote:
>> Yes, thanks for pointing out. It is redundant since I don't use temps this time.
>>
>> I've updated and rebased the patch. See http://cr.openjdk.java.net/~pli/rfr/8241475/webrev.02/
> 
> Please push.
> 

Pushed.

Thanks,
Ningsheng

From adinn at redhat.com  Fri Apr  3 09:13:40 2020
From: adinn at redhat.com (Andrew Dinn)
Date: Fri, 3 Apr 2020 10:13:40 +0100
Subject: [aarch64-port-dev ] RFR(S): 8241475: AArch64: Add missing support
 for PopCountVI node
In-Reply-To: <110347ce-0629-c5ff-d072-080094570f09@arm.com>
References: <DB8PR08MB49699764FA887647D8495C5396C80@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <110347ce-0629-c5ff-d072-080094570f09@arm.com>
Message-ID: <b2e80cd9-49ea-3f9e-f2ca-9b36cdc5166b@redhat.com>

On 03/04/2020 03:41, Ningsheng Jian wrote:
> Hi Pengfei,
> 
> On 3/31/20 5:32 PM, Pengfei Li wrote:
>> Hi,
>>
>> Please help review this another missing node support for AArch64.
>>
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8241475
>> Webrev: http://cr.openjdk.java.net/~pli/rfr/8241475/webrev.01/
>>
> 
> Just took a close look before pushing your code, and I think this line
> can be removed?
> 
> +? effect(TEMP_DEF dst);
Strictly, I think this is correct but I don't think it matters.

I believe this usage is meant to identify a case where a generated
multi-instruction sequence uses the output register (i.e. dst = target
of Set) both as an output in the final instruction and as an
intermediate scratch register in intervening instructions. That is the
case for both these rules.

The only way that might make a difference is if the back end were able
to interleave instructions in other generated sequences with the
instructions generated by this rule during instruction scheduling (or,
say, via peephole rules). However, I don't believe that can happen given
the current adlc code and AArch64 rules.

n.b. there are several other exemples of TEMP_DEF use in aarch64.ad. I
am not sure that they are the only ones where a dst register is used as
both output and intermediary (we will only find out by carefully
eyeballing every rule).

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill


From aph at redhat.com  Fri Apr  3 09:22:30 2020
From: aph at redhat.com (Andrew Haley)
Date: Fri, 3 Apr 2020 10:22:30 +0100
Subject: [aarch64-port-dev ] RFR(S): 8241475: AArch64: Add missing support
 for PopCountVI node
In-Reply-To: <b2e80cd9-49ea-3f9e-f2ca-9b36cdc5166b@redhat.com>
References: <DB8PR08MB49699764FA887647D8495C5396C80@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <110347ce-0629-c5ff-d072-080094570f09@arm.com>
 <b2e80cd9-49ea-3f9e-f2ca-9b36cdc5166b@redhat.com>
Message-ID: <9b007363-0380-3d6a-8df6-f0afca4c50d5@redhat.com>

On 4/3/20 10:13 AM, Andrew Dinn wrote:
> On 03/04/2020 03:41, Ningsheng Jian wrote:
>> Hi Pengfei,
>>
>> On 3/31/20 5:32 PM, Pengfei Li wrote:
>>> Hi,
>>>
>>> Please help review this another missing node support for AArch64.
>>>
>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8241475
>>> Webrev: http://cr.openjdk.java.net/~pli/rfr/8241475/webrev.01/
>>>
>>
>> Just took a close look before pushing your code, and I think this line
>> can be removed?
>>
>> +? effect(TEMP_DEF dst);
> Strictly, I think this is correct but I don't think it matters.
> 
> I believe this usage is meant to identify a case where a generated
> multi-instruction sequence uses the output register (i.e. dst = target
> of Set) both as an output in the final instruction and as an
> intermediate scratch register in intervening instructions. That is the
> case for both these rules.

More simply, it prevents the situation where the same register is used as both
an output and an input. Withe these patterns that doesn't matter.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From vladimir.x.ivanov at oracle.com  Fri Apr  3 09:27:02 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Fri, 3 Apr 2020 12:27:02 +0300
Subject: RFR (XS) 8242073: x86_32 build failure after JDK-8241040
In-Reply-To: <d4c8effc-3fa1-1849-63f0-ef64728f9d92@redhat.com>
References: <d4c8effc-3fa1-1849-63f0-ef64728f9d92@redhat.com>
Message-ID: <88a77b32-7b4d-087d-3fe3-6fa154156e92@oracle.com>

Looks good and trivial.

Best regards,
Vladimir Ivanov

On 03.04.2020 10:30, Aleksey Shipilev wrote:
> Build bug:
>    https://bugs.openjdk.java.net/browse/JDK-8242073
> 
> immU8 is undefined in x86_32.ad, so new matchers in x86.ad fail. Copying immU8 definition from
> x86_64.ad helps. Matched the operand block order and internal format of x86_32.ad with this patch:
> 
> diff -r f50a7df94744 src/hotspot/cpu/x86/x86_32.ad
> --- a/src/hotspot/cpu/x86/x86_32.ad     Fri Apr 03 07:27:53 2020 +0100
> +++ b/src/hotspot/cpu/x86/x86_32.ad     Fri Apr 03 09:29:33 2020 +0200
> @@ -3367,10 +3367,19 @@
>     op_cost(5);
>     format %{ %}
>     interface(CONST_INTER);
>   %}
> 
> +operand immU8() %{
> +  predicate((0 <= n->get_int()) && (n->get_int() <= 255));
> +  match(ConI);
> +
> +  op_cost(5);
> +  format %{ %}
> +  interface(CONST_INTER);
> +%}
> +
>   operand immI16() %{
>     predicate((-32768 <= n->get_int()) && (n->get_int() <= 32767));
>     match(ConI);
> 
>     op_cost(10);
> 
> Testing: x86_32 build
> 

From shade at redhat.com  Fri Apr  3 09:43:20 2020
From: shade at redhat.com (Aleksey Shipilev)
Date: Fri, 3 Apr 2020 11:43:20 +0200
Subject: RFR (XS) 8242073: x86_32 build failure after JDK-8241040
In-Reply-To: <88a77b32-7b4d-087d-3fe3-6fa154156e92@oracle.com>
References: <d4c8effc-3fa1-1849-63f0-ef64728f9d92@redhat.com>
 <88a77b32-7b4d-087d-3fe3-6fa154156e92@oracle.com>
Message-ID: <8d690500-67db-8c9d-424d-a836f9d49a61@redhat.com>

Thanks, pushed.

On 4/3/20 11:27 AM, Vladimir Ivanov wrote:
> Looks good and trivial.
> 
> Best regards,
> Vladimir Ivanov
> 
> On 03.04.2020 10:30, Aleksey Shipilev wrote:
>> Build bug:
>>    https://bugs.openjdk.java.net/browse/JDK-8242073
>>
>> immU8 is undefined in x86_32.ad, so new matchers in x86.ad fail. Copying immU8 definition from
>> x86_64.ad helps. Matched the operand block order and internal format of x86_32.ad with this patch:
>>
>> diff -r f50a7df94744 src/hotspot/cpu/x86/x86_32.ad
>> --- a/src/hotspot/cpu/x86/x86_32.ad     Fri Apr 03 07:27:53 2020 +0100
>> +++ b/src/hotspot/cpu/x86/x86_32.ad     Fri Apr 03 09:29:33 2020 +0200
>> @@ -3367,10 +3367,19 @@
>>     op_cost(5);
>>     format %{ %}
>>     interface(CONST_INTER);
>>   %}
>>
>> +operand immU8() %{
>> +  predicate((0 <= n->get_int()) && (n->get_int() <= 255));
>> +  match(ConI);
>> +
>> +  op_cost(5);
>> +  format %{ %}
>> +  interface(CONST_INTER);
>> +%}
>> +
>>   operand immI16() %{
>>     predicate((-32768 <= n->get_int()) && (n->get_int() <= 32767));
>>     match(ConI);
>>
>>     op_cost(10);
>>
>> Testing: x86_32 build
>>
> 


-- 
Thanks,
-Aleksey


From ningsheng.jian at arm.com  Fri Apr  3 10:00:38 2020
From: ningsheng.jian at arm.com (Ningsheng Jian)
Date: Fri, 3 Apr 2020 18:00:38 +0800
Subject: [aarch64-port-dev ] RFR(S): 8241475: AArch64: Add missing support
 for PopCountVI node
In-Reply-To: <9b007363-0380-3d6a-8df6-f0afca4c50d5@redhat.com>
References: <DB8PR08MB49699764FA887647D8495C5396C80@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <110347ce-0629-c5ff-d072-080094570f09@arm.com>
 <b2e80cd9-49ea-3f9e-f2ca-9b36cdc5166b@redhat.com>
 <9b007363-0380-3d6a-8df6-f0afca4c50d5@redhat.com>
Message-ID: <34dcff53-5afc-29c2-6086-e0d66882026c@arm.com>

On 4/3/20 5:22 PM, Andrew Haley wrote:
> On 4/3/20 10:13 AM, Andrew Dinn wrote:
>> On 03/04/2020 03:41, Ningsheng Jian wrote:
>>> Hi Pengfei,
>>>
>>> On 3/31/20 5:32 PM, Pengfei Li wrote:
>>>> Hi,
>>>>
>>>> Please help review this another missing node support for AArch64.
>>>>
>>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8241475
>>>> Webrev: http://cr.openjdk.java.net/~pli/rfr/8241475/webrev.01/
>>>>
>>>
>>> Just took a close look before pushing your code, and I think this line
>>> can be removed?
>>>
>>> +? effect(TEMP_DEF dst);
>> Strictly, I think this is correct but I don't think it matters.
>>
>> I believe this usage is meant to identify a case where a generated
>> multi-instruction sequence uses the output register (i.e. dst = target
>> of Set) both as an output in the final instruction and as an
>> intermediate scratch register in intervening instructions. That is the
>> case for both these rules.
> 
> More simply, it prevents the situation where the same register is used as both
> an output and an input. Withe these patterns that doesn't matter.
> 

Yeah, in this code block dst and src are not necessary to be different regs.

Thanks,
Ningsheng

From Yang.Zhang at arm.com  Fri Apr  3 10:49:06 2020
From: Yang.Zhang at arm.com (Yang Zhang)
Date: Fri, 3 Apr 2020 10:49:06 +0000
Subject: RFR(S):  8241911: AArch64: Fix a potential register clash issue in
 reduce_add2I  
Message-ID: <VI1PR0802MB2558190831829ACF49F38C4A8EC70@VI1PR0802MB2558.eurprd08.prod.outlook.com>

Hi,

Could you please help to review this patch?

In original reduce_add2I, dst may be the same as tmp2, which may get incorrect result.
Some reduction operation instruct code formats are also cleaned up.

JBS: https://bugs.openjdk.java.net/browse/JDK-8241911
Webrev: http://cr.openjdk.java.net/~yzhang/8241911/webrev.00/


Regards
Yang


From tobias.hartmann at oracle.com  Fri Apr  3 11:21:25 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Fri, 3 Apr 2020 13:21:25 +0200
Subject: [15] RFR(T): 8242090: Remove dead code from c1_LIR
Message-ID: <dbd0f62b-6e7a-8bb0-2859-eda561399c71@oracle.com>

Hi,

please review the following patch that removes some dead code:
https://bugs.openjdk.java.net/browse/JDK-8242090
http://cr.openjdk.java.net/~thartmann/8242090/webrev.00/

Thanks,
Tobias

From claes.redestad at oracle.com  Fri Apr  3 11:54:50 2020
From: claes.redestad at oracle.com (Claes Redestad)
Date: Fri, 3 Apr 2020 13:54:50 +0200
Subject: [15] RFR(T): 8242090: Remove dead code from c1_LIR
In-Reply-To: <dbd0f62b-6e7a-8bb0-2859-eda561399c71@oracle.com>
References: <dbd0f62b-6e7a-8bb0-2859-eda561399c71@oracle.com>
Message-ID: <84a5e6e1-7441-002d-7564-94d9eb1d4f31@oracle.com>

Looks good to me!

/Claes

On 2020-04-03 13:21, Tobias Hartmann wrote:
> Hi,
> 
> please review the following patch that removes some dead code:
> https://bugs.openjdk.java.net/browse/JDK-8242090
> http://cr.openjdk.java.net/~thartmann/8242090/webrev.00/
> 
> Thanks,
> Tobias
> 

From tobias.hartmann at oracle.com  Fri Apr  3 11:59:36 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Fri, 3 Apr 2020 13:59:36 +0200
Subject: [15] RFR(T): 8242090: Remove dead code from c1_LIR
In-Reply-To: <84a5e6e1-7441-002d-7564-94d9eb1d4f31@oracle.com>
References: <dbd0f62b-6e7a-8bb0-2859-eda561399c71@oracle.com>
 <84a5e6e1-7441-002d-7564-94d9eb1d4f31@oracle.com>
Message-ID: <40e00a69-7b3e-bef2-8b84-a71659611bc3@oracle.com>

Thanks Claes!

Best regards,
Tobias

On 03.04.20 13:54, Claes Redestad wrote:
> Looks good to me!
> 
> /Claes
> 
> On 2020-04-03 13:21, Tobias Hartmann wrote:
>> Hi,
>>
>> please review the following patch that removes some dead code:
>> https://bugs.openjdk.java.net/browse/JDK-8242090
>> http://cr.openjdk.java.net/~thartmann/8242090/webrev.00/
>>
>> Thanks,
>> Tobias
>>

From tobias.hartmann at oracle.com  Fri Apr  3 13:41:29 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Fri, 3 Apr 2020 15:41:29 +0200
Subject: [15] RFR(S): 8241997: Scalar replacement of cloned array is broken
 after JDK-8238759
Message-ID: <8a6527a1-f8dc-0306-58ea-93bbc4671775@oracle.com>

Hi,

please review the following patch:
https://bugs.openjdk.java.net/browse/JDK-8241997
http://cr.openjdk.java.net/~thartmann/8241997/webrev.00/

When merging the fix for JDK-8238759 [1] into the Valhalla repo, we've noticed that some of our test
started to fail because their C2 IR matching rules detected that cloned, non-escaping array
allocations are no longer scalar replaced (for example, [2]).

The problem is that the scalar replacement code still expects ArrayCopyNode::Dest to be an AddPNode.
I've verified that my fix re-enables scalar replacement. The related Valhalla tests now pass.

Thanks,
Tobias

[1] https://bugs.openjdk.java.net/browse/JDK-8238759
[2]
http://hg.openjdk.java.net/valhalla/valhalla/file/00010b44d679/test/hotspot/jtreg/compiler/valhalla/valuetypes/TestArrays.java#l672

From rwestrel at redhat.com  Fri Apr  3 14:05:23 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Fri, 03 Apr 2020 16:05:23 +0200
Subject: [15] RFR(S): 8241997: Scalar replacement of cloned array is
 broken after JDK-8238759
In-Reply-To: <8a6527a1-f8dc-0306-58ea-93bbc4671775@oracle.com>
References: <8a6527a1-f8dc-0306-58ea-93bbc4671775@oracle.com>
Message-ID: <875zegd5r0.fsf@redhat.com>


> http://cr.openjdk.java.net/~thartmann/8241997/webrev.00/

Looks good to me.

Roland.


From tobias.hartmann at oracle.com  Fri Apr  3 14:08:28 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Fri, 3 Apr 2020 16:08:28 +0200
Subject: [15] RFR(T): 8242090: Remove dead code from c1_LIR
In-Reply-To: <40e00a69-7b3e-bef2-8b84-a71659611bc3@oracle.com>
References: <dbd0f62b-6e7a-8bb0-2859-eda561399c71@oracle.com>
 <84a5e6e1-7441-002d-7564-94d9eb1d4f31@oracle.com>
 <40e00a69-7b3e-bef2-8b84-a71659611bc3@oracle.com>
Message-ID: <dc4a0770-afd8-f2f6-afe8-3284db67599b@oracle.com>

Claes pointed out that lir_word_align is unused as well:
http://cr.openjdk.java.net/~thartmann/8242090/webrev.01/

Thanks,
Tobias

On 03.04.20 13:59, Tobias Hartmann wrote:
> Thanks Claes!
> 
> Best regards,
> Tobias
> 
> On 03.04.20 13:54, Claes Redestad wrote:
>> Looks good to me!
>>
>> /Claes
>>
>> On 2020-04-03 13:21, Tobias Hartmann wrote:
>>> Hi,
>>>
>>> please review the following patch that removes some dead code:
>>> https://bugs.openjdk.java.net/browse/JDK-8242090
>>> http://cr.openjdk.java.net/~thartmann/8242090/webrev.00/
>>>
>>> Thanks,
>>> Tobias
>>>

From tobias.hartmann at oracle.com  Fri Apr  3 14:08:49 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Fri, 3 Apr 2020 16:08:49 +0200
Subject: [15] RFR(S): 8241997: Scalar replacement of cloned array is
 broken after JDK-8238759
In-Reply-To: <875zegd5r0.fsf@redhat.com>
References: <8a6527a1-f8dc-0306-58ea-93bbc4671775@oracle.com>
 <875zegd5r0.fsf@redhat.com>
Message-ID: <0e1907ea-db37-d354-ee76-f6cf8ca0af0a@oracle.com>

Thanks Roland!

Best regards,
Tobias

On 03.04.20 16:05, Roland Westrelin wrote:
> 
>> http://cr.openjdk.java.net/~thartmann/8241997/webrev.00/
> 
> Looks good to me.
> 
> Roland.
> 

From claes.redestad at oracle.com  Fri Apr  3 14:15:22 2020
From: claes.redestad at oracle.com (Claes Redestad)
Date: Fri, 3 Apr 2020 16:15:22 +0200
Subject: [15] RFR(T): 8242090: Remove dead code from c1_LIR
In-Reply-To: <dc4a0770-afd8-f2f6-afe8-3284db67599b@oracle.com>
References: <dbd0f62b-6e7a-8bb0-2859-eda561399c71@oracle.com>
 <84a5e6e1-7441-002d-7564-94d9eb1d4f31@oracle.com>
 <40e00a69-7b3e-bef2-8b84-a71659611bc3@oracle.com>
 <dc4a0770-afd8-f2f6-afe8-3284db67599b@oracle.com>
Message-ID: <9ea01eb1-6498-7bdf-416f-675afd621110@oracle.com>

On 2020-04-03 16:08, Tobias Hartmann wrote:
> Claes pointed out that lir_word_align is unused as well:
> http://cr.openjdk.java.net/~thartmann/8242090/webrev.01/

Looks good,

lir_fpop_raw also looked unused, but seems to be used on x86_32 only.
I'm not sure it's worth the trouble guarding its use with X86 &&
NOT_LP64..?

/Claes

From nils.eliasson at oracle.com  Fri Apr  3 15:29:07 2020
From: nils.eliasson at oracle.com (Nils Eliasson)
Date: Fri, 3 Apr 2020 17:29:07 +0200
Subject: [15] RFR(S): 8241997: Scalar replacement of cloned array is
 broken after JDK-8238759
In-Reply-To: <8a6527a1-f8dc-0306-58ea-93bbc4671775@oracle.com>
References: <8a6527a1-f8dc-0306-58ea-93bbc4671775@oracle.com>
Message-ID: <902ed245-ee7e-a98e-cf36-ff96bff79245@oracle.com>

Hi,

Nice find - but not all changes in macro.cpp seems related to what was 
caused by JDK-8238759. What are the additional changes in 
PhaseMacroExpand::process_users_of_allocation and 
PhaseMacroExpand::can_eliminate_allocation motivated by?

Regards,
Nils


On 2020-04-03 15:41, Tobias Hartmann wrote:
> Hi,
>
> please review the following patch:
> https://bugs.openjdk.java.net/browse/JDK-8241997
> http://cr.openjdk.java.net/~thartmann/8241997/webrev.00/
>
> When merging the fix for JDK-8238759 [1] into the Valhalla repo, we've noticed that some of our test
> started to fail because their C2 IR matching rules detected that cloned, non-escaping array
> allocations are no longer scalar replaced (for example, [2]).
>
> The problem is that the scalar replacement code still expects ArrayCopyNode::Dest to be an AddPNode.
> I've verified that my fix re-enables scalar replacement. The related Valhalla tests now pass.
>
> Thanks,
> Tobias
>
> [1] https://bugs.openjdk.java.net/browse/JDK-8238759
> [2]
> http://hg.openjdk.java.net/valhalla/valhalla/file/00010b44d679/test/hotspot/jtreg/compiler/valhalla/valuetypes/TestArrays.java#l672


From vladimir.kozlov at oracle.com  Fri Apr  3 17:31:32 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Fri, 3 Apr 2020 10:31:32 -0700
Subject: RFR(S) 8231756: [JVMCI] need support for deoptimizing virtual
 byte arrays encoding non-byte primitives
In-Reply-To: <5eb640c1-fb33-2fea-ef2f-416e19220d97@oracle.com>
References: <5eb640c1-fb33-2fea-ef2f-416e19220d97@oracle.com>
Message-ID: <68e115c1-2088-c075-8ffc-2c79f489fa81@oracle.com>

Hi Tom,

I looked on testing results and one test fails consistently:

compiler/jvmci/jdk.vm.ci.hotspot.test/src/jdk/vm/ci/hotspot/test/VirtualObjectLayoutTest.java

Vladimir K

On 4/2/20 12:12 PM, Tom Rodriguez wrote:
> http://cr.openjdk.java.net/~never/8231756/webrev
> https://bugs.openjdk.java.net/browse/JDK-8231756
> 
> This adds support for deoptimizing with non-byte primitive values stored on top of a byte array, similarly to the way 
> that a double or long can be stored on top of 2 int fields.? More detail is provided in the bug report and new unit 
> tests exercise the deoptimization.? mach5 testing is in progress.
> 
> tom

From tom.rodriguez at oracle.com  Fri Apr  3 19:37:49 2020
From: tom.rodriguez at oracle.com (Tom Rodriguez)
Date: Fri, 3 Apr 2020 12:37:49 -0700
Subject: RFR(S) 8231756: [JVMCI] need support for deoptimizing virtual
 byte arrays encoding non-byte primitives
In-Reply-To: <68e115c1-2088-c075-8ffc-2c79f489fa81@oracle.com>
References: <5eb640c1-fb33-2fea-ef2f-416e19220d97@oracle.com>
 <68e115c1-2088-c075-8ffc-2c79f489fa81@oracle.com>
Message-ID: <19684e18-4d08-c687-1bb0-c869ab707d74@oracle.com>


Vladimir Kozlov wrote on 4/3/20 10:31 AM:
> Hi Tom,
> 
> I looked on testing results and one test fails consistently:
> 
> compiler/jvmci/jdk.vm.ci.hotspot.test/src/jdk/vm/ci/hotspot/test/VirtualObjectLayoutTest.java 

Sorry that was an old mach5 run and I forgot to update with the new one. 
  There are some failures but they seem unrelated to me.

tom

> 
> 
> Vladimir K
> 
> On 4/2/20 12:12 PM, Tom Rodriguez wrote:
>> http://cr.openjdk.java.net/~never/8231756/webrev
>> https://bugs.openjdk.java.net/browse/JDK-8231756
>>
>> This adds support for deoptimizing with non-byte primitive values 
>> stored on top of a byte array, similarly to the way that a double or 
>> long can be stored on top of 2 int fields.? More detail is provided in 
>> the bug report and new unit tests exercise the deoptimization.? mach5 
>> testing is in progress.
>>
>> tom

From vladimir.x.ivanov at oracle.com  Fri Apr  3 23:12:30 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Sat, 4 Apr 2020 02:12:30 +0300
Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator): General
 HotSpot changes
Message-ID: <c1bdf88c-5de2-d069-5f31-5a95c6988bf8@oracle.com>

Hi,

Following up on review requests of API [0] and Java implementation [1] 
for Vector API (JEP 338 [2]), here's a request for review of general 
HotSpot changes (in shared code) required for supporting the API:

 
http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/all.00-03/

(First of all, to set proper expectations: since the JEP is still in 
Candidate state, the intention is to initiate preliminary round(s) of 
review to inform the community and gather feedback before sending out 
final/official RFRs once the JEP is Targeted to a release.)

Vector API (being developed in Project Panama [3]) relies on JVM support 
to utilize optimal vector hardware instructions at runtime. It interacts 
with JVM through intrinsics (declared in 
jdk.internal.vm.vector.VectorSupport [4]) which expose vector operations 
support in C2 JIT-compiler.

As Paul wrote earlier: "A vector intrinsic is an internal low-level 
vector operation. The last argument to the intrinsic is fall back 
behavior in Java, implementing the scalar operation over the number of 
elements held by the vector.  Thus, If the intrinsic is not supported in 
C2 for the other arguments then the Java implementation is executed (the 
Java implementation is always executed when running in the interpreter 
or for C1)."

The rest of JVM support is about aggressively optimizing vector boxes to 
minimize (ideally eliminate) the overhead of boxing for vector values.
It's a stop-the-gap solution for vector box elimination problem until 
inline classes arrive. Vector classes are value-based and in the longer 
term will be migrated to inline classes once the support becomes available.

Vector API talk from JVMLS'18 [5] contains brief overview of JVM 
implementation and some details.

Complete implementation resides in vector-unstable branch of panama/dev 
repository [6].

Now to gory details (the patch is split in multiple "sub-webrevs"):

===========================================================

(1) 
http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/00.backend.shared/

Ideal vector nodes for new operations introduced by Vector API.

(Platform-specific back end support will be posted for review separately).

===========================================================

(2) 
http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/

JVM Java interface (VectorSupport) and intrinsic support in C2.

Vector instances are initially represented as VectorBox macro nodes and 
"unboxing" is represented by VectorUnbox node. It simplifies vector box 
elimination analysis and the nodes are expanded later right before EA pass.

Vectors have 2-level on-heap representation: for the vector value 
primitive array is used as a backing storage and it is encapsulated in a 
typed wrapper (e.g., Int256Vector - vector of 8 ints - contains a int[8] 
instance which is used to store vector value).

Unless VectorBox node goes away, it needs to be expanded into an 
allocation eventually, but it is a pure node and doesn't have any JVM 
state associated with it. The problem is solved by keeping JVM state 
separately in a VectorBoxAllocate node associated with VectorBox node 
and use it during expansion.

Also, to simplify vector box elimination, inlining of vector reboxing 
calls (VectorSupport::maybeRebox) is delayed until the analysis is over.

===========================================================

(3) 
http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/02.vbox_elimination/

Vector box elimination analysis implementation. (Brief overview: slides 
#36-42 [5].)

The main part is devoted to scalarization across safepoints and 
rematerialization support during deoptimization. In C2-generated code 
vector operations work with raw vector values which live in registers or 
spilled on the stack and it allows to avoid boxing/unboxing when a 
vector value is alive across a safepoint. As with other values, there's 
just a location of the vector value at the safepoint and vector type 
information recorded in the relevant nmethod metadata and all the 
heavy-lifting happens only when rematerialization takes place.

The analysis preserves object identity invariants except during 
aggressive reboxing (guarded by -XX:+EnableAggressiveReboxing).

(Aggressive reboxing is crucial for cases when vectors "escape": it 
allocates a fresh instance at every escape point thus enabling original 
instance to go away.)

===========================================================

(4) 
http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/03.module.hotspot/

HotSpot changes for jdk.incubator.vector module. Vector support is 
makred experimental and turned off by default. JEP 338 proposes the API 
to be released as an incubator module, so a user has to specify 
"--add-module jdk.incubator.vector" on the command line to be able to 
use it.
When user does that, JVM automatically enables Vector API support.
It improves usability (user doesn't need to separately "open" the API 
and enable JVM support) while minimizing risks of destabilitzation from 
new code when the API is not used.


That's it! Will be happy to answer any questions.

And thanks in advance for any feedback!

Best regards,
Vladimir Ivanov

[0] 
https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html

[1] 
https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041228.html

[2] https://openjdk.java.net/jeps/338

[3] https://openjdk.java.net/projects/panama/

[4] 
http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java.html

[5] http://cr.openjdk.java.net/~vlivanov/talks/2018_JVMLS_VectorAPI.pdf

[6] http://hg.openjdk.java.net/panama/dev/shortlog/92bbd44386e9

     $ hg clone http://hg.openjdk.java.net/panama/dev/ -b vector-unstable

From forax at univ-mlv.fr  Fri Apr  3 23:31:11 2020
From: forax at univ-mlv.fr (Remi Forax)
Date: Sat, 4 Apr 2020 01:31:11 +0200 (CEST)
Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator):
 General HotSpot changes
In-Reply-To: <c1bdf88c-5de2-d069-5f31-5a95c6988bf8@oracle.com>
References: <c1bdf88c-5de2-d069-5f31-5a95c6988bf8@oracle.com>
Message-ID: <579633672.1221535.1585956671384.JavaMail.zimbra@u-pem.fr>

[...]

> (4)
> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/03.module.hotspot/
> 
> HotSpot changes for jdk.incubator.vector module. Vector support is
> makred experimental and turned off by default. JEP 338 proposes the API
> to be released as an incubator module, so a user has to specify
> "--add-module jdk.incubator.vector" on the command line to be able to
> use it.

Typo, it's --add-modules

> When user does that, JVM automatically enables Vector API support.
> It improves usability (user doesn't need to separately "open" the API
> and enable JVM support) while minimizing risks of destabilitzation from
> new code when the API is not used.

Question, what if i declare a module-info that requires "jdk.incubator.vector", because in that case, i don't have to add --add-modules jdk.incubator.vector on the command line, but does the VM will enable the Vector API support ?

> 
> 
> That's it! Will be happy to answer any questions.
> 
> And thanks in advance for any feedback!

regards,
R?mi

> 
> Best regards,
> Vladimir Ivanov
> 
> [0]
> https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html
> 
> [1]
> https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041228.html
> 
> [2] https://openjdk.java.net/jeps/338
> 
> [3] https://openjdk.java.net/projects/panama/
> 
> [4]
> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java.html
> 
> [5] http://cr.openjdk.java.net/~vlivanov/talks/2018_JVMLS_VectorAPI.pdf
> 
> [6] http://hg.openjdk.java.net/panama/dev/shortlog/92bbd44386e9
> 
>      $ hg clone http://hg.openjdk.java.net/panama/dev/ -b vector-unstable

From vladimir.x.ivanov at oracle.com  Fri Apr  3 23:52:03 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Sat, 4 Apr 2020 02:52:03 +0300
Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator):
 General HotSpot changes
In-Reply-To: <579633672.1221535.1585956671384.JavaMail.zimbra@u-pem.fr>
References: <c1bdf88c-5de2-d069-5f31-5a95c6988bf8@oracle.com>
 <579633672.1221535.1585956671384.JavaMail.zimbra@u-pem.fr>
Message-ID: <4e61bd37-8f8d-1f35-c1db-7826db2b0f53@oracle.com>


> Typo, it's --add-modules

Good catch, Remi. Thanks for the correction.

> 
>> When user does that, JVM automatically enables Vector API support.
>> It improves usability (user doesn't need to separately "open" the API
>> and enable JVM support) while minimizing risks of destabilitzation from
>> new code when the API is not used.
> 
> Question, what if i declare a module-info that requires "jdk.incubator.vector", because in that case, i don't have to add --add-modules jdk.incubator.vector on the command line, but does the VM will enable the Vector API support ?

Good point. JEP 11: "Incubator Modules" [1] states the following:

"Applications on the class path must use the --add-modules command-line 
option to request that an incubator module be resolved. Applications 
developed as modules can specify requires or requires transitive 
dependences upon an incubator module directly."

Current implementation doesn't distinguish whether the module is 
resolved for an application on the class path or by another module, so 
JVM support will be enabled by default in both cases. Do you see any 
problems with that?

Best regards,
Vladimir Ivanov

[1] https://openjdk.java.net/jeps/11

>> That's it! Will be happy to answer any questions.
>>
>> And thanks in advance for any feedback!
> 
> regards,
> R?mi
> 
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> [0]
>> https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html
>>
>> [1]
>> https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041228.html
>>
>> [2] https://openjdk.java.net/jeps/338
>>
>> [3] https://openjdk.java.net/projects/panama/
>>
>> [4]
>> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java.html
>>
>> [5] http://cr.openjdk.java.net/~vlivanov/talks/2018_JVMLS_VectorAPI.pdf
>>
>> [6] http://hg.openjdk.java.net/panama/dev/shortlog/92bbd44386e9
>>
>>       $ hg clone http://hg.openjdk.java.net/panama/dev/ -b vector-unstable

From sandhya.viswanathan at intel.com  Sat Apr  4 00:16:57 2020
From: sandhya.viswanathan at intel.com (Viswanathan, Sandhya)
Date: Sat, 4 Apr 2020 00:16:57 +0000
Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator): x86
 backend changes
Message-ID: <MW3PR11MB47452D16DFBF1AD1B18A79ACEFC40@MW3PR11MB4745.namprd11.prod.outlook.com>

Hi,


Following up on review requests of API [0], Java implementation [1] and

General Hotspot changes[3] for Vector API, here's a request for review

of x86 backend changes required for supporting the API:


JEP: https://openjdk.java.net/jeps/338

JBS: https://bugs.openjdk.java.net/browse/JDK-8223347

Webrev:http://cr.openjdk.java.net/~sviswanathan/VAPI_RFR/x86_webrev/webrev.00/


Complete implementation resides in vector-unstable branch of

panama/dev repository [3].

Looking forward to your feedback.

Best Regards,
Sandhya


[0]  https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html


[1]  https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-April/065587.html


[2]  https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-April/037798.html


[3]  https://openjdk.java.net/projects/panama/

       $ hg clone http://hg.openjdk.java.net/panama/dev/ -b vector-unstable


From vladimir.kozlov at oracle.com  Sat Apr  4 00:41:46 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Fri, 3 Apr 2020 17:41:46 -0700
Subject: RFR(S) 8231756: [JVMCI] need support for deoptimizing virtual
 byte arrays encoding non-byte primitives
In-Reply-To: <19684e18-4d08-c687-1bb0-c869ab707d74@oracle.com>
References: <5eb640c1-fb33-2fea-ef2f-416e19220d97@oracle.com>
 <68e115c1-2088-c075-8ffc-2c79f489fa81@oracle.com>
 <19684e18-4d08-c687-1bb0-c869ab707d74@oracle.com>
Message-ID: <68c8608f-fb2c-83ed-3b14-39dccb9e80ad@oracle.com>

I think new code in deoptimize.cpp should be JVMCI specific.

I filed 8242150 for serviceability tests failures in testing. It seems caused by recent changes.

It is weird to see SPARC_32 checks in deoptimization.cpp which we should not have in new code:

#ifdef _LP64
         jlong res = (jlong) *((jlong *) &val);
#else
#ifdef SPARC
       // For SPARC we have to swap high and low words.

We don't support such configuration for eons.

I don't see  where _support_large_access_byte_array_virtualization  is checked. If it is only in Graal then it should be 
guarded by #if.

Thanks,
Vladimir

On 4/3/20 12:37 PM, Tom Rodriguez wrote:
> 
> 
> Vladimir Kozlov wrote on 4/3/20 10:31 AM:
>> Hi Tom,
>>
>> I looked on testing results and one test fails consistently:
>>
>> compiler/jvmci/jdk.vm.ci.hotspot.test/src/jdk/vm/ci/hotspot/test/VirtualObjectLayoutTest.java 
> 
> Sorry that was an old mach5 run and I forgot to update with the new one. ?There are some failures but they seem 
> unrelated to me.
> 
> tom
> 
>>
>>
>> Vladimir K
>>
>> On 4/2/20 12:12 PM, Tom Rodriguez wrote:
>>> http://cr.openjdk.java.net/~never/8231756/webrev
>>> https://bugs.openjdk.java.net/browse/JDK-8231756
>>>
>>> This adds support for deoptimizing with non-byte primitive values stored on top of a byte array, similarly to the way 
>>> that a double or long can be stored on top of 2 int fields.? More detail is provided in the bug report and new unit 
>>> tests exercise the deoptimization.? mach5 testing is in progress.
>>>
>>> tom

From forax at univ-mlv.fr  Sat Apr  4 12:18:34 2020
From: forax at univ-mlv.fr (forax at univ-mlv.fr)
Date: Sat, 4 Apr 2020 14:18:34 +0200 (CEST)
Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator):
 General HotSpot changes
In-Reply-To: <4e61bd37-8f8d-1f35-c1db-7826db2b0f53@oracle.com>
References: <c1bdf88c-5de2-d069-5f31-5a95c6988bf8@oracle.com>
 <579633672.1221535.1585956671384.JavaMail.zimbra@u-pem.fr>
 <4e61bd37-8f8d-1f35-c1db-7826db2b0f53@oracle.com>
Message-ID: <801330324.1421417.1586002714453.JavaMail.zimbra@u-pem.fr>

----- Mail original -----
> De: "Vladimir Ivanov" <vladimir.x.ivanov at oracle.com>
> ?: "Remi Forax" <forax at univ-mlv.fr>
> Cc: "hotspot-dev" <hotspot-dev at openjdk.java.net>, "hotspot compiler" <hotspot-compiler-dev at openjdk.java.net>,
> "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
> Envoy?: Samedi 4 Avril 2020 01:52:03
> Objet: Re: RFR (XXL): 8223347: Integration of Vector API (Incubator): General HotSpot changes

>> Typo, it's --add-modules
> 
> Good catch, Remi. Thanks for the correction.
> 
>> 
>>> When user does that, JVM automatically enables Vector API support.
>>> It improves usability (user doesn't need to separately "open" the API
>>> and enable JVM support) while minimizing risks of destabilitzation from
>>> new code when the API is not used.
>> 
>> Question, what if i declare a module-info that requires "jdk.incubator.vector",
>> because in that case, i don't have to add --add-modules jdk.incubator.vector on
>> the command line, but does the VM will enable the Vector API support ?
> 
> Good point. JEP 11: "Incubator Modules" [1] states the following:
> 
> "Applications on the class path must use the --add-modules command-line
> option to request that an incubator module be resolved. Applications
> developed as modules can specify requires or requires transitive
> dependences upon an incubator module directly."
> 
> Current implementation doesn't distinguish whether the module is
> resolved for an application on the class path or by another module, so
> JVM support will be enabled by default in both cases. Do you see any
> problems with that?

So the VM supports is enabled either because there is an explicit --add-modules or because the module is transitively reachable from the root modules.
It means that it doesn't work if the module jdk.incubator.vector is loaded using a ModuleLayer. Users has to use XX:+EnableVectorSupport in that case.

regards,
R?mi

> 
> Best regards,
> Vladimir Ivanov
> 
> [1] https://openjdk.java.net/jeps/11
> 
>>> That's it! Will be happy to answer any questions.
>>>
>>> And thanks in advance for any feedback!
>> 
>> regards,
>> R?mi
>> 
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> [0]
>>> https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html
>>>
>>> [1]
>>> https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041228.html
>>>
>>> [2] https://openjdk.java.net/jeps/338
>>>
>>> [3] https://openjdk.java.net/projects/panama/
>>>
>>> [4]
>>> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java.html
>>>
>>> [5] http://cr.openjdk.java.net/~vlivanov/talks/2018_JVMLS_VectorAPI.pdf
>>>
>>> [6] http://hg.openjdk.java.net/panama/dev/shortlog/92bbd44386e9
>>>
> >>       $ hg clone http://hg.openjdk.java.net/panama/dev/ -b vector-unstable

From Alan.Bateman at oracle.com  Sat Apr  4 12:37:29 2020
From: Alan.Bateman at oracle.com (Alan Bateman)
Date: Sat, 4 Apr 2020 13:37:29 +0100
Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator):
 General HotSpot changes
In-Reply-To: <801330324.1421417.1586002714453.JavaMail.zimbra@u-pem.fr>
References: <c1bdf88c-5de2-d069-5f31-5a95c6988bf8@oracle.com>
 <579633672.1221535.1585956671384.JavaMail.zimbra@u-pem.fr>
 <4e61bd37-8f8d-1f35-c1db-7826db2b0f53@oracle.com>
 <801330324.1421417.1586002714453.JavaMail.zimbra@u-pem.fr>
Message-ID: <baae6c2d-b9b6-81f6-cdf2-1da0cf7078c3@oracle.com>

On 04/04/2020 13:18, forax at univ-mlv.fr wrote:
> :
> So the VM supports is enabled either because there is an explicit --add-modules or because the module is transitively reachable from the root modules.
> It means that it doesn't work if the module jdk.incubator.vector is loaded using a ModuleLayer. Users has to use XX:+EnableVectorSupport in that case.
>
Is jdk.incubator.vector is mapped to the boot loader? If so then it 
can't be loaded into a child layer.

-Alan


From tobias.hartmann at oracle.com  Mon Apr  6 06:10:54 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Mon, 6 Apr 2020 08:10:54 +0200
Subject: [15] RFR(T): 8242090: Remove dead code from c1_LIR
In-Reply-To: <9ea01eb1-6498-7bdf-416f-675afd621110@oracle.com>
References: <dbd0f62b-6e7a-8bb0-2859-eda561399c71@oracle.com>
 <84a5e6e1-7441-002d-7564-94d9eb1d4f31@oracle.com>
 <40e00a69-7b3e-bef2-8b84-a71659611bc3@oracle.com>
 <dc4a0770-afd8-f2f6-afe8-3284db67599b@oracle.com>
 <9ea01eb1-6498-7bdf-416f-675afd621110@oracle.com>
Message-ID: <3d836412-09dc-d67b-7839-22942808fe65@oracle.com>


On 03.04.20 16:15, Claes Redestad wrote:
> lir_fpop_raw also looked unused, but seems to be used on x86_32 only.
> I'm not sure it's worth the trouble guarding its use with X86 &&
> NOT_LP64..?

I gave it a quick try but I don't think it's worth sprinkling additional #ifdefs into the enum and
the shared code in c1_LinearScan.cpp. I've simply removed the unused fpop_raw() method:
http://cr.openjdk.java.net/~thartmann/8242090/webrev.02/

Best regards,
Tobias

From tobias.hartmann at oracle.com  Mon Apr  6 06:23:40 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Mon, 6 Apr 2020 08:23:40 +0200
Subject: [15] RFR(S): 8241997: Scalar replacement of cloned array is
 broken after JDK-8238759
In-Reply-To: <902ed245-ee7e-a98e-cf36-ff96bff79245@oracle.com>
References: <8a6527a1-f8dc-0306-58ea-93bbc4671775@oracle.com>
 <902ed245-ee7e-a98e-cf36-ff96bff79245@oracle.com>
Message-ID: <e7df9038-a537-1964-4c59-4257eaee31b7@oracle.com>

Hi Nils,

thanks for the review!

On 03.04.20 17:29, Nils Eliasson wrote:
> Nice find - but not all changes in macro.cpp seems related to what was caused by JDK-8238759. What
> are the additional changes in PhaseMacroExpand::process_users_of_allocation and
> PhaseMacroExpand::can_eliminate_allocation motivated by?

Changes in 'can_eliminate_allocation'
- line 675: Check is always false since an allocation result is not connected to a clonebasic
through an AddP anymore.
- line 686: Instead, clonebasic is now directly connected to the allocation through the
ArrayCopyNode::Dest input.

Changes to 'process_users_of_allocation':
- line 970: This is a bit hard to follow in the webrev. I've moved the clonebasic handling from the
use->is_AddP() branch to the use->is_ArrayCopy() branch, again because the clonebasic is now
directly connected through the result cast and not indirectly through an AddP.

Best regards,
Tobias

From tobias.hartmann at oracle.com  Mon Apr  6 06:34:45 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Mon, 6 Apr 2020 08:34:45 +0200
Subject: [11u] 8217230: assert(t == t_no_spec) failure in
 NodeHash::check_no_speculative_types()
In-Reply-To: <87h7y2c5ua.fsf@redhat.com>
References: <874kubfked.fsf@redhat.com> <87h7y2c5ua.fsf@redhat.com>
Message-ID: <a099d1f2-92e9-9b21-add6-efaafa2c9c81@oracle.com>

Hi Roland,

looks good.

Best regards,
Tobias

On 02.04.20 16:36, Roland Westrelin wrote:
> 
>> This is required to backport 8237086 (assert(is_MachReturn()) running
>> CTW with fix for JDK-8231291).
>>
>> Original bug:
>>   https://bugs.openjdk.java.net/browse/JDK-8217230
>>   http://hg.openjdk.java.net/jdk/jdk12/rev/1b292ae4eb50
>>
>> Original patch does not apply cleanly to 11u because context changed in
>> compile.hpp. Patch is otherwise identical.
>>
>> 11u webrev:
>>   http://cr.openjdk.java.net/~roland/8217230.11u/webrev.00/
>>
>> Testing: x86_64 build, tier1 + tier2
> 
> Anyone for this review?
> 
> Roland.
> 

From rwestrel at redhat.com  Mon Apr  6 07:17:15 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Mon, 06 Apr 2020 09:17:15 +0200
Subject: [11u] 8217230: assert(t == t_no_spec) failure in
 NodeHash::check_no_speculative_types()
In-Reply-To: <a099d1f2-92e9-9b21-add6-efaafa2c9c81@oracle.com>
References: <874kubfked.fsf@redhat.com> <87h7y2c5ua.fsf@redhat.com>
 <a099d1f2-92e9-9b21-add6-efaafa2c9c81@oracle.com>
Message-ID: <87369hccck.fsf@redhat.com>


Thanks for the review.

Roland.


From nils.eliasson at oracle.com  Mon Apr  6 07:23:50 2020
From: nils.eliasson at oracle.com (Nils Eliasson)
Date: Mon, 6 Apr 2020 09:23:50 +0200
Subject: [15] RFR(S): 8241997: Scalar replacement of cloned array is
 broken after JDK-8238759
In-Reply-To: <e7df9038-a537-1964-4c59-4257eaee31b7@oracle.com>
References: <8a6527a1-f8dc-0306-58ea-93bbc4671775@oracle.com>
 <902ed245-ee7e-a98e-cf36-ff96bff79245@oracle.com>
 <e7df9038-a537-1964-4c59-4257eaee31b7@oracle.com>
Message-ID: <45926bbf-4388-8dbd-9e32-2d1ea00b7e5d@oracle.com>

Thanks for the explanation.

I think there will be more opportunities for cleaning up cloning 
optimizations. The array-clone should just be the special case of acopy 
where the full array is copied, and which can't fault on index or type 
check.

Your change fixes a performance issue I have seen, but didn't understood 
that I caused it :)

Best regards,

// Nils

On 2020-04-06 08:23, Tobias Hartmann wrote:
> Hi Nils,
>
> thanks for the review!
>
> On 03.04.20 17:29, Nils Eliasson wrote:
>> Nice find - but not all changes in macro.cpp seems related to what was caused by JDK-8238759. What
>> are the additional changes in PhaseMacroExpand::process_users_of_allocation and
>> PhaseMacroExpand::can_eliminate_allocation motivated by?
> Changes in 'can_eliminate_allocation'
> - line 675: Check is always false since an allocation result is not connected to a clonebasic
> through an AddP anymore.
> - line 686: Instead, clonebasic is now directly connected to the allocation through the
> ArrayCopyNode::Dest input.
>
> Changes to 'process_users_of_allocation':
> - line 970: This is a bit hard to follow in the webrev. I've moved the clonebasic handling from the
> use->is_AddP() branch to the use->is_ArrayCopy() branch, again because the clonebasic is now
> directly connected through the result cast and not indirectly through an AddP.
>
> Best regards,
> Tobias


From tobias.hartmann at oracle.com  Mon Apr  6 07:31:25 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Mon, 6 Apr 2020 09:31:25 +0200
Subject: [15] RFR(S): 8241997: Scalar replacement of cloned array is
 broken after JDK-8238759
In-Reply-To: <45926bbf-4388-8dbd-9e32-2d1ea00b7e5d@oracle.com>
References: <8a6527a1-f8dc-0306-58ea-93bbc4671775@oracle.com>
 <902ed245-ee7e-a98e-cf36-ff96bff79245@oracle.com>
 <e7df9038-a537-1964-4c59-4257eaee31b7@oracle.com>
 <45926bbf-4388-8dbd-9e32-2d1ea00b7e5d@oracle.com>
Message-ID: <9105edc0-5f4f-bb4d-40db-e610828b204a@oracle.com>

Hi Nils,

On 06.04.20 09:23, Nils Eliasson wrote:
> I think there will be more opportunities for cleaning up cloning optimizations. The array-clone
> should just be the special case of acopy where the full array is copied, and which can't fault on
> index or type check.

Yes, we should try to get rid of most of the remaining is_clonebasic special-casing.

> Your change fixes a performance issue I have seen, but didn't understood that I caused it :)

Okay, great! :)

Thanks,
Tobias

From tobias.hartmann at oracle.com  Mon Apr  6 07:48:50 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Mon, 6 Apr 2020 09:48:50 +0200
Subject: RFR(S): 8241900: Loop unswitching may cause dependence on null
 check to be lost
In-Reply-To: <878sjdc5jl.fsf@redhat.com>
References: <878sjdc5jl.fsf@redhat.com>
Message-ID: <36d43333-81e8-1f79-1a04-06f8e34a2c30@oracle.com>

Hi Roland,

On 03.04.20 10:55, Roland Westrelin wrote:
> The fix I propose is to keep the dependence on the hoisted test on loop
> unswitching by using dominated_by() instead of short_circuit_if(). This
> way on step 2) 3) above, the CastPP is made dependent on the hoisted
> test so reordering of the CastPP with its null check can't happen.

This seems reasonable but I'm wondering if that doesn't enable incorrect re-ordering of dependent
data nodes with other tests in-between the original and the hoisted test? I.e., without your fix,
data nodes are made dependent on the test "just above" the unswitched test. With your fix, they are
dependent on the hoisted test outside of the loop body.

Please add the appropriate affects versions to the bug. Also, please add a link to the JBS bug to
your RFRs.

Best regards,
Tobias

From vladimir.x.ivanov at oracle.com  Mon Apr  6 08:02:10 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Mon, 6 Apr 2020 11:02:10 +0300
Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator):
 General HotSpot changes
In-Reply-To: <baae6c2d-b9b6-81f6-cdf2-1da0cf7078c3@oracle.com>
References: <c1bdf88c-5de2-d069-5f31-5a95c6988bf8@oracle.com>
 <579633672.1221535.1585956671384.JavaMail.zimbra@u-pem.fr>
 <4e61bd37-8f8d-1f35-c1db-7826db2b0f53@oracle.com>
 <801330324.1421417.1586002714453.JavaMail.zimbra@u-pem.fr>
 <baae6c2d-b9b6-81f6-cdf2-1da0cf7078c3@oracle.com>
Message-ID: <cbfc0480-1eb4-94ed-1171-23eea68bf080@oracle.com>


>> So the VM supports is enabled either because there is an explicit 
>> --add-modules or because the module is transitively reachable from the 
>> root modules.
>> It means that it doesn't work if the module jdk.incubator.vector is 
>> loaded using a ModuleLayer. Users has to use XX:+EnableVectorSupport 
>> in that case.
>>
> Is jdk.incubator.vector is mapped to the boot loader? If so then it 
> can't be loaded into a child layer.

Yes, jdk.incubator.vector is a boot module. The reason to put it there 
is to be able to trust final instance fields by the JVM.

Since the module extensively uses VM annotations, it should be either 
boot or platform module in order to have access to them, but in case of 
platform module existing logic for trusting final instance fields 
doesn't work and all such fields should be marked as @Stable instead.

Best regards,
Vladimir Ivanov

From rwestrel at redhat.com  Mon Apr  6 08:34:42 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Mon, 06 Apr 2020 10:34:42 +0200
Subject: RFR(S): 8241900: Loop unswitching may cause dependence on null
 check to be lost
In-Reply-To: <36d43333-81e8-1f79-1a04-06f8e34a2c30@oracle.com>
References: <878sjdc5jl.fsf@redhat.com>
 <36d43333-81e8-1f79-1a04-06f8e34a2c30@oracle.com>
Message-ID: <87zhbpau71.fsf@redhat.com>


Hi Tobias,

Thanks for looking at this.

> This seems reasonable but I'm wondering if that doesn't enable incorrect re-ordering of dependent
> data nodes with other tests in-between the original and the hoisted test? I.e., without your fix,
> data nodes are made dependent on the test "just above" the unswitched test. With your fix, they are
> dependent on the hoisted test outside of the loop body.

I've been wondering about that too but couldn't find a scenario where it
would go wrong. dominated_by() is what's used when a if is replaced by a
dominating if with the same condition in
PhaseIdealLoop::split_if_with_blocks_post(). Loop switching is similar:
we add a dominating if, and then remove the loop copies because they are
redundant.


> Please add the appropriate affects versions to the bug. Also, please add a link to the JBS bug to
> your RFRs.

Sorry about that, I keep forgetting.

Roland.


From tobias.hartmann at oracle.com  Mon Apr  6 08:51:53 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Mon, 6 Apr 2020 10:51:53 +0200
Subject: RFR(S): 8241900: Loop unswitching may cause dependence on null
 check to be lost
In-Reply-To: <87zhbpau71.fsf@redhat.com>
References: <878sjdc5jl.fsf@redhat.com>
 <36d43333-81e8-1f79-1a04-06f8e34a2c30@oracle.com> <87zhbpau71.fsf@redhat.com>
Message-ID: <6cd91986-952c-1419-47d1-f7d25b955a70@oracle.com>


On 06.04.20 10:34, Roland Westrelin wrote:
> I've been wondering about that too but couldn't find a scenario where it
> would go wrong. dominated_by() is what's used when a if is replaced by a
> dominating if with the same condition in
> PhaseIdealLoop::split_if_with_blocks_post(). Loop switching is similar:
> we add a dominating if, and then remove the loop copies because they are
> redundant.

Right, I couldn't find such a scenario either and as you've pointed out the same problem would
exists at other places as well. Looks good.

Best regards,
Tobias

From claes.redestad at oracle.com  Mon Apr  6 10:08:58 2020
From: claes.redestad at oracle.com (Claes Redestad)
Date: Mon, 6 Apr 2020 12:08:58 +0200
Subject: [15] RFR(T): 8242090: Remove dead code from c1_LIR
In-Reply-To: <3d836412-09dc-d67b-7839-22942808fe65@oracle.com>
References: <dbd0f62b-6e7a-8bb0-2859-eda561399c71@oracle.com>
 <84a5e6e1-7441-002d-7564-94d9eb1d4f31@oracle.com>
 <40e00a69-7b3e-bef2-8b84-a71659611bc3@oracle.com>
 <dc4a0770-afd8-f2f6-afe8-3284db67599b@oracle.com>
 <9ea01eb1-6498-7bdf-416f-675afd621110@oracle.com>
 <3d836412-09dc-d67b-7839-22942808fe65@oracle.com>
Message-ID: <e258af9b-2a07-35f4-bca4-48f8b9016238@oracle.com>

On 2020-04-06 08:10, Tobias Hartmann wrote:
> 
> 
> On 03.04.20 16:15, Claes Redestad wrote:
>> lir_fpop_raw also looked unused, but seems to be used on x86_32 only.
>> I'm not sure it's worth the trouble guarding its use with X86 &&
>> NOT_LP64..?
> 
> I gave it a quick try but I don't think it's worth sprinkling additional #ifdefs into the enum and
> the shared code in c1_LinearScan.cpp. I've simply removed the unused fpop_raw() method:
> http://cr.openjdk.java.net/~thartmann/8242090/webrev.02/

Still looks good (and trivial).

/Claes

> 
> Best regards,
> Tobias
> 

From tobias.hartmann at oracle.com  Mon Apr  6 10:10:20 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Mon, 6 Apr 2020 12:10:20 +0200
Subject: [15] RFR(T): 8242090: Remove dead code from c1_LIR
In-Reply-To: <e258af9b-2a07-35f4-bca4-48f8b9016238@oracle.com>
References: <dbd0f62b-6e7a-8bb0-2859-eda561399c71@oracle.com>
 <84a5e6e1-7441-002d-7564-94d9eb1d4f31@oracle.com>
 <40e00a69-7b3e-bef2-8b84-a71659611bc3@oracle.com>
 <dc4a0770-afd8-f2f6-afe8-3284db67599b@oracle.com>
 <9ea01eb1-6498-7bdf-416f-675afd621110@oracle.com>
 <3d836412-09dc-d67b-7839-22942808fe65@oracle.com>
 <e258af9b-2a07-35f4-bca4-48f8b9016238@oracle.com>
Message-ID: <2af9fa0b-9a40-c83f-7736-a33f16d76483@oracle.com>


On 06.04.20 12:08, Claes Redestad wrote:
> Still looks good (and trivial).

Thanks again! Pushed.

Best regards,
Tobias

From vladimir.x.ivanov at oracle.com  Mon Apr  6 13:38:12 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Mon, 6 Apr 2020 16:38:12 +0300
Subject: Polymorphic Guarded Inlining in C2
In-Reply-To: <084ea561-6305-a8fd-9d7f-5ba108d41312@oracle.com>
References: <MWHPR21MB051142D7637FFAB03EDBE0A3B01D0@MWHPR21MB0511.namprd21.prod.outlook.com>
 <MWHPR21MB0511F7E9A2CC9118822F38ABB01D0@MWHPR21MB0511.namprd21.prod.outlook.com>
 <cab9034b-ad5e-9c6b-85e6-2d9abd6affa7@oracle.com>
 <MWHPR21MB051152A581B205FDD22675B5B0190@MWHPR21MB0511.namprd21.prod.outlook.com>
 <e1272577-2859-36da-9679-33a1c25a2b52@oracle.com>
 <MWHPR21MB0511CEFDDEAC30BF4CFD9234B0110@MWHPR21MB0511.namprd21.prod.outlook.com>
 <6bbeea49-7335-9640-d524-32fa03968f42@oracle.com>
 <MWHPR21MB051135E4B2A9DF06CA69AB31B0E40@MWHPR21MB0511.namprd21.prod.outlook.com>
 <MWHPR21MB0511944D93B23B845A0ADF63B0E40@MWHPR21MB0511.namprd21.prod.outlook.com>
 <MWHPR21MB051128C6984BFBCF2BE5099EB0E40@MWHPR21MB0511.namprd21.prod.outlook.com>
 <MWHPR21MB05119B1597C64A3733CA0B74B0F50@MWHPR21MB0511.namprd21.prod.outlook.com>
 <ea3c5eec-6a3f-b252-1d24-d3d266ca6b93@oracle.com>
 <084ea561-6305-a8fd-9d7f-5ba108d41312@oracle.com>
Message-ID: <6de5487c-c13e-c03e-9d0b-f3093b115daf@oracle.com>

I see 2 directions (mostly independent) to proceed: (1) use existing 
profiling info only; and (2) when more profile info is available.

I suggest to explore them independently.

There's enough profiling data available to introduce polymorpic case 
with 2 major receivers ("2-poly"). And it'll complete the matrix of 
possible shapes.

Gathering more data (-XX:TypeProfileWidth=N > 2) enables 2 more generic 
shapes: "N-morphic" and "N-poly". The only difference between them is 
what happens on fallback patch - deopt / uncommon trap or a virtual call.

Regarding 2-poly, there is TypeProfileMajorReceiverPercent which should 
be extended to 2 cases which leads to 2 parameter: aggregated major 
receiver percentage and minimum indiviual percentage.

Also, it makes sense to introduce UseOnlyInlinedPolymorphic which aligns 
2-poly with bimorphic case.

And, as I mentioned before, IMO it's promising to distinguish 
invokevirtual and invokeinterface cases. So, additional flag to control 
that would be useful.

Regarding N-poly/N-morphic case, they can be generalized from 
2-poly/bi-morphic cases.

I believe experiments on 2-poly will provide useful insights on 
N-poly/N-morphic, so it makes sense to start with 2-poly first.

Best regards,
Vladimir Ivanov

On 01.04.2020 01:29, Vladimir Kozlov wrote:
> Looks like graphs were stripped from email. I put them on GitHub:
> 
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-ren_tpw.png> 
> 
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_tpw.png> 
> 
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-dacapo_tpw.png> 
> 
> 
> Also Vladimir Ivanov forwarded me data he collected.
> 
> His next data shows that profiling is not "free". Vladimir I. limited to 
> tier3 (-XX:TieredStopAtLevel=3, C1 compilation with profiling code) to 
> show that profiling code with TPW=8 is slower. Note, with 4 tiers this 
> may not visible because execution will be switched to C2 compiled code 
> (without profiling code).
> 
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_tier3.png> 
> 
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-dacapo_tier3.png> 
> 
> 
> Next data collected for proposed patch. Vladimir I. collected data for 
> several flags configurations.
> Next graphs are for one of settings:' -XX:+UsePolymorphicInlining 
> -XX:+UseOnlyInlinedPolymorphic -XX:TypeProfileWidth=4'
> 
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_poly_inl_tpw4.png> 
> 
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-decapo_poly_inl_tpw4.png> 
> 
> 
> It has mixed data but most benchmarks are not affected. Which means we 
> need to spend more time on proposed changes.
> 
> Vladimir K
> 
> On 3/31/20 10:39 AM, Vladimir Kozlov wrote:
>> I start loking on it.
>>
>> I think ideally TypeProfileWidth should be per call site and not per 
>> method - and it will require more complicated implementation (an other 
>> RFE). But for experiments I think setting it to 8 (or higher) for all 
>> methods is okay.
>>
>> Note, more profiling lines per each call site is cost few Mb in 
>> CodeCache (overestimation 20K nmethods * 10 call sites * 6 * 8 bytes) 
>> vs very complicated code to have dynamic number of lines.
>>
>> I think we should first investigate best heuristics for inlining vs 
>> direct call vs vcall vs uncommmont traps for polymorphic cases and 
>> worry about memory and time consumption during profiling later.
>>
>> I did some performance runs with latest JDK 15 for TypeProfileWidth=8 
>> vs =2 and don't see much difference for spec benchmarks (see attached 
>> graph - grey dots mean no significant difference). But there are 
>> regressions (red dots) for Renessance which includes some modern 
>> benchmarks.
>>
>> I will work his week to get similar data with Ludovic's patch.
>>
>> I am for incremental approach. I think we can start/push based on what 
>> Ludovic is currently suggesting (do more processing for TPW > 2) while 
>> preserving current default behaviour (for TPW <= 2). But only if it 
>> gives improvements in these benchmarks. We use these benchmarks as 
>> criteria for JDK releases.
>>
>> Regards,
>> Vladimir
>>
>> On 3/20/20 4:52 PM, Ludovic Henry wrote:
>>> Hi Vladimir,
>>>
>>> As requested offline, please find following the latest version of the 
>>> patch. Contrary to what was discussed
>>> initially, I haven't done the work to support per-method 
>>> TypeProfileWidth, as that requires to extend the
>>> existing CompilerDirectives to be available to the Interpreter. For 
>>> me to achieve that work, I would need
>>> guidance on how to approach the problem, and what your expectations are.
>>>
>>> Thank you,
>>>
>>> -- 
>>> Ludovic
>>>
>>> diff --git a/src/hotspot/cpu/x86/interp_masm_x86.cpp 
>>> b/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>> index 4ed93169c7..bad9cddf20 100644
>>> --- a/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>> +++ b/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>> @@ -1731,7 +1731,7 @@ void 
>>> InterpreterMacroAssembler::record_item_in_profile_helper(Register 
>>> item, Reg
>>> ??????????? Label found_null;
>>> ??????????? jccb(Assembler::zero, found_null);
>>> ??????????? // Item did not match any saved item and there is no 
>>> empty row for it.
>>> -????????? // Increment total counter to indicate polymorphic case.
>>> +????????? // Increment total counter to indicate megamorphic case.
>>> ??????????? increment_mdp_data_at(mdp, non_profiled_offset);
>>> ??????????? jmp(done);
>>> ??????????? bind(found_null);
>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp 
>>> b/src/hotspot/share/ci/ciCallProfile.hpp
>>> index 73854806ed..c5030149bf 100644
>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp
>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp
>>> @@ -38,7 +38,8 @@ private:
>>> ??? friend class ciMethod;
>>> ??? friend class ciMethodHandle;
>>> -? enum { MorphismLimit = 2 }; // Max call site's morphism we care about
>>> +? enum { MorphismLimit = 8 }; // Max call site's morphism we care about
>>> +? bool _is_megamorphic;????????? // whether the call site is 
>>> megamorphic
>>> ??? int? _limit;??????????????? // number of receivers have been 
>>> determined
>>> ??? int? _morphism;???????????? // determined call site's morphism
>>> ??? int? _count;??????????????? // # times has this call been executed
>>> @@ -47,6 +48,8 @@ private:
>>> ??? ciKlass*? _receiver[MorphismLimit + 1];? // receivers (exact)
>>> ??? ciCallProfile() {
>>> +??? guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit 
>>> can't be smaller than TypeProfileWidth");
>>> +??? _is_megamorphic = false;
>>> ????? _limit = 0;
>>> ????? _morphism??? = 0;
>>> ????? _count = -1;
>>> @@ -58,6 +61,8 @@ private:
>>> ??? void add_receiver(ciKlass* receiver, int receiver_count);
>>> ? public:
>>> +? bool????? is_megamorphic() const??? { return _is_megamorphic; }
>>> +
>>> ??? // Note:? The following predicates return false for invalid 
>>> profiles:
>>> ??? bool????? has_receiver(int i) const { return _limit > i; }
>>> ??? int?????? morphism() const????????? { return _morphism; }
>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp 
>>> b/src/hotspot/share/ci/ciMethod.cpp
>>> index d771be8dac..c190919708 100644
>>> --- a/src/hotspot/share/ci/ciMethod.cpp
>>> +++ b/src/hotspot/share/ci/ciMethod.cpp
>>> @@ -531,25 +531,27 @@ ciCallProfile ciMethod::call_profile_at_bci(int 
>>> bci) {
>>> ??????????? // If we extend profiling to record methods,
>>> ??????????? // we will set result._method also.
>>> ????????? }
>>> -??????? // Determine call site's morphism.
>>> +??????? // Determine call site's megamorphism.
>>> ????????? // The call site count is 0 with known morphism (only 1 or 
>>> 2 receivers)
>>> ????????? // or < 0 in the case of a type check failure for 
>>> checkcast, aastore, instanceof.
>>> -??????? // The call site count is > 0 in the case of a polymorphic 
>>> virtual call.
>>> +??????? // The call site count is > 0 in the case of a megamorphic 
>>> virtual call.
>>> ????????? if (morphism > 0 && morphism == result._limit) {
>>> ???????????? // The morphism <= MorphismLimit.
>>> -?????????? if ((morphism <? ciCallProfile::MorphismLimit) ||
>>> -?????????????? (morphism == ciCallProfile::MorphismLimit && count == 
>>> 0)) {
>>> +?????????? if ((morphism <? TypeProfileWidth) ||
>>> +?????????????? (morphism == TypeProfileWidth && count == 0)) {
>>> ? #ifdef ASSERT
>>> ?????????????? if (count > 0) {
>>> ???????????????? this->print_short_name(tty);
>>> ???????????????? tty->print_cr(" @ bci:%d", bci);
>>> ???????????????? this->print_codes();
>>> -?????????????? assert(false, "this call site should not be 
>>> polymorphic");
>>> +?????????????? assert(false, "this call site should not be 
>>> megamorphic");
>>> ?????????????? }
>>> ? #endif
>>> -???????????? result._morphism = morphism;
>>> +?????????? } else {
>>> +????????????? result._is_megamorphic = true;
>>> ???????????? }
>>> ????????? }
>>> +??????? result._morphism = morphism;
>>> ????????? // Make the count consistent if this is a call profile. If 
>>> count is
>>> ????????? // zero or less, presume that this is a typecheck profile and
>>> ????????? // do nothing.? Otherwise, increase count to be the sum of all
>>> @@ -578,7 +580,7 @@ void ciCallProfile::add_receiver(ciKlass* 
>>> receiver, int receiver_count) {
>>> ??? }
>>> ??? _receiver[i] = receiver;
>>> ??? _receiver_count[i] = receiver_count;
>>> -? if (_limit < MorphismLimit) _limit++;
>>> +? if (_limit < TypeProfileWidth) _limit++;
>>> ? }
>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp 
>>> b/src/hotspot/share/opto/c2_globals.hpp
>>> index d605bdb7bd..e4a5e7ea8b 100644
>>> --- a/src/hotspot/share/opto/c2_globals.hpp
>>> +++ b/src/hotspot/share/opto/c2_globals.hpp
>>> @@ -389,9 +389,16 @@
>>> ??? product(bool, UseBimorphicInlining, 
>>> true,???????????????????????????????? \
>>> ??????????? "Profiling based inlining for two 
>>> receivers")???????????????????? \
>>>                                                                               
>>> \
>>> +? product(bool, UsePolymorphicInlining, 
>>> true,?????????????????????????????? \
>>> +????????? "Profiling based inlining for two or more 
>>> receivers")???????????? \
>>> +                                                                            
>>> \
>>> ??? product(bool, UseOnlyInlinedBimorphic, 
>>> true,????????????????????????????? \
>>> ??????????? "Don't use BimorphicInlining if can't inline a second 
>>> method")??? \
>>>                                                                               
>>> \
>>> +? product(bool, UseOnlyInlinedPolymorphic, 
>>> true,??????????????????????????? \
>>> +????????? "Don't use PolymorphicInlining if can't inline a secondary 
>>> "????? \
>>> +          
>>> "method")???????????????????????????????????????????????????????? \
>>> +                                                                            
>>> \
>>> ??? product(bool, InsertMemBarAfterArraycopy, 
>>> true,?????????????????????????? \
>>> ??????????? "Insert memory barrier after arraycopy 
>>> call")???????????????????? \
>>>                                                                               
>>> \
>>> @@ -645,6 +652,10 @@
>>> ??????????? "% of major receiver type to all profiled 
>>> receivers")???????????? \
>>> ??????????? range(0, 
>>> 100)???????????????????????????????????????????????????? \
>>>                                                                               
>>> \
>>> +? product(intx, TypeProfileMinimumReceiverPercent, 
>>> 20,????????????????????? \
>>> +????????? "minimum % of receiver type to all profiled 
>>> receivers")?????????? \
>>> +????????? range(0, 
>>> 100)???????????????????????????????????????????????????? \
>>> +                                                                            
>>> \
>>> ??? diagnostic(bool, PrintIntrinsics, 
>>> false,????????????????????????????????? \
>>> ??????????? "prints attempted and successful inlining of 
>>> intrinsics")???????? \
>>>                                                                               
>>> \
>>> diff --git a/src/hotspot/share/opto/doCall.cpp 
>>> b/src/hotspot/share/opto/doCall.cpp
>>> index 44ab387ac8..dba2b114c6 100644
>>> --- a/src/hotspot/share/opto/doCall.cpp
>>> +++ b/src/hotspot/share/opto/doCall.cpp
>>> @@ -83,25 +83,27 @@ CallGenerator* Compile::call_generator(ciMethod* 
>>> callee, int vtable_index, bool
>>> ??? // See how many times this site has been invoked.
>>> ??? int site_count = profile.count();
>>> -? int receiver_count = -1;
>>> -? if (call_does_dispatch && UseTypeProfile && 
>>> profile.has_receiver(0)) {
>>> -??? // Receivers in the profile structure are ordered by call counts
>>> -??? // so that the most called (major) receiver is profile.receiver(0).
>>> -??? receiver_count = profile.receiver_count(0);
>>> -? }
>>> ??? CompileLog* log = this->log();
>>> ??? if (log != NULL) {
>>> -??? int rid = (receiver_count >= 0)? 
>>> log->identify(profile.receiver(0)): -1;
>>> -??? int r2id = (rid != -1 && profile.has_receiver(1))? 
>>> log->identify(profile.receiver(1)):-1;
>>> +??? int* rids;
>>> +??? if (call_does_dispatch) {
>>> +????? rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth);
>>> +????? for (int i = 0; i < TypeProfileWidth && 
>>> profile.has_receiver(i); i++) {
>>> +??????? rids[i] = log->identify(profile.receiver(i));
>>> +????? }
>>> +??? }
>>> ????? log->begin_elem("call method='%d' count='%d' prof_factor='%f'",
>>> ????????????????????? log->identify(callee), site_count, prof_factor);
>>> -??? if (call_does_dispatch)? log->print(" virtual='1'");
>>> ????? if (allow_inline)???? log->print(" inline='1'");
>>> -??? if (receiver_count >= 0) {
>>> -????? log->print(" receiver='%d' receiver_count='%d'", rid, 
>>> receiver_count);
>>> -????? if (profile.has_receiver(1)) {
>>> -??????? log->print(" receiver2='%d' receiver2_count='%d'", r2id, 
>>> profile.receiver_count(1));
>>> +??? if (call_does_dispatch) {
>>> +????? log->print(" virtual='1'");
>>> +????? for (int i = 0; i < TypeProfileWidth && 
>>> profile.has_receiver(i); i++) {
>>> +??????? if (i == 0) {
>>> +????????? log->print(" receiver='%d' receiver_count='%d' 
>>> receiver_prob='%f'", rids[i], profile.receiver_count(i), 
>>> profile.receiver_prob(i));
>>> +??????? } else {
>>> +????????? log->print(" receiver%d='%d' receiver%d_count='%d' 
>>> receiver%d_prob='%f'", i + 1, rids[i], i + 1, 
>>> profile.receiver_count(i), i + 1, profile.receiver_prob(i));
>>> +??????? }
>>> ??????? }
>>> ????? }
>>> ????? if (callee->is_method_handle_intrinsic()) {
>>> @@ -205,92 +207,112 @@ CallGenerator* 
>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>> ????? if (call_does_dispatch && site_count > 0 && UseTypeProfile) {
>>> ??????? // The major receiver's count >= 
>>> TypeProfileMajorReceiverPercent of site_count.
>>> ??????? bool have_major_receiver = profile.has_receiver(0) && 
>>> (100.*profile.receiver_prob(0) >= 
>>> (float)TypeProfileMajorReceiverPercent);
>>> -????? ciMethod* receiver_method = NULL;
>>> ??????? int morphism = profile.morphism();
>>> +
>>> +????? int width = morphism > 0 ? morphism : 1;
>>> +????? ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*, 
>>> width);
>>> +????? memset(receiver_methods, 0, sizeof(ciMethod*) * width);
>>> +????? CallGenerator** hit_cgs = NEW_RESOURCE_ARRAY(CallGenerator*, 
>>> width);
>>> +????? memset(hit_cgs, 0, sizeof(CallGenerator*) * width);
>>> +
>>> ??????? if (speculative_receiver_type != NULL) {
>>> ????????? if (!too_many_traps_or_recompiles(caller, bci, 
>>> Deoptimization::Reason_speculate_class_check)) {
>>> ??????????? // We have a speculative type, we should be able to resolve
>>> ??????????? // the call. We do that before looking at the profiling at
>>> -????????? // this invoke because it may lead to bimorphic inlining 
>>> which
>>> +????????? // this invoke because it may lead to polymorphic inlining 
>>> which
>>> ??????????? // a speculative type should help us avoid.
>>> -????????? receiver_method = 
>>> callee->resolve_invoke(jvms->method()->holder(),
>>> -                                                   
>>> speculative_receiver_type);
>>> -????????? if (receiver_method == NULL) {
>>> +????????? receiver_methods[0] = 
>>> callee->resolve_invoke(jvms->method()->holder(),
>>> +                                                       
>>> speculative_receiver_type);
>>> +????????? if (receiver_methods[0] == NULL) {
>>> ????????????? speculative_receiver_type = NULL;
>>> ??????????? } else {
>>> ????????????? morphism = 1;
>>> ??????????? }
>>> ????????? } else {
>>> ??????????? // speculation failed before. Use profiling at the call
>>> -????????? // (could allow bimorphic inlining for instance).
>>> +????????? // (could allow polymorphic inlining for instance).
>>> ??????????? speculative_receiver_type = NULL;
>>> ????????? }
>>> ??????? }
>>> -????? if (receiver_method == NULL &&
>>> -????????? (have_major_receiver || morphism == 1 ||
>>> -?????????? (morphism == 2 && UseBimorphicInlining))) {
>>> -??????? // receiver_method = profile.method();
>>> -??????? // Profiles do not suggest methods now.? Look it up in the 
>>> major receiver.
>>> -??????? receiver_method = 
>>> callee->resolve_invoke(jvms->method()->holder(),
>>> -                                                      
>>> profile.receiver(0));
>>> -????? }
>>> -????? if (receiver_method != NULL) {
>>> -??????? // The single majority receiver sufficiently outweighs the 
>>> minority.
>>> -??????? CallGenerator* hit_cg = this->call_generator(receiver_method,
>>> -????????????? vtable_index, !call_does_dispatch, jvms, allow_inline, 
>>> prof_factor);
>>> -??????? if (hit_cg != NULL) {
>>> -????????? // Look up second receiver.
>>> -????????? CallGenerator* next_hit_cg = NULL;
>>> -????????? ciMethod* next_receiver_method = NULL;
>>> -????????? if (morphism == 2 && UseBimorphicInlining) {
>>> -??????????? next_receiver_method = 
>>> callee->resolve_invoke(jvms->method()->holder(),
>>> -                                                               
>>> profile.receiver(1));
>>> -??????????? if (next_receiver_method != NULL) {
>>> -????????????? next_hit_cg = this->call_generator(next_receiver_method,
>>> -????????????????????????????????? vtable_index, !call_does_dispatch, 
>>> jvms,
>>> -????????????????????????????????? allow_inline, prof_factor);
>>> -????????????? if (next_hit_cg != NULL && !next_hit_cg->is_inline() &&
>>> -????????????????? have_major_receiver && UseOnlyInlinedBimorphic) {
>>> -????????????????? // Skip if we can't inline second receiver's method
>>> -????????????????? next_hit_cg = NULL;
>>> -????????????? }
>>> -??????????? }
>>> -????????? }
>>> -????????? CallGenerator* miss_cg;
>>> -????????? Deoptimization::DeoptReason reason = (morphism == 2
>>> -?????????????????????????????????????????????? ? 
>>> Deoptimization::Reason_bimorphic
>>> -?????????????????????????????????????????????? : 
>>> Deoptimization::reason_class_check(speculative_receiver_type != NULL));
>>> -????????? if ((morphism == 1 || (morphism == 2 && next_hit_cg != 
>>> NULL)) &&
>>> -????????????? !too_many_traps_or_recompiles(caller, bci, reason)
>>> -???????????? ) {
>>> -??????????? // Generate uncommon trap for class check failure path
>>> -??????????? // in case of monomorphic or bimorphic virtual call site.
>>> -??????????? miss_cg = CallGenerator::for_uncommon_trap(callee, reason,
>>> -??????????????????????? Deoptimization::Action_maybe_recompile);
>>> +????? bool removed_cgs = false;
>>> +????? // Look up receivers.
>>> +????? for (int i = 0; i < morphism; i++) {
>>> +??????? if ((i == 1 && !UseBimorphicInlining) || (i >= 1 && 
>>> !UsePolymorphicInlining)) {
>>> +????????? break;
>>> +??????? }
>>> +??????? if (receiver_methods[i] == NULL && profile.has_receiver(i)) {
>>> +????????? receiver_methods[i] = 
>>> callee->resolve_invoke(jvms->method()->holder(),
>>> +                                                        
>>> profile.receiver(i));
>>> +??????? }
>>> +??????? if (receiver_methods[i] != NULL) {
>>> +????????? bool allow_inline;
>>> +????????? if (speculative_receiver_type != NULL) {
>>> +??????????? allow_inline = true;
>>> ??????????? } else {
>>> -??????????? // Generate virtual call for class check failure path
>>> -??????????? // in case of polymorphic virtual call site.
>>> -??????????? miss_cg = CallGenerator::for_virtual_call(callee, 
>>> vtable_index);
>>> +??????????? allow_inline = 100.*profile.receiver_prob(i) >= 
>>> (float)TypeProfileMinimumReceiverPercent;
>>> ??????????? }
>>> -????????? if (miss_cg != NULL) {
>>> -??????????? if (next_hit_cg != NULL) {
>>> -????????????? assert(speculative_receiver_type == NULL, "shouldn't 
>>> end up here if we used speculation");
>>> -????????????? trace_type_profile(C, jvms->method(), jvms->depth() - 
>>> 1, jvms->bci(), next_receiver_method, profile.receiver(1), 
>>> site_count, profile.receiver_count(1));
>>> -????????????? // We don't need to record dependency on a receiver 
>>> here and below.
>>> -????????????? // Whenever we inline, the dependency is added by 
>>> Parse::Parse().
>>> -????????????? miss_cg = 
>>> CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, 
>>> next_hit_cg, PROB_MAX);
>>> -??????????? }
>>> -??????????? if (miss_cg != NULL) {
>>> -????????????? ciKlass* k = speculative_receiver_type != NULL ? 
>>> speculative_receiver_type : profile.receiver(0);
>>> -????????????? trace_type_profile(C, jvms->method(), jvms->depth() - 
>>> 1, jvms->bci(), receiver_method, k, site_count, receiver_count);
>>> -????????????? float hit_prob = speculative_receiver_type != NULL ? 
>>> 1.0 : profile.receiver_prob(0);
>>> -????????????? CallGenerator* cg = 
>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob);
>>> -????????????? if (cg != NULL)? return cg;
>>> +????????? hit_cgs[i] = this->call_generator(receiver_methods[i],
>>> +??????????????????????????????? vtable_index, !call_does_dispatch, 
>>> jvms,
>>> +??????????????????????????????? allow_inline, prof_factor);
>>> +????????? if (hit_cgs[i] != NULL) {
>>> +??????????? if (speculative_receiver_type != NULL) {
>>> +????????????? // Do nothing if it's a speculative type
>>> +??????????? } else if (bytecode == Bytecodes::_invokeinterface) {
>>> +????????????? // Do nothing if it's an interface, multiple 
>>> direct-calls are faster than one indirect-call
>>> +??????????? } else if (!have_major_receiver) {
>>> +????????????? // Do nothing if there is no major receiver
>>> +??????????? } else if ((morphism == 2 && !UseOnlyInlinedBimorphic) 
>>> || (morphism >= 2 && !UseOnlyInlinedPolymorphic)) {
>>> +????????????? // Do nothing if the user allows non-inlined 
>>> polymorphic calls
>>> +??????????? } else if (!hit_cgs[i]->is_inline()) {
>>> +????????????? // Skip if we can't inline receiver's method
>>> +????????????? hit_cgs[i] = NULL;
>>> +????????????? removed_cgs = true;
>>> ????????????? }
>>> ??????????? }
>>> ????????? }
>>> ??????? }
>>> +
>>> +????? // Generate the fallback path
>>> +????? Deoptimization::DeoptReason reason = (morphism != 1
>>> +??????????????????????????????????????????? ? 
>>> Deoptimization::Reason_polymorphic
>>> +??????????????????????????????????????????? : 
>>> Deoptimization::reason_class_check(speculative_receiver_type != NULL));
>>> +????? bool disable_trap = (profile.is_megamorphic() || removed_cgs 
>>> || too_many_traps_or_recompiles(caller, bci, reason));
>>> +????? if (log != NULL) {
>>> +??????? log->elem("call_fallback method='%d' count='%d' 
>>> morphism='%d' trap='%d'",
>>> +????????????????????? log->identify(callee), site_count, morphism, 
>>> disable_trap ? 0 : 1);
>>> +????? }
>>> +????? CallGenerator* miss_cg;
>>> +????? if (!disable_trap) {
>>> +??????? // Generate uncommon trap for class check failure path
>>> +??????? // in case of polymorphic virtual call site.
>>> +??????? miss_cg = CallGenerator::for_uncommon_trap(callee, reason,
>>> +??????????????????? Deoptimization::Action_maybe_recompile);
>>> +????? } else {
>>> +??????? // Generate virtual call for class check failure path
>>> +??????? // in case of megamorphic virtual call site.
>>> +??????? miss_cg = CallGenerator::for_virtual_call(callee, 
>>> vtable_index);
>>> +????? }
>>> +
>>> +????? // Generate the guards
>>> +????? CallGenerator* cg = NULL;
>>> +????? if (speculative_receiver_type != NULL) {
>>> +??????? if (hit_cgs[0] != NULL) {
>>> +????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, 
>>> jvms->bci(), receiver_methods[0], speculative_receiver_type, 
>>> site_count, profile.receiver_count(0));
>>> +????????? // We don't need to record dependency on a receiver here 
>>> and below.
>>> +????????? // Whenever we inline, the dependency is added by 
>>> Parse::Parse().
>>> +????????? cg = 
>>> CallGenerator::for_predicted_call(speculative_receiver_type, miss_cg, 
>>> hit_cgs[0], PROB_MAX);
>>> +??????? }
>>> +????? } else {
>>> +??????? for (int i = morphism - 1; i >= 0 && miss_cg != NULL; i--) {
>>> +????????? if (hit_cgs[i] != NULL) {
>>> +??????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, 
>>> jvms->bci(), receiver_methods[i], profile.receiver(i), site_count, 
>>> profile.receiver_count(i));
>>> +??????????? miss_cg = 
>>> CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, 
>>> hit_cgs[i], profile.receiver_prob(i));
>>> +????????? }
>>> +??????? }
>>> +??????? cg = miss_cg;
>>> +????? }
>>> +????? if (cg != NULL)? return cg;
>>> ????? }
>>> ????? // If there is only one implementor of this interface then we
>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp 
>>> b/src/hotspot/share/runtime/deoptimization.cpp
>>> index 11df15e004..2d14b52854 100644
>>> --- a/src/hotspot/share/runtime/deoptimization.cpp
>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp
>>> @@ -2382,7 +2382,7 @@ const char* Deoptimization::_trap_reason_name[] 
>>> = {
>>> ??? "class_check",
>>> ??? "array_check",
>>> ??? "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"),
>>> -? "bimorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>> +? "polymorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>> ??? "profile_predicate",
>>> ??? "unloaded",
>>> ??? "uninitialized",
>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp 
>>> b/src/hotspot/share/runtime/deoptimization.hpp
>>> index 1cfff5394e..c1eb998aba 100644
>>> --- a/src/hotspot/share/runtime/deoptimization.hpp
>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp
>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic {
>>> ????? Reason_class_check,?????????? // saw unexpected object class 
>>> (@bci)
>>> ????? Reason_array_check,?????????? // saw unexpected array class 
>>> (aastore @bci)
>>> ????? Reason_intrinsic,???????????? // saw unexpected operand to 
>>> intrinsic (@bci)
>>> -??? Reason_bimorphic,???????????? // saw unexpected object class in 
>>> bimorphic inlining (@bci)
>>> +??? Reason_polymorphic,?????????? // saw unexpected object class in 
>>> bimorphic inlining (@bci)
>>> ? #if INCLUDE_JVMCI
>>> ????? Reason_unreached0???????????? = Reason_null_assert,
>>> ????? Reason_type_checked_inlining? = Reason_intrinsic,
>>> -??? Reason_optimized_type_check?? = Reason_bimorphic,
>>> +??? Reason_optimized_type_check?? = Reason_polymorphic,
>>> ? #endif
>>> ????? Reason_profile_predicate,???? // compiler generated predicate 
>>> moved from frequent branch in a loop failed
>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp 
>>> b/src/hotspot/share/runtime/vmStructs.cpp
>>> index 94b544824e..ee761626c4 100644
>>> --- a/src/hotspot/share/runtime/vmStructs.cpp
>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp
>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry<InstanceKlass*, 
>>> mtClass>? KlassHashtableEntry;
>>>     
>>> declare_constant(Deoptimization::Reason_class_check)                    
>>> \
>>>     
>>> declare_constant(Deoptimization::Reason_array_check)                    
>>> \
>>>     
>>> declare_constant(Deoptimization::Reason_intrinsic)                      
>>> \
>>> -  
>>> declare_constant(Deoptimization::Reason_bimorphic)                      
>>> \
>>> +  
>>> declare_constant(Deoptimization::Reason_polymorphic)                    
>>> \
>>>     
>>> declare_constant(Deoptimization::Reason_profile_predicate)              
>>> \
>>>     
>>> declare_constant(Deoptimization::Reason_unloaded)                       
>>> \
>>>     
>>> declare_constant(Deoptimization::Reason_uninitialized)                  
>>> \
>>>
>>> -----Original Message-----
>>> From: hotspot-compiler-dev 
>>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Ludovic 
>>> Henry
>>> Sent: Tuesday, March 3, 2020 10:50 AM
>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose 
>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>
>>> I just got to run the PolymorphicVirtualCallBenchmark microbenchmark 
>>> with
>>> various TypeProfileWidth values. The results are:
>>>
>>> Benchmark???????????????????????????? Mode? Cnt? Score?? Error? Units 
>>> Configuration
>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.802 ? 0.048  
>>> ops/s -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.425 ? 0.019  
>>> ops/s -XX:TypeProfileWidth=1 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.857 ? 0.109  
>>> ops/s -XX:TypeProfileWidth=2 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.876 ? 0.051  
>>> ops/s -XX:TypeProfileWidth=3 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.867 ? 0.045  
>>> ops/s -XX:TypeProfileWidth=4 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.835 ? 0.104  
>>> ops/s -XX:TypeProfileWidth=5 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.886 ? 0.139  
>>> ops/s -XX:TypeProfileWidth=6 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.887 ? 0.040  
>>> ops/s -XX:TypeProfileWidth=7 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.684 ? 0.020  
>>> ops/s -XX:TypeProfileWidth=8 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>>
>>> The main thing I observe is that there isn't a linear (or even any 
>>> apparent)
>>> correlation between the number of guards generated (guided by
>>> TypeProfileWidth), and the time taken.
>>>
>>> I am trying to understand why there is a dip for TypeProfileWidth equal
>>> to 1 and 8.
>>>
>>> -- 
>>> Ludovic
>>>
>>> -----Original Message-----
>>> From: Ludovic Henry <luhenry at microsoft.com>
>>> Sent: Tuesday, March 3, 2020 9:33 AM
>>> To: Ludovic Henry <luhenry at microsoft.com>; Vladimir Ivanov 
>>> <vladimir.x.ivanov at oracle.com>; John Rose <john.r.rose at oracle.com>; 
>>> hotspot-compiler-dev at openjdk.java.net
>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>
>>> Hi Vladimir,
>>>
>>> I did a rerun of the following benchmark with various configurations:
>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fluhenry%2Fjdk-microbenchmarks%2Fblob%2Fmaster%2Fsrc%2Fmain%2Fjava%2Forg%2Fsample%2FTypeProfileWidthOverheadBenchmark.java&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&amp;sdata=KzGJ0Xc95rWag6rF8gyFW7%2BHcDUccltwMN5JmL%2BUhdE%3D&amp;reserved=0 
>>>
>>>
>>> The results are as follows:
>>>
>>> Benchmark???????????????????????????? Mode? Cnt? Score?? Error? Units 
>>> Configuration
>>> PolymorphicVirtualCallBenchmark.run?? thrpt 5??? 2.910 ? 0.040? ops/s 
>>> indirect-call? -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run?? thrpt 5??? 2.752 ? 0.039? ops/s 
>>> direct-call??? -XX:TypeProfileWidth=8 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run?? thrpt 5??? 3.407 ? 0.085? ops/s 
>>> inlined-call?? -XX:TypeProfileWidth=8 -XX:-PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> Benchmark???????????????????????????? Mode? Cnt? Score?? Error? Units 
>>> Configuration
>>> PolymorphicInterfaceCallBenchmark.run thrpt 5??? 2.043 ? 0.025? ops/s 
>>> indirect-call? -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableTrap
>>> PolymorphicInterfaceCallBenchmark.run thrpt 5??? 2.555 ? 0.063? ops/s 
>>> direct-call??? -XX:TypeProfileWidth=8 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicInterfaceCallBenchmark.run thrpt 5??? 3.217 ? 0.058? ops/s 
>>> inlined-call?? -XX:TypeProfileWidth=8 -XX:-PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>>
>>> The Hotspot logs (with generated assembly) are available at:
>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fluhenry%2F4f015541cb6628517ab6698b1487c17d&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&amp;sdata=q7MAs1V9tbuZhBIjNDQefNuhsiKVv7HHRn1kdcbrUPk%3D&amp;reserved=0 
>>>
>>>
>>> The main takeaway from that experiment is that direct calls w/o 
>>> inlining is faster
>>> than indirect calls for icalls but slower for vcalls, and that 
>>> inlining is always faster
>>> than direct calls.
>>>
>>> (I fully understand this applies mainly on this microbenchmark, and 
>>> we need to
>>> validate on larger benchmarks. I'm working on that next. However, it 
>>> clearly show
>>> gains on a pathological case.)
>>>
>>> Next, I want to figure out at how many guard the direct-call 
>>> regresses compared
>>> to indirect-call in the vcall case, and I want to run larger 
>>> benchmarks. Any
>>> particular you would like to see running? I am planning on doing 
>>> SPECjbb2015 first.
>>>
>>> Thank you,
>>>
>>> -- 
>>> Ludovic
>>>
>>> -----Original Message-----
>>> From: hotspot-compiler-dev 
>>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Ludovic 
>>> Henry
>>> Sent: Monday, March 2, 2020 4:20 PM
>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose 
>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>
>>> Hi Vladimir,
>>>
>>> Sorry for the long delay in response, I was at multiple conferences 
>>> over the past few
>>> weeks. I'm back to the office now and fully focus on getting progress 
>>> on that.
>>>
>>>>> Possible avenues of improvements I can see are:
>>>>> ??? - Gather all the types in an unbounded list so we can know 
>>>>> which ones
>>>>> are the most frequent. It is unlikely to help with Java as, in the 
>>>>> general
>>>>> case, there are only a few types present a call-sites. It could, 
>>>>> however,
>>>>> be particularly helpful for languages that tend to have many types at
>>>>> call-sites, like functional languages, for example.
>>>>
>>>> I doubt having unbounded list of receiver types is practical: it's
>>>> costly to gather, but isn't too useful for compilation. But measuring
>>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some 
>>>> numbers.
>>>
>>> I agree that it isn't very practical. It can be useful in the case 
>>> where there are
>>> many types at a call-site, and the first ones end up not being 
>>> frequent enough to
>>> mandate a guard. This is clearly an edge-case, and I don't think we 
>>> should optimize
>>> for it.
>>>
>>>>> In what we have today, some of the worst-case scenarios are the 
>>>>> following:
>>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, 
>>>>> the first and
>>>>> second types are types A and B, and the other type(s) is(are) not 
>>>>> recorded,
>>>>> and it increments the `count` value. Even if A and B are used in 
>>>>> the initialization
>>>>> path (i.e. only a few times) and the other type(s) is(are) used in 
>>>>> the hot
>>>>> path (i.e. many times), the latter are never considered for 
>>>>> inlining - because
>>>>> it was never recorded during profiling.
>>>>
>>>> Can it be alleviated by (partially) clearing type profile (e.g.,
>>>> periodically free some space by removing elements with lower 
>>>> frequencies
>>>> and give new types a chance to be profiled?
>>>
>>> Doing that reliably relies on the assumption that we know what the 
>>> shape of
>>> the workload is going to be in future iterations. Otherwise, how 
>>> could you
>>> guarantee that a type that's not currently frequent will not be in 
>>> the future,
>>> and that the information that you remove now will not be important 
>>> later. This
>>> is an assumption that, IMO, is worst than missing types which are hot 
>>> later in
>>> the execution for two reasons: 1. it's no better, and 2. it's a lot 
>>> less intuitive and
>>> harder to debug/understand than a straightforward "overflow".
>>>
>>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, 
>>>>> you have the
>>>>> first type A with 49% probability, the second type B with 49% 
>>>>> probability, and
>>>>> the other types with 2% probability. Even though A and B are the 
>>>>> two hottest
>>>>> paths, it does not generate guards because none are a major receiver.
>>>>
>>>> Yes. On the other hand, on average it'll cause inlining twice as much
>>>> code (2 methods vs 1).
>>>
>>> It will not necessarily cause twice as much inlining because of 
>>> late-inlining. Like
>>> you point out later, it will generate a direct-call in case there 
>>> isn't room for more
>>> inlinable code.
>>>
>>>> Also, does it make sense to increase morphism factor even if inlining
>>>> doesn't happen?
>>>>
>>>> ?? if (recv.klass == C1) {? // >>0%
>>>> ????? ... inlined ...
>>>> ?? } else if (recv.klass == C2) { // >>0%
>>>> ????? m2(); // direct call
>>>> ?? } else { // >0%
>>>> ????? m(); // virtual call
>>>> ?? }
>>>>
>>>> vs
>>>>
>>>> ?? if (recv.klass == C1) {? // >>0%
>>>> ????? ... inlined ...
>>>> ?? } else { // >>0%
>>>> ????? m(); // virtual call
>>>> ?? }
>>>
>>> There is the advantage that modern CPUs are better at predicting 
>>> instruction-branches
>>> than data-branches. These guards will then allow the CPU to make 
>>> better decisions allowing
>>> for better superscalar executions, memory prefetching, etc.
>>>
>>> This, IMO, makes sense for warm calls, especially since the cost is a 
>>> guard + a call, which is
>>> much lower than a inlined method, but brings benefits over an 
>>> indirect call.
>>>
>>>> In other words, how much could we get just by lowering
>>>> TypeProfileMajorReceiverPercent?
>>>
>>> TypeProfileMajorReceiverPercent is only used today when you have a 
>>> megamorphic
>>> call-site (aka more types than TypeProfileWidth) but still one type 
>>> receiving more than
>>> N% of the calls. By reducing the value, you would not increase the 
>>> number of guards,
>>> but the threshold at which you generate the 1st guard in a 
>>> megamorphic case.
>>>
>>>>>> ??????? - for N-morphic case what's the negative effect 
>>>>>> (quantitative) of
>>>>>> the deopt?
>>>>> We are triggering the uncommon trap in this case iff we observed a 
>>>>> limited
>>>>> and stable set of types in the early stages of the Tiered Compilation
>>>>> pipeline (making us generate N-morphic guards), and we suddenly 
>>>>> observe a
>>>>> new type. AFAIU, this is precisely what deopt is for.
>>>>
>>>> I should have added "... compared to N-polymorhic case". My 
>>>> intuition is
>>>> the higher morphism factor is the fewer the benefits of deopt (compared
>>>> to a call) are. It would be very good to validate it with some
>>>> benchmarks (both micro- and larger ones).
>>>
>>> I agree that what you are describing makes sense as well. To reduce 
>>> the cost of deopt
>>> here, having a TypeProfileMinimumReceiverPercent helps. That is 
>>> because if any type is
>>> seen less than this specific frequency, then it won't generate a 
>>> guard, leading to an indirect
>>> call in the fallback case.
>>>
>>>>> I'm writing a JMH benchmark to stress that specific case. I'll 
>>>>> share it as soon
>>>>> as I have something reliably reproducing.
>>>>
>>>> Thanks! A representative set of microbenchmarks will be very helpful.
>>>
>>> It turns out the guard is only generated once, meaning that if we 
>>> ever hit it then we
>>> generate an indirect call.
>>>
>>> We also only generate the trap iff all the guards are hot (inlined) 
>>> or warm (direct call),
>>> so any of the following case triggers the creation of an indirect 
>>> call over a trap:
>>> ? - we hit the trap once before
>>> ? - one or more guards are cold (aka not inlinable even with 
>>> late-inlining)
>>>
>>>> It was more about opportunities for future explorations. I don't think
>>>> we have to act on it right away.
>>>>
>>>> As with "deopt vs call", my guess is callee should benefit much more
>>>> from inlining than the caller it is inlined into (caller sees multiple
>>>> callee candidates and has to merge the results while each callee
>>>> observes the full context and can benefit from it).
>>>>
>>>> If we can run some sort of static analysis on callee bytecode, what 
>>>> kind
>>>> of code patterns should we look for to guide inlining decisions?
>>>
>>> Any pattern that would benefit from other optimizations (escape 
>>> analysis,
>>> dead code elimination, constant propagation, etc.) is good, but short of
>>> shadowing statically what all these optimizations do, I can't see an 
>>> easy way
>>> to do it.
>>>
>>> That is where late-inlining, or more advanced dynamic heuristics like 
>>> the one you
>>> can find in Graal EE, is worthwhile.
>>>
>>>> Regaring experiments to try first, here are some ideas I find 
>>>> promising:
>>>>
>>>> ???? * measure the cost of additional profiling
>>>> ???????? -XX:TypeProfileWidth=N without changing compilers
>>>
>>> I am running the following jmh microbenchmark
>>>
>>> ???? public final static int N = 100_000_000;
>>>
>>> ???? @State(Scope.Benchmark)
>>> ???? public static class TypeProfileWidthOverheadBenchmarkState {
>>> ???????? public A[] objs = new A[N];
>>>
>>> ???????? @Setup
>>> ???????? public void setup() throws Exception {
>>> ???????????? for (int i = 0; i < objs.length; ++i) {
>>> ???????????????? switch (i % 8) {
>>> ???????????????? case 0: objs[i] = new A1(); break;
>>> ???????????????? case 1: objs[i] = new A2(); break;
>>> ???????????????? case 2: objs[i] = new A3(); break;
>>> ???????????????? case 3: objs[i] = new A4(); break;
>>> ???????????????? case 4: objs[i] = new A5(); break;
>>> ???????????????? case 5: objs[i] = new A6(); break;
>>> ???????????????? case 6: objs[i] = new A7(); break;
>>> ???????????????? case 7: objs[i] = new A8(); break;
>>> ???????????????? }
>>> ???????????? }
>>> ???????? }
>>> ???? }
>>>
>>> ???? @Benchmark @OperationsPerInvocation(N)
>>> ???? public void run(TypeProfileWidthOverheadBenchmarkState state, 
>>> Blackhole blackhole) {
>>> ???????? A[] objs = state.objs;
>>> ???????? for (int i = 0; i < objs.length; ++i) {
>>> ???????????? objs[i].foo(i, blackhole);
>>> ???????? }
>>> ???? }
>>>
>>> And I am running with the following JVM parameters:
>>>
>>> -XX:TypeProfileWidth=0 -XX:CompileThreshold=200000000 
>>> -XX:Tier3CompileThreshold=200000000 
>>> -XX:Tier3InvocationThreshold=200000000 
>>> -XX:Tier3BackEdgeThreshold=200000000
>>> -XX:TypeProfileWidth=8 -XX:CompileThreshold=200000000 
>>> -XX:Tier3CompileThreshold=200000000 
>>> -XX:Tier3InvocationThreshold=200000000 
>>> -XX:Tier3BackEdgeThreshold=200000000
>>>
>>> I observe no statistically representative difference between in s/ops
>>> between TypeProfileWidth=0 and TypeProfileWidth=8. I also could observe
>>> no significant difference in the resulting analysis using Intel VTune.
>>>
>>> I verified that the benchmark never goes beyond Tier-0 with 
>>> -XX:+PrintCompilation.
>>>
>>>> ???? * N-morphic vs N-polymorphic (N>=2):
>>>> ?????? - how much deopt helps compared to a virtual call on fallback 
>>>> path?
>>>
>>> I have done the following microbenchmark, but I am not sure that it's
>>> going to fully answer the question you are raising here.
>>>
>>> ???? public final static int N = 100_000_000;
>>>
>>> ???? @State(Scope.Benchmark)
>>> ???? public static class PolymorphicDeoptBenchmarkState {
>>> ???????? public A[] objs = new A[N];
>>>
>>> ???????? @Setup
>>> ???????? public void setup() throws Exception {
>>> ???????????? int cutoff1 = (int)(objs.length * .90);
>>> ???????????? int cutoff2 = (int)(objs.length * .95);
>>> ???????????? for (int i = 0; i < cutoff1; ++i) {
>>> ???????????????? switch (i % 2) {
>>> ???????????????? case 0: objs[i] = new A1(); break;
>>> ???????????????? case 1: objs[i] = new A2(); break;
>>> ???????????????? }
>>> ???????????? }
>>> ???????????? for (int i = cutoff1; i < cutoff2; ++i) {
>>> ???????????????? switch (i % 4) {
>>> ???????????????? case 0: objs[i] = new A1(); break;
>>> ???????????????? case 1: objs[i] = new A2(); break;
>>> ???????????????? case 2:
>>> ???????????????? case 3: objs[i] = new A3(); break;
>>> ???????????????? }
>>> ???????????? }
>>> ???????????? for (int i = cutoff2; i < objs.length; ++i) {
>>> ???????????????? switch (i % 4) {
>>> ???????????????? case 0:
>>> ???????????????? case 1: objs[i] = new A3(); break;
>>> ???????????????? case 2:
>>> ???????????????? case 3: objs[i] = new A4(); break;
>>> ???????????????? }
>>> ???????????? }
>>> ???????? }
>>> ???? }
>>>
>>> ???? @Benchmark @OperationsPerInvocation(N)
>>> ???? public void run(PolymorphicDeoptBenchmarkState state, Blackhole 
>>> blackhole) {
>>> ???????? A[] objs = state.objs;
>>> ???????? for (int i = 0; i < objs.length; ++i) {
>>> ???????????? objs[i].foo(i, blackhole);
>>> ???????? }
>>> ???? }
>>>
>>> I run this benchmark with -XX:+PolyGuardDisableTrap or
>>> -XX:-PolyGuardDisableTrap which force enable/disable the trap in the
>>> fallback.
>>>
>>> For that kind of cases, a visitor pattern is what I expect to most 
>>> largely
>>> profit/suffer from a deopt or virtual-call in the fallback path. 
>>> Would you
>>> know of such benchmark that heavily relies on this pattern, and that I
>>> could readily reuse?
>>>
>>>> ???? * inlining vs devirtualization
>>>> ?????? - a knob to control inlining in N-morphic/N-polymorphic cases
>>>> ?????? - measure separately the effects of devirtualization and 
>>>> inlining
>>>
>>> For that one, I reused the first microbenchmark I mentioned above, and
>>> added a PolyGuardDisableInlining flag that controls whether we create a
>>> direct-call or inline.
>>>
>>> The results are 2.958 ? 0.011 ops/s for -XX:-PolyGuardDisableInlining 
>>> (aka inlined)
>>> vs 2.540 ? 0.018 ops/s for -XX:+PolyGuardDisableInlining (aka direct 
>>> call).
>>>
>>> This benchmarks hasn't been run in the best possible conditions (on 
>>> my dev
>>> machine, in WSL), but it gives a strong indication that even a direct 
>>> call has a
>>> non-negligible impact, and that inlining leads to better result 
>>> (again, in this
>>> microbenchmark).
>>>
>>> Otherwise, on the per-method TypeProfileWidth knob, I couldn't find 
>>> anything
>>> that would be readily available from the Interpreter. Would you have 
>>> any pointer
>>> of a pre-existing feature that required this specific kind of 
>>> plumbing? I would otherwise
>>> find myself in need of making CompilerDirectives available from the 
>>> Interpreter, and
>>> that is something outside of my current expertise (always happy to 
>>> learn, but I
>>> will need some pointers!).
>>>
>>> Thank you,
>>>
>>> -- 
>>> Ludovic
>>>
>>> -----Original Message-----
>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>> Sent: Thursday, February 20, 2020 9:00 AM
>>> To: Ludovic Henry <luhenry at microsoft.com>; John Rose 
>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>
>>> Hi Ludovic,
>>>
>>> [...]
>>>
>>>> Thanks for this explanation, it makes it a lot clearer what the 
>>>> cases and
>>>> your concerns are. To rephrase in my own words, what you are 
>>>> interested in
>>>> is not this change in particular, but more the possibility that this 
>>>> change
>>>> provides and how to take it the next step, correct?
>>>
>>> Yes, it's a good summary.
>>>
>>> [...]
>>>
>>>>> ??????? - affects profiling strategy: majority of receivers vs 
>>>>> complete
>>>>> list of receiver types observed;
>>>> Today, we only use the N first receivers when the number of types does
>>>> not exceed TypeProfileWidth; otherwise, we use none of them.
>>>> Possible avenues of improvements I can see are:
>>>> ??? - Gather all the types in an unbounded list so we can know which 
>>>> ones
>>>> are the most frequent. It is unlikely to help with Java as, in the 
>>>> general
>>>> case, there are only a few types present a call-sites. It could, 
>>>> however,
>>>> be particularly helpful for languages that tend to have many types at
>>>> call-sites, like functional languages, for example.
>>>
>>> I doubt having unbounded list of receiver types is practical: it's
>>> costly to gather, but isn't too useful for compilation. But measuring
>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some 
>>> numbers.
>>>
>>>> ?? - Use the existing types to generate guards for these types we 
>>>> know are
>>>> common enough. Then use the types which are hot or warm, even in 
>>>> case of a
>>>> megamorphic call-site. It would be a simple iteration of what we have
>>>> nowadays.
>>>
>>>> In what we have today, some of the worst-case scenarios are the 
>>>> following:
>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, the 
>>>> first and
>>>> second types are types A and B, and the other type(s) is(are) not 
>>>> recorded,
>>>> and it increments the `count` value. Even if A and B are used in the 
>>>> initialization
>>>> path (i.e. only a few times) and the other type(s) is(are) used in 
>>>> the hot
>>>> path (i.e. many times), the latter are never considered for inlining 
>>>> - because
>>>> it was never recorded during profiling.
>>>
>>> Can it be alleviated by (partially) clearing type profile (e.g.,
>>> periodically free some space by removing elements with lower frequencies
>>> and give new types a chance to be profiled?
>>>
>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, you 
>>>> have the
>>>> first type A with 49% probability, the second type B with 49% 
>>>> probability, and
>>>> the other types with 2% probability. Even though A and B are the two 
>>>> hottest
>>>> paths, it does not generate guards because none are a major receiver.
>>>
>>> Yes. On the other hand, on average it'll cause inlining twice as much
>>> code (2 methods vs 1).
>>>
>>> Also, does it make sense to increase morphism factor even if inlining
>>> doesn't happen?
>>>
>>> ??? if (recv.klass == C1) {? // >>0%
>>> ?????? ... inlined ...
>>> ??? } else if (recv.klass == C2) { // >>0%
>>> ?????? m2(); // direct call
>>> ??? } else { // >0%
>>> ?????? m(); // virtual call
>>> ??? }
>>>
>>> vs
>>>
>>> ??? if (recv.klass == C1) {? // >>0%
>>> ?????? ... inlined ...
>>> ??? } else { // >>0%
>>> ?????? m(); // virtual call
>>> ??? }
>>>
>>> In other words, how much could we get just by lowering
>>> TypeProfileMajorReceiverPercent?
>>>
>>> And it relates to "virtual/interface call" vs "type guard + direct call"
>>> code shapes comparison: how much does devirtualization help?
>>>
>>> Otherwise, enabling 2-polymorphic shape becomes feasible only if both
>>> cases are inlined.
>>>
>>>>> ??????? - for N-morphic case what's the negative effect 
>>>>> (quantitative) of
>>>>> the deopt?
>>>> We are triggering the uncommon trap in this case iff we observed a 
>>>> limited
>>>> and stable set of types in the early stages of the Tiered Compilation
>>>> pipeline (making us generate N-morphic guards), and we suddenly 
>>>> observe a
>>>> new type. AFAIU, this is precisely what deopt is for.
>>>
>>> I should have added "... compared to N-polymorhic case". My intuition is
>>> the higher morphism factor is the fewer the benefits of deopt (compared
>>> to a call) are. It would be very good to validate it with some
>>> benchmarks (both micro- and larger ones).
>>>
>>>> I'm writing a JMH benchmark to stress that specific case. I'll share 
>>>> it as soon
>>>> as I have something reliably reproducing.
>>>
>>> Thanks! A representative set of microbenchmarks will be very helpful.
>>>
>>>>> ???? * invokevirtual vs invokeinterface call sites
>>>>> ??????? - different cost models;
>>>>> ??????? - interfaces are harder to optimize, but opportunities for
>>>>> strength-reduction from interface to virtual calls exist;
>>>> ? From the profiling information and the inlining mechanism point of 
>>>> view,
>>>> that it is an invokevirtual or an invokeinterface doesn't change 
>>>> anything
>>>>
>>>> Are you saying that we have more to gain from generating a guard for
>>>> invokeinterface over invokevirtual because the fall-back of the
>>>> invokeinterface is much more expensive?
>>>
>>> Yes, that's the question: if we see an improvement, how much does
>>> devirtualization contribute to that?
>>>
>>> (If we add a type-guarded direct call, but there's no inlining
>>> happening, inline cache effectively strength-reduce a virtual call to a
>>> direct call.)
>>>
>>> Considering current implementation of virtual and interface calls
>>> (vtables vs itables), the cost model is very different.
>>>
>>> For vtable calls, it doesn't look too appealing to introduce large
>>> inline caches for individual receiver types since a call through a
>>> vtable involves 3 dependent loads [1] (recv => Klass* => Method* =>
>>> address).
>>>
>>> For itable calls it can be a big win in some situations: itable lookup
>>> iterates over Klass::_secondary_supers array and it can become quite
>>> costly. For example, some Scala workloads experience significant
>>> overheads from megamorphic calls.
>>>
>>> If we see an improvement on some benchmark, it would be very useful to
>>> be able to determine (quantitatively) how much does inlining and
>>> devirtualization contribute.
>>>
>>> FTR ErikO has been experimenting with an alternative vtable/itable
>>> implementation [4] which brings interface calls close to virtual calls.
>>> So, if it turns out that devirtualization (and not inlining) of
>>> interface calls is what contributes the most, then speeding up
>>> megamorphic interface calls becomes a more attractive alternative.
>>>
>>>>> ???? * inlining heuristics
>>>>> ??????? - devirtualization vs inlining
>>>>> ????????? - how much benefit from expanding a call site 
>>>>> (devirtualize more
>>>>> cases) without inlining? should differ for virtual & interface cases
>>>> I'm also writing a JMH benchmark for this case, and I'll share it as 
>>>> soon
>>>> as I have it reliably reproducing the issue you describe.
>>>
>>> Also, I think it's important to have a knob to control it (inline vs
>>> devirtualize). It'll enable experiments with larger benchmarks.
>>>
>>>>> ??????? - diminishing returns with increase in number of cases
>>>>> ??????? - expanding a single call site leads to more code, but 
>>>>> frequencies
>>>>> stay the same => colder code
>>>>> ??????? - based on profiling info (types + frequencies), dynamically
>>>>> choose morphism factor on per-call site basis?
>>>> That is where I propose to have a lower receiver probability at 
>>>> which we'll
>>>> stop adding more guards. I am experimenting with a global flag with 
>>>> a default
>>>> value of 10%.
>>>>> ??????? - what optimization opportunities to look for? it looks 
>>>>> like in
>>>>> general callees should benefit more than the caller (due to merges 
>>>>> after
>>>>> the call site)
>>>> Could you please expand your concern or provide an example.
>>>
>>> It was more about opportunities for future explorations. I don't think
>>> we have to act on it right away.
>>>
>>> As with "deopt vs call", my guess is callee should benefit much more
>>> from inlining than the caller it is inlined into (caller sees multiple
>>> callee candidates and has to merge the results while each callee
>>> observes the full context and can benefit from it).
>>>
>>> If we can run some sort of static analysis on callee bytecode, what kind
>>> of code patterns should we look for to guide inlining decisions?
>>>
>>>
>>> ? >> What's your take on it? Any other ideas?
>>> ? >
>>> ? > We don't know what we don't know. We need first to improve the
>>> logging and
>>> ? > debugging output of uncommon traps for polymorphic call-sites. 
>>> Then, we
>>> ? > need to gather data about the different cases you talked about.
>>> ? >
>>> ? > We also need to have some microbenchmarks to validate some of the
>>> questions
>>> ? > you are raising, and verify what level of gains we can expect 
>>> from this
>>> ? > optimization. Further validation will be needed on larger 
>>> benchmarks and
>>> ? > real-world applications as well, and that's where, I think, we need
>>> to develop
>>> ? > logging and debugging for this feature.
>>>
>>> Yes, sounds good.
>>>
>>> Regaring experiments to try first, here are some ideas I find promising:
>>>
>>> ???? * measure the cost of additional profiling
>>> ???????? -XX:TypeProfileWidth=N without changing compilers
>>>
>>> ???? * N-morphic vs N-polymorphic (N>=2):
>>> ?????? - how much deopt helps compared to a virtual call on fallback 
>>> path?
>>>
>>> ???? * inlining vs devirtualization
>>> ?????? - a knob to control inlining in N-morphic/N-polymorphic cases
>>> ?????? - measure separately the effects of devirtualization and inlining
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> [1]
>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l48&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&amp;sdata=jp4emfr%2FRnQA3RLtGI%2BCJ4Q%2F%2Bx6NpIUsWPexl3KbxvM%3D&amp;reserved=0 
>>>
>>>
>>> [2]
>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l142&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&amp;sdata=P1iMschSMsRCBw7e4IHWimB26v3MdqWWrGZKSvv64vA%3D&amp;reserved=0 
>>>
>>>
>>> [3]
>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FmacroAssembler_x86.cpp%23l4294&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&amp;sdata=pSEuxGJYxfPHzcouSeoCYSRdfdlSByYwpyZfFA7dlYU%3D&amp;reserved=0 
>>>
>>>
>>> [4] 
>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.openjdk.java.net%2Fbrowse%2FJDK-8221828&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&amp;sdata=ppajmGyCb1rcDjr0v52XMPcLySA7jQTTjmYE5vIEEoo%3D&amp;reserved=0 
>>>
>>>
>>>> -----Original Message-----
>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>> Sent: Tuesday, February 11, 2020 3:10 PM
>>>> To: Ludovic Henry <luhenry at microsoft.com>; John Rose 
>>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>>
>>>> Hi Ludovic,
>>>>
>>>> I fully agree that it's premature to discuss how default behavior 
>>>> should
>>>> be changed since much more data is needed to be able to proceed with 
>>>> the
>>>> decision. But considering the ultimate goal is to actually improve
>>>> relevant heuristics (and effectively change the default behavior), it's
>>>> the right time to discuss what kind of experiments are needed to gather
>>>> enough data for further analysis.
>>>>
>>>> Though different shapes do look very similar at first, the shape of
>>>> fallback makes a big difference. That's why monomorphic and polymorphic
>>>> cases are distinct: uncommon traps are effectively exits and can
>>>> significantly simplify CFG while calls can return and have to be merged
>>>> back.
>>>>
>>>> Polymorphic shape is stable (no deopts/recompiles involved), but 
>>>> doesn't
>>>> simplify the CFG around the call site.
>>>>
>>>> Monomorphic shape gives more optimization opportunities, but deopts are
>>>> highly undesirable due to associated costs.
>>>>
>>>> For example:
>>>>
>>>> ???? if (recv.klass != C) { deopt(); }
>>>> ???? C.m(recv);
>>>>
>>>> ???? // recv.klass == C - exact type
>>>> ???? // return value == C.m(recv)
>>>>
>>>> vs
>>>>
>>>> ???? if (recv.klass == C) {
>>>> ?????? C.m(recv);
>>>> ???? } else {
>>>> ?????? I.m(recv);
>>>> ???? }
>>>>
>>>> ???? // recv.klass <: I - subtype
>>>> ???? // return value is a phi merging C.m() & I.m() where I.m() is
>>>> completley opaque.
>>>>
>>>> Monomorphic shape can degenerate into polymorphic (too many 
>>>> recompiles),
>>>> but that's a forced move to stabilize the behavior and avoid vicious
>>>> recomilation cycle (which is *very* expensive). (Another alternative is
>>>> to leave deopt as is - set deopt action to "none" - but that's usually
>>>> much worse decision.)
>>>>
>>>> And that's the reason why monomorphic shape requires a unique receiver
>>>> type in profile while polymorphic shape works with major receiver type
>>>> and probabilities.
>>>>
>>>>
>>>> Considering further steps, IMO for experimental purposes a single knob
>>>> won't cut it: there are multiple degrees of freedom which may play
>>>> important role in building accurate performance model. I'm not yet
>>>> convinced it's all about inlining and narrowing the scope of discussion
>>>> specifically to type profile width doesn't help.
>>>>
>>>> I'd like to see more knobs introduced before we start conducting
>>>> extensive experiments. So, let's discuss what other information we can
>>>> benefit from.
>>>>
>>>> I mentioned some possible options in the previous email. I find the
>>>> following aspects important for future discussion:
>>>>
>>>> ???? * shape of fallback path
>>>> ??????? - what to generalize: 2- to N-morphic vs 1- to N-polymorphic;
>>>> ??????? - affects profiling strategy: majority of receivers vs complete
>>>> list of receiver types observed;
>>>> ??????? - for N-morphic case what's the negative effect 
>>>> (quantitative) of
>>>> the deopt?
>>>>
>>>> ???? * invokevirtual vs invokeinterface call sites
>>>> ??????? - different cost models;
>>>> ??????? - interfaces are harder to optimize, but opportunities for
>>>> strength-reduction from interface to virtual calls exist;
>>>>
>>>> ???? * inlining heuristics
>>>> ??????? - devirtualization vs inlining
>>>> ????????? - how much benefit from expanding a call site 
>>>> (devirtualize more
>>>> cases) without inlining? should differ for virtual & interface cases
>>>> ??????? - diminishing returns with increase in number of cases
>>>> ??????? - expanding a single call site leads to more code, but 
>>>> frequencies
>>>> stay the same => colder code
>>>> ??????? - based on profiling info (types + frequencies), dynamically
>>>> choose morphism factor on per-call site basis?
>>>> ??????? - what optimization opportunities to look for? it looks like in
>>>> general callees should benefit more than the caller (due to merges 
>>>> after
>>>> the call site)
>>>>
>>>> What's your take on it? Any other ideas?
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>> On 11.02.2020 02:42, Ludovic Henry wrote:
>>>>> Hello,
>>>>> Thank you very much, John and Vladimir, for your feedback.
>>>>> First, I want to stress out that this patch does not change the 
>>>>> default. It is still bi-morphic guarded inlining by default. This 
>>>>> patch, however, provides you the ability to configure the JVM to go 
>>>>> for N-morphic guarded inlining, with N being controlled by the 
>>>>> -XX:TypeProfileWidth configuration knob. I understand there are 
>>>>> shortcomings with the specifics of this approach so I'll work on 
>>>>> fixing those. However, I would want this discussion to focus on 
>>>>> this *configurable* feature and not on changing the default. The 
>>>>> latter, I think, should be discussed as part of another, more 
>>>>> extended running discussion, since, as you pointed out, it has far 
>>>>> more reaching consequences that are merely improving a 
>>>>> micro-benchmark.
>>>>>
>>>>> Now to answer some of your specific questions.
>>>>>
>>>>>>
>>>>>> I haven't looked through the patch in details, but here are some 
>>>>>> thoughts.
>>>>>>
>>>>>> As of now, there are 4 main scenarios for devirtualization [1]. It 
>>>>>> seems you try to generalize (b) which becomes:
>>>>>>
>>>>>> ????? if (recv.klass == K1) {
>>>>> m1(...); // either inline or a direct call
>>>>>> ????? } else if (recv.klass == K2) {
>>>>> m2(...); // either inline or a direct call
>>>>>> ????? ...
>>>>>> ????? } else if (recv.klass == Kn) {
>>>>> mn(...); // either inline or a direct call
>>>>>> ????? } else {
>>>>> deopt(); // invalidate + reinterpret
>>>>>> ????? }
>>>>>
>>>>> The general shape that exist currently in tip is:
>>>>>
>>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
>>>>> if (recv.klass == K1) {
>>>>> ???? m1(.); // either inline or a direct call
>>>>> }
>>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && 
>>>>> UseBimorphicInlining && !is_cold
>>>>> else if (recv.klass == K2) {
>>>>> ???? m2(.); // either inline or a direct call
>>>>> }
>>>>> else {
>>>>> ???? // if (!too_many_traps_or_deopt())
>>>>> ???? deopt(); // invalidate + reinterpret
>>>>> ???? // else
>>>>> ???? invokeinterface A.foo(.); // virtual call with Inline Cache
>>>>> }
>>>>> There is no particular distinction between Bimorphic, Polymorphic, 
>>>>> and Megamorphic. The latter relates more to the fallback rather 
>>>>> than the guards. What this change brings is more guards for 
>>>>> N-morphic call-sites with N > 2. But it doesn't change why and how 
>>>>> these guards are generated (or at least, that is not the intention).
>>>>> The general shape that this change proposes is:
>>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
>>>>> if (recv.klass == K1) {
>>>>> ???? m1(.); // either inline or a direct call
>>>>> }
>>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && 
>>>>> (UseBimorphicInlining || UsePolymorphicInling)
>>>>> && !is_cold
>>>>> else if (recv.klass == K2) {
>>>>> ???? m2(.); // either inline or a direct call
>>>>> }
>>>>> // if TypeProfileWidth >= 3 && profile.has_receiver(2) && 
>>>>> UsePolymorphicInling && !is_cold
>>>>> else if (recv.klass == K3) {
>>>>> ???? m3(.); // either inline or a direct call
>>>>> }
>>>>> // if TypeProfileWidth >= 4 && profile.has_receiver(3) && 
>>>>> UsePolymorphicInling && !is_cold
>>>>> else if (recv.klass == K4) {
>>>>> ???? m4(.); // either inline or a direct call
>>>>> }
>>>>> else {
>>>>> ???? // if (!too_many_traps_or_deopt())
>>>>> ???? deopt(); // invalidate + reinterpret
>>>>> ???? // else
>>>>> ???? invokeinterface A.foo(.); // virtual call with Inline Cache
>>>>> }
>>>>> You can observe that the condition to create the guards is no 
>>>>> different; only the total number increases based on 
>>>>> TypeProfileWidth and UsePolymorphicInlining.
>>>>>> Question #1: what if you generalize polymorphic shape instead and 
>>>>>> allow multiple major receivers? Deoptimizing (and then 
>>>>>> recompiling) look less beneficial the higher morphism is 
>>>>>> (especially considering the inlining on all paths becomes less 
>>>>>> likely as well). So, having a virtual call (which becomes less 
>>>>>> likely due to lower frequency) on the fallback path may be a 
>>>>>> better option.
>>>>> I agree with this statement in the general sense. However, in 
>>>>> practice, it depends on the specifics of each application. That is 
>>>>> why the degree of polymorphism needs to rely on a configuration 
>>>>> knob, and not pre-determined on a set of benchmarks. I agree with 
>>>>> the proposal to have this knob as a per-method knob, instead of a 
>>>>> global knob.
>>>>> As for the impact of a higher morphism, I expect deoptimizations to 
>>>>> happen less often as more guards are generated, leading to a lower 
>>>>> probability of reaching the fallback path, leading to less uncommon 
>>>>> trap/deoptimizations. Moreover, the fallback is already going to be 
>>>>> a virtual call in case we hit the uncommon trap too often (using 
>>>>> too_many_traps_or_recompiles).
>>>>>> Question #2: it would be very interesting to understand what 
>>>>>> exactly contributes the most to performance improvements? Is it 
>>>>>> inlining? Or maybe devirtualization (avoid the cost of virtual 
>>>>>> call)? How much come from optimizing interface calls (itable vs 
>>>>>> vtable stubs)?
>>>>> Devirtualization in itself (direct vs. indirect call) is not the 
>>>>> *primary* source of the gain. The gain comes from the additional 
>>>>> optimizations that are applied by C2 when increasing the scope/size 
>>>>> of the code compiled via inlining.
>>>>> In the case of warm code that's not inlined as part of incremental 
>>>>> inlining, the call is a direct call rather than an indirect call. I 
>>>>> haven't measured it, but I expect performance to be positively 
>>>>> impacted because of the better ability of modern CPUs to correctly 
>>>>> predict instruction branches (a direct call) rather than data 
>>>>> branches (an indirect call).
>>>>>> Deciding how to spend inlining budget on multiple targets with 
>>>>>> moderate frequency can be hard, so it makes sense to consider 
>>>>>> expanding 3/4/mega-morphic call sites in post-parse phase (during 
>>>>>> incremental inlining).
>>>>> Incremental inlining is already integrated with the existing 
>>>>> solution. In the case of a hot or warm call, in case of failure to 
>>>>> inline, it generates a direct call. You still have the guards, 
>>>>> reducing the cost of an indirect call, but without the cost of the 
>>>>> inlined code.
>>>>>> Question #3: how much TypeProfileWidth affects profiling speed 
>>>>>> (interpreter and level #3 code) and dynamic footprint?
>>>>> I'll come back to you with some results.
>>>>>> Getting answers to those (and similar) questions should give us 
>>>>>> much more insights what is actually happening in practice.
>>>>>>
>>>>>> Speaking of the first deliverables, it would be good to introduce 
>>>>>> a new experimental mode to be able to easily conduct such 
>>>>>> experiments with product binaries and I'd like to see the patch 
>>>>>> evolving in that direction. It'll enable us to gather important 
>>>>>> data to guide our decisions about how to enhance the heuristics in 
>>>>>> the product.
>>>>> This patch does not change the default shape of the generated code 
>>>>> with bimorphic guarded inlining, because the default value of 
>>>>> TypeProfileWidth is 2. If your concern is that TypeProfileWidth is 
>>>>> used for other purposes and that I should add a dedicated knob to 
>>>>> control the maximum morphism of these guards, then I agree. I am 
>>>>> using TypeProfileWidth because it's the available and more 
>>>>> straightforward knob today.
>>>>> Overall, this change does not propose to go from bimorphic to 
>>>>> N-morphic by default (with N between 0 and 8). This change focuses 
>>>>> on using an existing knob (TypeProfileWidth) to open the 
>>>>> possibility for N-morphic guarded inlining. I would want the 
>>>>> discussion to change the default to be part of a separate RFR, to 
>>>>> separate the feature change discussion from the default change 
>>>>> discussion.
>>>>>> Such optimizations are usually not unqualified wins because of 
>>>>>> highly "non-linear" or "non-local" effects, where a local change 
>>>>>> in one direction might couple to nearby change in a different 
>>>>>> direction, with a net change that's "wrong", due to side effects 
>>>>>> rolling out from the "good" change. (I'm talking about side 
>>>>>> effects in our IR graph shaping heuristics, not memory side effects.)
>>>>>>
>>>>>> One out of many such "wrong" changes is a local optimization which 
>>>>>> expands code on a medium-hot path, which has the side effect of 
>>>>>> making a containing block of code larger than convenient.? Three 
>>>>>> ways of being "larger than convenient" are a. the object code of 
>>>>>> some containing loop doesn't fit as well in the instruction 
>>>>>> memory, b. the total IR size tips over some budgetary limit which 
>>>>>> causes further IR creation to be throttled (or the whole graph to 
>>>>>> be thrown away!), or c. some loop gains additional branch 
>>>>>> structure that impedes the optimization of the loop, where an out 
>>>>>> of line call would not.
>>>>>>
>>>>>> My overall point here is that an eager expansion of IR that is 
>>>>>> locally "better" (we might even say "optimal") with respect to the 
>>>>>> specific path under consideration hurts the optimization of nearby 
>>>>>> paths which are more important.
>>>>> I generally agree with this statement and explanation. Again, it is 
>>>>> not the intention of this patch to change the default number of 
>>>>> guards for polymorphic call-sites, but it is to give users the 
>>>>> ability to optimize the code generation of their JVM to their 
>>>>> application.
>>>>> Since I am relying on the existing inlining infrastructure, late 
>>>>> inlining and hot/warm/cold call generators allows to have a 
>>>>> "best-of-both-world" approach: it inlines code in the hot guards, 
>>>>> it direct calls or inline (if inlining thresholds permits) the 
>>>>> method in the warm guards, and it doesn't even generate the guard 
>>>>> in the cold guards. The question here is, then how do you define 
>>>>> hot, warm, and cold. As discussed above, I want to explore using a 
>>>>> low-threshold even to try to generate a guard (at least 10% of 
>>>>> calls are to this specific receiver).
>>>>> On the overhead of adding more guards, I see this change as 
>>>>> beneficial because it removes an arbitrary limit on what code can 
>>>>> be inlined. For example, if you have a call-site with 3 types, each 
>>>>> with a hit probability of 30%, then with a maximum limit of 2 types 
>>>>> (with bimorphic guarded inlining), only the first 2 types are 
>>>>> guarded and inlined. That is despite an apparent gain in guarding 
>>>>> and inlining against the 3 types.
>>>>> I agree we want to have guardrails to avoid worst-case 
>>>>> degradations. It is my understanding that the existing inlining 
>>>>> infrastructure (with late inlining, for example) provides many 
>>>>> safeguards already, and it is up to this change not to abuse these.
>>>>>> (It clearly doesn't work to tell an impacted customer, well, you 
>>>>>> may get a 5% loss, but the micro created to test this thing shows 
>>>>>> a 20% gain, and all the functional tests pass.)
>>>>>>
>>>>>> This leads me to the following suggestion:? Your code is a very 
>>>>>> good POC, and deserves more work, and the next step in that work 
>>>>>> is probably looking for and thinking about performance 
>>>>>> regressions, and figuring out how to throttle this thing.
>>>>> Here again, I want that feature to be behind a configuration knob, 
>>>>> and then discuss in a future RFR to change the default.
>>>>>> A specific next step would be to make the throttling of this 
>>>>>> feature be controllable. MorphismLimit should be a global on its 
>>>>>> own.? And it should be configurable through the CompilerOracle per 
>>>>>> method.? (See similar code for similar throttles.)? And it should 
>>>>>> be more sensitive to the hotness of the overall call and of the 
>>>>>> various slices of the call's profile.? (I notice with suspicion 
>>>>>> that the comment "The single majority receiver sufficiently 
>>>>>> outweighs the minority" is missing in the changed code.)? And, if 
>>>>>> the change is as disruptive to heuristics as I suspect it *might* 
>>>>>> be, the call site itself *might* need some kind of dynamic 
>>>>>> feedback which says, after some deopt or reprofiling, "take it 
>>>>>> easy here, try plan B." That last point is just speculation, but I 
>>>>>> threw it in to show the kinds of measures we *sometimes* have to 
>>>>>> take in avoiding "side effects" to our locally pleasant 
>>>>>> optimizations.
>>>>> I'll add this per-method knob on the CompilerOracle in the next 
>>>>> iteration of this patch.
>>>>>> But, let me repeat: I'm glad to see this experiment. And very, 
>>>>>> very glad to see all the cool stuff that is coming out of your 
>>>>>> work-group.? Welcome to the adventure!
>>>>> For future improvements, I will keep focusing on inlining as I see 
>>>>> it as the door opener to many more optimizations in C2. I am still 
>>>>> learning at what can be done to reduce the size of the inlined code 
>>>>> by, for example, applying specific optimizations that simplify the 
>>>>> CG (like dead-code elimination or constant propagation) before 
>>>>> inlining the code. As you said, we are not short of ideas on *how* 
>>>>> to improve it, but we have to be very wary of *what impact* it'll 
>>>>> have on real-world applications. We're working with internal 
>>>>> customers to figure that out, and we'll share them as soon as we 
>>>>> are ready with benchmarks for those use-case patterns.
>>>>> What I am working on now is:
>>>>> ??? - Add a per-method flag through CompilerOracle
>>>>> ??? - Add a threshold on the probability of a receiver to generate 
>>>>> a guard (I am thinking of 10%, i.e., if a receiver is observed less 
>>>>> than 1 in every 10 calls, then don't generate a guard and use the 
>>>>> fallback)
>>>>> ??? - Check the overhead of increasing TypeProfileWidth on 
>>>>> profiling speed (in the interpreter and level #3 code)
>>>>> Thank you, and looking forward to the next review (I expect to post 
>>>>> the next iteration of the patch today or tomorrow).
>>>>> -- 
>>>>> Ludovic
>>>>>
>>>>> -----Original Message-----
>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>> Sent: Thursday, February 6, 2020 1:07 PM
>>>>> To: Ludovic Henry <luhenry at microsoft.com>; 
>>>>> hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>>>
>>>>> Very interesting results, Ludovic!
>>>>>
>>>>>> The image can be found at 
>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fnam06.safelinks.protection.outlook.com%2F%3Furl%3Dhttps*3A*2F*2Fgist.github.com*2Fluhenry*2Fb7cc1ed55c51cb0fbc527cbc45018473%26amp%3Bdata%3D02*7C01*7Cluhenry*40microsoft.com*7Cc802bcf8b3154a1bb97508d7af47850b*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637170594027774879%26amp%3Bsdata%3D6dxACbgyeU0S51dVY6fSVZ*2B4USDsTOdnuxTOXkiVYqI*3D%26amp%3Breserved%3D0__%3BJSUlJSUlJSUlJSUlJSUl!!GqivPVa7Brio!MIQWA4cQu4RzL0yqjiU5mO4tpHcImj_lUv7tEh-gG6NCMtSRIAyhvfyJZQBPoAJC_Ajjaxg%24&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&amp;sdata=aLAxfc5BDeyOK6VMwM7bUAgjVLMCAt8A6LDlsNkC6fU%3D&amp;reserved=0 
>>>>>>
>>>>>
>>>>> Can you elaborate on the experiment itself, please? In particular, 
>>>>> what
>>>>> does PERCENTILES actually mean?
>>>>>
>>>>> I haven't looked through the patch in details, but here are some 
>>>>> thoughts.
>>>>>
>>>>> As of now, there are 4 main scenarios for devirtualization [1]. It 
>>>>> seems
>>>>> you try to generalize (b) which becomes:
>>>>>
>>>>> ????? if (recv.klass == K1) {
>>>>> ???????? m1(...); // either inline or a direct call
>>>>> ????? } else if (recv.klass == K2) {
>>>>> ???????? m2(...); // either inline or a direct call
>>>>> ????? ...
>>>>> ????? } else if (recv.klass == Kn) {
>>>>> ???????? mn(...); // either inline or a direct call
>>>>> ????? } else {
>>>>> ???????? deopt(); // invalidate + reinterpret
>>>>> ????? }
>>>>>
>>>>> Question #1: what if you generalize polymorphic shape instead and 
>>>>> allow
>>>>> multiple major receivers? Deoptimizing (and then recompiling) look 
>>>>> less
>>>>> beneficial the higher morphism is (especially considering the inlining
>>>>> on all paths becomes less likely as well). So, having a virtual call
>>>>> (which becomes less likely due to lower frequency) on the fallback 
>>>>> path
>>>>> may be a better option.
>>>>>
>>>>>
>>>>> Question #2: it would be very interesting to understand what exactly
>>>>> contributes the most to performance improvements? Is it inlining? Or
>>>>> maybe devirtualization (avoid the cost of virtual call)? How much come
>>>>> from optimizing interface calls (itable vs vtable stubs)?
>>>>>
>>>>> Deciding how to spend inlining budget on multiple targets with 
>>>>> moderate
>>>>> frequency can be hard, so it makes sense to consider expanding
>>>>> 3/4/mega-morphic call sites in post-parse phase (during incremental
>>>>> inlining).
>>>>>
>>>>>
>>>>> Question #3: how much TypeProfileWidth affects profiling speed
>>>>> (interpreter and level #3 code) and dynamic footprint?
>>>>>
>>>>>
>>>>> Getting answers to those (and similar) questions should give us much
>>>>> more insights what is actually happening in practice.
>>>>>
>>>>> Speaking of the first deliverables, it would be good to introduce a 
>>>>> new
>>>>> experimental mode to be able to easily conduct such experiments with
>>>>> product binaries and I'd like to see the patch evolving in that
>>>>> direction. It'll enable us to gather important data to guide our
>>>>> decisions about how to enhance the heuristics in the product.
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>> [1] (a) Monomorphic:
>>>>> ????? if (recv.klass == K1) {
>>>>> ???????? m1(...); // either inline or a direct call
>>>>> ????? } else {
>>>>> ???????? deopt(); // invalidate + reinterpret
>>>>> ????? }
>>>>>
>>>>> ????? (b) Bimorphic:
>>>>> ????? if (recv.klass == K1) {
>>>>> ???????? m1(...); // either inline or a direct call
>>>>> ????? } else if (recv.klass == K2) {
>>>>> ???????? m2(...); // either inline or a direct call
>>>>> ????? } else {
>>>>> ???????? deopt(); // invalidate + reinterpret
>>>>> ????? }
>>>>>
>>>>> ????? (c) Polymorphic:
>>>>> ????? if (recv.klass == K1) { // major receiver (by default, >90%)
>>>>> ???????? m1(...); // either inline or a direct call
>>>>> ????? } else {
>>>>> ???????? K.m(); // virtual call
>>>>> ????? }
>>>>>
>>>>> ????? (d) Megamorphic:
>>>>> ????? K.m(); // virtual (K is either concrete or interface class)
>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Ludovic
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: hotspot-compiler-dev 
>>>>>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of 
>>>>>> Ludovic Henry
>>>>>> Sent: Thursday, February 6, 2020 9:18 AM
>>>>>> To: hotspot-compiler-dev at openjdk.java.net
>>>>>> Subject: RFR: Polymorphic Guarded Inlining in C2
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> In our evergoing search of improving performance, I've looked at 
>>>>>> inlining and, more specifically, at polymorphic guarded inlining. 
>>>>>> Today in HotSpot, the maximum number of guards for types at any 
>>>>>> call site is two - with bimorphic guarded inlining. However, Graal 
>>>>>> and Zing have observed great results with increasing that limit.
>>>>>>
>>>>>> You'll find following a patch that makes the number of guards for 
>>>>>> types configurable with the `TypeProfileWidth` global.
>>>>>>
>>>>>> Testing:
>>>>>> Passing tier1 on Linux and Windows, plus other large applications 
>>>>>> (through the Adopt testing scripts)
>>>>>>
>>>>>> Benchmarking:
>>>>>> To get data, we run a benchmark against Apache Pinot and observe 
>>>>>> the following results:
>>>>>>
>>>>>> [cid:image001.png at 01D5D2DB.F5165550]
>>>>>>
>>>>>> We observe close to 20% improvements on this sample benchmark with 
>>>>>> a morphism (=width) of 3 or 4. We are currently validating these 
>>>>>> numbers on a more extensive set of benchmarks and platforms, and 
>>>>>> I'll share them as soon as we have them.
>>>>>>
>>>>>> I am happy to provide more information, just let me know if you 
>>>>>> have any question.
>>>>>>
>>>>>> Thank you,
>>>>>>
>>>>>> -- 
>>>>>> Ludovic
>>>>>>
>>>>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp 
>>>>>> b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>> index 73854806ed..845070fbe1 100644
>>>>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>> @@ -38,7 +38,7 @@ private:
>>>>>> ?????? friend class ciMethod;
>>>>>> ?????? friend class ciMethodHandle;
>>>>>>
>>>>>> -? enum { MorphismLimit = 2 }; // Max call site's morphism we care 
>>>>>> about
>>>>>> +? enum { MorphismLimit = 8 }; // Max call site's morphism we care 
>>>>>> about
>>>>>> ?????? int? _limit;??????????????? // number of receivers have 
>>>>>> been determined
>>>>>> ?????? int? _morphism;???????????? // determined call site's morphism
>>>>>> ?????? int? _count;??????????????? // # times has this call been 
>>>>>> executed
>>>>>> @@ -47,6 +47,7 @@ private:
>>>>>> ?????? ciKlass*? _receiver[MorphismLimit + 1];? // receivers (exact)
>>>>>>
>>>>>> ?????? ciCallProfile() {
>>>>>> +??? guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit 
>>>>>> can't be smaller than TypeProfileWidth");
>>>>>> ???????? _limit = 0;
>>>>>> ???????? _morphism??? = 0;
>>>>>> ???????? _count = -1;
>>>>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp 
>>>>>> b/src/hotspot/share/ci/ciMethod.cpp
>>>>>> index d771be8dac..8e4ecc8597 100644
>>>>>> --- a/src/hotspot/share/ci/ciMethod.cpp
>>>>>> +++ b/src/hotspot/share/ci/ciMethod.cpp
>>>>>> @@ -496,9 +496,7 @@ ciCallProfile 
>>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>>> ?????????? // Every profiled call site has a counter.
>>>>>> ?????????? int count = 
>>>>>> check_overflow(data->as_CounterData()->count(), 
>>>>>> java_code_at_bci(bci));
>>>>>>
>>>>>> -????? if (!data->is_ReceiverTypeData()) {
>>>>>> -??????? result._receiver_count[0] = 0;? // that's a definite zero
>>>>>> -????? } else { // ReceiverTypeData is a subclass of CounterData
>>>>>> +????? if (data->is_ReceiverTypeData()) {
>>>>>> ???????????? ciReceiverTypeData* call = 
>>>>>> (ciReceiverTypeData*)data->as_ReceiverTypeData();
>>>>>> ???????????? // In addition, virtual call sites have receiver type 
>>>>>> information
>>>>>> ???????????? int receivers_count_total = 0;
>>>>>> @@ -515,7 +513,7 @@ ciCallProfile 
>>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>>> ?????????????? // is recorded or an associated counter is 
>>>>>> incremented, but not both. With
>>>>>> ?????????????? // tiered compilation, however, both can happen due 
>>>>>> to the interpreter and
>>>>>> ?????????????? // C1 profiling invocations differently. Address 
>>>>>> that inconsistency here.
>>>>>> -????????? if (morphism == 1 && count > 0) {
>>>>>> +????????? if (morphism >= 1 && count > 0) {
>>>>>> ???????????????? epsilon = count;
>>>>>> ???????????????? count = 0;
>>>>>> ?????????????? }
>>>>>> @@ -531,25 +529,26 @@ ciCallProfile 
>>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>>> ????????????? // If we extend profiling to record methods,
>>>>>> ?????????????? // we will set result._method also.
>>>>>> ???????????? }
>>>>>> +??????? result._morphism = morphism;
>>>>>> ???????????? // Determine call site's morphism.
>>>>>> ???????????? // The call site count is 0 with known morphism (only 
>>>>>> 1 or 2 receivers)
>>>>>> ???????????? // or < 0 in the case of a type check failure for 
>>>>>> checkcast, aastore, instanceof.
>>>>>> ???????????? // The call site count is > 0 in the case of a 
>>>>>> polymorphic virtual call.
>>>>>> -??????? if (morphism > 0 && morphism == result._limit) {
>>>>>> -?????????? // The morphism <= MorphismLimit.
>>>>>> -?????????? if ((morphism <? ciCallProfile::MorphismLimit) ||
>>>>>> -?????????????? (morphism == ciCallProfile::MorphismLimit && count 
>>>>>> == 0)) {
>>>>>> +??????? assert(result._morphism == result._limit, "");
>>>>>> #ifdef ASSERT
>>>>>> +??????? if (result._morphism > 0) {
>>>>>> +?????????? // The morphism <= TypeProfileWidth.
>>>>>> +?????????? if ((result._morphism <? TypeProfileWidth) ||
>>>>>> +?????????????? (result._morphism == TypeProfileWidth && count == 
>>>>>> 0)) {
>>>>>> ????????????????? if (count > 0) {
>>>>>> ??????????????????? this->print_short_name(tty);
>>>>>> ??????????????????? tty->print_cr(" @ bci:%d", bci);
>>>>>> ??????????????????? this->print_codes();
>>>>>> ??????????????????? assert(false, "this call site should not be 
>>>>>> polymorphic");
>>>>>> ????????????????? }
>>>>>> -#endif
>>>>>> -???????????? result._morphism = morphism;
>>>>>> ??????????????? }
>>>>>> ???????????? }
>>>>>> +#endif
>>>>>> ???????????? // Make the count consistent if this is a call 
>>>>>> profile. If count is
>>>>>> ???????????? // zero or less, presume that this is a typecheck 
>>>>>> profile and
>>>>>> ???????????? // do nothing.? Otherwise, increase count to be the 
>>>>>> sum of all
>>>>>> @@ -578,7 +577,7 @@ void ciCallProfile::add_receiver(ciKlass* 
>>>>>> receiver, int receiver_count) {
>>>>>> ?????? }
>>>>>> ?????? _receiver[i] = receiver;
>>>>>> ?????? _receiver_count[i] = receiver_count;
>>>>>> -? if (_limit < MorphismLimit) _limit++;
>>>>>> +? if (_limit < TypeProfileWidth) _limit++;
>>>>>> }
>>>>>>
>>>>>>
>>>>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp 
>>>>>> b/src/hotspot/share/opto/c2_globals.hpp
>>>>>> index d605bdb7bd..7a8dee43e5 100644
>>>>>> --- a/src/hotspot/share/opto/c2_globals.hpp
>>>>>> +++ b/src/hotspot/share/opto/c2_globals.hpp
>>>>>> @@ -389,9 +389,16 @@
>>>>>> ?????? product(bool, UseBimorphicInlining, 
>>>>>> true,???????????????????????????????? \
>>>>>> ?????????????? "Profiling based inlining for two 
>>>>>> receivers")???????????????????? \
>>>>>>                                                                                  
>>>>>> \
>>>>>> +? product(bool, UsePolymorphicInlining, 
>>>>>> true,?????????????????????????????? \
>>>>>> +????????? "Profiling based inlining for two or more 
>>>>>> receivers")???????????? \
>>>>>> +                                                                            
>>>>>> \
>>>>>> ?????? product(bool, UseOnlyInlinedBimorphic, 
>>>>>> true,????????????????????????????? \
>>>>>> ?????????????? "Don't use BimorphicInlining if can't inline a 
>>>>>> second method")??? \
>>>>>>                                                                                  
>>>>>> \
>>>>>> +? product(bool, UseOnlyInlinedPolymorphic, 
>>>>>> true,??????????????????????????? \
>>>>>> +????????? "Don't use PolymorphicInlining if can't inline a 
>>>>>> non-major "????? \
>>>>>> +????????? "receiver's 
>>>>>> method")????????????????????????????????????????????? \
>>>>>> +                                                                            
>>>>>> \
>>>>>> ?????? product(bool, InsertMemBarAfterArraycopy, 
>>>>>> true,?????????????????????????? \
>>>>>> ?????????????? "Insert memory barrier after arraycopy 
>>>>>> call")???????????????????? \
>>>>>>                                                                                  
>>>>>> \
>>>>>> diff --git a/src/hotspot/share/opto/doCall.cpp 
>>>>>> b/src/hotspot/share/opto/doCall.cpp
>>>>>> index 44ab387ac8..6f940209ce 100644
>>>>>> --- a/src/hotspot/share/opto/doCall.cpp
>>>>>> +++ b/src/hotspot/share/opto/doCall.cpp
>>>>>> @@ -83,25 +83,23 @@ CallGenerator* 
>>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>>>
>>>>>> ?????? // See how many times this site has been invoked.
>>>>>> ?????? int site_count = profile.count();
>>>>>> -? int receiver_count = -1;
>>>>>> -? if (call_does_dispatch && UseTypeProfile && 
>>>>>> profile.has_receiver(0)) {
>>>>>> -??? // Receivers in the profile structure are ordered by call counts
>>>>>> -??? // so that the most called (major) receiver is 
>>>>>> profile.receiver(0).
>>>>>> -??? receiver_count = profile.receiver_count(0);
>>>>>> -? }
>>>>>>
>>>>>> ?????? CompileLog* log = this->log();
>>>>>> ?????? if (log != NULL) {
>>>>>> -??? int rid = (receiver_count >= 0)? 
>>>>>> log->identify(profile.receiver(0)): -1;
>>>>>> -??? int r2id = (rid != -1 && profile.has_receiver(1))? 
>>>>>> log->identify(profile.receiver(1)):-1;
>>>>>> +??? ResourceMark rm;
>>>>>> +??? int* rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth);
>>>>>> +??? for (int i = 0; i < TypeProfileWidth && 
>>>>>> profile.has_receiver(i); i++) {
>>>>>> +????? rids[i] = log->identify(profile.receiver(i));
>>>>>> +??? }
>>>>>> ???????? log->begin_elem("call method='%d' count='%d' 
>>>>>> prof_factor='%f'",
>>>>>> ???????????????????????? log->identify(callee), site_count, 
>>>>>> prof_factor);
>>>>>> ???????? if (call_does_dispatch)? log->print(" virtual='1'");
>>>>>> ???????? if (allow_inline)???? log->print(" inline='1'");
>>>>>> -??? if (receiver_count >= 0) {
>>>>>> -????? log->print(" receiver='%d' receiver_count='%d'", rid, 
>>>>>> receiver_count);
>>>>>> -?????? if (profile.has_receiver(1)) {
>>>>>> -??????? log->print(" receiver2='%d' receiver2_count='%d'", r2id, 
>>>>>> profile.receiver_count(1));
>>>>>> +??? for (int i = 0; i < TypeProfileWidth && 
>>>>>> profile.has_receiver(i); i++) {
>>>>>> +????? if (i == 0) {
>>>>>> +??????? log->print(" receiver='%d' receiver_count='%d'", rids[i], 
>>>>>> profile.receiver_count(i));
>>>>>> +????? } else {
>>>>>> +??????? log->print(" receiver%d='%d' receiver%d_count='%d'", i + 
>>>>>> 1, rids[i], i + 1, profile.receiver_count(i));
>>>>>> ?????????? }
>>>>>> ???????? }
>>>>>> ???????? if (callee->is_method_handle_intrinsic()) {
>>>>>> @@ -205,90 +203,96 @@ CallGenerator* 
>>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>>> ???????? if (call_does_dispatch && site_count > 0 && 
>>>>>> UseTypeProfile) {
>>>>>> ?????????? // The major receiver's count >= 
>>>>>> TypeProfileMajorReceiverPercent of site_count.
>>>>>> ?????????? bool have_major_receiver = profile.has_receiver(0) && 
>>>>>> (100.*profile.receiver_prob(0) >= 
>>>>>> (float)TypeProfileMajorReceiverPercent);
>>>>>> -????? ciMethod* receiver_method = NULL;
>>>>>>
>>>>>> ?????????? int morphism = profile.morphism();
>>>>>> +
>>>>>> +????? ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*, 
>>>>>> MAX(1, morphism));
>>>>>> +????? memset(receiver_methods, 0, sizeof(ciMethod*) * MAX(1, 
>>>>>> morphism));
>>>>>> +
>>>>>> ?????????? if (speculative_receiver_type != NULL) {
>>>>>> ???????????? if (!too_many_traps_or_recompiles(caller, bci, 
>>>>>> Deoptimization::Reason_speculate_class_check)) {
>>>>>> ?????????????? // We have a speculative type, we should be able to 
>>>>>> resolve
>>>>>> ?????????????? // the call. We do that before looking at the 
>>>>>> profiling at
>>>>>> -????????? // this invoke because it may lead to bimorphic 
>>>>>> inlining which
>>>>>> +????????? // this invoke because it may lead to polymorphic 
>>>>>> inlining which
>>>>>> ?????????????? // a speculative type should help us avoid.
>>>>>> -????????? receiver_method = 
>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>> -                                                   
>>>>>> speculative_receiver_type);
>>>>>> -????????? if (receiver_method == NULL) {
>>>>>> +????????? receiver_methods[0] = 
>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>> +                                                       
>>>>>> speculative_receiver_type);
>>>>>> +????????? if (receiver_methods[0] == NULL) {
>>>>>> ???????????????? speculative_receiver_type = NULL;
>>>>>> ?????????????? } else {
>>>>>> ???????????????? morphism = 1;
>>>>>> ?????????????? }
>>>>>> ???????????? } else {
>>>>>> ?????????????? // speculation failed before. Use profiling at the 
>>>>>> call
>>>>>> -????????? // (could allow bimorphic inlining for instance).
>>>>>> +????????? // (could allow polymorphic inlining for instance).
>>>>>> ?????????????? speculative_receiver_type = NULL;
>>>>>> ???????????? }
>>>>>> ?????????? }
>>>>>> -????? if (receiver_method == NULL &&
>>>>>> +????? if (receiver_methods[0] == NULL &&
>>>>>> ?????????????? (have_major_receiver || morphism == 1 ||
>>>>>> -?????????? (morphism == 2 && UseBimorphicInlining))) {
>>>>>> -??????? // receiver_method = profile.method();
>>>>>> +?????????? (morphism == 2 && UseBimorphicInlining) ||
>>>>>> +?????????? (morphism >= 2 && UsePolymorphicInlining))) {
>>>>>> +??????? assert(profile.has_receiver(0), "no receiver at 0");
>>>>>> +??????? // receiver_methods[0] = profile.method();
>>>>>> ???????????? // Profiles do not suggest methods now.? Look it up 
>>>>>> in the major receiver.
>>>>>> -??????? receiver_method = 
>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>> -                                                      
>>>>>> profile.receiver(0));
>>>>>> +??????? receiver_methods[0] = 
>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>> +                                                          
>>>>>> profile.receiver(0));
>>>>>> ?????????? }
>>>>>> -????? if (receiver_method != NULL) {
>>>>>> -??????? // The single majority receiver sufficiently outweighs 
>>>>>> the minority.
>>>>>> -??????? CallGenerator* hit_cg = 
>>>>>> this->call_generator(receiver_method,
>>>>>> -????????????? vtable_index, !call_does_dispatch, jvms, 
>>>>>> allow_inline, prof_factor);
>>>>>> -??????? if (hit_cg != NULL) {
>>>>>> -????????? // Look up second receiver.
>>>>>> -????????? CallGenerator* next_hit_cg = NULL;
>>>>>> -????????? ciMethod* next_receiver_method = NULL;
>>>>>> -????????? if (morphism == 2 && UseBimorphicInlining) {
>>>>>> -??????????? next_receiver_method = 
>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>> -                                                               
>>>>>> profile.receiver(1));
>>>>>> -??????????? if (next_receiver_method != NULL) {
>>>>>> -????????????? next_hit_cg = 
>>>>>> this->call_generator(next_receiver_method,
>>>>>> -????????????????????????????????? vtable_index, 
>>>>>> !call_does_dispatch, jvms,
>>>>>> -????????????????????????????????? allow_inline, prof_factor);
>>>>>> -????????????? if (next_hit_cg != NULL && 
>>>>>> !next_hit_cg->is_inline() &&
>>>>>> -????????????????? have_major_receiver && UseOnlyInlinedBimorphic) {
>>>>>> -????????????????? // Skip if we can't inline second receiver's 
>>>>>> method
>>>>>> -????????????????? next_hit_cg = NULL;
>>>>>> +????? if (receiver_methods[0] != NULL) {
>>>>>> +??????? CallGenerator** hit_cgs = 
>>>>>> NEW_RESOURCE_ARRAY(CallGenerator*, MAX(1, morphism));
>>>>>> +??????? memset(hit_cgs, 0, sizeof(CallGenerator*) * MAX(1, 
>>>>>> morphism));
>>>>>> +
>>>>>> +??????? hit_cgs[0] = this->call_generator(receiver_methods[0],
>>>>>> +??????????????????????????? vtable_index, !call_does_dispatch, jvms,
>>>>>> +??????????????????????????? allow_inline, prof_factor);
>>>>>> +??????? if (hit_cgs[0] != NULL) {
>>>>>> +????????? if ((morphism == 2 && UseBimorphicInlining) || 
>>>>>> (morphism >= 2 && UsePolymorphicInlining)) {
>>>>>> +??????????? for (int i = 1; i < morphism; i++) {
>>>>>> +????????????? assert(profile.has_receiver(i), "no receiver at 
>>>>>> %d", i);
>>>>>> +????????????? receiver_methods[i] = 
>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>> +                                                            
>>>>>> profile.receiver(i));
>>>>>> +????????????? if (receiver_methods[i] != NULL) {
>>>>>> +??????????????? hit_cgs[i] = 
>>>>>> this->call_generator(receiver_methods[i],
>>>>>> +????????????????????????????????????? vtable_index, 
>>>>>> !call_does_dispatch, jvms,
>>>>>> +????????????????????????????????????? allow_inline, prof_factor);
>>>>>> +??????????????? if (hit_cgs[i] != NULL && 
>>>>>> !hit_cgs[i]->is_inline() && have_major_receiver &&
>>>>>> +??????????????????? ((morphism == 2 && UseOnlyInlinedBimorphic) 
>>>>>> || (morphism >= 2 && UseOnlyInlinedPolymorphic))) {
>>>>>> +????????????????? // Skip if we can't inline non-major receiver's 
>>>>>> method
>>>>>> +????????????????? hit_cgs[i] = NULL;
>>>>>> +??????????????? }
>>>>>> ?????????????????? }
>>>>>> ???????????????? }
>>>>>> ?????????????? }
>>>>>> ?????????????? CallGenerator* miss_cg;
>>>>>> -????????? Deoptimization::DeoptReason reason = (morphism == 2
>>>>>> -?????????????????????????????????????????????? ? 
>>>>>> Deoptimization::Reason_bimorphic
>>>>>> +????????? Deoptimization::DeoptReason reason = (morphism >= 2
>>>>>> +?????????????????????????????????????????????? ? 
>>>>>> Deoptimization::Reason_polymorphic
>>>>>> ??????????????????????????????????????????????????? : 
>>>>>> Deoptimization::reason_class_check(speculative_receiver_type != 
>>>>>> NULL));
>>>>>> -????????? if ((morphism == 1 || (morphism == 2 && next_hit_cg != 
>>>>>> NULL)) &&
>>>>>> -????????????? !too_many_traps_or_recompiles(caller, bci, reason)
>>>>>> -???????????? ) {
>>>>>> +????????? if (!too_many_traps_or_recompiles(caller, bci, reason)) {
>>>>>> ???????????????? // Generate uncommon trap for class check failure 
>>>>>> path
>>>>>> -??????????? // in case of monomorphic or bimorphic virtual call 
>>>>>> site.
>>>>>> +??????????? // in case of polymorphic virtual call site.
>>>>>> ???????????????? miss_cg = 
>>>>>> CallGenerator::for_uncommon_trap(callee, reason,
>>>>>> ???????????????????????????? Deoptimization::Action_maybe_recompile);
>>>>>> ?????????????? } else {
>>>>>> ???????????????? // Generate virtual call for class check failure 
>>>>>> path
>>>>>> -??????????? // in case of polymorphic virtual call site.
>>>>>> +??????????? // in case of megamorphic virtual call site.
>>>>>> ???????????????? miss_cg = CallGenerator::for_virtual_call(callee, 
>>>>>> vtable_index);
>>>>>> ?????????????? }
>>>>>> -????????? if (miss_cg != NULL) {
>>>>>> -??????????? if (next_hit_cg != NULL) {
>>>>>> +????????? for (int i = morphism - 1; i >= 1 && miss_cg != NULL; 
>>>>>> i--) {
>>>>>> +??????????? if (hit_cgs[i] != NULL) {
>>>>>> ?????????????????? assert(speculative_receiver_type == NULL, 
>>>>>> "shouldn't end up here if we used speculation");
>>>>>> -????????????? trace_type_profile(C, jvms->method(), jvms->depth() 
>>>>>> - 1, jvms->bci(), next_receiver_method, profile.receiver(1), 
>>>>>> site_count, profile.receiver_count(1));
>>>>>> +????????????? trace_type_profile(C, jvms->method(), jvms->depth() 
>>>>>> - 1, jvms->bci(), receiver_methods[i], profile.receiver(i), 
>>>>>> site_count, profile.receiver_count(i));
>>>>>> ?????????????????? // We don't need to record dependency on a 
>>>>>> receiver here and below.
>>>>>> ?????????????????? // Whenever we inline, the dependency is added 
>>>>>> by Parse::Parse().
>>>>>> -????????????? miss_cg = 
>>>>>> CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, 
>>>>>> next_hit_cg, PROB_MAX);
>>>>>> -??????????? }
>>>>>> -??????????? if (miss_cg != NULL) {
>>>>>> -????????????? ciKlass* k = speculative_receiver_type != NULL ? 
>>>>>> speculative_receiver_type : profile.receiver(0);
>>>>>> -????????????? trace_type_profile(C, jvms->method(), jvms->depth() 
>>>>>> - 1, jvms->bci(), receiver_method, k, site_count, receiver_count);
>>>>>> -????????????? float hit_prob = speculative_receiver_type != NULL 
>>>>>> ? 1.0 : profile.receiver_prob(0);
>>>>>> -????????????? CallGenerator* cg = 
>>>>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob);
>>>>>> -????????????? if (cg != NULL)? return cg;
>>>>>> +????????????? miss_cg = 
>>>>>> CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, 
>>>>>> hit_cgs[i], PROB_MAX);
>>>>>> ???????????????? }
>>>>>> ?????????????? }
>>>>>> +????????? if (miss_cg != NULL) {
>>>>>> +??????????? ciKlass* k = speculative_receiver_type != NULL ? 
>>>>>> speculative_receiver_type : profile.receiver(0);
>>>>>> +??????????? trace_type_profile(C, jvms->method(), jvms->depth() - 
>>>>>> 1, jvms->bci(), receiver_methods[0], k, site_count, 
>>>>>> profile.receiver_count(0));
>>>>>> +??????????? float hit_prob = speculative_receiver_type != NULL ? 
>>>>>> 1.0 : profile.receiver_prob(0);
>>>>>> +??????????? CallGenerator* cg = 
>>>>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cgs[0], hit_prob);
>>>>>> +??????????? if (cg != NULL)? return cg;
>>>>>> +????????? }
>>>>>> ???????????? }
>>>>>> ????????? }
>>>>>> ???????? }
>>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp 
>>>>>> b/src/hotspot/share/runtime/deoptimization.cpp
>>>>>> index 11df15e004..2d14b52854 100644
>>>>>> --- a/src/hotspot/share/runtime/deoptimization.cpp
>>>>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp
>>>>>> @@ -2382,7 +2382,7 @@ const char* 
>>>>>> Deoptimization::_trap_reason_name[] = {
>>>>>> ?????? "class_check",
>>>>>> ?????? "array_check",
>>>>>> ?????? "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"),
>>>>>> -? "bimorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>>> +? "polymorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>>> ?????? "profile_predicate",
>>>>>> ?????? "unloaded",
>>>>>> ?????? "uninitialized",
>>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp 
>>>>>> b/src/hotspot/share/runtime/deoptimization.hpp
>>>>>> index 1cfff5394e..c1eb998aba 100644
>>>>>> --- a/src/hotspot/share/runtime/deoptimization.hpp
>>>>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp
>>>>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic {
>>>>>> ???????? Reason_class_check,?????????? // saw unexpected object 
>>>>>> class (@bci)
>>>>>> ???????? Reason_array_check,?????????? // saw unexpected array 
>>>>>> class (aastore @bci)
>>>>>> ???????? Reason_intrinsic,???????????? // saw unexpected operand 
>>>>>> to intrinsic (@bci)
>>>>>> -??? Reason_bimorphic,???????????? // saw unexpected object class 
>>>>>> in bimorphic inlining (@bci)
>>>>>> +??? Reason_polymorphic,?????????? // saw unexpected object class 
>>>>>> in bimorphic inlining (@bci)
>>>>>>
>>>>>> #if INCLUDE_JVMCI
>>>>>> ???????? Reason_unreached0???????????? = Reason_null_assert,
>>>>>> ???????? Reason_type_checked_inlining? = Reason_intrinsic,
>>>>>> -??? Reason_optimized_type_check?? = Reason_bimorphic,
>>>>>> +??? Reason_optimized_type_check?? = Reason_polymorphic,
>>>>>> #endif
>>>>>>
>>>>>> ???????? Reason_profile_predicate,???? // compiler generated 
>>>>>> predicate moved from frequent branch in a loop failed
>>>>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp 
>>>>>> b/src/hotspot/share/runtime/vmStructs.cpp
>>>>>> index 94b544824e..ee761626c4 100644
>>>>>> --- a/src/hotspot/share/runtime/vmStructs.cpp
>>>>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp
>>>>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry<InstanceKlass*, 
>>>>>> mtClass>? KlassHashtableEntry;
>>>>>>        
>>>>>> declare_constant(Deoptimization::Reason_class_check)                    
>>>>>> \
>>>>>>        
>>>>>> declare_constant(Deoptimization::Reason_array_check)                    
>>>>>> \
>>>>>>        
>>>>>> declare_constant(Deoptimization::Reason_intrinsic)                      
>>>>>> \
>>>>>> -  
>>>>>> declare_constant(Deoptimization::Reason_bimorphic)                      
>>>>>> \
>>>>>> +  
>>>>>> declare_constant(Deoptimization::Reason_polymorphic)                    
>>>>>> \
>>>>>>        
>>>>>> declare_constant(Deoptimization::Reason_profile_predicate)              
>>>>>> \
>>>>>>        
>>>>>> declare_constant(Deoptimization::Reason_unloaded)                       
>>>>>> \
>>>>>>        
>>>>>> declare_constant(Deoptimization::Reason_uninitialized)                  
>>>>>> \
>>>>>>

From viv.desh at gmail.com  Mon Apr  6 18:55:05 2020
From: viv.desh at gmail.com (Vivek Deshpande)
Date: Mon, 6 Apr 2020 11:55:05 -0700
Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator): x86
 backend changes
In-Reply-To: <MW3PR11MB47452D16DFBF1AD1B18A79ACEFC40@MW3PR11MB4745.namprd11.prod.outlook.com>
References: <MW3PR11MB47452D16DFBF1AD1B18A79ACEFC40@MW3PR11MB4745.namprd11.prod.outlook.com>
Message-ID: <CACR9jGOVPZjEGCO9ots2BEgxVOgtNj2oeA8WKbrH-aK=7wdQbw@mail.gmail.com>

Hi Sandhya

I looked at the patch over the weekend. It looks good to me and a lot of
work is involved.
I have a question. Is this patch intended to panama/dev or mainline jdk?
Nit: macroAssembler_x86.cpp has extra line at 115.

Regards,
Vivek
OpenJDK id: vdeshpande

On Fri, Apr 3, 2020 at 5:18 PM Viswanathan, Sandhya <
sandhya.viswanathan at intel.com> wrote:

> Hi,
>
>
> Following up on review requests of API [0], Java implementation [1] and
>
> General Hotspot changes[3] for Vector API, here's a request for review
>
> of x86 backend changes required for supporting the API:
>
>
>
> JEP: https://openjdk.java.net/jeps/338
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8223347
>
> Webrev:
> http://cr.openjdk.java.net/~sviswanathan/VAPI_RFR/x86_webrev/webrev.00/
>
>
>
> Complete implementation resides in vector-unstable branch of
>
> panama/dev repository [3].
>
> Looking forward to your feedback.
>
> Best Regards,
> Sandhya
>
>
> [0]
> https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html
>
>
>
> [1]
> https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-April/065587.html
>
>
>
> [2]
> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-April/037798.html
>
>
>
> [3]  https://openjdk.java.net/projects/panama/
>
>        $ hg clone http://hg.openjdk.java.net/panama/dev/ -b
> vector-unstable
>
>
>
>
>
>

-- 
Thanks and Regards,

Vivek Deshpande
viv.desh at gmail.com

From sandhya.viswanathan at intel.com  Mon Apr  6 19:01:17 2020
From: sandhya.viswanathan at intel.com (Viswanathan, Sandhya)
Date: Mon, 6 Apr 2020 19:01:17 +0000
Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator): x86
 backend changes
In-Reply-To: <CACR9jGOVPZjEGCO9ots2BEgxVOgtNj2oeA8WKbrH-aK=7wdQbw@mail.gmail.com>
References: <MW3PR11MB47452D16DFBF1AD1B18A79ACEFC40@MW3PR11MB4745.namprd11.prod.outlook.com>
 <CACR9jGOVPZjEGCO9ots2BEgxVOgtNj2oeA8WKbrH-aK=7wdQbw@mail.gmail.com>
Message-ID: <MW3PR11MB4745B9E2A67EB557946D3AFAEFC20@MW3PR11MB4745.namprd11.prod.outlook.com>

Hi Vivek,

Thanks for the feedback. This patch is for mainline jdk.

Best Regards,
Sandhya

From: Vivek Deshpande <viv.desh at gmail.com>
Sent: Monday, April 06, 2020 11:55 AM
To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
Cc: hotspot-compiler-dev at openjdk.java.net; core-libs-dev at openjdk.java.net; hotspot-dev <hotspot-dev at openjdk.java.net>
Subject: Re: RFR (XXL): 8223347: Integration of Vector API (Incubator): x86 backend changes

Hi Sandhya

I looked at the patch over the weekend. It looks good to me and a lot of work is involved.
I have a question. Is this patch intended to panama/dev or mainline jdk?
Nit: macroAssembler_x86.cpp has extra line at 115.

Regards,
Vivek
OpenJDK id: vdeshpande

On Fri, Apr 3, 2020 at 5:18 PM Viswanathan, Sandhya <sandhya.viswanathan at intel.com<mailto:sandhya.viswanathan at intel.com>> wrote:
Hi,


Following up on review requests of API [0], Java implementation [1] and

General Hotspot changes[3] for Vector API, here's a request for review

of x86 backend changes required for supporting the API:


JEP: https://openjdk.java.net/jeps/338

JBS: https://bugs.openjdk.java.net/browse/JDK-8223347

Webrev:http://cr.openjdk.java.net/~sviswanathan/VAPI_RFR/x86_webrev/webrev.00/


Complete implementation resides in vector-unstable branch of

panama/dev repository [3].

Looking forward to your feedback.

Best Regards,
Sandhya


[0]  https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html


[1]  https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-April/065587.html


[2]  https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-April/037798.html


[3]  https://openjdk.java.net/projects/panama/

       $ hg clone http://hg.openjdk.java.net/panama/dev/ -b vector-unstable


--
Thanks and Regards,

Vivek Deshpande
viv.desh at gmail.com<mailto:viv.desh at gmail.com>

From ekaterina.pavlova at oracle.com  Tue Apr  7 03:12:49 2020
From: ekaterina.pavlova at oracle.com (Ekaterina Pavlova)
Date: Mon, 6 Apr 2020 20:12:49 -0700
Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator):
 General HotSpot changes
In-Reply-To: <c1bdf88c-5de2-d069-5f31-5a95c6988bf8@oracle.com>
References: <c1bdf88c-5de2-d069-5f31-5a95c6988bf8@oracle.com>
Message-ID: <3b6f8501-5d90-0a14-ed21-5978567d95d2@oracle.com>

Hi Vladimir,

what kind of testing has been done to verify these changes?
Taking into account the changes are quite large and touch share code
running hs compiler and perhaps runtime tiers would be very advisable.

thanks,
-katya

On 4/3/20 4:12 PM, Vladimir Ivanov wrote:
> Hi,
> 
> Following up on review requests of API [0] and Java implementation [1] for Vector API (JEP 338 [2]), here's a request for review of general HotSpot changes (in shared code) required for supporting the API:
> 
> 
> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/all.00-03/
> 
> (First of all, to set proper expectations: since the JEP is still in Candidate state, the intention is to initiate preliminary round(s) of review to inform the community and gather feedback before sending out final/official RFRs once the JEP is Targeted to a release.)
> 
> Vector API (being developed in Project Panama [3]) relies on JVM support to utilize optimal vector hardware instructions at runtime. It interacts with JVM through intrinsics (declared in jdk.internal.vm.vector.VectorSupport [4]) which expose vector operations support in C2 JIT-compiler.
> 
> As Paul wrote earlier: "A vector intrinsic is an internal low-level vector operation. The last argument to the intrinsic is fall back behavior in Java, implementing the scalar operation over the number of elements held by the vector.? Thus, If the intrinsic is not supported in C2 for the other arguments then the Java implementation is executed (the Java implementation is always executed when running in the interpreter or for C1)."
> 
> The rest of JVM support is about aggressively optimizing vector boxes to minimize (ideally eliminate) the overhead of boxing for vector values.
> It's a stop-the-gap solution for vector box elimination problem until inline classes arrive. Vector classes are value-based and in the longer term will be migrated to inline classes once the support becomes available.
> 
> Vector API talk from JVMLS'18 [5] contains brief overview of JVM implementation and some details.
> 
> Complete implementation resides in vector-unstable branch of panama/dev repository [6].
> 
> Now to gory details (the patch is split in multiple "sub-webrevs"):
> 
> ===========================================================
> 
> (1) http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/00.backend.shared/
> 
> Ideal vector nodes for new operations introduced by Vector API.
> 
> (Platform-specific back end support will be posted for review separately).
> 
> ===========================================================
> 
> (2) http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/
> 
> JVM Java interface (VectorSupport) and intrinsic support in C2.
> 
> Vector instances are initially represented as VectorBox macro nodes and "unboxing" is represented by VectorUnbox node. It simplifies vector box elimination analysis and the nodes are expanded later right before EA pass.
> 
> Vectors have 2-level on-heap representation: for the vector value primitive array is used as a backing storage and it is encapsulated in a typed wrapper (e.g., Int256Vector - vector of 8 ints - contains a int[8] instance which is used to store vector value).
> 
> Unless VectorBox node goes away, it needs to be expanded into an allocation eventually, but it is a pure node and doesn't have any JVM state associated with it. The problem is solved by keeping JVM state separately in a VectorBoxAllocate node associated with VectorBox node and use it during expansion.
> 
> Also, to simplify vector box elimination, inlining of vector reboxing calls (VectorSupport::maybeRebox) is delayed until the analysis is over.
> 
> ===========================================================
> 
> (3) http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/02.vbox_elimination/
> 
> Vector box elimination analysis implementation. (Brief overview: slides #36-42 [5].)
> 
> The main part is devoted to scalarization across safepoints and rematerialization support during deoptimization. In C2-generated code vector operations work with raw vector values which live in registers or spilled on the stack and it allows to avoid boxing/unboxing when a vector value is alive across a safepoint. As with other values, there's just a location of the vector value at the safepoint and vector type information recorded in the relevant nmethod metadata and all the heavy-lifting happens only when rematerialization takes place.
> 
> The analysis preserves object identity invariants except during aggressive reboxing (guarded by -XX:+EnableAggressiveReboxing).
> 
> (Aggressive reboxing is crucial for cases when vectors "escape": it allocates a fresh instance at every escape point thus enabling original instance to go away.)
> 
> ===========================================================
> 
> (4) http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/03.module.hotspot/
> 
> HotSpot changes for jdk.incubator.vector module. Vector support is makred experimental and turned off by default. JEP 338 proposes the API to be released as an incubator module, so a user has to specify "--add-module jdk.incubator.vector" on the command line to be able to use it.
> When user does that, JVM automatically enables Vector API support.
> It improves usability (user doesn't need to separately "open" the API and enable JVM support) while minimizing risks of destabilitzation from new code when the API is not used.
> 
> 
> That's it! Will be happy to answer any questions.
> 
> And thanks in advance for any feedback!
> 
> Best regards,
> Vladimir Ivanov
> 
> [0] https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html
> 
> [1] https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041228.html
> 
> [2] https://openjdk.java.net/jeps/338
> 
> [3] https://openjdk.java.net/projects/panama/
> 
> [4] http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java.html
> 
> [5] http://cr.openjdk.java.net/~vlivanov/talks/2018_JVMLS_VectorAPI.pdf
> 
> [6] http://hg.openjdk.java.net/panama/dev/shortlog/92bbd44386e9
> 
>  ??? $ hg clone http://hg.openjdk.java.net/panama/dev/ -b vector-unstable


From vladimir.x.ivanov at oracle.com  Tue Apr  7 09:39:32 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Tue, 7 Apr 2020 12:39:32 +0300
Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator):
 General HotSpot changes
In-Reply-To: <3b6f8501-5d90-0a14-ed21-5978567d95d2@oracle.com>
References: <c1bdf88c-5de2-d069-5f31-5a95c6988bf8@oracle.com>
 <3b6f8501-5d90-0a14-ed21-5978567d95d2@oracle.com>
Message-ID: <7dc065c6-b8c4-3d83-4b5d-788e07d8d6e5@oracle.com>

Hi Katya,

> what kind of testing has been done to verify these changes?
> Taking into account the changes are quite large and touch share code
> running hs compiler and perhaps runtime tiers would be very advisable.

The changes (and previous versions) were tested in 2 modes:

   * ran through tier1-tier4 with the functionality turned OFF; (also, 
some previous version went through tier1-tier6 once)

   * unit tests on Vector API were run on different x86 hardware in the 
following modes: -XX:UseAVX=[3,2,1,0] -XX:UseSSE=[4,3,2]. Arm engineers 
tested the version in vector-unstable branch on AArch64 hardware.

As of now, the only known test failure is 
compiler/graalunit/HotspotTest.java in 
org.graalvm.compiler.hotspot.test.CheckGraalIntrinsics which should be 
taught about new JVM intrinsics added.

Best regards,
Vladimir Ivanov

> On 4/3/20 4:12 PM, Vladimir Ivanov wrote:
>> Hi,
>>
>> Following up on review requests of API [0] and Java implementation [1] 
>> for Vector API (JEP 338 [2]), here's a request for review of general 
>> HotSpot changes (in shared code) required for supporting the API:
>>
>>
>> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/all.00-03/ 
>>
>>
>> (First of all, to set proper expectations: since the JEP is still in 
>> Candidate state, the intention is to initiate preliminary round(s) of 
>> review to inform the community and gather feedback before sending out 
>> final/official RFRs once the JEP is Targeted to a release.)
>>
>> Vector API (being developed in Project Panama [3]) relies on JVM 
>> support to utilize optimal vector hardware instructions at runtime. It 
>> interacts with JVM through intrinsics (declared in 
>> jdk.internal.vm.vector.VectorSupport [4]) which expose vector 
>> operations support in C2 JIT-compiler.
>>
>> As Paul wrote earlier: "A vector intrinsic is an internal low-level 
>> vector operation. The last argument to the intrinsic is fall back 
>> behavior in Java, implementing the scalar operation over the number of 
>> elements held by the vector.? Thus, If the intrinsic is not supported 
>> in C2 for the other arguments then the Java implementation is executed 
>> (the Java implementation is always executed when running in the 
>> interpreter or for C1)."
>>
>> The rest of JVM support is about aggressively optimizing vector boxes 
>> to minimize (ideally eliminate) the overhead of boxing for vector values.
>> It's a stop-the-gap solution for vector box elimination problem until 
>> inline classes arrive. Vector classes are value-based and in the 
>> longer term will be migrated to inline classes once the support 
>> becomes available.
>>
>> Vector API talk from JVMLS'18 [5] contains brief overview of JVM 
>> implementation and some details.
>>
>> Complete implementation resides in vector-unstable branch of 
>> panama/dev repository [6].
>>
>> Now to gory details (the patch is split in multiple "sub-webrevs"):
>>
>> ===========================================================
>>
>> (1) 
>> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/00.backend.shared/ 
>>
>>
>> Ideal vector nodes for new operations introduced by Vector API.
>>
>> (Platform-specific back end support will be posted for review 
>> separately).
>>
>> ===========================================================
>>
>> (2) 
>> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/ 
>>
>>
>> JVM Java interface (VectorSupport) and intrinsic support in C2.
>>
>> Vector instances are initially represented as VectorBox macro nodes 
>> and "unboxing" is represented by VectorUnbox node. It simplifies 
>> vector box elimination analysis and the nodes are expanded later right 
>> before EA pass.
>>
>> Vectors have 2-level on-heap representation: for the vector value 
>> primitive array is used as a backing storage and it is encapsulated in 
>> a typed wrapper (e.g., Int256Vector - vector of 8 ints - contains a 
>> int[8] instance which is used to store vector value).
>>
>> Unless VectorBox node goes away, it needs to be expanded into an 
>> allocation eventually, but it is a pure node and doesn't have any JVM 
>> state associated with it. The problem is solved by keeping JVM state 
>> separately in a VectorBoxAllocate node associated with VectorBox node 
>> and use it during expansion.
>>
>> Also, to simplify vector box elimination, inlining of vector reboxing 
>> calls (VectorSupport::maybeRebox) is delayed until the analysis is over.
>>
>> ===========================================================
>>
>> (3) 
>> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/02.vbox_elimination/ 
>>
>>
>> Vector box elimination analysis implementation. (Brief overview: 
>> slides #36-42 [5].)
>>
>> The main part is devoted to scalarization across safepoints and 
>> rematerialization support during deoptimization. In C2-generated code 
>> vector operations work with raw vector values which live in registers 
>> or spilled on the stack and it allows to avoid boxing/unboxing when a 
>> vector value is alive across a safepoint. As with other values, 
>> there's just a location of the vector value at the safepoint and 
>> vector type information recorded in the relevant nmethod metadata and 
>> all the heavy-lifting happens only when rematerialization takes place.
>>
>> The analysis preserves object identity invariants except during 
>> aggressive reboxing (guarded by -XX:+EnableAggressiveReboxing).
>>
>> (Aggressive reboxing is crucial for cases when vectors "escape": it 
>> allocates a fresh instance at every escape point thus enabling 
>> original instance to go away.)
>>
>> ===========================================================
>>
>> (4) 
>> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/03.module.hotspot/ 
>>
>>
>> HotSpot changes for jdk.incubator.vector module. Vector support is 
>> makred experimental and turned off by default. JEP 338 proposes the 
>> API to be released as an incubator module, so a user has to specify 
>> "--add-module jdk.incubator.vector" on the command line to be able to 
>> use it.
>> When user does that, JVM automatically enables Vector API support.
>> It improves usability (user doesn't need to separately "open" the API 
>> and enable JVM support) while minimizing risks of destabilitzation 
>> from new code when the API is not used.
>>
>>
>> That's it! Will be happy to answer any questions.
>>
>> And thanks in advance for any feedback!
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> [0] 
>> https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html 
>>
>>
>> [1] 
>> https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041228.html 
>>
>>
>> [2] https://openjdk.java.net/jeps/338
>>
>> [3] https://openjdk.java.net/projects/panama/
>>
>> [4] 
>> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java.html 
>>
>>
>> [5] http://cr.openjdk.java.net/~vlivanov/talks/2018_JVMLS_VectorAPI.pdf
>>
>> [6] http://hg.openjdk.java.net/panama/dev/shortlog/92bbd44386e9
>>
>> ???? $ hg clone http://hg.openjdk.java.net/panama/dev/ -b vector-unstable
> 

From vladimir.kozlov at oracle.com  Tue Apr  7 17:15:34 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Tue, 7 Apr 2020 10:15:34 -0700
Subject: RFR(S): 8241900: Loop unswitching may cause dependence on null
 check to be lost
In-Reply-To: <6cd91986-952c-1419-47d1-f7d25b955a70@oracle.com>
References: <878sjdc5jl.fsf@redhat.com>
 <36d43333-81e8-1f79-1a04-06f8e34a2c30@oracle.com> <87zhbpau71.fsf@redhat.com>
 <6cd91986-952c-1419-47d1-f7d25b955a70@oracle.com>
Message-ID: <289f3e63-9603-d90e-8b31-1d02d22d6ae7@oracle.com>

I also agree with these changes. And I see that Tobias's testing did not find issues (except timeout on SPARC).

Thanks,
Vladimir

On 4/6/20 1:51 AM, Tobias Hartmann wrote:
> 
> On 06.04.20 10:34, Roland Westrelin wrote:
>> I've been wondering about that too but couldn't find a scenario where it
>> would go wrong. dominated_by() is what's used when a if is replaced by a
>> dominating if with the same condition in
>> PhaseIdealLoop::split_if_with_blocks_post(). Loop switching is similar:
>> we add a dominating if, and then remove the loop copies because they are
>> redundant.
> 
> Right, I couldn't find such a scenario either and as you've pointed out the same problem would
> exists at other places as well. Looks good.
> 
> Best regards,
> Tobias
> 

From vladimir.x.ivanov at oracle.com  Tue Apr  7 17:29:55 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Tue, 7 Apr 2020 20:29:55 +0300
Subject: [15] RFR (S): 8242289: C2: Support platform-specific node cloning in
 Matcher
Message-ID: <2645fc7c-76f7-dd3b-bee3-72d0b923d46f@oracle.com>

http://cr.openjdk.java.net/~vlivanov/8242289/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8242289

Introduce a platform-specific entry point (Matcher::pd_clone_node) and 
move platform-specific node cloning during matching.

Matcher processes every node only once unless it is marked as shared.
It is too restrictive in some cases, so the workaround is to explicitly 
check for particular IR patterns and clone relevant nodes during 
matching phase.

As an example, take a look at ShiftCntV. There are the following match 
rules in aarch64.ad:

   match(Set dst (RShiftVB src (RShiftCntV shift)));

By default, RShiftCntV node is matched only once, so when it has 
multiple users, only it will be folded only into one of them and for the 
rest the value it produces will be put in register. To overcome that, 
Matcher is taught to detect such pattern and "clone" RShiftCntV input 
every time it matches RShiftV node. In case of RShiftCntV, it's 
arm32/aarch64-specific and other platforms (x86 in particular) don't 
optimize for it.

To avoid polluting shared code (in matcher.cpp) with platform-specific 
portions, I propose to add Matcher::pd_clone_node and place 
platform-specific checks there.

Also, as a cleanup, renamed Matcher::clone_address_expressions() to 
pd_clone_address_expressions since it's a platform-specific method.

Testing: hs-precheckin-comp, hs-tier1, hs-tier2,
          cross-builds on all affected platforms

Thanks!

Best regards,
Vladimir Ivanov

From vladimir.kozlov at oracle.com  Tue Apr  7 17:43:25 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Tue, 7 Apr 2020 10:43:25 -0700
Subject: [15] RFR (S): 8242289: C2: Support platform-specific node cloning
 in Matcher
In-Reply-To: <2645fc7c-76f7-dd3b-bee3-72d0b923d46f@oracle.com>
References: <2645fc7c-76f7-dd3b-bee3-72d0b923d46f@oracle.com>
Message-ID: <da5e6888-041d-b067-9389-f536d702c837@oracle.com>

Good.

Thanks,
Vladimir

On 4/7/20 10:29 AM, Vladimir Ivanov wrote:
> http://cr.openjdk.java.net/~vlivanov/8242289/webrev.00/
> https://bugs.openjdk.java.net/browse/JDK-8242289
> 
> Introduce a platform-specific entry point (Matcher::pd_clone_node) and move platform-specific node cloning during matching.
> 
> Matcher processes every node only once unless it is marked as shared.
> It is too restrictive in some cases, so the workaround is to explicitly check for particular IR patterns and clone 
> relevant nodes during matching phase.
> 
> As an example, take a look at ShiftCntV. There are the following match rules in aarch64.ad:
> 
>  ? match(Set dst (RShiftVB src (RShiftCntV shift)));
> 
> By default, RShiftCntV node is matched only once, so when it has multiple users, only it will be folded only into one of 
> them and for the rest the value it produces will be put in register. To overcome that, Matcher is taught to detect such 
> pattern and "clone" RShiftCntV input every time it matches RShiftV node. In case of RShiftCntV, it's 
> arm32/aarch64-specific and other platforms (x86 in particular) don't optimize for it.
> 
> To avoid polluting shared code (in matcher.cpp) with platform-specific portions, I propose to add Matcher::pd_clone_node 
> and place platform-specific checks there.
> 
> Also, as a cleanup, renamed Matcher::clone_address_expressions() to pd_clone_address_expressions since it's a 
> platform-specific method.
> 
> Testing: hs-precheckin-comp, hs-tier1, hs-tier2,
>  ???????? cross-builds on all affected platforms
> 
> Thanks!
> 
> Best regards,
> Vladimir Ivanov

From vladimir.kozlov at oracle.com  Tue Apr  7 17:54:07 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Tue, 7 Apr 2020 10:54:07 -0700
Subject: Polymorphic Guarded Inlining in C2
In-Reply-To: <6de5487c-c13e-c03e-9d0b-f3093b115daf@oracle.com>
References: <MWHPR21MB051142D7637FFAB03EDBE0A3B01D0@MWHPR21MB0511.namprd21.prod.outlook.com>
 <MWHPR21MB0511F7E9A2CC9118822F38ABB01D0@MWHPR21MB0511.namprd21.prod.outlook.com>
 <cab9034b-ad5e-9c6b-85e6-2d9abd6affa7@oracle.com>
 <MWHPR21MB051152A581B205FDD22675B5B0190@MWHPR21MB0511.namprd21.prod.outlook.com>
 <e1272577-2859-36da-9679-33a1c25a2b52@oracle.com>
 <MWHPR21MB0511CEFDDEAC30BF4CFD9234B0110@MWHPR21MB0511.namprd21.prod.outlook.com>
 <6bbeea49-7335-9640-d524-32fa03968f42@oracle.com>
 <MWHPR21MB051135E4B2A9DF06CA69AB31B0E40@MWHPR21MB0511.namprd21.prod.outlook.com>
 <MWHPR21MB0511944D93B23B845A0ADF63B0E40@MWHPR21MB0511.namprd21.prod.outlook.com>
 <MWHPR21MB051128C6984BFBCF2BE5099EB0E40@MWHPR21MB0511.namprd21.prod.outlook.com>
 <MWHPR21MB05119B1597C64A3733CA0B74B0F50@MWHPR21MB0511.namprd21.prod.outlook.com>
 <ea3c5eec-6a3f-b252-1d24-d3d266ca6b93@oracle.com>
 <084ea561-6305-a8fd-9d7f-5ba108d41312@oracle.com>
 <6de5487c-c13e-c03e-9d0b-f3093b115daf@oracle.com>
Message-ID: <0ee0b383-285e-bd93-3490-84ad991b53d1@oracle.com>

An other thing we can do is collect statistic data about how many different receivers can be recorded with big 
TypeProfileWidth. My recollection from long ago was the only case for poly was HashMap usage. It would be nice to 
collect this data again for modern Java benchmarks. We can use them to see afftets of changes - benchmarks which do not 
have poly cases are usless in these experiments.

On 4/6/20 6:38 AM, Vladimir Ivanov wrote:
> I see 2 directions (mostly independent) to proceed: (1) use existing profiling info only; and (2) when more profile info 
> is available.
> 
> I suggest to explore them independently.
> 
> There's enough profiling data available to introduce polymorpic case with 2 major receivers ("2-poly"). And it'll 
> complete the matrix of possible shapes.

Please explain how it is different from current bimprphic case?

> 
> Gathering more data (-XX:TypeProfileWidth=N > 2) enables 2 more generic shapes: "N-morphic" and "N-poly". The only 
> difference between them is what happens on fallback patch - deopt / uncommon trap or a virtual call.
> 
> Regarding 2-poly, there is TypeProfileMajorReceiverPercent which should be extended to 2 cases which leads to 2 
> parameter: aggregated major receiver percentage and minimum indiviual percentage.

okay

> 
> Also, it makes sense to introduce UseOnlyInlinedPolymorphic which aligns 2-poly with bimorphic case.
> 
> And, as I mentioned before, IMO it's promising to distinguish invokevirtual and invokeinterface cases. So, additional 
> flag to control that would be useful.

yes

> 
> Regarding N-poly/N-morphic case, they can be generalized from 2-poly/bi-morphic cases.
> 
> I believe experiments on 2-poly will provide useful insights on N-poly/N-morphic, so it makes sense to start with 2-poly 
> first.

Yes

Thanks,
Vladimir K

> 
> Best regards,
> Vladimir Ivanov
> 
> On 01.04.2020 01:29, Vladimir Kozlov wrote:
>> Looks like graphs were stripped from email. I put them on GitHub:
>>
>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-ren_tpw.png>
>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_tpw.png>
>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-dacapo_tpw.png>
>>
>> Also Vladimir Ivanov forwarded me data he collected.
>>
>> His next data shows that profiling is not "free". Vladimir I. limited to tier3 (-XX:TieredStopAtLevel=3, C1 
>> compilation with profiling code) to show that profiling code with TPW=8 is slower. Note, with 4 tiers this may not 
>> visible because execution will be switched to C2 compiled code (without profiling code).
>>
>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_tier3.png>
>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-dacapo_tier3.png>
>>
>> Next data collected for proposed patch. Vladimir I. collected data for several flags configurations.
>> Next graphs are for one of settings:' -XX:+UsePolymorphicInlining -XX:+UseOnlyInlinedPolymorphic -XX:TypeProfileWidth=4'
>>
>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_poly_inl_tpw4.png>
>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-decapo_poly_inl_tpw4.png>
>>
>> It has mixed data but most benchmarks are not affected. Which means we need to spend more time on proposed changes.
>>
>> Vladimir K
>>
>> On 3/31/20 10:39 AM, Vladimir Kozlov wrote:
>>> I start loking on it.
>>>
>>> I think ideally TypeProfileWidth should be per call site and not per method - and it will require more complicated 
>>> implementation (an other RFE). But for experiments I think setting it to 8 (or higher) for all methods is okay.
>>>
>>> Note, more profiling lines per each call site is cost few Mb in CodeCache (overestimation 20K nmethods * 10 call 
>>> sites * 6 * 8 bytes) vs very complicated code to have dynamic number of lines.
>>>
>>> I think we should first investigate best heuristics for inlining vs direct call vs vcall vs uncommmont traps for 
>>> polymorphic cases and worry about memory and time consumption during profiling later.
>>>
>>> I did some performance runs with latest JDK 15 for TypeProfileWidth=8 vs =2 and don't see much difference for spec 
>>> benchmarks (see attached graph - grey dots mean no significant difference). But there are regressions (red dots) for 
>>> Renessance which includes some modern benchmarks.
>>>
>>> I will work his week to get similar data with Ludovic's patch.
>>>
>>> I am for incremental approach. I think we can start/push based on what Ludovic is currently suggesting (do more 
>>> processing for TPW > 2) while preserving current default behaviour (for TPW <= 2). But only if it gives improvements 
>>> in these benchmarks. We use these benchmarks as criteria for JDK releases.
>>>
>>> Regards,
>>> Vladimir
>>>
>>> On 3/20/20 4:52 PM, Ludovic Henry wrote:
>>>> Hi Vladimir,
>>>>
>>>> As requested offline, please find following the latest version of the patch. Contrary to what was discussed
>>>> initially, I haven't done the work to support per-method TypeProfileWidth, as that requires to extend the
>>>> existing CompilerDirectives to be available to the Interpreter. For me to achieve that work, I would need
>>>> guidance on how to approach the problem, and what your expectations are.
>>>>
>>>> Thank you,
>>>>
>>>> -- 
>>>> Ludovic
>>>>
>>>> diff --git a/src/hotspot/cpu/x86/interp_masm_x86.cpp b/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>>> index 4ed93169c7..bad9cddf20 100644
>>>> --- a/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>>> +++ b/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>>> @@ -1731,7 +1731,7 @@ void InterpreterMacroAssembler::record_item_in_profile_helper(Register item, Reg
>>>> ??????????? Label found_null;
>>>> ??????????? jccb(Assembler::zero, found_null);
>>>> ??????????? // Item did not match any saved item and there is no empty row for it.
>>>> -????????? // Increment total counter to indicate polymorphic case.
>>>> +????????? // Increment total counter to indicate megamorphic case.
>>>> ??????????? increment_mdp_data_at(mdp, non_profiled_offset);
>>>> ??????????? jmp(done);
>>>> ??????????? bind(found_null);
>>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp b/src/hotspot/share/ci/ciCallProfile.hpp
>>>> index 73854806ed..c5030149bf 100644
>>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp
>>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp
>>>> @@ -38,7 +38,8 @@ private:
>>>> ??? friend class ciMethod;
>>>> ??? friend class ciMethodHandle;
>>>> -? enum { MorphismLimit = 2 }; // Max call site's morphism we care about
>>>> +? enum { MorphismLimit = 8 }; // Max call site's morphism we care about
>>>> +? bool _is_megamorphic;????????? // whether the call site is megamorphic
>>>> ??? int? _limit;??????????????? // number of receivers have been determined
>>>> ??? int? _morphism;???????????? // determined call site's morphism
>>>> ??? int? _count;??????????????? // # times has this call been executed
>>>> @@ -47,6 +48,8 @@ private:
>>>> ??? ciKlass*? _receiver[MorphismLimit + 1];? // receivers (exact)
>>>> ??? ciCallProfile() {
>>>> +??? guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit can't be smaller than TypeProfileWidth");
>>>> +??? _is_megamorphic = false;
>>>> ????? _limit = 0;
>>>> ????? _morphism??? = 0;
>>>> ????? _count = -1;
>>>> @@ -58,6 +61,8 @@ private:
>>>> ??? void add_receiver(ciKlass* receiver, int receiver_count);
>>>> ? public:
>>>> +? bool????? is_megamorphic() const??? { return _is_megamorphic; }
>>>> +
>>>> ??? // Note:? The following predicates return false for invalid profiles:
>>>> ??? bool????? has_receiver(int i) const { return _limit > i; }
>>>> ??? int?????? morphism() const????????? { return _morphism; }
>>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp b/src/hotspot/share/ci/ciMethod.cpp
>>>> index d771be8dac..c190919708 100644
>>>> --- a/src/hotspot/share/ci/ciMethod.cpp
>>>> +++ b/src/hotspot/share/ci/ciMethod.cpp
>>>> @@ -531,25 +531,27 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) {
>>>> ??????????? // If we extend profiling to record methods,
>>>> ??????????? // we will set result._method also.
>>>> ????????? }
>>>> -??????? // Determine call site's morphism.
>>>> +??????? // Determine call site's megamorphism.
>>>> ????????? // The call site count is 0 with known morphism (only 1 or 2 receivers)
>>>> ????????? // or < 0 in the case of a type check failure for checkcast, aastore, instanceof.
>>>> -??????? // The call site count is > 0 in the case of a polymorphic virtual call.
>>>> +??????? // The call site count is > 0 in the case of a megamorphic virtual call.
>>>> ????????? if (morphism > 0 && morphism == result._limit) {
>>>> ???????????? // The morphism <= MorphismLimit.
>>>> -?????????? if ((morphism <? ciCallProfile::MorphismLimit) ||
>>>> -?????????????? (morphism == ciCallProfile::MorphismLimit && count == 0)) {
>>>> +?????????? if ((morphism <? TypeProfileWidth) ||
>>>> +?????????????? (morphism == TypeProfileWidth && count == 0)) {
>>>> ? #ifdef ASSERT
>>>> ?????????????? if (count > 0) {
>>>> ???????????????? this->print_short_name(tty);
>>>> ???????????????? tty->print_cr(" @ bci:%d", bci);
>>>> ???????????????? this->print_codes();
>>>> -?????????????? assert(false, "this call site should not be polymorphic");
>>>> +?????????????? assert(false, "this call site should not be megamorphic");
>>>> ?????????????? }
>>>> ? #endif
>>>> -???????????? result._morphism = morphism;
>>>> +?????????? } else {
>>>> +????????????? result._is_megamorphic = true;
>>>> ???????????? }
>>>> ????????? }
>>>> +??????? result._morphism = morphism;
>>>> ????????? // Make the count consistent if this is a call profile. If count is
>>>> ????????? // zero or less, presume that this is a typecheck profile and
>>>> ????????? // do nothing.? Otherwise, increase count to be the sum of all
>>>> @@ -578,7 +580,7 @@ void ciCallProfile::add_receiver(ciKlass* receiver, int receiver_count) {
>>>> ??? }
>>>> ??? _receiver[i] = receiver;
>>>> ??? _receiver_count[i] = receiver_count;
>>>> -? if (_limit < MorphismLimit) _limit++;
>>>> +? if (_limit < TypeProfileWidth) _limit++;
>>>> ? }
>>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp b/src/hotspot/share/opto/c2_globals.hpp
>>>> index d605bdb7bd..e4a5e7ea8b 100644
>>>> --- a/src/hotspot/share/opto/c2_globals.hpp
>>>> +++ b/src/hotspot/share/opto/c2_globals.hpp
>>>> @@ -389,9 +389,16 @@
>>>> ??? product(bool, UseBimorphicInlining, true,???????????????????????????????? \
>>>> ??????????? "Profiling based inlining for two receivers")???????????????????? \
>>>> \
>>>> +? product(bool, UsePolymorphicInlining, true,?????????????????????????????? \
>>>> +????????? "Profiling based inlining for two or more receivers")???????????? \
>>>> + \
>>>> ??? product(bool, UseOnlyInlinedBimorphic, true,????????????????????????????? \
>>>> ??????????? "Don't use BimorphicInlining if can't inline a second method")??? \
>>>> \
>>>> +? product(bool, UseOnlyInlinedPolymorphic, true,??????????????????????????? \
>>>> +????????? "Don't use PolymorphicInlining if can't inline a secondary "????? \
>>>> + "method")???????????????????????????????????????????????????????? \
>>>> + \
>>>> ??? product(bool, InsertMemBarAfterArraycopy, true,?????????????????????????? \
>>>> ??????????? "Insert memory barrier after arraycopy call")???????????????????? \
>>>> \
>>>> @@ -645,6 +652,10 @@
>>>> ??????????? "% of major receiver type to all profiled receivers")???????????? \
>>>> ??????????? range(0, 100)???????????????????????????????????????????????????? \
>>>> \
>>>> +? product(intx, TypeProfileMinimumReceiverPercent, 20,????????????????????? \
>>>> +????????? "minimum % of receiver type to all profiled receivers")?????????? \
>>>> +????????? range(0, 100)???????????????????????????????????????????????????? \
>>>> + \
>>>> ??? diagnostic(bool, PrintIntrinsics, false,????????????????????????????????? \
>>>> ??????????? "prints attempted and successful inlining of intrinsics")???????? \
>>>> \
>>>> diff --git a/src/hotspot/share/opto/doCall.cpp b/src/hotspot/share/opto/doCall.cpp
>>>> index 44ab387ac8..dba2b114c6 100644
>>>> --- a/src/hotspot/share/opto/doCall.cpp
>>>> +++ b/src/hotspot/share/opto/doCall.cpp
>>>> @@ -83,25 +83,27 @@ CallGenerator* Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>> ??? // See how many times this site has been invoked.
>>>> ??? int site_count = profile.count();
>>>> -? int receiver_count = -1;
>>>> -? if (call_does_dispatch && UseTypeProfile && profile.has_receiver(0)) {
>>>> -??? // Receivers in the profile structure are ordered by call counts
>>>> -??? // so that the most called (major) receiver is profile.receiver(0).
>>>> -??? receiver_count = profile.receiver_count(0);
>>>> -? }
>>>> ??? CompileLog* log = this->log();
>>>> ??? if (log != NULL) {
>>>> -??? int rid = (receiver_count >= 0)? log->identify(profile.receiver(0)): -1;
>>>> -??? int r2id = (rid != -1 && profile.has_receiver(1))? log->identify(profile.receiver(1)):-1;
>>>> +??? int* rids;
>>>> +??? if (call_does_dispatch) {
>>>> +????? rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth);
>>>> +????? for (int i = 0; i < TypeProfileWidth && profile.has_receiver(i); i++) {
>>>> +??????? rids[i] = log->identify(profile.receiver(i));
>>>> +????? }
>>>> +??? }
>>>> ????? log->begin_elem("call method='%d' count='%d' prof_factor='%f'",
>>>> ????????????????????? log->identify(callee), site_count, prof_factor);
>>>> -??? if (call_does_dispatch)? log->print(" virtual='1'");
>>>> ????? if (allow_inline)???? log->print(" inline='1'");
>>>> -??? if (receiver_count >= 0) {
>>>> -????? log->print(" receiver='%d' receiver_count='%d'", rid, receiver_count);
>>>> -????? if (profile.has_receiver(1)) {
>>>> -??????? log->print(" receiver2='%d' receiver2_count='%d'", r2id, profile.receiver_count(1));
>>>> +??? if (call_does_dispatch) {
>>>> +????? log->print(" virtual='1'");
>>>> +????? for (int i = 0; i < TypeProfileWidth && profile.has_receiver(i); i++) {
>>>> +??????? if (i == 0) {
>>>> +????????? log->print(" receiver='%d' receiver_count='%d' receiver_prob='%f'", rids[i], profile.receiver_count(i), 
>>>> profile.receiver_prob(i));
>>>> +??????? } else {
>>>> +????????? log->print(" receiver%d='%d' receiver%d_count='%d' receiver%d_prob='%f'", i + 1, rids[i], i + 1, 
>>>> profile.receiver_count(i), i + 1, profile.receiver_prob(i));
>>>> +??????? }
>>>> ??????? }
>>>> ????? }
>>>> ????? if (callee->is_method_handle_intrinsic()) {
>>>> @@ -205,92 +207,112 @@ CallGenerator* Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>> ????? if (call_does_dispatch && site_count > 0 && UseTypeProfile) {
>>>> ??????? // The major receiver's count >= TypeProfileMajorReceiverPercent of site_count.
>>>> ??????? bool have_major_receiver = profile.has_receiver(0) && (100.*profile.receiver_prob(0) >= 
>>>> (float)TypeProfileMajorReceiverPercent);
>>>> -????? ciMethod* receiver_method = NULL;
>>>> ??????? int morphism = profile.morphism();
>>>> +
>>>> +????? int width = morphism > 0 ? morphism : 1;
>>>> +????? ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*, width);
>>>> +????? memset(receiver_methods, 0, sizeof(ciMethod*) * width);
>>>> +????? CallGenerator** hit_cgs = NEW_RESOURCE_ARRAY(CallGenerator*, width);
>>>> +????? memset(hit_cgs, 0, sizeof(CallGenerator*) * width);
>>>> +
>>>> ??????? if (speculative_receiver_type != NULL) {
>>>> ????????? if (!too_many_traps_or_recompiles(caller, bci, Deoptimization::Reason_speculate_class_check)) {
>>>> ??????????? // We have a speculative type, we should be able to resolve
>>>> ??????????? // the call. We do that before looking at the profiling at
>>>> -????????? // this invoke because it may lead to bimorphic inlining which
>>>> +????????? // this invoke because it may lead to polymorphic inlining which
>>>> ??????????? // a speculative type should help us avoid.
>>>> -????????? receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>>>> - speculative_receiver_type);
>>>> -????????? if (receiver_method == NULL) {
>>>> +????????? receiver_methods[0] = callee->resolve_invoke(jvms->method()->holder(),
>>>> + speculative_receiver_type);
>>>> +????????? if (receiver_methods[0] == NULL) {
>>>> ????????????? speculative_receiver_type = NULL;
>>>> ??????????? } else {
>>>> ????????????? morphism = 1;
>>>> ??????????? }
>>>> ????????? } else {
>>>> ??????????? // speculation failed before. Use profiling at the call
>>>> -????????? // (could allow bimorphic inlining for instance).
>>>> +????????? // (could allow polymorphic inlining for instance).
>>>> ??????????? speculative_receiver_type = NULL;
>>>> ????????? }
>>>> ??????? }
>>>> -????? if (receiver_method == NULL &&
>>>> -????????? (have_major_receiver || morphism == 1 ||
>>>> -?????????? (morphism == 2 && UseBimorphicInlining))) {
>>>> -??????? // receiver_method = profile.method();
>>>> -??????? // Profiles do not suggest methods now.? Look it up in the major receiver.
>>>> -??????? receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>>>> - profile.receiver(0));
>>>> -????? }
>>>> -????? if (receiver_method != NULL) {
>>>> -??????? // The single majority receiver sufficiently outweighs the minority.
>>>> -??????? CallGenerator* hit_cg = this->call_generator(receiver_method,
>>>> -????????????? vtable_index, !call_does_dispatch, jvms, allow_inline, prof_factor);
>>>> -??????? if (hit_cg != NULL) {
>>>> -????????? // Look up second receiver.
>>>> -????????? CallGenerator* next_hit_cg = NULL;
>>>> -????????? ciMethod* next_receiver_method = NULL;
>>>> -????????? if (morphism == 2 && UseBimorphicInlining) {
>>>> -??????????? next_receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>>>> - profile.receiver(1));
>>>> -??????????? if (next_receiver_method != NULL) {
>>>> -????????????? next_hit_cg = this->call_generator(next_receiver_method,
>>>> -????????????????????????????????? vtable_index, !call_does_dispatch, jvms,
>>>> -????????????????????????????????? allow_inline, prof_factor);
>>>> -????????????? if (next_hit_cg != NULL && !next_hit_cg->is_inline() &&
>>>> -????????????????? have_major_receiver && UseOnlyInlinedBimorphic) {
>>>> -????????????????? // Skip if we can't inline second receiver's method
>>>> -????????????????? next_hit_cg = NULL;
>>>> -????????????? }
>>>> -??????????? }
>>>> -????????? }
>>>> -????????? CallGenerator* miss_cg;
>>>> -????????? Deoptimization::DeoptReason reason = (morphism == 2
>>>> -?????????????????????????????????????????????? ? Deoptimization::Reason_bimorphic
>>>> -?????????????????????????????????????????????? : Deoptimization::reason_class_check(speculative_receiver_type != 
>>>> NULL));
>>>> -????????? if ((morphism == 1 || (morphism == 2 && next_hit_cg != NULL)) &&
>>>> -????????????? !too_many_traps_or_recompiles(caller, bci, reason)
>>>> -???????????? ) {
>>>> -??????????? // Generate uncommon trap for class check failure path
>>>> -??????????? // in case of monomorphic or bimorphic virtual call site.
>>>> -??????????? miss_cg = CallGenerator::for_uncommon_trap(callee, reason,
>>>> -??????????????????????? Deoptimization::Action_maybe_recompile);
>>>> +????? bool removed_cgs = false;
>>>> +????? // Look up receivers.
>>>> +????? for (int i = 0; i < morphism; i++) {
>>>> +??????? if ((i == 1 && !UseBimorphicInlining) || (i >= 1 && !UsePolymorphicInlining)) {
>>>> +????????? break;
>>>> +??????? }
>>>> +??????? if (receiver_methods[i] == NULL && profile.has_receiver(i)) {
>>>> +????????? receiver_methods[i] = callee->resolve_invoke(jvms->method()->holder(),
>>>> + profile.receiver(i));
>>>> +??????? }
>>>> +??????? if (receiver_methods[i] != NULL) {
>>>> +????????? bool allow_inline;
>>>> +????????? if (speculative_receiver_type != NULL) {
>>>> +??????????? allow_inline = true;
>>>> ??????????? } else {
>>>> -??????????? // Generate virtual call for class check failure path
>>>> -??????????? // in case of polymorphic virtual call site.
>>>> -??????????? miss_cg = CallGenerator::for_virtual_call(callee, vtable_index);
>>>> +??????????? allow_inline = 100.*profile.receiver_prob(i) >= (float)TypeProfileMinimumReceiverPercent;
>>>> ??????????? }
>>>> -????????? if (miss_cg != NULL) {
>>>> -??????????? if (next_hit_cg != NULL) {
>>>> -????????????? assert(speculative_receiver_type == NULL, "shouldn't end up here if we used speculation");
>>>> -????????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), next_receiver_method, 
>>>> profile.receiver(1), site_count, profile.receiver_count(1));
>>>> -????????????? // We don't need to record dependency on a receiver here and below.
>>>> -????????????? // Whenever we inline, the dependency is added by Parse::Parse().
>>>> -????????????? miss_cg = CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, next_hit_cg, PROB_MAX);
>>>> -??????????? }
>>>> -??????????? if (miss_cg != NULL) {
>>>> -????????????? ciKlass* k = speculative_receiver_type != NULL ? speculative_receiver_type : profile.receiver(0);
>>>> -????????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_method, k, site_count, 
>>>> receiver_count);
>>>> -????????????? float hit_prob = speculative_receiver_type != NULL ? 1.0 : profile.receiver_prob(0);
>>>> -????????????? CallGenerator* cg = CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob);
>>>> -????????????? if (cg != NULL)? return cg;
>>>> +????????? hit_cgs[i] = this->call_generator(receiver_methods[i],
>>>> +??????????????????????????????? vtable_index, !call_does_dispatch, jvms,
>>>> +??????????????????????????????? allow_inline, prof_factor);
>>>> +????????? if (hit_cgs[i] != NULL) {
>>>> +??????????? if (speculative_receiver_type != NULL) {
>>>> +????????????? // Do nothing if it's a speculative type
>>>> +??????????? } else if (bytecode == Bytecodes::_invokeinterface) {
>>>> +????????????? // Do nothing if it's an interface, multiple direct-calls are faster than one indirect-call
>>>> +??????????? } else if (!have_major_receiver) {
>>>> +????????????? // Do nothing if there is no major receiver
>>>> +??????????? } else if ((morphism == 2 && !UseOnlyInlinedBimorphic) || (morphism >= 2 && !UseOnlyInlinedPolymorphic)) {
>>>> +????????????? // Do nothing if the user allows non-inlined polymorphic calls
>>>> +??????????? } else if (!hit_cgs[i]->is_inline()) {
>>>> +????????????? // Skip if we can't inline receiver's method
>>>> +????????????? hit_cgs[i] = NULL;
>>>> +????????????? removed_cgs = true;
>>>> ????????????? }
>>>> ??????????? }
>>>> ????????? }
>>>> ??????? }
>>>> +
>>>> +????? // Generate the fallback path
>>>> +????? Deoptimization::DeoptReason reason = (morphism != 1
>>>> +??????????????????????????????????????????? ? Deoptimization::Reason_polymorphic
>>>> +??????????????????????????????????????????? : Deoptimization::reason_class_check(speculative_receiver_type != NULL));
>>>> +????? bool disable_trap = (profile.is_megamorphic() || removed_cgs || too_many_traps_or_recompiles(caller, bci, 
>>>> reason));
>>>> +????? if (log != NULL) {
>>>> +??????? log->elem("call_fallback method='%d' count='%d' morphism='%d' trap='%d'",
>>>> +????????????????????? log->identify(callee), site_count, morphism, disable_trap ? 0 : 1);
>>>> +????? }
>>>> +????? CallGenerator* miss_cg;
>>>> +????? if (!disable_trap) {
>>>> +??????? // Generate uncommon trap for class check failure path
>>>> +??????? // in case of polymorphic virtual call site.
>>>> +??????? miss_cg = CallGenerator::for_uncommon_trap(callee, reason,
>>>> +??????????????????? Deoptimization::Action_maybe_recompile);
>>>> +????? } else {
>>>> +??????? // Generate virtual call for class check failure path
>>>> +??????? // in case of megamorphic virtual call site.
>>>> +??????? miss_cg = CallGenerator::for_virtual_call(callee, vtable_index);
>>>> +????? }
>>>> +
>>>> +????? // Generate the guards
>>>> +????? CallGenerator* cg = NULL;
>>>> +????? if (speculative_receiver_type != NULL) {
>>>> +??????? if (hit_cgs[0] != NULL) {
>>>> +????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_methods[0], 
>>>> speculative_receiver_type, site_count, profile.receiver_count(0));
>>>> +????????? // We don't need to record dependency on a receiver here and below.
>>>> +????????? // Whenever we inline, the dependency is added by Parse::Parse().
>>>> +????????? cg = CallGenerator::for_predicted_call(speculative_receiver_type, miss_cg, hit_cgs[0], PROB_MAX);
>>>> +??????? }
>>>> +????? } else {
>>>> +??????? for (int i = morphism - 1; i >= 0 && miss_cg != NULL; i--) {
>>>> +????????? if (hit_cgs[i] != NULL) {
>>>> +??????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_methods[i], 
>>>> profile.receiver(i), site_count, profile.receiver_count(i));
>>>> +??????????? miss_cg = CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, hit_cgs[i], 
>>>> profile.receiver_prob(i));
>>>> +????????? }
>>>> +??????? }
>>>> +??????? cg = miss_cg;
>>>> +????? }
>>>> +????? if (cg != NULL)? return cg;
>>>> ????? }
>>>> ????? // If there is only one implementor of this interface then we
>>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp b/src/hotspot/share/runtime/deoptimization.cpp
>>>> index 11df15e004..2d14b52854 100644
>>>> --- a/src/hotspot/share/runtime/deoptimization.cpp
>>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp
>>>> @@ -2382,7 +2382,7 @@ const char* Deoptimization::_trap_reason_name[] = {
>>>> ??? "class_check",
>>>> ??? "array_check",
>>>> ??? "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"),
>>>> -? "bimorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>> +? "polymorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>> ??? "profile_predicate",
>>>> ??? "unloaded",
>>>> ??? "uninitialized",
>>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp b/src/hotspot/share/runtime/deoptimization.hpp
>>>> index 1cfff5394e..c1eb998aba 100644
>>>> --- a/src/hotspot/share/runtime/deoptimization.hpp
>>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp
>>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic {
>>>> ????? Reason_class_check,?????????? // saw unexpected object class (@bci)
>>>> ????? Reason_array_check,?????????? // saw unexpected array class (aastore @bci)
>>>> ????? Reason_intrinsic,???????????? // saw unexpected operand to intrinsic (@bci)
>>>> -??? Reason_bimorphic,???????????? // saw unexpected object class in bimorphic inlining (@bci)
>>>> +??? Reason_polymorphic,?????????? // saw unexpected object class in bimorphic inlining (@bci)
>>>> ? #if INCLUDE_JVMCI
>>>> ????? Reason_unreached0???????????? = Reason_null_assert,
>>>> ????? Reason_type_checked_inlining? = Reason_intrinsic,
>>>> -??? Reason_optimized_type_check?? = Reason_bimorphic,
>>>> +??? Reason_optimized_type_check?? = Reason_polymorphic,
>>>> ? #endif
>>>> ????? Reason_profile_predicate,???? // compiler generated predicate moved from frequent branch in a loop failed
>>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp b/src/hotspot/share/runtime/vmStructs.cpp
>>>> index 94b544824e..ee761626c4 100644
>>>> --- a/src/hotspot/share/runtime/vmStructs.cpp
>>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp
>>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry<InstanceKlass*, mtClass>? KlassHashtableEntry;
>>>> declare_constant(Deoptimization::Reason_class_check) \
>>>> declare_constant(Deoptimization::Reason_array_check) \
>>>> declare_constant(Deoptimization::Reason_intrinsic) \
>>>> - declare_constant(Deoptimization::Reason_bimorphic) \
>>>> + declare_constant(Deoptimization::Reason_polymorphic) \
>>>> declare_constant(Deoptimization::Reason_profile_predicate) \
>>>> declare_constant(Deoptimization::Reason_unloaded) \
>>>> declare_constant(Deoptimization::Reason_uninitialized) \
>>>>
>>>> -----Original Message-----
>>>> From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Ludovic Henry
>>>> Sent: Tuesday, March 3, 2020 10:50 AM
>>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose <john.r.rose at oracle.com>; 
>>>> hotspot-compiler-dev at openjdk.java.net
>>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>>
>>>> I just got to run the PolymorphicVirtualCallBenchmark microbenchmark with
>>>> various TypeProfileWidth values. The results are:
>>>>
>>>> Benchmark???????????????????????????? Mode? Cnt? Score?? Error? Units Configuration
>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.802 ? 0.048 ops/s -XX:TypeProfileWidth=0 
>>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.425 ? 0.019 ops/s -XX:TypeProfileWidth=1 
>>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.857 ? 0.109 ops/s -XX:TypeProfileWidth=2 
>>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.876 ? 0.051 ops/s -XX:TypeProfileWidth=3 
>>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.867 ? 0.045 ops/s -XX:TypeProfileWidth=4 
>>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.835 ? 0.104 ops/s -XX:TypeProfileWidth=5 
>>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.886 ? 0.139 ops/s -XX:TypeProfileWidth=6 
>>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.887 ? 0.040 ops/s -XX:TypeProfileWidth=7 
>>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.684 ? 0.020 ops/s -XX:TypeProfileWidth=8 
>>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>>
>>>> The main thing I observe is that there isn't a linear (or even any apparent)
>>>> correlation between the number of guards generated (guided by
>>>> TypeProfileWidth), and the time taken.
>>>>
>>>> I am trying to understand why there is a dip for TypeProfileWidth equal
>>>> to 1 and 8.
>>>>
>>>> -- 
>>>> Ludovic
>>>>
>>>> -----Original Message-----
>>>> From: Ludovic Henry <luhenry at microsoft.com>
>>>> Sent: Tuesday, March 3, 2020 9:33 AM
>>>> To: Ludovic Henry <luhenry at microsoft.com>; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose 
>>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>>
>>>> Hi Vladimir,
>>>>
>>>> I did a rerun of the following benchmark with various configurations:
>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fluhenry%2Fjdk-microbenchmarks%2Fblob%2Fmaster%2Fsrc%2Fmain%2Fjava%2Forg%2Fsample%2FTypeProfileWidthOverheadBenchmark.java&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&amp;sdata=KzGJ0Xc95rWag6rF8gyFW7%2BHcDUccltwMN5JmL%2BUhdE%3D&amp;reserved=0 
>>>>
>>>>
>>>> The results are as follows:
>>>>
>>>> Benchmark???????????????????????????? Mode? Cnt? Score?? Error? Units Configuration
>>>> PolymorphicVirtualCallBenchmark.run?? thrpt 5??? 2.910 ? 0.040? ops/s indirect-call? -XX:TypeProfileWidth=0 
>>>> -XX:+PolyGuardDisableTrap
>>>> PolymorphicVirtualCallBenchmark.run?? thrpt 5??? 2.752 ? 0.039? ops/s direct-call??? -XX:TypeProfileWidth=8 
>>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>> PolymorphicVirtualCallBenchmark.run?? thrpt 5??? 3.407 ? 0.085? ops/s inlined-call?? -XX:TypeProfileWidth=8 
>>>> -XX:-PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>> Benchmark???????????????????????????? Mode? Cnt? Score?? Error? Units Configuration
>>>> PolymorphicInterfaceCallBenchmark.run thrpt 5??? 2.043 ? 0.025? ops/s indirect-call? -XX:TypeProfileWidth=0 
>>>> -XX:+PolyGuardDisableTrap
>>>> PolymorphicInterfaceCallBenchmark.run thrpt 5??? 2.555 ? 0.063? ops/s direct-call??? -XX:TypeProfileWidth=8 
>>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>> PolymorphicInterfaceCallBenchmark.run thrpt 5??? 3.217 ? 0.058? ops/s inlined-call?? -XX:TypeProfileWidth=8 
>>>> -XX:-PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>>
>>>> The Hotspot logs (with generated assembly) are available at:
>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fluhenry%2F4f015541cb6628517ab6698b1487c17d&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&amp;sdata=q7MAs1V9tbuZhBIjNDQefNuhsiKVv7HHRn1kdcbrUPk%3D&amp;reserved=0 
>>>>
>>>>
>>>> The main takeaway from that experiment is that direct calls w/o inlining is faster
>>>> than indirect calls for icalls but slower for vcalls, and that inlining is always faster
>>>> than direct calls.
>>>>
>>>> (I fully understand this applies mainly on this microbenchmark, and we need to
>>>> validate on larger benchmarks. I'm working on that next. However, it clearly show
>>>> gains on a pathological case.)
>>>>
>>>> Next, I want to figure out at how many guard the direct-call regresses compared
>>>> to indirect-call in the vcall case, and I want to run larger benchmarks. Any
>>>> particular you would like to see running? I am planning on doing SPECjbb2015 first.
>>>>
>>>> Thank you,
>>>>
>>>> -- 
>>>> Ludovic
>>>>
>>>> -----Original Message-----
>>>> From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Ludovic Henry
>>>> Sent: Monday, March 2, 2020 4:20 PM
>>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose <john.r.rose at oracle.com>; 
>>>> hotspot-compiler-dev at openjdk.java.net
>>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>>
>>>> Hi Vladimir,
>>>>
>>>> Sorry for the long delay in response, I was at multiple conferences over the past few
>>>> weeks. I'm back to the office now and fully focus on getting progress on that.
>>>>
>>>>>> Possible avenues of improvements I can see are:
>>>>>> ??? - Gather all the types in an unbounded list so we can know which ones
>>>>>> are the most frequent. It is unlikely to help with Java as, in the general
>>>>>> case, there are only a few types present a call-sites. It could, however,
>>>>>> be particularly helpful for languages that tend to have many types at
>>>>>> call-sites, like functional languages, for example.
>>>>>
>>>>> I doubt having unbounded list of receiver types is practical: it's
>>>>> costly to gather, but isn't too useful for compilation. But measuring
>>>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some numbers.
>>>>
>>>> I agree that it isn't very practical. It can be useful in the case where there are
>>>> many types at a call-site, and the first ones end up not being frequent enough to
>>>> mandate a guard. This is clearly an edge-case, and I don't think we should optimize
>>>> for it.
>>>>
>>>>>> In what we have today, some of the worst-case scenarios are the following:
>>>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, the first and
>>>>>> second types are types A and B, and the other type(s) is(are) not recorded,
>>>>>> and it increments the `count` value. Even if A and B are used in the initialization
>>>>>> path (i.e. only a few times) and the other type(s) is(are) used in the hot
>>>>>> path (i.e. many times), the latter are never considered for inlining - because
>>>>>> it was never recorded during profiling.
>>>>>
>>>>> Can it be alleviated by (partially) clearing type profile (e.g.,
>>>>> periodically free some space by removing elements with lower frequencies
>>>>> and give new types a chance to be profiled?
>>>>
>>>> Doing that reliably relies on the assumption that we know what the shape of
>>>> the workload is going to be in future iterations. Otherwise, how could you
>>>> guarantee that a type that's not currently frequent will not be in the future,
>>>> and that the information that you remove now will not be important later. This
>>>> is an assumption that, IMO, is worst than missing types which are hot later in
>>>> the execution for two reasons: 1. it's no better, and 2. it's a lot less intuitive and
>>>> harder to debug/understand than a straightforward "overflow".
>>>>
>>>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, you have the
>>>>>> first type A with 49% probability, the second type B with 49% probability, and
>>>>>> the other types with 2% probability. Even though A and B are the two hottest
>>>>>> paths, it does not generate guards because none are a major receiver.
>>>>>
>>>>> Yes. On the other hand, on average it'll cause inlining twice as much
>>>>> code (2 methods vs 1).
>>>>
>>>> It will not necessarily cause twice as much inlining because of late-inlining. Like
>>>> you point out later, it will generate a direct-call in case there isn't room for more
>>>> inlinable code.
>>>>
>>>>> Also, does it make sense to increase morphism factor even if inlining
>>>>> doesn't happen?
>>>>>
>>>>> ?? if (recv.klass == C1) {? // >>0%
>>>>> ????? ... inlined ...
>>>>> ?? } else if (recv.klass == C2) { // >>0%
>>>>> ????? m2(); // direct call
>>>>> ?? } else { // >0%
>>>>> ????? m(); // virtual call
>>>>> ?? }
>>>>>
>>>>> vs
>>>>>
>>>>> ?? if (recv.klass == C1) {? // >>0%
>>>>> ????? ... inlined ...
>>>>> ?? } else { // >>0%
>>>>> ????? m(); // virtual call
>>>>> ?? }
>>>>
>>>> There is the advantage that modern CPUs are better at predicting instruction-branches
>>>> than data-branches. These guards will then allow the CPU to make better decisions allowing
>>>> for better superscalar executions, memory prefetching, etc.
>>>>
>>>> This, IMO, makes sense for warm calls, especially since the cost is a guard + a call, which is
>>>> much lower than a inlined method, but brings benefits over an indirect call.
>>>>
>>>>> In other words, how much could we get just by lowering
>>>>> TypeProfileMajorReceiverPercent?
>>>>
>>>> TypeProfileMajorReceiverPercent is only used today when you have a megamorphic
>>>> call-site (aka more types than TypeProfileWidth) but still one type receiving more than
>>>> N% of the calls. By reducing the value, you would not increase the number of guards,
>>>> but the threshold at which you generate the 1st guard in a megamorphic case.
>>>>
>>>>>>> ??????? - for N-morphic case what's the negative effect (quantitative) of
>>>>>>> the deopt?
>>>>>> We are triggering the uncommon trap in this case iff we observed a limited
>>>>>> and stable set of types in the early stages of the Tiered Compilation
>>>>>> pipeline (making us generate N-morphic guards), and we suddenly observe a
>>>>>> new type. AFAIU, this is precisely what deopt is for.
>>>>>
>>>>> I should have added "... compared to N-polymorhic case". My intuition is
>>>>> the higher morphism factor is the fewer the benefits of deopt (compared
>>>>> to a call) are. It would be very good to validate it with some
>>>>> benchmarks (both micro- and larger ones).
>>>>
>>>> I agree that what you are describing makes sense as well. To reduce the cost of deopt
>>>> here, having a TypeProfileMinimumReceiverPercent helps. That is because if any type is
>>>> seen less than this specific frequency, then it won't generate a guard, leading to an indirect
>>>> call in the fallback case.
>>>>
>>>>>> I'm writing a JMH benchmark to stress that specific case. I'll share it as soon
>>>>>> as I have something reliably reproducing.
>>>>>
>>>>> Thanks! A representative set of microbenchmarks will be very helpful.
>>>>
>>>> It turns out the guard is only generated once, meaning that if we ever hit it then we
>>>> generate an indirect call.
>>>>
>>>> We also only generate the trap iff all the guards are hot (inlined) or warm (direct call),
>>>> so any of the following case triggers the creation of an indirect call over a trap:
>>>> ? - we hit the trap once before
>>>> ? - one or more guards are cold (aka not inlinable even with late-inlining)
>>>>
>>>>> It was more about opportunities for future explorations. I don't think
>>>>> we have to act on it right away.
>>>>>
>>>>> As with "deopt vs call", my guess is callee should benefit much more
>>>>> from inlining than the caller it is inlined into (caller sees multiple
>>>>> callee candidates and has to merge the results while each callee
>>>>> observes the full context and can benefit from it).
>>>>>
>>>>> If we can run some sort of static analysis on callee bytecode, what kind
>>>>> of code patterns should we look for to guide inlining decisions?
>>>>
>>>> Any pattern that would benefit from other optimizations (escape analysis,
>>>> dead code elimination, constant propagation, etc.) is good, but short of
>>>> shadowing statically what all these optimizations do, I can't see an easy way
>>>> to do it.
>>>>
>>>> That is where late-inlining, or more advanced dynamic heuristics like the one you
>>>> can find in Graal EE, is worthwhile.
>>>>
>>>>> Regaring experiments to try first, here are some ideas I find promising:
>>>>>
>>>>> ???? * measure the cost of additional profiling
>>>>> ???????? -XX:TypeProfileWidth=N without changing compilers
>>>>
>>>> I am running the following jmh microbenchmark
>>>>
>>>> ???? public final static int N = 100_000_000;
>>>>
>>>> ???? @State(Scope.Benchmark)
>>>> ???? public static class TypeProfileWidthOverheadBenchmarkState {
>>>> ???????? public A[] objs = new A[N];
>>>>
>>>> ???????? @Setup
>>>> ???????? public void setup() throws Exception {
>>>> ???????????? for (int i = 0; i < objs.length; ++i) {
>>>> ???????????????? switch (i % 8) {
>>>> ???????????????? case 0: objs[i] = new A1(); break;
>>>> ???????????????? case 1: objs[i] = new A2(); break;
>>>> ???????????????? case 2: objs[i] = new A3(); break;
>>>> ???????????????? case 3: objs[i] = new A4(); break;
>>>> ???????????????? case 4: objs[i] = new A5(); break;
>>>> ???????????????? case 5: objs[i] = new A6(); break;
>>>> ???????????????? case 6: objs[i] = new A7(); break;
>>>> ???????????????? case 7: objs[i] = new A8(); break;
>>>> ???????????????? }
>>>> ???????????? }
>>>> ???????? }
>>>> ???? }
>>>>
>>>> ???? @Benchmark @OperationsPerInvocation(N)
>>>> ???? public void run(TypeProfileWidthOverheadBenchmarkState state, Blackhole blackhole) {
>>>> ???????? A[] objs = state.objs;
>>>> ???????? for (int i = 0; i < objs.length; ++i) {
>>>> ???????????? objs[i].foo(i, blackhole);
>>>> ???????? }
>>>> ???? }
>>>>
>>>> And I am running with the following JVM parameters:
>>>>
>>>> -XX:TypeProfileWidth=0 -XX:CompileThreshold=200000000 -XX:Tier3CompileThreshold=200000000 
>>>> -XX:Tier3InvocationThreshold=200000000 -XX:Tier3BackEdgeThreshold=200000000
>>>> -XX:TypeProfileWidth=8 -XX:CompileThreshold=200000000 -XX:Tier3CompileThreshold=200000000 
>>>> -XX:Tier3InvocationThreshold=200000000 -XX:Tier3BackEdgeThreshold=200000000
>>>>
>>>> I observe no statistically representative difference between in s/ops
>>>> between TypeProfileWidth=0 and TypeProfileWidth=8. I also could observe
>>>> no significant difference in the resulting analysis using Intel VTune.
>>>>
>>>> I verified that the benchmark never goes beyond Tier-0 with -XX:+PrintCompilation.
>>>>
>>>>> ???? * N-morphic vs N-polymorphic (N>=2):
>>>>> ?????? - how much deopt helps compared to a virtual call on fallback path?
>>>>
>>>> I have done the following microbenchmark, but I am not sure that it's
>>>> going to fully answer the question you are raising here.
>>>>
>>>> ???? public final static int N = 100_000_000;
>>>>
>>>> ???? @State(Scope.Benchmark)
>>>> ???? public static class PolymorphicDeoptBenchmarkState {
>>>> ???????? public A[] objs = new A[N];
>>>>
>>>> ???????? @Setup
>>>> ???????? public void setup() throws Exception {
>>>> ???????????? int cutoff1 = (int)(objs.length * .90);
>>>> ???????????? int cutoff2 = (int)(objs.length * .95);
>>>> ???????????? for (int i = 0; i < cutoff1; ++i) {
>>>> ???????????????? switch (i % 2) {
>>>> ???????????????? case 0: objs[i] = new A1(); break;
>>>> ???????????????? case 1: objs[i] = new A2(); break;
>>>> ???????????????? }
>>>> ???????????? }
>>>> ???????????? for (int i = cutoff1; i < cutoff2; ++i) {
>>>> ???????????????? switch (i % 4) {
>>>> ???????????????? case 0: objs[i] = new A1(); break;
>>>> ???????????????? case 1: objs[i] = new A2(); break;
>>>> ???????????????? case 2:
>>>> ???????????????? case 3: objs[i] = new A3(); break;
>>>> ???????????????? }
>>>> ???????????? }
>>>> ???????????? for (int i = cutoff2; i < objs.length; ++i) {
>>>> ???????????????? switch (i % 4) {
>>>> ???????????????? case 0:
>>>> ???????????????? case 1: objs[i] = new A3(); break;
>>>> ???????????????? case 2:
>>>> ???????????????? case 3: objs[i] = new A4(); break;
>>>> ???????????????? }
>>>> ???????????? }
>>>> ???????? }
>>>> ???? }
>>>>
>>>> ???? @Benchmark @OperationsPerInvocation(N)
>>>> ???? public void run(PolymorphicDeoptBenchmarkState state, Blackhole blackhole) {
>>>> ???????? A[] objs = state.objs;
>>>> ???????? for (int i = 0; i < objs.length; ++i) {
>>>> ???????????? objs[i].foo(i, blackhole);
>>>> ???????? }
>>>> ???? }
>>>>
>>>> I run this benchmark with -XX:+PolyGuardDisableTrap or
>>>> -XX:-PolyGuardDisableTrap which force enable/disable the trap in the
>>>> fallback.
>>>>
>>>> For that kind of cases, a visitor pattern is what I expect to most largely
>>>> profit/suffer from a deopt or virtual-call in the fallback path. Would you
>>>> know of such benchmark that heavily relies on this pattern, and that I
>>>> could readily reuse?
>>>>
>>>>> ???? * inlining vs devirtualization
>>>>> ?????? - a knob to control inlining in N-morphic/N-polymorphic cases
>>>>> ?????? - measure separately the effects of devirtualization and inlining
>>>>
>>>> For that one, I reused the first microbenchmark I mentioned above, and
>>>> added a PolyGuardDisableInlining flag that controls whether we create a
>>>> direct-call or inline.
>>>>
>>>> The results are 2.958 ? 0.011 ops/s for -XX:-PolyGuardDisableInlining (aka inlined)
>>>> vs 2.540 ? 0.018 ops/s for -XX:+PolyGuardDisableInlining (aka direct call).
>>>>
>>>> This benchmarks hasn't been run in the best possible conditions (on my dev
>>>> machine, in WSL), but it gives a strong indication that even a direct call has a
>>>> non-negligible impact, and that inlining leads to better result (again, in this
>>>> microbenchmark).
>>>>
>>>> Otherwise, on the per-method TypeProfileWidth knob, I couldn't find anything
>>>> that would be readily available from the Interpreter. Would you have any pointer
>>>> of a pre-existing feature that required this specific kind of plumbing? I would otherwise
>>>> find myself in need of making CompilerDirectives available from the Interpreter, and
>>>> that is something outside of my current expertise (always happy to learn, but I
>>>> will need some pointers!).
>>>>
>>>> Thank you,
>>>>
>>>> -- 
>>>> Ludovic
>>>>
>>>> -----Original Message-----
>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>> Sent: Thursday, February 20, 2020 9:00 AM
>>>> To: Ludovic Henry <luhenry at microsoft.com>; John Rose <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>>
>>>> Hi Ludovic,
>>>>
>>>> [...]
>>>>
>>>>> Thanks for this explanation, it makes it a lot clearer what the cases and
>>>>> your concerns are. To rephrase in my own words, what you are interested in
>>>>> is not this change in particular, but more the possibility that this change
>>>>> provides and how to take it the next step, correct?
>>>>
>>>> Yes, it's a good summary.
>>>>
>>>> [...]
>>>>
>>>>>> ??????? - affects profiling strategy: majority of receivers vs complete
>>>>>> list of receiver types observed;
>>>>> Today, we only use the N first receivers when the number of types does
>>>>> not exceed TypeProfileWidth; otherwise, we use none of them.
>>>>> Possible avenues of improvements I can see are:
>>>>> ??? - Gather all the types in an unbounded list so we can know which ones
>>>>> are the most frequent. It is unlikely to help with Java as, in the general
>>>>> case, there are only a few types present a call-sites. It could, however,
>>>>> be particularly helpful for languages that tend to have many types at
>>>>> call-sites, like functional languages, for example.
>>>>
>>>> I doubt having unbounded list of receiver types is practical: it's
>>>> costly to gather, but isn't too useful for compilation. But measuring
>>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some numbers.
>>>>
>>>>> ?? - Use the existing types to generate guards for these types we know are
>>>>> common enough. Then use the types which are hot or warm, even in case of a
>>>>> megamorphic call-site. It would be a simple iteration of what we have
>>>>> nowadays.
>>>>
>>>>> In what we have today, some of the worst-case scenarios are the following:
>>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, the first and
>>>>> second types are types A and B, and the other type(s) is(are) not recorded,
>>>>> and it increments the `count` value. Even if A and B are used in the initialization
>>>>> path (i.e. only a few times) and the other type(s) is(are) used in the hot
>>>>> path (i.e. many times), the latter are never considered for inlining - because
>>>>> it was never recorded during profiling.
>>>>
>>>> Can it be alleviated by (partially) clearing type profile (e.g.,
>>>> periodically free some space by removing elements with lower frequencies
>>>> and give new types a chance to be profiled?
>>>>
>>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, you have the
>>>>> first type A with 49% probability, the second type B with 49% probability, and
>>>>> the other types with 2% probability. Even though A and B are the two hottest
>>>>> paths, it does not generate guards because none are a major receiver.
>>>>
>>>> Yes. On the other hand, on average it'll cause inlining twice as much
>>>> code (2 methods vs 1).
>>>>
>>>> Also, does it make sense to increase morphism factor even if inlining
>>>> doesn't happen?
>>>>
>>>> ??? if (recv.klass == C1) {? // >>0%
>>>> ?????? ... inlined ...
>>>> ??? } else if (recv.klass == C2) { // >>0%
>>>> ?????? m2(); // direct call
>>>> ??? } else { // >0%
>>>> ?????? m(); // virtual call
>>>> ??? }
>>>>
>>>> vs
>>>>
>>>> ??? if (recv.klass == C1) {? // >>0%
>>>> ?????? ... inlined ...
>>>> ??? } else { // >>0%
>>>> ?????? m(); // virtual call
>>>> ??? }
>>>>
>>>> In other words, how much could we get just by lowering
>>>> TypeProfileMajorReceiverPercent?
>>>>
>>>> And it relates to "virtual/interface call" vs "type guard + direct call"
>>>> code shapes comparison: how much does devirtualization help?
>>>>
>>>> Otherwise, enabling 2-polymorphic shape becomes feasible only if both
>>>> cases are inlined.
>>>>
>>>>>> ??????? - for N-morphic case what's the negative effect (quantitative) of
>>>>>> the deopt?
>>>>> We are triggering the uncommon trap in this case iff we observed a limited
>>>>> and stable set of types in the early stages of the Tiered Compilation
>>>>> pipeline (making us generate N-morphic guards), and we suddenly observe a
>>>>> new type. AFAIU, this is precisely what deopt is for.
>>>>
>>>> I should have added "... compared to N-polymorhic case". My intuition is
>>>> the higher morphism factor is the fewer the benefits of deopt (compared
>>>> to a call) are. It would be very good to validate it with some
>>>> benchmarks (both micro- and larger ones).
>>>>
>>>>> I'm writing a JMH benchmark to stress that specific case. I'll share it as soon
>>>>> as I have something reliably reproducing.
>>>>
>>>> Thanks! A representative set of microbenchmarks will be very helpful.
>>>>
>>>>>> ???? * invokevirtual vs invokeinterface call sites
>>>>>> ??????? - different cost models;
>>>>>> ??????? - interfaces are harder to optimize, but opportunities for
>>>>>> strength-reduction from interface to virtual calls exist;
>>>>> ? From the profiling information and the inlining mechanism point of view,
>>>>> that it is an invokevirtual or an invokeinterface doesn't change anything
>>>>>
>>>>> Are you saying that we have more to gain from generating a guard for
>>>>> invokeinterface over invokevirtual because the fall-back of the
>>>>> invokeinterface is much more expensive?
>>>>
>>>> Yes, that's the question: if we see an improvement, how much does
>>>> devirtualization contribute to that?
>>>>
>>>> (If we add a type-guarded direct call, but there's no inlining
>>>> happening, inline cache effectively strength-reduce a virtual call to a
>>>> direct call.)
>>>>
>>>> Considering current implementation of virtual and interface calls
>>>> (vtables vs itables), the cost model is very different.
>>>>
>>>> For vtable calls, it doesn't look too appealing to introduce large
>>>> inline caches for individual receiver types since a call through a
>>>> vtable involves 3 dependent loads [1] (recv => Klass* => Method* =>
>>>> address).
>>>>
>>>> For itable calls it can be a big win in some situations: itable lookup
>>>> iterates over Klass::_secondary_supers array and it can become quite
>>>> costly. For example, some Scala workloads experience significant
>>>> overheads from megamorphic calls.
>>>>
>>>> If we see an improvement on some benchmark, it would be very useful to
>>>> be able to determine (quantitatively) how much does inlining and
>>>> devirtualization contribute.
>>>>
>>>> FTR ErikO has been experimenting with an alternative vtable/itable
>>>> implementation [4] which brings interface calls close to virtual calls.
>>>> So, if it turns out that devirtualization (and not inlining) of
>>>> interface calls is what contributes the most, then speeding up
>>>> megamorphic interface calls becomes a more attractive alternative.
>>>>
>>>>>> ???? * inlining heuristics
>>>>>> ??????? - devirtualization vs inlining
>>>>>> ????????? - how much benefit from expanding a call site (devirtualize more
>>>>>> cases) without inlining? should differ for virtual & interface cases
>>>>> I'm also writing a JMH benchmark for this case, and I'll share it as soon
>>>>> as I have it reliably reproducing the issue you describe.
>>>>
>>>> Also, I think it's important to have a knob to control it (inline vs
>>>> devirtualize). It'll enable experiments with larger benchmarks.
>>>>
>>>>>> ??????? - diminishing returns with increase in number of cases
>>>>>> ??????? - expanding a single call site leads to more code, but frequencies
>>>>>> stay the same => colder code
>>>>>> ??????? - based on profiling info (types + frequencies), dynamically
>>>>>> choose morphism factor on per-call site basis?
>>>>> That is where I propose to have a lower receiver probability at which we'll
>>>>> stop adding more guards. I am experimenting with a global flag with a default
>>>>> value of 10%.
>>>>>> ??????? - what optimization opportunities to look for? it looks like in
>>>>>> general callees should benefit more than the caller (due to merges after
>>>>>> the call site)
>>>>> Could you please expand your concern or provide an example.
>>>>
>>>> It was more about opportunities for future explorations. I don't think
>>>> we have to act on it right away.
>>>>
>>>> As with "deopt vs call", my guess is callee should benefit much more
>>>> from inlining than the caller it is inlined into (caller sees multiple
>>>> callee candidates and has to merge the results while each callee
>>>> observes the full context and can benefit from it).
>>>>
>>>> If we can run some sort of static analysis on callee bytecode, what kind
>>>> of code patterns should we look for to guide inlining decisions?
>>>>
>>>>
>>>> ? >> What's your take on it? Any other ideas?
>>>> ? >
>>>> ? > We don't know what we don't know. We need first to improve the
>>>> logging and
>>>> ? > debugging output of uncommon traps for polymorphic call-sites. Then, we
>>>> ? > need to gather data about the different cases you talked about.
>>>> ? >
>>>> ? > We also need to have some microbenchmarks to validate some of the
>>>> questions
>>>> ? > you are raising, and verify what level of gains we can expect from this
>>>> ? > optimization. Further validation will be needed on larger benchmarks and
>>>> ? > real-world applications as well, and that's where, I think, we need
>>>> to develop
>>>> ? > logging and debugging for this feature.
>>>>
>>>> Yes, sounds good.
>>>>
>>>> Regaring experiments to try first, here are some ideas I find promising:
>>>>
>>>> ???? * measure the cost of additional profiling
>>>> ???????? -XX:TypeProfileWidth=N without changing compilers
>>>>
>>>> ???? * N-morphic vs N-polymorphic (N>=2):
>>>> ?????? - how much deopt helps compared to a virtual call on fallback path?
>>>>
>>>> ???? * inlining vs devirtualization
>>>> ?????? - a knob to control inlining in N-morphic/N-polymorphic cases
>>>> ?????? - measure separately the effects of devirtualization and inlining
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>> [1]
>>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l48&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&amp;sdata=jp4emfr%2FRnQA3RLtGI%2BCJ4Q%2F%2Bx6NpIUsWPexl3KbxvM%3D&amp;reserved=0 
>>>>
>>>>
>>>> [2]
>>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l142&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&amp;sdata=P1iMschSMsRCBw7e4IHWimB26v3MdqWWrGZKSvv64vA%3D&amp;reserved=0 
>>>>
>>>>
>>>> [3]
>>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FmacroAssembler_x86.cpp%23l4294&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&amp;sdata=pSEuxGJYxfPHzcouSeoCYSRdfdlSByYwpyZfFA7dlYU%3D&amp;reserved=0 
>>>>
>>>>
>>>> [4] 
>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.openjdk.java.net%2Fbrowse%2FJDK-8221828&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&amp;sdata=ppajmGyCb1rcDjr0v52XMPcLySA7jQTTjmYE5vIEEoo%3D&amp;reserved=0 
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>> Sent: Tuesday, February 11, 2020 3:10 PM
>>>>> To: Ludovic Henry <luhenry at microsoft.com>; John Rose <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>>>
>>>>> Hi Ludovic,
>>>>>
>>>>> I fully agree that it's premature to discuss how default behavior should
>>>>> be changed since much more data is needed to be able to proceed with the
>>>>> decision. But considering the ultimate goal is to actually improve
>>>>> relevant heuristics (and effectively change the default behavior), it's
>>>>> the right time to discuss what kind of experiments are needed to gather
>>>>> enough data for further analysis.
>>>>>
>>>>> Though different shapes do look very similar at first, the shape of
>>>>> fallback makes a big difference. That's why monomorphic and polymorphic
>>>>> cases are distinct: uncommon traps are effectively exits and can
>>>>> significantly simplify CFG while calls can return and have to be merged
>>>>> back.
>>>>>
>>>>> Polymorphic shape is stable (no deopts/recompiles involved), but doesn't
>>>>> simplify the CFG around the call site.
>>>>>
>>>>> Monomorphic shape gives more optimization opportunities, but deopts are
>>>>> highly undesirable due to associated costs.
>>>>>
>>>>> For example:
>>>>>
>>>>> ???? if (recv.klass != C) { deopt(); }
>>>>> ???? C.m(recv);
>>>>>
>>>>> ???? // recv.klass == C - exact type
>>>>> ???? // return value == C.m(recv)
>>>>>
>>>>> vs
>>>>>
>>>>> ???? if (recv.klass == C) {
>>>>> ?????? C.m(recv);
>>>>> ???? } else {
>>>>> ?????? I.m(recv);
>>>>> ???? }
>>>>>
>>>>> ???? // recv.klass <: I - subtype
>>>>> ???? // return value is a phi merging C.m() & I.m() where I.m() is
>>>>> completley opaque.
>>>>>
>>>>> Monomorphic shape can degenerate into polymorphic (too many recompiles),
>>>>> but that's a forced move to stabilize the behavior and avoid vicious
>>>>> recomilation cycle (which is *very* expensive). (Another alternative is
>>>>> to leave deopt as is - set deopt action to "none" - but that's usually
>>>>> much worse decision.)
>>>>>
>>>>> And that's the reason why monomorphic shape requires a unique receiver
>>>>> type in profile while polymorphic shape works with major receiver type
>>>>> and probabilities.
>>>>>
>>>>>
>>>>> Considering further steps, IMO for experimental purposes a single knob
>>>>> won't cut it: there are multiple degrees of freedom which may play
>>>>> important role in building accurate performance model. I'm not yet
>>>>> convinced it's all about inlining and narrowing the scope of discussion
>>>>> specifically to type profile width doesn't help.
>>>>>
>>>>> I'd like to see more knobs introduced before we start conducting
>>>>> extensive experiments. So, let's discuss what other information we can
>>>>> benefit from.
>>>>>
>>>>> I mentioned some possible options in the previous email. I find the
>>>>> following aspects important for future discussion:
>>>>>
>>>>> ???? * shape of fallback path
>>>>> ??????? - what to generalize: 2- to N-morphic vs 1- to N-polymorphic;
>>>>> ??????? - affects profiling strategy: majority of receivers vs complete
>>>>> list of receiver types observed;
>>>>> ??????? - for N-morphic case what's the negative effect (quantitative) of
>>>>> the deopt?
>>>>>
>>>>> ???? * invokevirtual vs invokeinterface call sites
>>>>> ??????? - different cost models;
>>>>> ??????? - interfaces are harder to optimize, but opportunities for
>>>>> strength-reduction from interface to virtual calls exist;
>>>>>
>>>>> ???? * inlining heuristics
>>>>> ??????? - devirtualization vs inlining
>>>>> ????????? - how much benefit from expanding a call site (devirtualize more
>>>>> cases) without inlining? should differ for virtual & interface cases
>>>>> ??????? - diminishing returns with increase in number of cases
>>>>> ??????? - expanding a single call site leads to more code, but frequencies
>>>>> stay the same => colder code
>>>>> ??????? - based on profiling info (types + frequencies), dynamically
>>>>> choose morphism factor on per-call site basis?
>>>>> ??????? - what optimization opportunities to look for? it looks like in
>>>>> general callees should benefit more than the caller (due to merges after
>>>>> the call site)
>>>>>
>>>>> What's your take on it? Any other ideas?
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>> On 11.02.2020 02:42, Ludovic Henry wrote:
>>>>>> Hello,
>>>>>> Thank you very much, John and Vladimir, for your feedback.
>>>>>> First, I want to stress out that this patch does not change the default. It is still bi-morphic guarded inlining 
>>>>>> by default. This patch, however, provides you the ability to configure the JVM to go for N-morphic guarded 
>>>>>> inlining, with N being controlled by the -XX:TypeProfileWidth configuration knob. I understand there are 
>>>>>> shortcomings with the specifics of this approach so I'll work on fixing those. However, I would want this 
>>>>>> discussion to focus on this *configurable* feature and not on changing the default. The latter, I think, should be 
>>>>>> discussed as part of another, more extended running discussion, since, as you pointed out, it has far more 
>>>>>> reaching consequences that are merely improving a micro-benchmark.
>>>>>>
>>>>>> Now to answer some of your specific questions.
>>>>>>
>>>>>>>
>>>>>>> I haven't looked through the patch in details, but here are some thoughts.
>>>>>>>
>>>>>>> As of now, there are 4 main scenarios for devirtualization [1]. It seems you try to generalize (b) which becomes:
>>>>>>>
>>>>>>> ????? if (recv.klass == K1) {
>>>>>> m1(...); // either inline or a direct call
>>>>>>> ????? } else if (recv.klass == K2) {
>>>>>> m2(...); // either inline or a direct call
>>>>>>> ????? ...
>>>>>>> ????? } else if (recv.klass == Kn) {
>>>>>> mn(...); // either inline or a direct call
>>>>>>> ????? } else {
>>>>>> deopt(); // invalidate + reinterpret
>>>>>>> ????? }
>>>>>>
>>>>>> The general shape that exist currently in tip is:
>>>>>>
>>>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
>>>>>> if (recv.klass == K1) {
>>>>>> ???? m1(.); // either inline or a direct call
>>>>>> }
>>>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && UseBimorphicInlining && !is_cold
>>>>>> else if (recv.klass == K2) {
>>>>>> ???? m2(.); // either inline or a direct call
>>>>>> }
>>>>>> else {
>>>>>> ???? // if (!too_many_traps_or_deopt())
>>>>>> ???? deopt(); // invalidate + reinterpret
>>>>>> ???? // else
>>>>>> ???? invokeinterface A.foo(.); // virtual call with Inline Cache
>>>>>> }
>>>>>> There is no particular distinction between Bimorphic, Polymorphic, and Megamorphic. The latter relates more to the 
>>>>>> fallback rather than the guards. What this change brings is more guards for N-morphic call-sites with N > 2. But 
>>>>>> it doesn't change why and how these guards are generated (or at least, that is not the intention).
>>>>>> The general shape that this change proposes is:
>>>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
>>>>>> if (recv.klass == K1) {
>>>>>> ???? m1(.); // either inline or a direct call
>>>>>> }
>>>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && (UseBimorphicInlining || UsePolymorphicInling)
>>>>>> && !is_cold
>>>>>> else if (recv.klass == K2) {
>>>>>> ???? m2(.); // either inline or a direct call
>>>>>> }
>>>>>> // if TypeProfileWidth >= 3 && profile.has_receiver(2) && UsePolymorphicInling && !is_cold
>>>>>> else if (recv.klass == K3) {
>>>>>> ???? m3(.); // either inline or a direct call
>>>>>> }
>>>>>> // if TypeProfileWidth >= 4 && profile.has_receiver(3) && UsePolymorphicInling && !is_cold
>>>>>> else if (recv.klass == K4) {
>>>>>> ???? m4(.); // either inline or a direct call
>>>>>> }
>>>>>> else {
>>>>>> ???? // if (!too_many_traps_or_deopt())
>>>>>> ???? deopt(); // invalidate + reinterpret
>>>>>> ???? // else
>>>>>> ???? invokeinterface A.foo(.); // virtual call with Inline Cache
>>>>>> }
>>>>>> You can observe that the condition to create the guards is no different; only the total number increases based on 
>>>>>> TypeProfileWidth and UsePolymorphicInlining.
>>>>>>> Question #1: what if you generalize polymorphic shape instead and allow multiple major receivers? Deoptimizing 
>>>>>>> (and then recompiling) look less beneficial the higher morphism is (especially considering the inlining on all 
>>>>>>> paths becomes less likely as well). So, having a virtual call (which becomes less likely due to lower frequency) 
>>>>>>> on the fallback path may be a better option.
>>>>>> I agree with this statement in the general sense. However, in practice, it depends on the specifics of each 
>>>>>> application. That is why the degree of polymorphism needs to rely on a configuration knob, and not pre-determined 
>>>>>> on a set of benchmarks. I agree with the proposal to have this knob as a per-method knob, instead of a global knob.
>>>>>> As for the impact of a higher morphism, I expect deoptimizations to happen less often as more guards are 
>>>>>> generated, leading to a lower probability of reaching the fallback path, leading to less uncommon 
>>>>>> trap/deoptimizations. Moreover, the fallback is already going to be a virtual call in case we hit the uncommon 
>>>>>> trap too often (using too_many_traps_or_recompiles).
>>>>>>> Question #2: it would be very interesting to understand what exactly contributes the most to performance 
>>>>>>> improvements? Is it inlining? Or maybe devirtualization (avoid the cost of virtual call)? How much come from 
>>>>>>> optimizing interface calls (itable vs vtable stubs)?
>>>>>> Devirtualization in itself (direct vs. indirect call) is not the *primary* source of the gain. The gain comes from 
>>>>>> the additional optimizations that are applied by C2 when increasing the scope/size of the code compiled via inlining.
>>>>>> In the case of warm code that's not inlined as part of incremental inlining, the call is a direct call rather than 
>>>>>> an indirect call. I haven't measured it, but I expect performance to be positively impacted because of the better 
>>>>>> ability of modern CPUs to correctly predict instruction branches (a direct call) rather than data branches (an 
>>>>>> indirect call).
>>>>>>> Deciding how to spend inlining budget on multiple targets with moderate frequency can be hard, so it makes sense 
>>>>>>> to consider expanding 3/4/mega-morphic call sites in post-parse phase (during incremental inlining).
>>>>>> Incremental inlining is already integrated with the existing solution. In the case of a hot or warm call, in case 
>>>>>> of failure to inline, it generates a direct call. You still have the guards, reducing the cost of an indirect 
>>>>>> call, but without the cost of the inlined code.
>>>>>>> Question #3: how much TypeProfileWidth affects profiling speed (interpreter and level #3 code) and dynamic 
>>>>>>> footprint?
>>>>>> I'll come back to you with some results.
>>>>>>> Getting answers to those (and similar) questions should give us much more insights what is actually happening in 
>>>>>>> practice.
>>>>>>>
>>>>>>> Speaking of the first deliverables, it would be good to introduce a new experimental mode to be able to easily 
>>>>>>> conduct such experiments with product binaries and I'd like to see the patch evolving in that direction. It'll 
>>>>>>> enable us to gather important data to guide our decisions about how to enhance the heuristics in the product.
>>>>>> This patch does not change the default shape of the generated code with bimorphic guarded inlining, because the 
>>>>>> default value of TypeProfileWidth is 2. If your concern is that TypeProfileWidth is used for other purposes and 
>>>>>> that I should add a dedicated knob to control the maximum morphism of these guards, then I agree. I am using 
>>>>>> TypeProfileWidth because it's the available and more straightforward knob today.
>>>>>> Overall, this change does not propose to go from bimorphic to N-morphic by default (with N between 0 and 8). This 
>>>>>> change focuses on using an existing knob (TypeProfileWidth) to open the possibility for N-morphic guarded 
>>>>>> inlining. I would want the discussion to change the default to be part of a separate RFR, to separate the feature 
>>>>>> change discussion from the default change discussion.
>>>>>>> Such optimizations are usually not unqualified wins because of highly "non-linear" or "non-local" effects, where 
>>>>>>> a local change in one direction might couple to nearby change in a different direction, with a net change that's 
>>>>>>> "wrong", due to side effects rolling out from the "good" change. (I'm talking about side effects in our IR graph 
>>>>>>> shaping heuristics, not memory side effects.)
>>>>>>>
>>>>>>> One out of many such "wrong" changes is a local optimization which expands code on a medium-hot path, which has 
>>>>>>> the side effect of making a containing block of code larger than convenient.? Three ways of being "larger than 
>>>>>>> convenient" are a. the object code of some containing loop doesn't fit as well in the instruction memory, b. the 
>>>>>>> total IR size tips over some budgetary limit which causes further IR creation to be throttled (or the whole graph 
>>>>>>> to be thrown away!), or c. some loop gains additional branch structure that impedes the optimization of the loop, 
>>>>>>> where an out of line call would not.
>>>>>>>
>>>>>>> My overall point here is that an eager expansion of IR that is locally "better" (we might even say "optimal") 
>>>>>>> with respect to the specific path under consideration hurts the optimization of nearby paths which are more 
>>>>>>> important.
>>>>>> I generally agree with this statement and explanation. Again, it is not the intention of this patch to change the 
>>>>>> default number of guards for polymorphic call-sites, but it is to give users the ability to optimize the code 
>>>>>> generation of their JVM to their application.
>>>>>> Since I am relying on the existing inlining infrastructure, late inlining and hot/warm/cold call generators allows 
>>>>>> to have a "best-of-both-world" approach: it inlines code in the hot guards, it direct calls or inline (if inlining 
>>>>>> thresholds permits) the method in the warm guards, and it doesn't even generate the guard in the cold guards. The 
>>>>>> question here is, then how do you define hot, warm, and cold. As discussed above, I want to explore using a 
>>>>>> low-threshold even to try to generate a guard (at least 10% of calls are to this specific receiver).
>>>>>> On the overhead of adding more guards, I see this change as beneficial because it removes an arbitrary limit on 
>>>>>> what code can be inlined. For example, if you have a call-site with 3 types, each with a hit probability of 30%, 
>>>>>> then with a maximum limit of 2 types (with bimorphic guarded inlining), only the first 2 types are guarded and 
>>>>>> inlined. That is despite an apparent gain in guarding and inlining against the 3 types.
>>>>>> I agree we want to have guardrails to avoid worst-case degradations. It is my understanding that the existing 
>>>>>> inlining infrastructure (with late inlining, for example) provides many safeguards already, and it is up to this 
>>>>>> change not to abuse these.
>>>>>>> (It clearly doesn't work to tell an impacted customer, well, you may get a 5% loss, but the micro created to test 
>>>>>>> this thing shows a 20% gain, and all the functional tests pass.)
>>>>>>>
>>>>>>> This leads me to the following suggestion:? Your code is a very good POC, and deserves more work, and the next 
>>>>>>> step in that work is probably looking for and thinking about performance regressions, and figuring out how to 
>>>>>>> throttle this thing.
>>>>>> Here again, I want that feature to be behind a configuration knob, and then discuss in a future RFR to change the 
>>>>>> default.
>>>>>>> A specific next step would be to make the throttling of this feature be controllable. MorphismLimit should be a 
>>>>>>> global on its own.? And it should be configurable through the CompilerOracle per method.? (See similar code for 
>>>>>>> similar throttles.)? And it should be more sensitive to the hotness of the overall call and of the various slices 
>>>>>>> of the call's profile.? (I notice with suspicion that the comment "The single majority receiver sufficiently 
>>>>>>> outweighs the minority" is missing in the changed code.)? And, if the change is as disruptive to heuristics as I 
>>>>>>> suspect it *might* be, the call site itself *might* need some kind of dynamic feedback which says, after some 
>>>>>>> deopt or reprofiling, "take it easy here, try plan B." That last point is just speculation, but I threw it in to 
>>>>>>> show the kinds of measures we *sometimes* have to take in avoiding "side effects" to our locally pleasant 
>>>>>>> optimizations.
>>>>>> I'll add this per-method knob on the CompilerOracle in the next iteration of this patch.
>>>>>>> But, let me repeat: I'm glad to see this experiment. And very, very glad to see all the cool stuff that is coming 
>>>>>>> out of your work-group.? Welcome to the adventure!
>>>>>> For future improvements, I will keep focusing on inlining as I see it as the door opener to many more 
>>>>>> optimizations in C2. I am still learning at what can be done to reduce the size of the inlined code by, for 
>>>>>> example, applying specific optimizations that simplify the CG (like dead-code elimination or constant propagation) 
>>>>>> before inlining the code. As you said, we are not short of ideas on *how* to improve it, but we have to be very 
>>>>>> wary of *what impact* it'll have on real-world applications. We're working with internal customers to figure that 
>>>>>> out, and we'll share them as soon as we are ready with benchmarks for those use-case patterns.
>>>>>> What I am working on now is:
>>>>>> ??? - Add a per-method flag through CompilerOracle
>>>>>> ??? - Add a threshold on the probability of a receiver to generate a guard (I am thinking of 10%, i.e., if a 
>>>>>> receiver is observed less than 1 in every 10 calls, then don't generate a guard and use the fallback)
>>>>>> ??? - Check the overhead of increasing TypeProfileWidth on profiling speed (in the interpreter and level #3 code)
>>>>>> Thank you, and looking forward to the next review (I expect to post the next iteration of the patch today or 
>>>>>> tomorrow).
>>>>>> -- 
>>>>>> Ludovic
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>>> Sent: Thursday, February 6, 2020 1:07 PM
>>>>>> To: Ludovic Henry <luhenry at microsoft.com>; hotspot-compiler-dev at openjdk.java.net
>>>>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>>>>
>>>>>> Very interesting results, Ludovic!
>>>>>>
>>>>>>> The image can be found at 
>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fnam06.safelinks.protection.outlook.com%2F%3Furl%3Dhttps*3A*2F*2Fgist.github.com*2Fluhenry*2Fb7cc1ed55c51cb0fbc527cbc45018473%26amp%3Bdata%3D02*7C01*7Cluhenry*40microsoft.com*7Cc802bcf8b3154a1bb97508d7af47850b*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637170594027774879%26amp%3Bsdata%3D6dxACbgyeU0S51dVY6fSVZ*2B4USDsTOdnuxTOXkiVYqI*3D%26amp%3Breserved%3D0__%3BJSUlJSUlJSUlJSUlJSUl!!GqivPVa7Brio!MIQWA4cQu4RzL0yqjiU5mO4tpHcImj_lUv7tEh-gG6NCMtSRIAyhvfyJZQBPoAJC_Ajjaxg%24&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&amp;sdata=aLAxfc5BDeyOK6VMwM7bUAgjVLMCAt8A6LDlsNkC6fU%3D&amp;reserved=0 
>>>>>>>
>>>>>>
>>>>>> Can you elaborate on the experiment itself, please? In particular, what
>>>>>> does PERCENTILES actually mean?
>>>>>>
>>>>>> I haven't looked through the patch in details, but here are some thoughts.
>>>>>>
>>>>>> As of now, there are 4 main scenarios for devirtualization [1]. It seems
>>>>>> you try to generalize (b) which becomes:
>>>>>>
>>>>>> ????? if (recv.klass == K1) {
>>>>>> ???????? m1(...); // either inline or a direct call
>>>>>> ????? } else if (recv.klass == K2) {
>>>>>> ???????? m2(...); // either inline or a direct call
>>>>>> ????? ...
>>>>>> ????? } else if (recv.klass == Kn) {
>>>>>> ???????? mn(...); // either inline or a direct call
>>>>>> ????? } else {
>>>>>> ???????? deopt(); // invalidate + reinterpret
>>>>>> ????? }
>>>>>>
>>>>>> Question #1: what if you generalize polymorphic shape instead and allow
>>>>>> multiple major receivers? Deoptimizing (and then recompiling) look less
>>>>>> beneficial the higher morphism is (especially considering the inlining
>>>>>> on all paths becomes less likely as well). So, having a virtual call
>>>>>> (which becomes less likely due to lower frequency) on the fallback path
>>>>>> may be a better option.
>>>>>>
>>>>>>
>>>>>> Question #2: it would be very interesting to understand what exactly
>>>>>> contributes the most to performance improvements? Is it inlining? Or
>>>>>> maybe devirtualization (avoid the cost of virtual call)? How much come
>>>>>> from optimizing interface calls (itable vs vtable stubs)?
>>>>>>
>>>>>> Deciding how to spend inlining budget on multiple targets with moderate
>>>>>> frequency can be hard, so it makes sense to consider expanding
>>>>>> 3/4/mega-morphic call sites in post-parse phase (during incremental
>>>>>> inlining).
>>>>>>
>>>>>>
>>>>>> Question #3: how much TypeProfileWidth affects profiling speed
>>>>>> (interpreter and level #3 code) and dynamic footprint?
>>>>>>
>>>>>>
>>>>>> Getting answers to those (and similar) questions should give us much
>>>>>> more insights what is actually happening in practice.
>>>>>>
>>>>>> Speaking of the first deliverables, it would be good to introduce a new
>>>>>> experimental mode to be able to easily conduct such experiments with
>>>>>> product binaries and I'd like to see the patch evolving in that
>>>>>> direction. It'll enable us to gather important data to guide our
>>>>>> decisions about how to enhance the heuristics in the product.
>>>>>>
>>>>>> Best regards,
>>>>>> Vladimir Ivanov
>>>>>>
>>>>>> [1] (a) Monomorphic:
>>>>>> ????? if (recv.klass == K1) {
>>>>>> ???????? m1(...); // either inline or a direct call
>>>>>> ????? } else {
>>>>>> ???????? deopt(); // invalidate + reinterpret
>>>>>> ????? }
>>>>>>
>>>>>> ????? (b) Bimorphic:
>>>>>> ????? if (recv.klass == K1) {
>>>>>> ???????? m1(...); // either inline or a direct call
>>>>>> ????? } else if (recv.klass == K2) {
>>>>>> ???????? m2(...); // either inline or a direct call
>>>>>> ????? } else {
>>>>>> ???????? deopt(); // invalidate + reinterpret
>>>>>> ????? }
>>>>>>
>>>>>> ????? (c) Polymorphic:
>>>>>> ????? if (recv.klass == K1) { // major receiver (by default, >90%)
>>>>>> ???????? m1(...); // either inline or a direct call
>>>>>> ????? } else {
>>>>>> ???????? K.m(); // virtual call
>>>>>> ????? }
>>>>>>
>>>>>> ????? (d) Megamorphic:
>>>>>> ????? K.m(); // virtual (K is either concrete or interface class)
>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Ludovic
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Ludovic Henry
>>>>>>> Sent: Thursday, February 6, 2020 9:18 AM
>>>>>>> To: hotspot-compiler-dev at openjdk.java.net
>>>>>>> Subject: RFR: Polymorphic Guarded Inlining in C2
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> In our evergoing search of improving performance, I've looked at inlining and, more specifically, at polymorphic 
>>>>>>> guarded inlining. Today in HotSpot, the maximum number of guards for types at any call site is two - with 
>>>>>>> bimorphic guarded inlining. However, Graal and Zing have observed great results with increasing that limit.
>>>>>>>
>>>>>>> You'll find following a patch that makes the number of guards for types configurable with the `TypeProfileWidth` 
>>>>>>> global.
>>>>>>>
>>>>>>> Testing:
>>>>>>> Passing tier1 on Linux and Windows, plus other large applications (through the Adopt testing scripts)
>>>>>>>
>>>>>>> Benchmarking:
>>>>>>> To get data, we run a benchmark against Apache Pinot and observe the following results:
>>>>>>>
>>>>>>> [cid:image001.png at 01D5D2DB.F5165550]
>>>>>>>
>>>>>>> We observe close to 20% improvements on this sample benchmark with a morphism (=width) of 3 or 4. We are 
>>>>>>> currently validating these numbers on a more extensive set of benchmarks and platforms, and I'll share them as 
>>>>>>> soon as we have them.
>>>>>>>
>>>>>>> I am happy to provide more information, just let me know if you have any question.
>>>>>>>
>>>>>>> Thank you,
>>>>>>>
>>>>>>> -- 
>>>>>>> Ludovic
>>>>>>>
>>>>>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>>> index 73854806ed..845070fbe1 100644
>>>>>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>>> @@ -38,7 +38,7 @@ private:
>>>>>>> ?????? friend class ciMethod;
>>>>>>> ?????? friend class ciMethodHandle;
>>>>>>>
>>>>>>> -? enum { MorphismLimit = 2 }; // Max call site's morphism we care about
>>>>>>> +? enum { MorphismLimit = 8 }; // Max call site's morphism we care about
>>>>>>> ?????? int? _limit;??????????????? // number of receivers have been determined
>>>>>>> ?????? int? _morphism;???????????? // determined call site's morphism
>>>>>>> ?????? int? _count;??????????????? // # times has this call been executed
>>>>>>> @@ -47,6 +47,7 @@ private:
>>>>>>> ?????? ciKlass*? _receiver[MorphismLimit + 1];? // receivers (exact)
>>>>>>>
>>>>>>> ?????? ciCallProfile() {
>>>>>>> +??? guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit can't be smaller than TypeProfileWidth");
>>>>>>> ???????? _limit = 0;
>>>>>>> ???????? _morphism??? = 0;
>>>>>>> ???????? _count = -1;
>>>>>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp b/src/hotspot/share/ci/ciMethod.cpp
>>>>>>> index d771be8dac..8e4ecc8597 100644
>>>>>>> --- a/src/hotspot/share/ci/ciMethod.cpp
>>>>>>> +++ b/src/hotspot/share/ci/ciMethod.cpp
>>>>>>> @@ -496,9 +496,7 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) {
>>>>>>> ?????????? // Every profiled call site has a counter.
>>>>>>> ?????????? int count = check_overflow(data->as_CounterData()->count(), java_code_at_bci(bci));
>>>>>>>
>>>>>>> -????? if (!data->is_ReceiverTypeData()) {
>>>>>>> -??????? result._receiver_count[0] = 0;? // that's a definite zero
>>>>>>> -????? } else { // ReceiverTypeData is a subclass of CounterData
>>>>>>> +????? if (data->is_ReceiverTypeData()) {
>>>>>>> ???????????? ciReceiverTypeData* call = (ciReceiverTypeData*)data->as_ReceiverTypeData();
>>>>>>> ???????????? // In addition, virtual call sites have receiver type information
>>>>>>> ???????????? int receivers_count_total = 0;
>>>>>>> @@ -515,7 +513,7 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) {
>>>>>>> ?????????????? // is recorded or an associated counter is incremented, but not both. With
>>>>>>> ?????????????? // tiered compilation, however, both can happen due to the interpreter and
>>>>>>> ?????????????? // C1 profiling invocations differently. Address that inconsistency here.
>>>>>>> -????????? if (morphism == 1 && count > 0) {
>>>>>>> +????????? if (morphism >= 1 && count > 0) {
>>>>>>> ???????????????? epsilon = count;
>>>>>>> ???????????????? count = 0;
>>>>>>> ?????????????? }
>>>>>>> @@ -531,25 +529,26 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) {
>>>>>>> ????????????? // If we extend profiling to record methods,
>>>>>>> ?????????????? // we will set result._method also.
>>>>>>> ???????????? }
>>>>>>> +??????? result._morphism = morphism;
>>>>>>> ???????????? // Determine call site's morphism.
>>>>>>> ???????????? // The call site count is 0 with known morphism (only 1 or 2 receivers)
>>>>>>> ???????????? // or < 0 in the case of a type check failure for checkcast, aastore, instanceof.
>>>>>>> ???????????? // The call site count is > 0 in the case of a polymorphic virtual call.
>>>>>>> -??????? if (morphism > 0 && morphism == result._limit) {
>>>>>>> -?????????? // The morphism <= MorphismLimit.
>>>>>>> -?????????? if ((morphism <? ciCallProfile::MorphismLimit) ||
>>>>>>> -?????????????? (morphism == ciCallProfile::MorphismLimit && count == 0)) {
>>>>>>> +??????? assert(result._morphism == result._limit, "");
>>>>>>> #ifdef ASSERT
>>>>>>> +??????? if (result._morphism > 0) {
>>>>>>> +?????????? // The morphism <= TypeProfileWidth.
>>>>>>> +?????????? if ((result._morphism <? TypeProfileWidth) ||
>>>>>>> +?????????????? (result._morphism == TypeProfileWidth && count == 0)) {
>>>>>>> ????????????????? if (count > 0) {
>>>>>>> ??????????????????? this->print_short_name(tty);
>>>>>>> ??????????????????? tty->print_cr(" @ bci:%d", bci);
>>>>>>> ??????????????????? this->print_codes();
>>>>>>> ??????????????????? assert(false, "this call site should not be polymorphic");
>>>>>>> ????????????????? }
>>>>>>> -#endif
>>>>>>> -???????????? result._morphism = morphism;
>>>>>>> ??????????????? }
>>>>>>> ???????????? }
>>>>>>> +#endif
>>>>>>> ???????????? // Make the count consistent if this is a call profile. If count is
>>>>>>> ???????????? // zero or less, presume that this is a typecheck profile and
>>>>>>> ???????????? // do nothing.? Otherwise, increase count to be the sum of all
>>>>>>> @@ -578,7 +577,7 @@ void ciCallProfile::add_receiver(ciKlass* receiver, int receiver_count) {
>>>>>>> ?????? }
>>>>>>> ?????? _receiver[i] = receiver;
>>>>>>> ?????? _receiver_count[i] = receiver_count;
>>>>>>> -? if (_limit < MorphismLimit) _limit++;
>>>>>>> +? if (_limit < TypeProfileWidth) _limit++;
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp b/src/hotspot/share/opto/c2_globals.hpp
>>>>>>> index d605bdb7bd..7a8dee43e5 100644
>>>>>>> --- a/src/hotspot/share/opto/c2_globals.hpp
>>>>>>> +++ b/src/hotspot/share/opto/c2_globals.hpp
>>>>>>> @@ -389,9 +389,16 @@
>>>>>>> ?????? product(bool, UseBimorphicInlining, true,???????????????????????????????? \
>>>>>>> ?????????????? "Profiling based inlining for two receivers")???????????????????? \
>>>>>>> \
>>>>>>> +? product(bool, UsePolymorphicInlining, true,?????????????????????????????? \
>>>>>>> +????????? "Profiling based inlining for two or more receivers")???????????? \
>>>>>>> + \
>>>>>>> ?????? product(bool, UseOnlyInlinedBimorphic, true,????????????????????????????? \
>>>>>>> ?????????????? "Don't use BimorphicInlining if can't inline a second method")??? \
>>>>>>> \
>>>>>>> +? product(bool, UseOnlyInlinedPolymorphic, true,??????????????????????????? \
>>>>>>> +????????? "Don't use PolymorphicInlining if can't inline a non-major "????? \
>>>>>>> +????????? "receiver's method")????????????????????????????????????????????? \
>>>>>>> + \
>>>>>>> ?????? product(bool, InsertMemBarAfterArraycopy, true,?????????????????????????? \
>>>>>>> ?????????????? "Insert memory barrier after arraycopy call")???????????????????? \
>>>>>>> \
>>>>>>> diff --git a/src/hotspot/share/opto/doCall.cpp b/src/hotspot/share/opto/doCall.cpp
>>>>>>> index 44ab387ac8..6f940209ce 100644
>>>>>>> --- a/src/hotspot/share/opto/doCall.cpp
>>>>>>> +++ b/src/hotspot/share/opto/doCall.cpp
>>>>>>> @@ -83,25 +83,23 @@ CallGenerator* Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>>>>
>>>>>>> ?????? // See how many times this site has been invoked.
>>>>>>> ?????? int site_count = profile.count();
>>>>>>> -? int receiver_count = -1;
>>>>>>> -? if (call_does_dispatch && UseTypeProfile && profile.has_receiver(0)) {
>>>>>>> -??? // Receivers in the profile structure are ordered by call counts
>>>>>>> -??? // so that the most called (major) receiver is profile.receiver(0).
>>>>>>> -??? receiver_count = profile.receiver_count(0);
>>>>>>> -? }
>>>>>>>
>>>>>>> ?????? CompileLog* log = this->log();
>>>>>>> ?????? if (log != NULL) {
>>>>>>> -??? int rid = (receiver_count >= 0)? log->identify(profile.receiver(0)): -1;
>>>>>>> -??? int r2id = (rid != -1 && profile.has_receiver(1))? log->identify(profile.receiver(1)):-1;
>>>>>>> +??? ResourceMark rm;
>>>>>>> +??? int* rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth);
>>>>>>> +??? for (int i = 0; i < TypeProfileWidth && profile.has_receiver(i); i++) {
>>>>>>> +????? rids[i] = log->identify(profile.receiver(i));
>>>>>>> +??? }
>>>>>>> ???????? log->begin_elem("call method='%d' count='%d' prof_factor='%f'",
>>>>>>> ???????????????????????? log->identify(callee), site_count, prof_factor);
>>>>>>> ???????? if (call_does_dispatch)? log->print(" virtual='1'");
>>>>>>> ???????? if (allow_inline)???? log->print(" inline='1'");
>>>>>>> -??? if (receiver_count >= 0) {
>>>>>>> -????? log->print(" receiver='%d' receiver_count='%d'", rid, receiver_count);
>>>>>>> -?????? if (profile.has_receiver(1)) {
>>>>>>> -??????? log->print(" receiver2='%d' receiver2_count='%d'", r2id, profile.receiver_count(1));
>>>>>>> +??? for (int i = 0; i < TypeProfileWidth && profile.has_receiver(i); i++) {
>>>>>>> +????? if (i == 0) {
>>>>>>> +??????? log->print(" receiver='%d' receiver_count='%d'", rids[i], profile.receiver_count(i));
>>>>>>> +????? } else {
>>>>>>> +??????? log->print(" receiver%d='%d' receiver%d_count='%d'", i + 1, rids[i], i + 1, profile.receiver_count(i));
>>>>>>> ?????????? }
>>>>>>> ???????? }
>>>>>>> ???????? if (callee->is_method_handle_intrinsic()) {
>>>>>>> @@ -205,90 +203,96 @@ CallGenerator* Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>>>> ???????? if (call_does_dispatch && site_count > 0 && UseTypeProfile) {
>>>>>>> ?????????? // The major receiver's count >= TypeProfileMajorReceiverPercent of site_count.
>>>>>>> ?????????? bool have_major_receiver = profile.has_receiver(0) && (100.*profile.receiver_prob(0) >= 
>>>>>>> (float)TypeProfileMajorReceiverPercent);
>>>>>>> -????? ciMethod* receiver_method = NULL;
>>>>>>>
>>>>>>> ?????????? int morphism = profile.morphism();
>>>>>>> +
>>>>>>> +????? ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*, MAX(1, morphism));
>>>>>>> +????? memset(receiver_methods, 0, sizeof(ciMethod*) * MAX(1, morphism));
>>>>>>> +
>>>>>>> ?????????? if (speculative_receiver_type != NULL) {
>>>>>>> ???????????? if (!too_many_traps_or_recompiles(caller, bci, Deoptimization::Reason_speculate_class_check)) {
>>>>>>> ?????????????? // We have a speculative type, we should be able to resolve
>>>>>>> ?????????????? // the call. We do that before looking at the profiling at
>>>>>>> -????????? // this invoke because it may lead to bimorphic inlining which
>>>>>>> +????????? // this invoke because it may lead to polymorphic inlining which
>>>>>>> ?????????????? // a speculative type should help us avoid.
>>>>>>> -????????? receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>>>>>>> - speculative_receiver_type);
>>>>>>> -????????? if (receiver_method == NULL) {
>>>>>>> +????????? receiver_methods[0] = callee->resolve_invoke(jvms->method()->holder(),
>>>>>>> + speculative_receiver_type);
>>>>>>> +????????? if (receiver_methods[0] == NULL) {
>>>>>>> ???????????????? speculative_receiver_type = NULL;
>>>>>>> ?????????????? } else {
>>>>>>> ???????????????? morphism = 1;
>>>>>>> ?????????????? }
>>>>>>> ???????????? } else {
>>>>>>> ?????????????? // speculation failed before. Use profiling at the call
>>>>>>> -????????? // (could allow bimorphic inlining for instance).
>>>>>>> +????????? // (could allow polymorphic inlining for instance).
>>>>>>> ?????????????? speculative_receiver_type = NULL;
>>>>>>> ???????????? }
>>>>>>> ?????????? }
>>>>>>> -????? if (receiver_method == NULL &&
>>>>>>> +????? if (receiver_methods[0] == NULL &&
>>>>>>> ?????????????? (have_major_receiver || morphism == 1 ||
>>>>>>> -?????????? (morphism == 2 && UseBimorphicInlining))) {
>>>>>>> -??????? // receiver_method = profile.method();
>>>>>>> +?????????? (morphism == 2 && UseBimorphicInlining) ||
>>>>>>> +?????????? (morphism >= 2 && UsePolymorphicInlining))) {
>>>>>>> +??????? assert(profile.has_receiver(0), "no receiver at 0");
>>>>>>> +??????? // receiver_methods[0] = profile.method();
>>>>>>> ???????????? // Profiles do not suggest methods now.? Look it up in the major receiver.
>>>>>>> -??????? receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>>>>>>> - profile.receiver(0));
>>>>>>> +??????? receiver_methods[0] = callee->resolve_invoke(jvms->method()->holder(),
>>>>>>> + profile.receiver(0));
>>>>>>> ?????????? }
>>>>>>> -????? if (receiver_method != NULL) {
>>>>>>> -??????? // The single majority receiver sufficiently outweighs the minority.
>>>>>>> -??????? CallGenerator* hit_cg = this->call_generator(receiver_method,
>>>>>>> -????????????? vtable_index, !call_does_dispatch, jvms, allow_inline, prof_factor);
>>>>>>> -??????? if (hit_cg != NULL) {
>>>>>>> -????????? // Look up second receiver.
>>>>>>> -????????? CallGenerator* next_hit_cg = NULL;
>>>>>>> -????????? ciMethod* next_receiver_method = NULL;
>>>>>>> -????????? if (morphism == 2 && UseBimorphicInlining) {
>>>>>>> -??????????? next_receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>>>>>>> - profile.receiver(1));
>>>>>>> -??????????? if (next_receiver_method != NULL) {
>>>>>>> -????????????? next_hit_cg = this->call_generator(next_receiver_method,
>>>>>>> -????????????????????????????????? vtable_index, !call_does_dispatch, jvms,
>>>>>>> -????????????????????????????????? allow_inline, prof_factor);
>>>>>>> -????????????? if (next_hit_cg != NULL && !next_hit_cg->is_inline() &&
>>>>>>> -????????????????? have_major_receiver && UseOnlyInlinedBimorphic) {
>>>>>>> -????????????????? // Skip if we can't inline second receiver's method
>>>>>>> -????????????????? next_hit_cg = NULL;
>>>>>>> +????? if (receiver_methods[0] != NULL) {
>>>>>>> +??????? CallGenerator** hit_cgs = NEW_RESOURCE_ARRAY(CallGenerator*, MAX(1, morphism));
>>>>>>> +??????? memset(hit_cgs, 0, sizeof(CallGenerator*) * MAX(1, morphism));
>>>>>>> +
>>>>>>> +??????? hit_cgs[0] = this->call_generator(receiver_methods[0],
>>>>>>> +??????????????????????????? vtable_index, !call_does_dispatch, jvms,
>>>>>>> +??????????????????????????? allow_inline, prof_factor);
>>>>>>> +??????? if (hit_cgs[0] != NULL) {
>>>>>>> +????????? if ((morphism == 2 && UseBimorphicInlining) || (morphism >= 2 && UsePolymorphicInlining)) {
>>>>>>> +??????????? for (int i = 1; i < morphism; i++) {
>>>>>>> +????????????? assert(profile.has_receiver(i), "no receiver at %d", i);
>>>>>>> +????????????? receiver_methods[i] = callee->resolve_invoke(jvms->method()->holder(),
>>>>>>> + profile.receiver(i));
>>>>>>> +????????????? if (receiver_methods[i] != NULL) {
>>>>>>> +??????????????? hit_cgs[i] = this->call_generator(receiver_methods[i],
>>>>>>> +????????????????????????????????????? vtable_index, !call_does_dispatch, jvms,
>>>>>>> +????????????????????????????????????? allow_inline, prof_factor);
>>>>>>> +??????????????? if (hit_cgs[i] != NULL && !hit_cgs[i]->is_inline() && have_major_receiver &&
>>>>>>> +??????????????????? ((morphism == 2 && UseOnlyInlinedBimorphic) || (morphism >= 2 && UseOnlyInlinedPolymorphic))) {
>>>>>>> +????????????????? // Skip if we can't inline non-major receiver's method
>>>>>>> +????????????????? hit_cgs[i] = NULL;
>>>>>>> +??????????????? }
>>>>>>> ?????????????????? }
>>>>>>> ???????????????? }
>>>>>>> ?????????????? }
>>>>>>> ?????????????? CallGenerator* miss_cg;
>>>>>>> -????????? Deoptimization::DeoptReason reason = (morphism == 2
>>>>>>> -?????????????????????????????????????????????? ? Deoptimization::Reason_bimorphic
>>>>>>> +????????? Deoptimization::DeoptReason reason = (morphism >= 2
>>>>>>> +?????????????????????????????????????????????? ? Deoptimization::Reason_polymorphic
>>>>>>> ??????????????????????????????????????????????????? : 
>>>>>>> Deoptimization::reason_class_check(speculative_receiver_type != NULL));
>>>>>>> -????????? if ((morphism == 1 || (morphism == 2 && next_hit_cg != NULL)) &&
>>>>>>> -????????????? !too_many_traps_or_recompiles(caller, bci, reason)
>>>>>>> -???????????? ) {
>>>>>>> +????????? if (!too_many_traps_or_recompiles(caller, bci, reason)) {
>>>>>>> ???????????????? // Generate uncommon trap for class check failure path
>>>>>>> -??????????? // in case of monomorphic or bimorphic virtual call site.
>>>>>>> +??????????? // in case of polymorphic virtual call site.
>>>>>>> ???????????????? miss_cg = CallGenerator::for_uncommon_trap(callee, reason,
>>>>>>> ???????????????????????????? Deoptimization::Action_maybe_recompile);
>>>>>>> ?????????????? } else {
>>>>>>> ???????????????? // Generate virtual call for class check failure path
>>>>>>> -??????????? // in case of polymorphic virtual call site.
>>>>>>> +??????????? // in case of megamorphic virtual call site.
>>>>>>> ???????????????? miss_cg = CallGenerator::for_virtual_call(callee, vtable_index);
>>>>>>> ?????????????? }
>>>>>>> -????????? if (miss_cg != NULL) {
>>>>>>> -??????????? if (next_hit_cg != NULL) {
>>>>>>> +????????? for (int i = morphism - 1; i >= 1 && miss_cg != NULL; i--) {
>>>>>>> +??????????? if (hit_cgs[i] != NULL) {
>>>>>>> ?????????????????? assert(speculative_receiver_type == NULL, "shouldn't end up here if we used speculation");
>>>>>>> -????????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), next_receiver_method, 
>>>>>>> profile.receiver(1), site_count, profile.receiver_count(1));
>>>>>>> +????????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_methods[i], 
>>>>>>> profile.receiver(i), site_count, profile.receiver_count(i));
>>>>>>> ?????????????????? // We don't need to record dependency on a receiver here and below.
>>>>>>> ?????????????????? // Whenever we inline, the dependency is added by Parse::Parse().
>>>>>>> -????????????? miss_cg = CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, next_hit_cg, PROB_MAX);
>>>>>>> -??????????? }
>>>>>>> -??????????? if (miss_cg != NULL) {
>>>>>>> -????????????? ciKlass* k = speculative_receiver_type != NULL ? speculative_receiver_type : profile.receiver(0);
>>>>>>> -????????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_method, k, 
>>>>>>> site_count, receiver_count);
>>>>>>> -????????????? float hit_prob = speculative_receiver_type != NULL ? 1.0 : profile.receiver_prob(0);
>>>>>>> -????????????? CallGenerator* cg = CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob);
>>>>>>> -????????????? if (cg != NULL)? return cg;
>>>>>>> +????????????? miss_cg = CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, hit_cgs[i], PROB_MAX);
>>>>>>> ???????????????? }
>>>>>>> ?????????????? }
>>>>>>> +????????? if (miss_cg != NULL) {
>>>>>>> +??????????? ciKlass* k = speculative_receiver_type != NULL ? speculative_receiver_type : profile.receiver(0);
>>>>>>> +??????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_methods[0], k, 
>>>>>>> site_count, profile.receiver_count(0));
>>>>>>> +??????????? float hit_prob = speculative_receiver_type != NULL ? 1.0 : profile.receiver_prob(0);
>>>>>>> +??????????? CallGenerator* cg = CallGenerator::for_predicted_call(k, miss_cg, hit_cgs[0], hit_prob);
>>>>>>> +??????????? if (cg != NULL)? return cg;
>>>>>>> +????????? }
>>>>>>> ???????????? }
>>>>>>> ????????? }
>>>>>>> ???????? }
>>>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp b/src/hotspot/share/runtime/deoptimization.cpp
>>>>>>> index 11df15e004..2d14b52854 100644
>>>>>>> --- a/src/hotspot/share/runtime/deoptimization.cpp
>>>>>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp
>>>>>>> @@ -2382,7 +2382,7 @@ const char* Deoptimization::_trap_reason_name[] = {
>>>>>>> ?????? "class_check",
>>>>>>> ?????? "array_check",
>>>>>>> ?????? "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"),
>>>>>>> -? "bimorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>>>> +? "polymorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>>>> ?????? "profile_predicate",
>>>>>>> ?????? "unloaded",
>>>>>>> ?????? "uninitialized",
>>>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp b/src/hotspot/share/runtime/deoptimization.hpp
>>>>>>> index 1cfff5394e..c1eb998aba 100644
>>>>>>> --- a/src/hotspot/share/runtime/deoptimization.hpp
>>>>>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp
>>>>>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic {
>>>>>>> ???????? Reason_class_check,?????????? // saw unexpected object class (@bci)
>>>>>>> ???????? Reason_array_check,?????????? // saw unexpected array class (aastore @bci)
>>>>>>> ???????? Reason_intrinsic,???????????? // saw unexpected operand to intrinsic (@bci)
>>>>>>> -??? Reason_bimorphic,???????????? // saw unexpected object class in bimorphic inlining (@bci)
>>>>>>> +??? Reason_polymorphic,?????????? // saw unexpected object class in bimorphic inlining (@bci)
>>>>>>>
>>>>>>> #if INCLUDE_JVMCI
>>>>>>> ???????? Reason_unreached0???????????? = Reason_null_assert,
>>>>>>> ???????? Reason_type_checked_inlining? = Reason_intrinsic,
>>>>>>> -??? Reason_optimized_type_check?? = Reason_bimorphic,
>>>>>>> +??? Reason_optimized_type_check?? = Reason_polymorphic,
>>>>>>> #endif
>>>>>>>
>>>>>>> ???????? Reason_profile_predicate,???? // compiler generated predicate moved from frequent branch in a loop failed
>>>>>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp b/src/hotspot/share/runtime/vmStructs.cpp
>>>>>>> index 94b544824e..ee761626c4 100644
>>>>>>> --- a/src/hotspot/share/runtime/vmStructs.cpp
>>>>>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp
>>>>>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry<InstanceKlass*, mtClass>? KlassHashtableEntry;
>>>>>>> declare_constant(Deoptimization::Reason_class_check) \
>>>>>>> declare_constant(Deoptimization::Reason_array_check) \
>>>>>>> declare_constant(Deoptimization::Reason_intrinsic) \
>>>>>>> - declare_constant(Deoptimization::Reason_bimorphic) \
>>>>>>> + declare_constant(Deoptimization::Reason_polymorphic) \
>>>>>>> declare_constant(Deoptimization::Reason_profile_predicate) \
>>>>>>> declare_constant(Deoptimization::Reason_unloaded) \
>>>>>>> declare_constant(Deoptimization::Reason_uninitialized) \
>>>>>>>

From vladimir.x.ivanov at oracle.com  Tue Apr  7 19:31:09 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Tue, 7 Apr 2020 22:31:09 +0300
Subject: Polymorphic Guarded Inlining in C2
In-Reply-To: <0ee0b383-285e-bd93-3490-84ad991b53d1@oracle.com>
References: <MWHPR21MB051142D7637FFAB03EDBE0A3B01D0@MWHPR21MB0511.namprd21.prod.outlook.com>
 <cab9034b-ad5e-9c6b-85e6-2d9abd6affa7@oracle.com>
 <MWHPR21MB051152A581B205FDD22675B5B0190@MWHPR21MB0511.namprd21.prod.outlook.com>
 <e1272577-2859-36da-9679-33a1c25a2b52@oracle.com>
 <MWHPR21MB0511CEFDDEAC30BF4CFD9234B0110@MWHPR21MB0511.namprd21.prod.outlook.com>
 <6bbeea49-7335-9640-d524-32fa03968f42@oracle.com>
 <MWHPR21MB051135E4B2A9DF06CA69AB31B0E40@MWHPR21MB0511.namprd21.prod.outlook.com>
 <MWHPR21MB0511944D93B23B845A0ADF63B0E40@MWHPR21MB0511.namprd21.prod.outlook.com>
 <MWHPR21MB051128C6984BFBCF2BE5099EB0E40@MWHPR21MB0511.namprd21.prod.outlook.com>
 <MWHPR21MB05119B1597C64A3733CA0B74B0F50@MWHPR21MB0511.namprd21.prod.outlook.com>
 <ea3c5eec-6a3f-b252-1d24-d3d266ca6b93@oracle.com>
 <084ea561-6305-a8fd-9d7f-5ba108d41312@oracle.com>
 <6de5487c-c13e-c03e-9d0b-f3093b115daf@oracle.com>
 <0ee0b383-285e-bd93-3490-84ad991b53d1@oracle.com>
Message-ID: <0307f0de-4743-5870-6f83-ce2e88d438b0@oracle.com>


> An other thing we can do is collect statistic data about how many 
> different receivers can be recorded with big TypeProfileWidth. My 
> recollection from long ago was the only case for poly was HashMap usage. 
> It would be nice to collect this data again for modern Java benchmarks. 
> We can use them to see afftets of changes - benchmarks which do not have 
> poly cases are usless in these experiments.

Yes, such data would be very valuable. The last time I looked at 
megamorphic call sites, only a few of standard benchmarks (SPEC*) had 
any in hot code.

Additionally, separating data for virtual and interface calls looks very 
useful.

> On 4/6/20 6:38 AM, Vladimir Ivanov wrote:
>> I see 2 directions (mostly independent) to proceed: (1) use existing 
>> profiling info only; and (2) when more profile info is available.
>>
>> I suggest to explore them independently.
>>
>> There's enough profiling data available to introduce polymorpic case 
>> with 2 major receivers ("2-poly"). And it'll complete the matrix of 
>> possible shapes.
> 
> Please explain how it is different from current bimprphic case?

Bimorphic case is when there are exactly 2 receivers recorded in type 
profile and on fallback path an uncommon trap is put.

Polymorphic (1-poly) doesn't care about total number of receivers, just 
that one of them is encountered more frequently than the others 
(>TypeProfileMajorReceiverPercent). On fallback path it has a virtual 
call. That's the difference from monomorphic (1-morphic) case.

What I call 2-poly is when the number of major receivers is increased to 
2, but still keeping a virtual call on fallback path.

So, the only difference between 2-poly and bimorphic is the shape of 
fallback path.

Best regards,
Vladimir Ivanov

>> Gathering more data (-XX:TypeProfileWidth=N > 2) enables 2 more 
>> generic shapes: "N-morphic" and "N-poly". The only difference between 
>> them is what happens on fallback patch - deopt / uncommon trap or a 
>> virtual call.
>>
>> Regarding 2-poly, there is TypeProfileMajorReceiverPercent which 
>> should be extended to 2 cases which leads to 2 parameter: aggregated 
>> major receiver percentage and minimum indiviual percentage.
> 
> okay
> 
>>
>> Also, it makes sense to introduce UseOnlyInlinedPolymorphic which 
>> aligns 2-poly with bimorphic case.
>>
>> And, as I mentioned before, IMO it's promising to distinguish 
>> invokevirtual and invokeinterface cases. So, additional flag to 
>> control that would be useful.
> 
> yes
> 
>>
>> Regarding N-poly/N-morphic case, they can be generalized from 
>> 2-poly/bi-morphic cases.
>>
>> I believe experiments on 2-poly will provide useful insights on 
>> N-poly/N-morphic, so it makes sense to start with 2-poly first.
> 
> Yes
> 
> Thanks,
> Vladimir K
> 
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> On 01.04.2020 01:29, Vladimir Kozlov wrote:
>>> Looks like graphs were stripped from email. I put them on GitHub:
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-ren_tpw.png> 
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_tpw.png> 
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-dacapo_tpw.png> 
>>>
>>>
>>> Also Vladimir Ivanov forwarded me data he collected.
>>>
>>> His next data shows that profiling is not "free". Vladimir I. limited 
>>> to tier3 (-XX:TieredStopAtLevel=3, C1 compilation with profiling 
>>> code) to show that profiling code with TPW=8 is slower. Note, with 4 
>>> tiers this may not visible because execution will be switched to C2 
>>> compiled code (without profiling code).
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_tier3.png> 
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-dacapo_tier3.png> 
>>>
>>>
>>> Next data collected for proposed patch. Vladimir I. collected data 
>>> for several flags configurations.
>>> Next graphs are for one of settings:' -XX:+UsePolymorphicInlining 
>>> -XX:+UseOnlyInlinedPolymorphic -XX:TypeProfileWidth=4'
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_poly_inl_tpw4.png> 
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-decapo_poly_inl_tpw4.png> 
>>>
>>>
>>> It has mixed data but most benchmarks are not affected. Which means 
>>> we need to spend more time on proposed changes.
>>>
>>> Vladimir K
>>>
>>> On 3/31/20 10:39 AM, Vladimir Kozlov wrote:
>>>> I start loking on it.
>>>>
>>>> I think ideally TypeProfileWidth should be per call site and not per 
>>>> method - and it will require more complicated implementation (an 
>>>> other RFE). But for experiments I think setting it to 8 (or higher) 
>>>> for all methods is okay.
>>>>
>>>> Note, more profiling lines per each call site is cost few Mb in 
>>>> CodeCache (overestimation 20K nmethods * 10 call sites * 6 * 8 
>>>> bytes) vs very complicated code to have dynamic number of lines.
>>>>
>>>> I think we should first investigate best heuristics for inlining vs 
>>>> direct call vs vcall vs uncommmont traps for polymorphic cases and 
>>>> worry about memory and time consumption during profiling later.
>>>>
>>>> I did some performance runs with latest JDK 15 for 
>>>> TypeProfileWidth=8 vs =2 and don't see much difference for spec 
>>>> benchmarks (see attached graph - grey dots mean no significant 
>>>> difference). But there are regressions (red dots) for Renessance 
>>>> which includes some modern benchmarks.
>>>>
>>>> I will work his week to get similar data with Ludovic's patch.
>>>>
>>>> I am for incremental approach. I think we can start/push based on 
>>>> what Ludovic is currently suggesting (do more processing for TPW > 
>>>> 2) while preserving current default behaviour (for TPW <= 2). But 
>>>> only if it gives improvements in these benchmarks. We use these 
>>>> benchmarks as criteria for JDK releases.
>>>>
>>>> Regards,
>>>> Vladimir
>>>>
>>>> On 3/20/20 4:52 PM, Ludovic Henry wrote:
>>>>> Hi Vladimir,
>>>>>
>>>>> As requested offline, please find following the latest version of 
>>>>> the patch. Contrary to what was discussed
>>>>> initially, I haven't done the work to support per-method 
>>>>> TypeProfileWidth, as that requires to extend the
>>>>> existing CompilerDirectives to be available to the Interpreter. For 
>>>>> me to achieve that work, I would need
>>>>> guidance on how to approach the problem, and what your expectations 
>>>>> are.
>>>>>
>>>>> Thank you,
>>>>>
>>>>> -- 
>>>>> Ludovic
>>>>>
>>>>> diff --git a/src/hotspot/cpu/x86/interp_masm_x86.cpp 
>>>>> b/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>>>> index 4ed93169c7..bad9cddf20 100644
>>>>> --- a/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>>>> +++ b/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>>>> @@ -1731,7 +1731,7 @@ void 
>>>>> InterpreterMacroAssembler::record_item_in_profile_helper(Register 
>>>>> item, Reg
>>>>> ??????????? Label found_null;
>>>>> ??????????? jccb(Assembler::zero, found_null);
>>>>> ??????????? // Item did not match any saved item and there is no 
>>>>> empty row for it.
>>>>> -????????? // Increment total counter to indicate polymorphic case.
>>>>> +????????? // Increment total counter to indicate megamorphic case.
>>>>> ??????????? increment_mdp_data_at(mdp, non_profiled_offset);
>>>>> ??????????? jmp(done);
>>>>> ??????????? bind(found_null);
>>>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp 
>>>>> b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>> index 73854806ed..c5030149bf 100644
>>>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp
>>>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>> @@ -38,7 +38,8 @@ private:
>>>>> ??? friend class ciMethod;
>>>>> ??? friend class ciMethodHandle;
>>>>> -? enum { MorphismLimit = 2 }; // Max call site's morphism we care 
>>>>> about
>>>>> +? enum { MorphismLimit = 8 }; // Max call site's morphism we care 
>>>>> about
>>>>> +? bool _is_megamorphic;????????? // whether the call site is 
>>>>> megamorphic
>>>>> ??? int? _limit;??????????????? // number of receivers have been 
>>>>> determined
>>>>> ??? int? _morphism;???????????? // determined call site's morphism
>>>>> ??? int? _count;??????????????? // # times has this call been executed
>>>>> @@ -47,6 +48,8 @@ private:
>>>>> ??? ciKlass*? _receiver[MorphismLimit + 1];? // receivers (exact)
>>>>> ??? ciCallProfile() {
>>>>> +??? guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit 
>>>>> can't be smaller than TypeProfileWidth");
>>>>> +??? _is_megamorphic = false;
>>>>> ????? _limit = 0;
>>>>> ????? _morphism??? = 0;
>>>>> ????? _count = -1;
>>>>> @@ -58,6 +61,8 @@ private:
>>>>> ??? void add_receiver(ciKlass* receiver, int receiver_count);
>>>>> ? public:
>>>>> +? bool????? is_megamorphic() const??? { return _is_megamorphic; }
>>>>> +
>>>>> ??? // Note:? The following predicates return false for invalid 
>>>>> profiles:
>>>>> ??? bool????? has_receiver(int i) const { return _limit > i; }
>>>>> ??? int?????? morphism() const????????? { return _morphism; }
>>>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp 
>>>>> b/src/hotspot/share/ci/ciMethod.cpp
>>>>> index d771be8dac..c190919708 100644
>>>>> --- a/src/hotspot/share/ci/ciMethod.cpp
>>>>> +++ b/src/hotspot/share/ci/ciMethod.cpp
>>>>> @@ -531,25 +531,27 @@ ciCallProfile 
>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>> ??????????? // If we extend profiling to record methods,
>>>>> ??????????? // we will set result._method also.
>>>>> ????????? }
>>>>> -??????? // Determine call site's morphism.
>>>>> +??????? // Determine call site's megamorphism.
>>>>> ????????? // The call site count is 0 with known morphism (only 1 
>>>>> or 2 receivers)
>>>>> ????????? // or < 0 in the case of a type check failure for 
>>>>> checkcast, aastore, instanceof.
>>>>> -??????? // The call site count is > 0 in the case of a polymorphic 
>>>>> virtual call.
>>>>> +??????? // The call site count is > 0 in the case of a megamorphic 
>>>>> virtual call.
>>>>> ????????? if (morphism > 0 && morphism == result._limit) {
>>>>> ???????????? // The morphism <= MorphismLimit.
>>>>> -?????????? if ((morphism <? ciCallProfile::MorphismLimit) ||
>>>>> -?????????????? (morphism == ciCallProfile::MorphismLimit && count 
>>>>> == 0)) {
>>>>> +?????????? if ((morphism <? TypeProfileWidth) ||
>>>>> +?????????????? (morphism == TypeProfileWidth && count == 0)) {
>>>>> ? #ifdef ASSERT
>>>>> ?????????????? if (count > 0) {
>>>>> ???????????????? this->print_short_name(tty);
>>>>> ???????????????? tty->print_cr(" @ bci:%d", bci);
>>>>> ???????????????? this->print_codes();
>>>>> -?????????????? assert(false, "this call site should not be 
>>>>> polymorphic");
>>>>> +?????????????? assert(false, "this call site should not be 
>>>>> megamorphic");
>>>>> ?????????????? }
>>>>> ? #endif
>>>>> -???????????? result._morphism = morphism;
>>>>> +?????????? } else {
>>>>> +????????????? result._is_megamorphic = true;
>>>>> ???????????? }
>>>>> ????????? }
>>>>> +??????? result._morphism = morphism;
>>>>> ????????? // Make the count consistent if this is a call profile. 
>>>>> If count is
>>>>> ????????? // zero or less, presume that this is a typecheck profile 
>>>>> and
>>>>> ????????? // do nothing.? Otherwise, increase count to be the sum 
>>>>> of all
>>>>> @@ -578,7 +580,7 @@ void ciCallProfile::add_receiver(ciKlass* 
>>>>> receiver, int receiver_count) {
>>>>> ??? }
>>>>> ??? _receiver[i] = receiver;
>>>>> ??? _receiver_count[i] = receiver_count;
>>>>> -? if (_limit < MorphismLimit) _limit++;
>>>>> +? if (_limit < TypeProfileWidth) _limit++;
>>>>> ? }
>>>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp 
>>>>> b/src/hotspot/share/opto/c2_globals.hpp
>>>>> index d605bdb7bd..e4a5e7ea8b 100644
>>>>> --- a/src/hotspot/share/opto/c2_globals.hpp
>>>>> +++ b/src/hotspot/share/opto/c2_globals.hpp
>>>>> @@ -389,9 +389,16 @@
>>>>> ??? product(bool, UseBimorphicInlining, 
>>>>> true,???????????????????????????????? \
>>>>> ??????????? "Profiling based inlining for two 
>>>>> receivers")???????????????????? \
>>>>> \
>>>>> +? product(bool, UsePolymorphicInlining, 
>>>>> true,?????????????????????????????? \
>>>>> +????????? "Profiling based inlining for two or more 
>>>>> receivers")???????????? \
>>>>> + \
>>>>> ??? product(bool, UseOnlyInlinedBimorphic, 
>>>>> true,????????????????????????????? \
>>>>> ??????????? "Don't use BimorphicInlining if can't inline a second 
>>>>> method")??? \
>>>>> \
>>>>> +? product(bool, UseOnlyInlinedPolymorphic, 
>>>>> true,??????????????????????????? \
>>>>> +????????? "Don't use PolymorphicInlining if can't inline a 
>>>>> secondary "????? \
>>>>> + "method")???????????????????????????????????????????????????????? \
>>>>> + \
>>>>> ??? product(bool, InsertMemBarAfterArraycopy, 
>>>>> true,?????????????????????????? \
>>>>> ??????????? "Insert memory barrier after arraycopy 
>>>>> call")???????????????????? \
>>>>> \
>>>>> @@ -645,6 +652,10 @@
>>>>> ??????????? "% of major receiver type to all profiled 
>>>>> receivers")???????????? \
>>>>> ??????????? range(0, 
>>>>> 100)???????????????????????????????????????????????????? \
>>>>> \
>>>>> +? product(intx, TypeProfileMinimumReceiverPercent, 
>>>>> 20,????????????????????? \
>>>>> +????????? "minimum % of receiver type to all profiled 
>>>>> receivers")?????????? \
>>>>> +????????? range(0, 
>>>>> 100)???????????????????????????????????????????????????? \
>>>>> + \
>>>>> ??? diagnostic(bool, PrintIntrinsics, 
>>>>> false,????????????????????????????????? \
>>>>> ??????????? "prints attempted and successful inlining of 
>>>>> intrinsics")???????? \
>>>>> \
>>>>> diff --git a/src/hotspot/share/opto/doCall.cpp 
>>>>> b/src/hotspot/share/opto/doCall.cpp
>>>>> index 44ab387ac8..dba2b114c6 100644
>>>>> --- a/src/hotspot/share/opto/doCall.cpp
>>>>> +++ b/src/hotspot/share/opto/doCall.cpp
>>>>> @@ -83,25 +83,27 @@ CallGenerator* 
>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>> ??? // See how many times this site has been invoked.
>>>>> ??? int site_count = profile.count();
>>>>> -? int receiver_count = -1;
>>>>> -? if (call_does_dispatch && UseTypeProfile && 
>>>>> profile.has_receiver(0)) {
>>>>> -??? // Receivers in the profile structure are ordered by call counts
>>>>> -??? // so that the most called (major) receiver is 
>>>>> profile.receiver(0).
>>>>> -??? receiver_count = profile.receiver_count(0);
>>>>> -? }
>>>>> ??? CompileLog* log = this->log();
>>>>> ??? if (log != NULL) {
>>>>> -??? int rid = (receiver_count >= 0)? 
>>>>> log->identify(profile.receiver(0)): -1;
>>>>> -??? int r2id = (rid != -1 && profile.has_receiver(1))? 
>>>>> log->identify(profile.receiver(1)):-1;
>>>>> +??? int* rids;
>>>>> +??? if (call_does_dispatch) {
>>>>> +????? rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth);
>>>>> +????? for (int i = 0; i < TypeProfileWidth && 
>>>>> profile.has_receiver(i); i++) {
>>>>> +??????? rids[i] = log->identify(profile.receiver(i));
>>>>> +????? }
>>>>> +??? }
>>>>> ????? log->begin_elem("call method='%d' count='%d' prof_factor='%f'",
>>>>> ????????????????????? log->identify(callee), site_count, prof_factor);
>>>>> -??? if (call_does_dispatch)? log->print(" virtual='1'");
>>>>> ????? if (allow_inline)???? log->print(" inline='1'");
>>>>> -??? if (receiver_count >= 0) {
>>>>> -????? log->print(" receiver='%d' receiver_count='%d'", rid, 
>>>>> receiver_count);
>>>>> -????? if (profile.has_receiver(1)) {
>>>>> -??????? log->print(" receiver2='%d' receiver2_count='%d'", r2id, 
>>>>> profile.receiver_count(1));
>>>>> +??? if (call_does_dispatch) {
>>>>> +????? log->print(" virtual='1'");
>>>>> +????? for (int i = 0; i < TypeProfileWidth && 
>>>>> profile.has_receiver(i); i++) {
>>>>> +??????? if (i == 0) {
>>>>> +????????? log->print(" receiver='%d' receiver_count='%d' 
>>>>> receiver_prob='%f'", rids[i], profile.receiver_count(i), 
>>>>> profile.receiver_prob(i));
>>>>> +??????? } else {
>>>>> +????????? log->print(" receiver%d='%d' receiver%d_count='%d' 
>>>>> receiver%d_prob='%f'", i + 1, rids[i], i + 1, 
>>>>> profile.receiver_count(i), i + 1, profile.receiver_prob(i));
>>>>> +??????? }
>>>>> ??????? }
>>>>> ????? }
>>>>> ????? if (callee->is_method_handle_intrinsic()) {
>>>>> @@ -205,92 +207,112 @@ CallGenerator* 
>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>> ????? if (call_does_dispatch && site_count > 0 && UseTypeProfile) {
>>>>> ??????? // The major receiver's count >= 
>>>>> TypeProfileMajorReceiverPercent of site_count.
>>>>> ??????? bool have_major_receiver = profile.has_receiver(0) && 
>>>>> (100.*profile.receiver_prob(0) >= 
>>>>> (float)TypeProfileMajorReceiverPercent);
>>>>> -????? ciMethod* receiver_method = NULL;
>>>>> ??????? int morphism = profile.morphism();
>>>>> +
>>>>> +????? int width = morphism > 0 ? morphism : 1;
>>>>> +????? ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*, 
>>>>> width);
>>>>> +????? memset(receiver_methods, 0, sizeof(ciMethod*) * width);
>>>>> +????? CallGenerator** hit_cgs = NEW_RESOURCE_ARRAY(CallGenerator*, 
>>>>> width);
>>>>> +????? memset(hit_cgs, 0, sizeof(CallGenerator*) * width);
>>>>> +
>>>>> ??????? if (speculative_receiver_type != NULL) {
>>>>> ????????? if (!too_many_traps_or_recompiles(caller, bci, 
>>>>> Deoptimization::Reason_speculate_class_check)) {
>>>>> ??????????? // We have a speculative type, we should be able to 
>>>>> resolve
>>>>> ??????????? // the call. We do that before looking at the profiling at
>>>>> -????????? // this invoke because it may lead to bimorphic inlining 
>>>>> which
>>>>> +????????? // this invoke because it may lead to polymorphic 
>>>>> inlining which
>>>>> ??????????? // a speculative type should help us avoid.
>>>>> -????????? receiver_method = 
>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>> - speculative_receiver_type);
>>>>> -????????? if (receiver_method == NULL) {
>>>>> +????????? receiver_methods[0] = 
>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>> + speculative_receiver_type);
>>>>> +????????? if (receiver_methods[0] == NULL) {
>>>>> ????????????? speculative_receiver_type = NULL;
>>>>> ??????????? } else {
>>>>> ????????????? morphism = 1;
>>>>> ??????????? }
>>>>> ????????? } else {
>>>>> ??????????? // speculation failed before. Use profiling at the call
>>>>> -????????? // (could allow bimorphic inlining for instance).
>>>>> +????????? // (could allow polymorphic inlining for instance).
>>>>> ??????????? speculative_receiver_type = NULL;
>>>>> ????????? }
>>>>> ??????? }
>>>>> -????? if (receiver_method == NULL &&
>>>>> -????????? (have_major_receiver || morphism == 1 ||
>>>>> -?????????? (morphism == 2 && UseBimorphicInlining))) {
>>>>> -??????? // receiver_method = profile.method();
>>>>> -??????? // Profiles do not suggest methods now.? Look it up in the 
>>>>> major receiver.
>>>>> -??????? receiver_method = 
>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>> - profile.receiver(0));
>>>>> -????? }
>>>>> -????? if (receiver_method != NULL) {
>>>>> -??????? // The single majority receiver sufficiently outweighs the 
>>>>> minority.
>>>>> -??????? CallGenerator* hit_cg = this->call_generator(receiver_method,
>>>>> -????????????? vtable_index, !call_does_dispatch, jvms, 
>>>>> allow_inline, prof_factor);
>>>>> -??????? if (hit_cg != NULL) {
>>>>> -????????? // Look up second receiver.
>>>>> -????????? CallGenerator* next_hit_cg = NULL;
>>>>> -????????? ciMethod* next_receiver_method = NULL;
>>>>> -????????? if (morphism == 2 && UseBimorphicInlining) {
>>>>> -??????????? next_receiver_method = 
>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>> - profile.receiver(1));
>>>>> -??????????? if (next_receiver_method != NULL) {
>>>>> -????????????? next_hit_cg = 
>>>>> this->call_generator(next_receiver_method,
>>>>> -????????????????????????????????? vtable_index, 
>>>>> !call_does_dispatch, jvms,
>>>>> -????????????????????????????????? allow_inline, prof_factor);
>>>>> -????????????? if (next_hit_cg != NULL && !next_hit_cg->is_inline() &&
>>>>> -????????????????? have_major_receiver && UseOnlyInlinedBimorphic) {
>>>>> -????????????????? // Skip if we can't inline second receiver's method
>>>>> -????????????????? next_hit_cg = NULL;
>>>>> -????????????? }
>>>>> -??????????? }
>>>>> -????????? }
>>>>> -????????? CallGenerator* miss_cg;
>>>>> -????????? Deoptimization::DeoptReason reason = (morphism == 2
>>>>> -?????????????????????????????????????????????? ? 
>>>>> Deoptimization::Reason_bimorphic
>>>>> -?????????????????????????????????????????????? : 
>>>>> Deoptimization::reason_class_check(speculative_receiver_type != 
>>>>> NULL));
>>>>> -????????? if ((morphism == 1 || (morphism == 2 && next_hit_cg != 
>>>>> NULL)) &&
>>>>> -????????????? !too_many_traps_or_recompiles(caller, bci, reason)
>>>>> -???????????? ) {
>>>>> -??????????? // Generate uncommon trap for class check failure path
>>>>> -??????????? // in case of monomorphic or bimorphic virtual call site.
>>>>> -??????????? miss_cg = CallGenerator::for_uncommon_trap(callee, 
>>>>> reason,
>>>>> -??????????????????????? Deoptimization::Action_maybe_recompile);
>>>>> +????? bool removed_cgs = false;
>>>>> +????? // Look up receivers.
>>>>> +????? for (int i = 0; i < morphism; i++) {
>>>>> +??????? if ((i == 1 && !UseBimorphicInlining) || (i >= 1 && 
>>>>> !UsePolymorphicInlining)) {
>>>>> +????????? break;
>>>>> +??????? }
>>>>> +??????? if (receiver_methods[i] == NULL && profile.has_receiver(i)) {
>>>>> +????????? receiver_methods[i] = 
>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>> + profile.receiver(i));
>>>>> +??????? }
>>>>> +??????? if (receiver_methods[i] != NULL) {
>>>>> +????????? bool allow_inline;
>>>>> +????????? if (speculative_receiver_type != NULL) {
>>>>> +??????????? allow_inline = true;
>>>>> ??????????? } else {
>>>>> -??????????? // Generate virtual call for class check failure path
>>>>> -??????????? // in case of polymorphic virtual call site.
>>>>> -??????????? miss_cg = CallGenerator::for_virtual_call(callee, 
>>>>> vtable_index);
>>>>> +??????????? allow_inline = 100.*profile.receiver_prob(i) >= 
>>>>> (float)TypeProfileMinimumReceiverPercent;
>>>>> ??????????? }
>>>>> -????????? if (miss_cg != NULL) {
>>>>> -??????????? if (next_hit_cg != NULL) {
>>>>> -????????????? assert(speculative_receiver_type == NULL, "shouldn't 
>>>>> end up here if we used speculation");
>>>>> -????????????? trace_type_profile(C, jvms->method(), jvms->depth() 
>>>>> - 1, jvms->bci(), next_receiver_method, profile.receiver(1), 
>>>>> site_count, profile.receiver_count(1));
>>>>> -????????????? // We don't need to record dependency on a receiver 
>>>>> here and below.
>>>>> -????????????? // Whenever we inline, the dependency is added by 
>>>>> Parse::Parse().
>>>>> -????????????? miss_cg = 
>>>>> CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, 
>>>>> next_hit_cg, PROB_MAX);
>>>>> -??????????? }
>>>>> -??????????? if (miss_cg != NULL) {
>>>>> -????????????? ciKlass* k = speculative_receiver_type != NULL ? 
>>>>> speculative_receiver_type : profile.receiver(0);
>>>>> -????????????? trace_type_profile(C, jvms->method(), jvms->depth() 
>>>>> - 1, jvms->bci(), receiver_method, k, site_count, receiver_count);
>>>>> -????????????? float hit_prob = speculative_receiver_type != NULL ? 
>>>>> 1.0 : profile.receiver_prob(0);
>>>>> -????????????? CallGenerator* cg = 
>>>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob);
>>>>> -????????????? if (cg != NULL)? return cg;
>>>>> +????????? hit_cgs[i] = this->call_generator(receiver_methods[i],
>>>>> +??????????????????????????????? vtable_index, !call_does_dispatch, 
>>>>> jvms,
>>>>> +??????????????????????????????? allow_inline, prof_factor);
>>>>> +????????? if (hit_cgs[i] != NULL) {
>>>>> +??????????? if (speculative_receiver_type != NULL) {
>>>>> +????????????? // Do nothing if it's a speculative type
>>>>> +??????????? } else if (bytecode == Bytecodes::_invokeinterface) {
>>>>> +????????????? // Do nothing if it's an interface, multiple 
>>>>> direct-calls are faster than one indirect-call
>>>>> +??????????? } else if (!have_major_receiver) {
>>>>> +????????????? // Do nothing if there is no major receiver
>>>>> +??????????? } else if ((morphism == 2 && !UseOnlyInlinedBimorphic) 
>>>>> || (morphism >= 2 && !UseOnlyInlinedPolymorphic)) {
>>>>> +????????????? // Do nothing if the user allows non-inlined 
>>>>> polymorphic calls
>>>>> +??????????? } else if (!hit_cgs[i]->is_inline()) {
>>>>> +????????????? // Skip if we can't inline receiver's method
>>>>> +????????????? hit_cgs[i] = NULL;
>>>>> +????????????? removed_cgs = true;
>>>>> ????????????? }
>>>>> ??????????? }
>>>>> ????????? }
>>>>> ??????? }
>>>>> +
>>>>> +????? // Generate the fallback path
>>>>> +????? Deoptimization::DeoptReason reason = (morphism != 1
>>>>> +??????????????????????????????????????????? ? 
>>>>> Deoptimization::Reason_polymorphic
>>>>> +??????????????????????????????????????????? : 
>>>>> Deoptimization::reason_class_check(speculative_receiver_type != 
>>>>> NULL));
>>>>> +????? bool disable_trap = (profile.is_megamorphic() || removed_cgs 
>>>>> || too_many_traps_or_recompiles(caller, bci, reason));
>>>>> +????? if (log != NULL) {
>>>>> +??????? log->elem("call_fallback method='%d' count='%d' 
>>>>> morphism='%d' trap='%d'",
>>>>> +????????????????????? log->identify(callee), site_count, morphism, 
>>>>> disable_trap ? 0 : 1);
>>>>> +????? }
>>>>> +????? CallGenerator* miss_cg;
>>>>> +????? if (!disable_trap) {
>>>>> +??????? // Generate uncommon trap for class check failure path
>>>>> +??????? // in case of polymorphic virtual call site.
>>>>> +??????? miss_cg = CallGenerator::for_uncommon_trap(callee, reason,
>>>>> +??????????????????? Deoptimization::Action_maybe_recompile);
>>>>> +????? } else {
>>>>> +??????? // Generate virtual call for class check failure path
>>>>> +??????? // in case of megamorphic virtual call site.
>>>>> +??????? miss_cg = CallGenerator::for_virtual_call(callee, 
>>>>> vtable_index);
>>>>> +????? }
>>>>> +
>>>>> +????? // Generate the guards
>>>>> +????? CallGenerator* cg = NULL;
>>>>> +????? if (speculative_receiver_type != NULL) {
>>>>> +??????? if (hit_cgs[0] != NULL) {
>>>>> +????????? trace_type_profile(C, jvms->method(), jvms->depth() - 1, 
>>>>> jvms->bci(), receiver_methods[0], speculative_receiver_type, 
>>>>> site_count, profile.receiver_count(0));
>>>>> +????????? // We don't need to record dependency on a receiver here 
>>>>> and below.
>>>>> +????????? // Whenever we inline, the dependency is added by 
>>>>> Parse::Parse().
>>>>> +????????? cg = 
>>>>> CallGenerator::for_predicted_call(speculative_receiver_type, 
>>>>> miss_cg, hit_cgs[0], PROB_MAX);
>>>>> +??????? }
>>>>> +????? } else {
>>>>> +??????? for (int i = morphism - 1; i >= 0 && miss_cg != NULL; i--) {
>>>>> +????????? if (hit_cgs[i] != NULL) {
>>>>> +??????????? trace_type_profile(C, jvms->method(), jvms->depth() - 
>>>>> 1, jvms->bci(), receiver_methods[i], profile.receiver(i), 
>>>>> site_count, profile.receiver_count(i));
>>>>> +??????????? miss_cg = 
>>>>> CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, 
>>>>> hit_cgs[i], profile.receiver_prob(i));
>>>>> +????????? }
>>>>> +??????? }
>>>>> +??????? cg = miss_cg;
>>>>> +????? }
>>>>> +????? if (cg != NULL)? return cg;
>>>>> ????? }
>>>>> ????? // If there is only one implementor of this interface then we
>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp 
>>>>> b/src/hotspot/share/runtime/deoptimization.cpp
>>>>> index 11df15e004..2d14b52854 100644
>>>>> --- a/src/hotspot/share/runtime/deoptimization.cpp
>>>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp
>>>>> @@ -2382,7 +2382,7 @@ const char* 
>>>>> Deoptimization::_trap_reason_name[] = {
>>>>> ??? "class_check",
>>>>> ??? "array_check",
>>>>> ??? "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"),
>>>>> -? "bimorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>> +? "polymorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>> ??? "profile_predicate",
>>>>> ??? "unloaded",
>>>>> ??? "uninitialized",
>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp 
>>>>> b/src/hotspot/share/runtime/deoptimization.hpp
>>>>> index 1cfff5394e..c1eb998aba 100644
>>>>> --- a/src/hotspot/share/runtime/deoptimization.hpp
>>>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp
>>>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic {
>>>>> ????? Reason_class_check,?????????? // saw unexpected object class 
>>>>> (@bci)
>>>>> ????? Reason_array_check,?????????? // saw unexpected array class 
>>>>> (aastore @bci)
>>>>> ????? Reason_intrinsic,???????????? // saw unexpected operand to 
>>>>> intrinsic (@bci)
>>>>> -??? Reason_bimorphic,???????????? // saw unexpected object class 
>>>>> in bimorphic inlining (@bci)
>>>>> +??? Reason_polymorphic,?????????? // saw unexpected object class 
>>>>> in bimorphic inlining (@bci)
>>>>> ? #if INCLUDE_JVMCI
>>>>> ????? Reason_unreached0???????????? = Reason_null_assert,
>>>>> ????? Reason_type_checked_inlining? = Reason_intrinsic,
>>>>> -??? Reason_optimized_type_check?? = Reason_bimorphic,
>>>>> +??? Reason_optimized_type_check?? = Reason_polymorphic,
>>>>> ? #endif
>>>>> ????? Reason_profile_predicate,???? // compiler generated predicate 
>>>>> moved from frequent branch in a loop failed
>>>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp 
>>>>> b/src/hotspot/share/runtime/vmStructs.cpp
>>>>> index 94b544824e..ee761626c4 100644
>>>>> --- a/src/hotspot/share/runtime/vmStructs.cpp
>>>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp
>>>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry<InstanceKlass*, 
>>>>> mtClass>? KlassHashtableEntry;
>>>>> declare_constant(Deoptimization::Reason_class_check) \
>>>>> declare_constant(Deoptimization::Reason_array_check) \
>>>>> declare_constant(Deoptimization::Reason_intrinsic) \
>>>>> - declare_constant(Deoptimization::Reason_bimorphic) \
>>>>> + declare_constant(Deoptimization::Reason_polymorphic) \
>>>>> declare_constant(Deoptimization::Reason_profile_predicate) \
>>>>> declare_constant(Deoptimization::Reason_unloaded) \
>>>>> declare_constant(Deoptimization::Reason_uninitialized) \
>>>>>
>>>>> -----Original Message-----
>>>>> From: hotspot-compiler-dev 
>>>>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of 
>>>>> Ludovic Henry
>>>>> Sent: Tuesday, March 3, 2020 10:50 AM
>>>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose 
>>>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>>>
>>>>> I just got to run the PolymorphicVirtualCallBenchmark 
>>>>> microbenchmark with
>>>>> various TypeProfileWidth values. The results are:
>>>>>
>>>>> Benchmark???????????????????????????? Mode? Cnt? Score?? Error  
>>>>> Units Configuration
>>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.802 ? 0.048 
>>>>> ops/s -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableInlining 
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.425 ? 0.019 
>>>>> ops/s -XX:TypeProfileWidth=1 -XX:+PolyGuardDisableInlining 
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.857 ? 0.109 
>>>>> ops/s -XX:TypeProfileWidth=2 -XX:+PolyGuardDisableInlining 
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.876 ? 0.051 
>>>>> ops/s -XX:TypeProfileWidth=3 -XX:+PolyGuardDisableInlining 
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.867 ? 0.045 
>>>>> ops/s -XX:TypeProfileWidth=4 -XX:+PolyGuardDisableInlining 
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.835 ? 0.104 
>>>>> ops/s -XX:TypeProfileWidth=5 -XX:+PolyGuardDisableInlining 
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.886 ? 0.139 
>>>>> ops/s -XX:TypeProfileWidth=6 -XX:+PolyGuardDisableInlining 
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.887 ? 0.040 
>>>>> ops/s -XX:TypeProfileWidth=7 -XX:+PolyGuardDisableInlining 
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run?? thrpt??? 5? 2.684 ? 0.020 
>>>>> ops/s -XX:TypeProfileWidth=8 -XX:+PolyGuardDisableInlining 
>>>>> -XX:+PolyGuardDisableTrap
>>>>>
>>>>> The main thing I observe is that there isn't a linear (or even any 
>>>>> apparent)
>>>>> correlation between the number of guards generated (guided by
>>>>> TypeProfileWidth), and the time taken.
>>>>>
>>>>> I am trying to understand why there is a dip for TypeProfileWidth 
>>>>> equal
>>>>> to 1 and 8.
>>>>>
>>>>> -- 
>>>>> Ludovic
>>>>>
>>>>> -----Original Message-----
>>>>> From: Ludovic Henry <luhenry at microsoft.com>
>>>>> Sent: Tuesday, March 3, 2020 9:33 AM
>>>>> To: Ludovic Henry <luhenry at microsoft.com>; Vladimir Ivanov 
>>>>> <vladimir.x.ivanov at oracle.com>; John Rose <john.r.rose at oracle.com>; 
>>>>> hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>>>
>>>>> Hi Vladimir,
>>>>>
>>>>> I did a rerun of the following benchmark with various configurations:
>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fluhenry%2Fjdk-microbenchmarks%2Fblob%2Fmaster%2Fsrc%2Fmain%2Fjava%2Forg%2Fsample%2FTypeProfileWidthOverheadBenchmark.java&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&amp;sdata=KzGJ0Xc95rWag6rF8gyFW7%2BHcDUccltwMN5JmL%2BUhdE%3D&amp;reserved=0 
>>>>>
>>>>>
>>>>> The results are as follows:
>>>>>
>>>>> Benchmark???????????????????????????? Mode? Cnt? Score?? Error  
>>>>> Units Configuration
>>>>> PolymorphicVirtualCallBenchmark.run?? thrpt 5??? 2.910 ? 0.040  
>>>>> ops/s indirect-call? -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run?? thrpt 5??? 2.752 ? 0.039  
>>>>> ops/s direct-call??? -XX:TypeProfileWidth=8 
>>>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run?? thrpt 5??? 3.407 ? 0.085  
>>>>> ops/s inlined-call?? -XX:TypeProfileWidth=8 
>>>>> -XX:-PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>>> Benchmark???????????????????????????? Mode? Cnt? Score?? Error  
>>>>> Units Configuration
>>>>> PolymorphicInterfaceCallBenchmark.run thrpt 5??? 2.043 ? 0.025  
>>>>> ops/s indirect-call? -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableTrap
>>>>> PolymorphicInterfaceCallBenchmark.run thrpt 5??? 2.555 ? 0.063  
>>>>> ops/s direct-call??? -XX:TypeProfileWidth=8 
>>>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>>> PolymorphicInterfaceCallBenchmark.run thrpt 5??? 3.217 ? 0.058  
>>>>> ops/s inlined-call?? -XX:TypeProfileWidth=8 
>>>>> -XX:-PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>>>
>>>>> The Hotspot logs (with generated assembly) are available at:
>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fluhenry%2F4f015541cb6628517ab6698b1487c17d&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&amp;sdata=q7MAs1V9tbuZhBIjNDQefNuhsiKVv7HHRn1kdcbrUPk%3D&amp;reserved=0 
>>>>>
>>>>>
>>>>> The main takeaway from that experiment is that direct calls w/o 
>>>>> inlining is faster
>>>>> than indirect calls for icalls but slower for vcalls, and that 
>>>>> inlining is always faster
>>>>> than direct calls.
>>>>>
>>>>> (I fully understand this applies mainly on this microbenchmark, and 
>>>>> we need to
>>>>> validate on larger benchmarks. I'm working on that next. However, 
>>>>> it clearly show
>>>>> gains on a pathological case.)
>>>>>
>>>>> Next, I want to figure out at how many guard the direct-call 
>>>>> regresses compared
>>>>> to indirect-call in the vcall case, and I want to run larger 
>>>>> benchmarks. Any
>>>>> particular you would like to see running? I am planning on doing 
>>>>> SPECjbb2015 first.
>>>>>
>>>>> Thank you,
>>>>>
>>>>> -- 
>>>>> Ludovic
>>>>>
>>>>> -----Original Message-----
>>>>> From: hotspot-compiler-dev 
>>>>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of 
>>>>> Ludovic Henry
>>>>> Sent: Monday, March 2, 2020 4:20 PM
>>>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose 
>>>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>>>
>>>>> Hi Vladimir,
>>>>>
>>>>> Sorry for the long delay in response, I was at multiple conferences 
>>>>> over the past few
>>>>> weeks. I'm back to the office now and fully focus on getting 
>>>>> progress on that.
>>>>>
>>>>>>> Possible avenues of improvements I can see are:
>>>>>>> ??? - Gather all the types in an unbounded list so we can know 
>>>>>>> which ones
>>>>>>> are the most frequent. It is unlikely to help with Java as, in 
>>>>>>> the general
>>>>>>> case, there are only a few types present a call-sites. It could, 
>>>>>>> however,
>>>>>>> be particularly helpful for languages that tend to have many 
>>>>>>> types at
>>>>>>> call-sites, like functional languages, for example.
>>>>>>
>>>>>> I doubt having unbounded list of receiver types is practical: it's
>>>>>> costly to gather, but isn't too useful for compilation. But measuring
>>>>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some 
>>>>>> numbers.
>>>>>
>>>>> I agree that it isn't very practical. It can be useful in the case 
>>>>> where there are
>>>>> many types at a call-site, and the first ones end up not being 
>>>>> frequent enough to
>>>>> mandate a guard. This is clearly an edge-case, and I don't think we 
>>>>> should optimize
>>>>> for it.
>>>>>
>>>>>>> In what we have today, some of the worst-case scenarios are the 
>>>>>>> following:
>>>>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, 
>>>>>>> the first and
>>>>>>> second types are types A and B, and the other type(s) is(are) not 
>>>>>>> recorded,
>>>>>>> and it increments the `count` value. Even if A and B are used in 
>>>>>>> the initialization
>>>>>>> path (i.e. only a few times) and the other type(s) is(are) used 
>>>>>>> in the hot
>>>>>>> path (i.e. many times), the latter are never considered for 
>>>>>>> inlining - because
>>>>>>> it was never recorded during profiling.
>>>>>>
>>>>>> Can it be alleviated by (partially) clearing type profile (e.g.,
>>>>>> periodically free some space by removing elements with lower 
>>>>>> frequencies
>>>>>> and give new types a chance to be profiled?
>>>>>
>>>>> Doing that reliably relies on the assumption that we know what the 
>>>>> shape of
>>>>> the workload is going to be in future iterations. Otherwise, how 
>>>>> could you
>>>>> guarantee that a type that's not currently frequent will not be in 
>>>>> the future,
>>>>> and that the information that you remove now will not be important 
>>>>> later. This
>>>>> is an assumption that, IMO, is worst than missing types which are 
>>>>> hot later in
>>>>> the execution for two reasons: 1. it's no better, and 2. it's a lot 
>>>>> less intuitive and
>>>>> harder to debug/understand than a straightforward "overflow".
>>>>>
>>>>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, 
>>>>>>> you have the
>>>>>>> first type A with 49% probability, the second type B with 49% 
>>>>>>> probability, and
>>>>>>> the other types with 2% probability. Even though A and B are the 
>>>>>>> two hottest
>>>>>>> paths, it does not generate guards because none are a major 
>>>>>>> receiver.
>>>>>>
>>>>>> Yes. On the other hand, on average it'll cause inlining twice as much
>>>>>> code (2 methods vs 1).
>>>>>
>>>>> It will not necessarily cause twice as much inlining because of 
>>>>> late-inlining. Like
>>>>> you point out later, it will generate a direct-call in case there 
>>>>> isn't room for more
>>>>> inlinable code.
>>>>>
>>>>>> Also, does it make sense to increase morphism factor even if inlining
>>>>>> doesn't happen?
>>>>>>
>>>>>> ?? if (recv.klass == C1) {? // >>0%
>>>>>> ????? ... inlined ...
>>>>>> ?? } else if (recv.klass == C2) { // >>0%
>>>>>> ????? m2(); // direct call
>>>>>> ?? } else { // >0%
>>>>>> ????? m(); // virtual call
>>>>>> ?? }
>>>>>>
>>>>>> vs
>>>>>>
>>>>>> ?? if (recv.klass == C1) {? // >>0%
>>>>>> ????? ... inlined ...
>>>>>> ?? } else { // >>0%
>>>>>> ????? m(); // virtual call
>>>>>> ?? }
>>>>>
>>>>> There is the advantage that modern CPUs are better at predicting 
>>>>> instruction-branches
>>>>> than data-branches. These guards will then allow the CPU to make 
>>>>> better decisions allowing
>>>>> for better superscalar executions, memory prefetching, etc.
>>>>>
>>>>> This, IMO, makes sense for warm calls, especially since the cost is 
>>>>> a guard + a call, which is
>>>>> much lower than a inlined method, but brings benefits over an 
>>>>> indirect call.
>>>>>
>>>>>> In other words, how much could we get just by lowering
>>>>>> TypeProfileMajorReceiverPercent?
>>>>>
>>>>> TypeProfileMajorReceiverPercent is only used today when you have a 
>>>>> megamorphic
>>>>> call-site (aka more types than TypeProfileWidth) but still one type 
>>>>> receiving more than
>>>>> N% of the calls. By reducing the value, you would not increase the 
>>>>> number of guards,
>>>>> but the threshold at which you generate the 1st guard in a 
>>>>> megamorphic case.
>>>>>
>>>>>>>> ??????? - for N-morphic case what's the negative effect 
>>>>>>>> (quantitative) of
>>>>>>>> the deopt?
>>>>>>> We are triggering the uncommon trap in this case iff we observed 
>>>>>>> a limited
>>>>>>> and stable set of types in the early stages of the Tiered 
>>>>>>> Compilation
>>>>>>> pipeline (making us generate N-morphic guards), and we suddenly 
>>>>>>> observe a
>>>>>>> new type. AFAIU, this is precisely what deopt is for.
>>>>>>
>>>>>> I should have added "... compared to N-polymorhic case". My 
>>>>>> intuition is
>>>>>> the higher morphism factor is the fewer the benefits of deopt 
>>>>>> (compared
>>>>>> to a call) are. It would be very good to validate it with some
>>>>>> benchmarks (both micro- and larger ones).
>>>>>
>>>>> I agree that what you are describing makes sense as well. To reduce 
>>>>> the cost of deopt
>>>>> here, having a TypeProfileMinimumReceiverPercent helps. That is 
>>>>> because if any type is
>>>>> seen less than this specific frequency, then it won't generate a 
>>>>> guard, leading to an indirect
>>>>> call in the fallback case.
>>>>>
>>>>>>> I'm writing a JMH benchmark to stress that specific case. I'll 
>>>>>>> share it as soon
>>>>>>> as I have something reliably reproducing.
>>>>>>
>>>>>> Thanks! A representative set of microbenchmarks will be very helpful.
>>>>>
>>>>> It turns out the guard is only generated once, meaning that if we 
>>>>> ever hit it then we
>>>>> generate an indirect call.
>>>>>
>>>>> We also only generate the trap iff all the guards are hot (inlined) 
>>>>> or warm (direct call),
>>>>> so any of the following case triggers the creation of an indirect 
>>>>> call over a trap:
>>>>> ? - we hit the trap once before
>>>>> ? - one or more guards are cold (aka not inlinable even with 
>>>>> late-inlining)
>>>>>
>>>>>> It was more about opportunities for future explorations. I don't 
>>>>>> think
>>>>>> we have to act on it right away.
>>>>>>
>>>>>> As with "deopt vs call", my guess is callee should benefit much more
>>>>>> from inlining than the caller it is inlined into (caller sees 
>>>>>> multiple
>>>>>> callee candidates and has to merge the results while each callee
>>>>>> observes the full context and can benefit from it).
>>>>>>
>>>>>> If we can run some sort of static analysis on callee bytecode, 
>>>>>> what kind
>>>>>> of code patterns should we look for to guide inlining decisions?
>>>>>
>>>>> Any pattern that would benefit from other optimizations (escape 
>>>>> analysis,
>>>>> dead code elimination, constant propagation, etc.) is good, but 
>>>>> short of
>>>>> shadowing statically what all these optimizations do, I can't see 
>>>>> an easy way
>>>>> to do it.
>>>>>
>>>>> That is where late-inlining, or more advanced dynamic heuristics 
>>>>> like the one you
>>>>> can find in Graal EE, is worthwhile.
>>>>>
>>>>>> Regaring experiments to try first, here are some ideas I find 
>>>>>> promising:
>>>>>>
>>>>>> ???? * measure the cost of additional profiling
>>>>>> ???????? -XX:TypeProfileWidth=N without changing compilers
>>>>>
>>>>> I am running the following jmh microbenchmark
>>>>>
>>>>> ???? public final static int N = 100_000_000;
>>>>>
>>>>> ???? @State(Scope.Benchmark)
>>>>> ???? public static class TypeProfileWidthOverheadBenchmarkState {
>>>>> ???????? public A[] objs = new A[N];
>>>>>
>>>>> ???????? @Setup
>>>>> ???????? public void setup() throws Exception {
>>>>> ???????????? for (int i = 0; i < objs.length; ++i) {
>>>>> ???????????????? switch (i % 8) {
>>>>> ???????????????? case 0: objs[i] = new A1(); break;
>>>>> ???????????????? case 1: objs[i] = new A2(); break;
>>>>> ???????????????? case 2: objs[i] = new A3(); break;
>>>>> ???????????????? case 3: objs[i] = new A4(); break;
>>>>> ???????????????? case 4: objs[i] = new A5(); break;
>>>>> ???????????????? case 5: objs[i] = new A6(); break;
>>>>> ???????????????? case 6: objs[i] = new A7(); break;
>>>>> ???????????????? case 7: objs[i] = new A8(); break;
>>>>> ???????????????? }
>>>>> ???????????? }
>>>>> ???????? }
>>>>> ???? }
>>>>>
>>>>> ???? @Benchmark @OperationsPerInvocation(N)
>>>>> ???? public void run(TypeProfileWidthOverheadBenchmarkState state, 
>>>>> Blackhole blackhole) {
>>>>> ???????? A[] objs = state.objs;
>>>>> ???????? for (int i = 0; i < objs.length; ++i) {
>>>>> ???????????? objs[i].foo(i, blackhole);
>>>>> ???????? }
>>>>> ???? }
>>>>>
>>>>> And I am running with the following JVM parameters:
>>>>>
>>>>> -XX:TypeProfileWidth=0 -XX:CompileThreshold=200000000 
>>>>> -XX:Tier3CompileThreshold=200000000 
>>>>> -XX:Tier3InvocationThreshold=200000000 
>>>>> -XX:Tier3BackEdgeThreshold=200000000
>>>>> -XX:TypeProfileWidth=8 -XX:CompileThreshold=200000000 
>>>>> -XX:Tier3CompileThreshold=200000000 
>>>>> -XX:Tier3InvocationThreshold=200000000 
>>>>> -XX:Tier3BackEdgeThreshold=200000000
>>>>>
>>>>> I observe no statistically representative difference between in s/ops
>>>>> between TypeProfileWidth=0 and TypeProfileWidth=8. I also could 
>>>>> observe
>>>>> no significant difference in the resulting analysis using Intel VTune.
>>>>>
>>>>> I verified that the benchmark never goes beyond Tier-0 with 
>>>>> -XX:+PrintCompilation.
>>>>>
>>>>>> ???? * N-morphic vs N-polymorphic (N>=2):
>>>>>> ?????? - how much deopt helps compared to a virtual call on 
>>>>>> fallback path?
>>>>>
>>>>> I have done the following microbenchmark, but I am not sure that it's
>>>>> going to fully answer the question you are raising here.
>>>>>
>>>>> ???? public final static int N = 100_000_000;
>>>>>
>>>>> ???? @State(Scope.Benchmark)
>>>>> ???? public static class PolymorphicDeoptBenchmarkState {
>>>>> ???????? public A[] objs = new A[N];
>>>>>
>>>>> ???????? @Setup
>>>>> ???????? public void setup() throws Exception {
>>>>> ???????????? int cutoff1 = (int)(objs.length * .90);
>>>>> ???????????? int cutoff2 = (int)(objs.length * .95);
>>>>> ???????????? for (int i = 0; i < cutoff1; ++i) {
>>>>> ???????????????? switch (i % 2) {
>>>>> ???????????????? case 0: objs[i] = new A1(); break;
>>>>> ???????????????? case 1: objs[i] = new A2(); break;
>>>>> ???????????????? }
>>>>> ???????????? }
>>>>> ???????????? for (int i = cutoff1; i < cutoff2; ++i) {
>>>>> ???????????????? switch (i % 4) {
>>>>> ???????????????? case 0: objs[i] = new A1(); break;
>>>>> ???????????????? case 1: objs[i] = new A2(); break;
>>>>> ???????????????? case 2:
>>>>> ???????????????? case 3: objs[i] = new A3(); break;
>>>>> ???????????????? }
>>>>> ???????????? }
>>>>> ???????????? for (int i = cutoff2; i < objs.length; ++i) {
>>>>> ???????????????? switch (i % 4) {
>>>>> ???????????????? case 0:
>>>>> ???????????????? case 1: objs[i] = new A3(); break;
>>>>> ???????????????? case 2:
>>>>> ???????????????? case 3: objs[i] = new A4(); break;
>>>>> ???????????????? }
>>>>> ???????????? }
>>>>> ???????? }
>>>>> ???? }
>>>>>
>>>>> ???? @Benchmark @OperationsPerInvocation(N)
>>>>> ???? public void run(PolymorphicDeoptBenchmarkState state, 
>>>>> Blackhole blackhole) {
>>>>> ???????? A[] objs = state.objs;
>>>>> ???????? for (int i = 0; i < objs.length; ++i) {
>>>>> ???????????? objs[i].foo(i, blackhole);
>>>>> ???????? }
>>>>> ???? }
>>>>>
>>>>> I run this benchmark with -XX:+PolyGuardDisableTrap or
>>>>> -XX:-PolyGuardDisableTrap which force enable/disable the trap in the
>>>>> fallback.
>>>>>
>>>>> For that kind of cases, a visitor pattern is what I expect to most 
>>>>> largely
>>>>> profit/suffer from a deopt or virtual-call in the fallback path. 
>>>>> Would you
>>>>> know of such benchmark that heavily relies on this pattern, and that I
>>>>> could readily reuse?
>>>>>
>>>>>> ???? * inlining vs devirtualization
>>>>>> ?????? - a knob to control inlining in N-morphic/N-polymorphic cases
>>>>>> ?????? - measure separately the effects of devirtualization and 
>>>>>> inlining
>>>>>
>>>>> For that one, I reused the first microbenchmark I mentioned above, and
>>>>> added a PolyGuardDisableInlining flag that controls whether we 
>>>>> create a
>>>>> direct-call or inline.
>>>>>
>>>>> The results are 2.958 ? 0.011 ops/s for 
>>>>> -XX:-PolyGuardDisableInlining (aka inlined)
>>>>> vs 2.540 ? 0.018 ops/s for -XX:+PolyGuardDisableInlining (aka 
>>>>> direct call).
>>>>>
>>>>> This benchmarks hasn't been run in the best possible conditions (on 
>>>>> my dev
>>>>> machine, in WSL), but it gives a strong indication that even a 
>>>>> direct call has a
>>>>> non-negligible impact, and that inlining leads to better result 
>>>>> (again, in this
>>>>> microbenchmark).
>>>>>
>>>>> Otherwise, on the per-method TypeProfileWidth knob, I couldn't find 
>>>>> anything
>>>>> that would be readily available from the Interpreter. Would you 
>>>>> have any pointer
>>>>> of a pre-existing feature that required this specific kind of 
>>>>> plumbing? I would otherwise
>>>>> find myself in need of making CompilerDirectives available from the 
>>>>> Interpreter, and
>>>>> that is something outside of my current expertise (always happy to 
>>>>> learn, but I
>>>>> will need some pointers!).
>>>>>
>>>>> Thank you,
>>>>>
>>>>> -- 
>>>>> Ludovic
>>>>>
>>>>> -----Original Message-----
>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>> Sent: Thursday, February 20, 2020 9:00 AM
>>>>> To: Ludovic Henry <luhenry at microsoft.com>; John Rose 
>>>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>>>
>>>>> Hi Ludovic,
>>>>>
>>>>> [...]
>>>>>
>>>>>> Thanks for this explanation, it makes it a lot clearer what the 
>>>>>> cases and
>>>>>> your concerns are. To rephrase in my own words, what you are 
>>>>>> interested in
>>>>>> is not this change in particular, but more the possibility that 
>>>>>> this change
>>>>>> provides and how to take it the next step, correct?
>>>>>
>>>>> Yes, it's a good summary.
>>>>>
>>>>> [...]
>>>>>
>>>>>>> ??????? - affects profiling strategy: majority of receivers vs 
>>>>>>> complete
>>>>>>> list of receiver types observed;
>>>>>> Today, we only use the N first receivers when the number of types 
>>>>>> does
>>>>>> not exceed TypeProfileWidth; otherwise, we use none of them.
>>>>>> Possible avenues of improvements I can see are:
>>>>>> ??? - Gather all the types in an unbounded list so we can know 
>>>>>> which ones
>>>>>> are the most frequent. It is unlikely to help with Java as, in the 
>>>>>> general
>>>>>> case, there are only a few types present a call-sites. It could, 
>>>>>> however,
>>>>>> be particularly helpful for languages that tend to have many types at
>>>>>> call-sites, like functional languages, for example.
>>>>>
>>>>> I doubt having unbounded list of receiver types is practical: it's
>>>>> costly to gather, but isn't too useful for compilation. But measuring
>>>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some 
>>>>> numbers.
>>>>>
>>>>>> ?? - Use the existing types to generate guards for these types we 
>>>>>> know are
>>>>>> common enough. Then use the types which are hot or warm, even in 
>>>>>> case of a
>>>>>> megamorphic call-site. It would be a simple iteration of what we have
>>>>>> nowadays.
>>>>>
>>>>>> In what we have today, some of the worst-case scenarios are the 
>>>>>> following:
>>>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, 
>>>>>> the first and
>>>>>> second types are types A and B, and the other type(s) is(are) not 
>>>>>> recorded,
>>>>>> and it increments the `count` value. Even if A and B are used in 
>>>>>> the initialization
>>>>>> path (i.e. only a few times) and the other type(s) is(are) used in 
>>>>>> the hot
>>>>>> path (i.e. many times), the latter are never considered for 
>>>>>> inlining - because
>>>>>> it was never recorded during profiling.
>>>>>
>>>>> Can it be alleviated by (partially) clearing type profile (e.g.,
>>>>> periodically free some space by removing elements with lower 
>>>>> frequencies
>>>>> and give new types a chance to be profiled?
>>>>>
>>>>>> ?? - Assuming you have TypeProfileWidth = 2, and at a call-site, 
>>>>>> you have the
>>>>>> first type A with 49% probability, the second type B with 49% 
>>>>>> probability, and
>>>>>> the other types with 2% probability. Even though A and B are the 
>>>>>> two hottest
>>>>>> paths, it does not generate guards because none are a major receiver.
>>>>>
>>>>> Yes. On the other hand, on average it'll cause inlining twice as much
>>>>> code (2 methods vs 1).
>>>>>
>>>>> Also, does it make sense to increase morphism factor even if inlining
>>>>> doesn't happen?
>>>>>
>>>>> ??? if (recv.klass == C1) {? // >>0%
>>>>> ?????? ... inlined ...
>>>>> ??? } else if (recv.klass == C2) { // >>0%
>>>>> ?????? m2(); // direct call
>>>>> ??? } else { // >0%
>>>>> ?????? m(); // virtual call
>>>>> ??? }
>>>>>
>>>>> vs
>>>>>
>>>>> ??? if (recv.klass == C1) {? // >>0%
>>>>> ?????? ... inlined ...
>>>>> ??? } else { // >>0%
>>>>> ?????? m(); // virtual call
>>>>> ??? }
>>>>>
>>>>> In other words, how much could we get just by lowering
>>>>> TypeProfileMajorReceiverPercent?
>>>>>
>>>>> And it relates to "virtual/interface call" vs "type guard + direct 
>>>>> call"
>>>>> code shapes comparison: how much does devirtualization help?
>>>>>
>>>>> Otherwise, enabling 2-polymorphic shape becomes feasible only if both
>>>>> cases are inlined.
>>>>>
>>>>>>> ??????? - for N-morphic case what's the negative effect 
>>>>>>> (quantitative) of
>>>>>>> the deopt?
>>>>>> We are triggering the uncommon trap in this case iff we observed a 
>>>>>> limited
>>>>>> and stable set of types in the early stages of the Tiered Compilation
>>>>>> pipeline (making us generate N-morphic guards), and we suddenly 
>>>>>> observe a
>>>>>> new type. AFAIU, this is precisely what deopt is for.
>>>>>
>>>>> I should have added "... compared to N-polymorhic case". My 
>>>>> intuition is
>>>>> the higher morphism factor is the fewer the benefits of deopt 
>>>>> (compared
>>>>> to a call) are. It would be very good to validate it with some
>>>>> benchmarks (both micro- and larger ones).
>>>>>
>>>>>> I'm writing a JMH benchmark to stress that specific case. I'll 
>>>>>> share it as soon
>>>>>> as I have something reliably reproducing.
>>>>>
>>>>> Thanks! A representative set of microbenchmarks will be very helpful.
>>>>>
>>>>>>> ???? * invokevirtual vs invokeinterface call sites
>>>>>>> ??????? - different cost models;
>>>>>>> ??????? - interfaces are harder to optimize, but opportunities for
>>>>>>> strength-reduction from interface to virtual calls exist;
>>>>>> ? From the profiling information and the inlining mechanism point 
>>>>>> of view,
>>>>>> that it is an invokevirtual or an invokeinterface doesn't change 
>>>>>> anything
>>>>>>
>>>>>> Are you saying that we have more to gain from generating a guard for
>>>>>> invokeinterface over invokevirtual because the fall-back of the
>>>>>> invokeinterface is much more expensive?
>>>>>
>>>>> Yes, that's the question: if we see an improvement, how much does
>>>>> devirtualization contribute to that?
>>>>>
>>>>> (If we add a type-guarded direct call, but there's no inlining
>>>>> happening, inline cache effectively strength-reduce a virtual call 
>>>>> to a
>>>>> direct call.)
>>>>>
>>>>> Considering current implementation of virtual and interface calls
>>>>> (vtables vs itables), the cost model is very different.
>>>>>
>>>>> For vtable calls, it doesn't look too appealing to introduce large
>>>>> inline caches for individual receiver types since a call through a
>>>>> vtable involves 3 dependent loads [1] (recv => Klass* => Method* =>
>>>>> address).
>>>>>
>>>>> For itable calls it can be a big win in some situations: itable lookup
>>>>> iterates over Klass::_secondary_supers array and it can become quite
>>>>> costly. For example, some Scala workloads experience significant
>>>>> overheads from megamorphic calls.
>>>>>
>>>>> If we see an improvement on some benchmark, it would be very useful to
>>>>> be able to determine (quantitatively) how much does inlining and
>>>>> devirtualization contribute.
>>>>>
>>>>> FTR ErikO has been experimenting with an alternative vtable/itable
>>>>> implementation [4] which brings interface calls close to virtual 
>>>>> calls.
>>>>> So, if it turns out that devirtualization (and not inlining) of
>>>>> interface calls is what contributes the most, then speeding up
>>>>> megamorphic interface calls becomes a more attractive alternative.
>>>>>
>>>>>>> ???? * inlining heuristics
>>>>>>> ??????? - devirtualization vs inlining
>>>>>>> ????????? - how much benefit from expanding a call site 
>>>>>>> (devirtualize more
>>>>>>> cases) without inlining? should differ for virtual & interface cases
>>>>>> I'm also writing a JMH benchmark for this case, and I'll share it 
>>>>>> as soon
>>>>>> as I have it reliably reproducing the issue you describe.
>>>>>
>>>>> Also, I think it's important to have a knob to control it (inline vs
>>>>> devirtualize). It'll enable experiments with larger benchmarks.
>>>>>
>>>>>>> ??????? - diminishing returns with increase in number of cases
>>>>>>> ??????? - expanding a single call site leads to more code, but 
>>>>>>> frequencies
>>>>>>> stay the same => colder code
>>>>>>> ??????? - based on profiling info (types + frequencies), dynamically
>>>>>>> choose morphism factor on per-call site basis?
>>>>>> That is where I propose to have a lower receiver probability at 
>>>>>> which we'll
>>>>>> stop adding more guards. I am experimenting with a global flag 
>>>>>> with a default
>>>>>> value of 10%.
>>>>>>> ??????? - what optimization opportunities to look for? it looks 
>>>>>>> like in
>>>>>>> general callees should benefit more than the caller (due to 
>>>>>>> merges after
>>>>>>> the call site)
>>>>>> Could you please expand your concern or provide an example.
>>>>>
>>>>> It was more about opportunities for future explorations. I don't think
>>>>> we have to act on it right away.
>>>>>
>>>>> As with "deopt vs call", my guess is callee should benefit much more
>>>>> from inlining than the caller it is inlined into (caller sees multiple
>>>>> callee candidates and has to merge the results while each callee
>>>>> observes the full context and can benefit from it).
>>>>>
>>>>> If we can run some sort of static analysis on callee bytecode, what 
>>>>> kind
>>>>> of code patterns should we look for to guide inlining decisions?
>>>>>
>>>>>
>>>>> ? >> What's your take on it? Any other ideas?
>>>>> ? >
>>>>> ? > We don't know what we don't know. We need first to improve the
>>>>> logging and
>>>>> ? > debugging output of uncommon traps for polymorphic call-sites. 
>>>>> Then, we
>>>>> ? > need to gather data about the different cases you talked about.
>>>>> ? >
>>>>> ? > We also need to have some microbenchmarks to validate some of the
>>>>> questions
>>>>> ? > you are raising, and verify what level of gains we can expect 
>>>>> from this
>>>>> ? > optimization. Further validation will be needed on larger 
>>>>> benchmarks and
>>>>> ? > real-world applications as well, and that's where, I think, we 
>>>>> need
>>>>> to develop
>>>>> ? > logging and debugging for this feature.
>>>>>
>>>>> Yes, sounds good.
>>>>>
>>>>> Regaring experiments to try first, here are some ideas I find 
>>>>> promising:
>>>>>
>>>>> ???? * measure the cost of additional profiling
>>>>> ???????? -XX:TypeProfileWidth=N without changing compilers
>>>>>
>>>>> ???? * N-morphic vs N-polymorphic (N>=2):
>>>>> ?????? - how much deopt helps compared to a virtual call on 
>>>>> fallback path?
>>>>>
>>>>> ???? * inlining vs devirtualization
>>>>> ?????? - a knob to control inlining in N-morphic/N-polymorphic cases
>>>>> ?????? - measure separately the effects of devirtualization and 
>>>>> inlining
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>> [1]
>>>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l48&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&amp;sdata=jp4emfr%2FRnQA3RLtGI%2BCJ4Q%2F%2Bx6NpIUsWPexl3KbxvM%3D&amp;reserved=0 
>>>>>
>>>>>
>>>>> [2]
>>>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l142&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&amp;sdata=P1iMschSMsRCBw7e4IHWimB26v3MdqWWrGZKSvv64vA%3D&amp;reserved=0 
>>>>>
>>>>>
>>>>> [3]
>>>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FmacroAssembler_x86.cpp%23l4294&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&amp;sdata=pSEuxGJYxfPHzcouSeoCYSRdfdlSByYwpyZfFA7dlYU%3D&amp;reserved=0 
>>>>>
>>>>>
>>>>> [4] 
>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.openjdk.java.net%2Fbrowse%2FJDK-8221828&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&amp;sdata=ppajmGyCb1rcDjr0v52XMPcLySA7jQTTjmYE5vIEEoo%3D&amp;reserved=0 
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>>> Sent: Tuesday, February 11, 2020 3:10 PM
>>>>>> To: Ludovic Henry <luhenry at microsoft.com>; John Rose 
>>>>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>>>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>>>>
>>>>>> Hi Ludovic,
>>>>>>
>>>>>> I fully agree that it's premature to discuss how default behavior 
>>>>>> should
>>>>>> be changed since much more data is needed to be able to proceed 
>>>>>> with the
>>>>>> decision. But considering the ultimate goal is to actually improve
>>>>>> relevant heuristics (and effectively change the default behavior), 
>>>>>> it's
>>>>>> the right time to discuss what kind of experiments are needed to 
>>>>>> gather
>>>>>> enough data for further analysis.
>>>>>>
>>>>>> Though different shapes do look very similar at first, the shape of
>>>>>> fallback makes a big difference. That's why monomorphic and 
>>>>>> polymorphic
>>>>>> cases are distinct: uncommon traps are effectively exits and can
>>>>>> significantly simplify CFG while calls can return and have to be 
>>>>>> merged
>>>>>> back.
>>>>>>
>>>>>> Polymorphic shape is stable (no deopts/recompiles involved), but 
>>>>>> doesn't
>>>>>> simplify the CFG around the call site.
>>>>>>
>>>>>> Monomorphic shape gives more optimization opportunities, but 
>>>>>> deopts are
>>>>>> highly undesirable due to associated costs.
>>>>>>
>>>>>> For example:
>>>>>>
>>>>>> ???? if (recv.klass != C) { deopt(); }
>>>>>> ???? C.m(recv);
>>>>>>
>>>>>> ???? // recv.klass == C - exact type
>>>>>> ???? // return value == C.m(recv)
>>>>>>
>>>>>> vs
>>>>>>
>>>>>> ???? if (recv.klass == C) {
>>>>>> ?????? C.m(recv);
>>>>>> ???? } else {
>>>>>> ?????? I.m(recv);
>>>>>> ???? }
>>>>>>
>>>>>> ???? // recv.klass <: I - subtype
>>>>>> ???? // return value is a phi merging C.m() & I.m() where I.m() is
>>>>>> completley opaque.
>>>>>>
>>>>>> Monomorphic shape can degenerate into polymorphic (too many 
>>>>>> recompiles),
>>>>>> but that's a forced move to stabilize the behavior and avoid vicious
>>>>>> recomilation cycle (which is *very* expensive). (Another 
>>>>>> alternative is
>>>>>> to leave deopt as is - set deopt action to "none" - but that's 
>>>>>> usually
>>>>>> much worse decision.)
>>>>>>
>>>>>> And that's the reason why monomorphic shape requires a unique 
>>>>>> receiver
>>>>>> type in profile while polymorphic shape works with major receiver 
>>>>>> type
>>>>>> and probabilities.
>>>>>>
>>>>>>
>>>>>> Considering further steps, IMO for experimental purposes a single 
>>>>>> knob
>>>>>> won't cut it: there are multiple degrees of freedom which may play
>>>>>> important role in building accurate performance model. I'm not yet
>>>>>> convinced it's all about inlining and narrowing the scope of 
>>>>>> discussion
>>>>>> specifically to type profile width doesn't help.
>>>>>>
>>>>>> I'd like to see more knobs introduced before we start conducting
>>>>>> extensive experiments. So, let's discuss what other information we 
>>>>>> can
>>>>>> benefit from.
>>>>>>
>>>>>> I mentioned some possible options in the previous email. I find the
>>>>>> following aspects important for future discussion:
>>>>>>
>>>>>> ???? * shape of fallback path
>>>>>> ??????? - what to generalize: 2- to N-morphic vs 1- to N-polymorphic;
>>>>>> ??????? - affects profiling strategy: majority of receivers vs 
>>>>>> complete
>>>>>> list of receiver types observed;
>>>>>> ??????? - for N-morphic case what's the negative effect 
>>>>>> (quantitative) of
>>>>>> the deopt?
>>>>>>
>>>>>> ???? * invokevirtual vs invokeinterface call sites
>>>>>> ??????? - different cost models;
>>>>>> ??????? - interfaces are harder to optimize, but opportunities for
>>>>>> strength-reduction from interface to virtual calls exist;
>>>>>>
>>>>>> ???? * inlining heuristics
>>>>>> ??????? - devirtualization vs inlining
>>>>>> ????????? - how much benefit from expanding a call site 
>>>>>> (devirtualize more
>>>>>> cases) without inlining? should differ for virtual & interface cases
>>>>>> ??????? - diminishing returns with increase in number of cases
>>>>>> ??????? - expanding a single call site leads to more code, but 
>>>>>> frequencies
>>>>>> stay the same => colder code
>>>>>> ??????? - based on profiling info (types + frequencies), dynamically
>>>>>> choose morphism factor on per-call site basis?
>>>>>> ??????? - what optimization opportunities to look for? it looks 
>>>>>> like in
>>>>>> general callees should benefit more than the caller (due to merges 
>>>>>> after
>>>>>> the call site)
>>>>>>
>>>>>> What's your take on it? Any other ideas?
>>>>>>
>>>>>> Best regards,
>>>>>> Vladimir Ivanov
>>>>>>
>>>>>> On 11.02.2020 02:42, Ludovic Henry wrote:
>>>>>>> Hello,
>>>>>>> Thank you very much, John and Vladimir, for your feedback.
>>>>>>> First, I want to stress out that this patch does not change the 
>>>>>>> default. It is still bi-morphic guarded inlining by default. This 
>>>>>>> patch, however, provides you the ability to configure the JVM to 
>>>>>>> go for N-morphic guarded inlining, with N being controlled by the 
>>>>>>> -XX:TypeProfileWidth configuration knob. I understand there are 
>>>>>>> shortcomings with the specifics of this approach so I'll work on 
>>>>>>> fixing those. However, I would want this discussion to focus on 
>>>>>>> this *configurable* feature and not on changing the default. The 
>>>>>>> latter, I think, should be discussed as part of another, more 
>>>>>>> extended running discussion, since, as you pointed out, it has 
>>>>>>> far more reaching consequences that are merely improving a 
>>>>>>> micro-benchmark.
>>>>>>>
>>>>>>> Now to answer some of your specific questions.
>>>>>>>
>>>>>>>>
>>>>>>>> I haven't looked through the patch in details, but here are some 
>>>>>>>> thoughts.
>>>>>>>>
>>>>>>>> As of now, there are 4 main scenarios for devirtualization [1]. 
>>>>>>>> It seems you try to generalize (b) which becomes:
>>>>>>>>
>>>>>>>> ????? if (recv.klass == K1) {
>>>>>>> m1(...); // either inline or a direct call
>>>>>>>> ????? } else if (recv.klass == K2) {
>>>>>>> m2(...); // either inline or a direct call
>>>>>>>> ????? ...
>>>>>>>> ????? } else if (recv.klass == Kn) {
>>>>>>> mn(...); // either inline or a direct call
>>>>>>>> ????? } else {
>>>>>>> deopt(); // invalidate + reinterpret
>>>>>>>> ????? }
>>>>>>>
>>>>>>> The general shape that exist currently in tip is:
>>>>>>>
>>>>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
>>>>>>> if (recv.klass == K1) {
>>>>>>> ???? m1(.); // either inline or a direct call
>>>>>>> }
>>>>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && 
>>>>>>> UseBimorphicInlining && !is_cold
>>>>>>> else if (recv.klass == K2) {
>>>>>>> ???? m2(.); // either inline or a direct call
>>>>>>> }
>>>>>>> else {
>>>>>>> ???? // if (!too_many_traps_or_deopt())
>>>>>>> ???? deopt(); // invalidate + reinterpret
>>>>>>> ???? // else
>>>>>>> ???? invokeinterface A.foo(.); // virtual call with Inline Cache
>>>>>>> }
>>>>>>> There is no particular distinction between Bimorphic, 
>>>>>>> Polymorphic, and Megamorphic. The latter relates more to the 
>>>>>>> fallback rather than the guards. What this change brings is more 
>>>>>>> guards for N-morphic call-sites with N > 2. But it doesn't change 
>>>>>>> why and how these guards are generated (or at least, that is not 
>>>>>>> the intention).
>>>>>>> The general shape that this change proposes is:
>>>>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
>>>>>>> if (recv.klass == K1) {
>>>>>>> ???? m1(.); // either inline or a direct call
>>>>>>> }
>>>>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && 
>>>>>>> (UseBimorphicInlining || UsePolymorphicInling)
>>>>>>> && !is_cold
>>>>>>> else if (recv.klass == K2) {
>>>>>>> ???? m2(.); // either inline or a direct call
>>>>>>> }
>>>>>>> // if TypeProfileWidth >= 3 && profile.has_receiver(2) && 
>>>>>>> UsePolymorphicInling && !is_cold
>>>>>>> else if (recv.klass == K3) {
>>>>>>> ???? m3(.); // either inline or a direct call
>>>>>>> }
>>>>>>> // if TypeProfileWidth >= 4 && profile.has_receiver(3) && 
>>>>>>> UsePolymorphicInling && !is_cold
>>>>>>> else if (recv.klass == K4) {
>>>>>>> ???? m4(.); // either inline or a direct call
>>>>>>> }
>>>>>>> else {
>>>>>>> ???? // if (!too_many_traps_or_deopt())
>>>>>>> ???? deopt(); // invalidate + reinterpret
>>>>>>> ???? // else
>>>>>>> ???? invokeinterface A.foo(.); // virtual call with Inline Cache
>>>>>>> }
>>>>>>> You can observe that the condition to create the guards is no 
>>>>>>> different; only the total number increases based on 
>>>>>>> TypeProfileWidth and UsePolymorphicInlining.
>>>>>>>> Question #1: what if you generalize polymorphic shape instead 
>>>>>>>> and allow multiple major receivers? Deoptimizing (and then 
>>>>>>>> recompiling) look less beneficial the higher morphism is 
>>>>>>>> (especially considering the inlining on all paths becomes less 
>>>>>>>> likely as well). So, having a virtual call (which becomes less 
>>>>>>>> likely due to lower frequency) on the fallback path may be a 
>>>>>>>> better option.
>>>>>>> I agree with this statement in the general sense. However, in 
>>>>>>> practice, it depends on the specifics of each application. That 
>>>>>>> is why the degree of polymorphism needs to rely on a 
>>>>>>> configuration knob, and not pre-determined on a set of 
>>>>>>> benchmarks. I agree with the proposal to have this knob as a 
>>>>>>> per-method knob, instead of a global knob.
>>>>>>> As for the impact of a higher morphism, I expect deoptimizations 
>>>>>>> to happen less often as more guards are generated, leading to a 
>>>>>>> lower probability of reaching the fallback path, leading to less 
>>>>>>> uncommon trap/deoptimizations. Moreover, the fallback is already 
>>>>>>> going to be a virtual call in case we hit the uncommon trap too 
>>>>>>> often (using too_many_traps_or_recompiles).
>>>>>>>> Question #2: it would be very interesting to understand what 
>>>>>>>> exactly contributes the most to performance improvements? Is it 
>>>>>>>> inlining? Or maybe devirtualization (avoid the cost of virtual 
>>>>>>>> call)? How much come from optimizing interface calls (itable vs 
>>>>>>>> vtable stubs)?
>>>>>>> Devirtualization in itself (direct vs. indirect call) is not the 
>>>>>>> *primary* source of the gain. The gain comes from the additional 
>>>>>>> optimizations that are applied by C2 when increasing the 
>>>>>>> scope/size of the code compiled via inlining.
>>>>>>> In the case of warm code that's not inlined as part of 
>>>>>>> incremental inlining, the call is a direct call rather than an 
>>>>>>> indirect call. I haven't measured it, but I expect performance to 
>>>>>>> be positively impacted because of the better ability of modern 
>>>>>>> CPUs to correctly predict instruction branches (a direct call) 
>>>>>>> rather than data branches (an indirect call).
>>>>>>>> Deciding how to spend inlining budget on multiple targets with 
>>>>>>>> moderate frequency can be hard, so it makes sense to consider 
>>>>>>>> expanding 3/4/mega-morphic call sites in post-parse phase 
>>>>>>>> (during incremental inlining).
>>>>>>> Incremental inlining is already integrated with the existing 
>>>>>>> solution. In the case of a hot or warm call, in case of failure 
>>>>>>> to inline, it generates a direct call. You still have the guards, 
>>>>>>> reducing the cost of an indirect call, but without the cost of 
>>>>>>> the inlined code.
>>>>>>>> Question #3: how much TypeProfileWidth affects profiling speed 
>>>>>>>> (interpreter and level #3 code) and dynamic footprint?
>>>>>>> I'll come back to you with some results.
>>>>>>>> Getting answers to those (and similar) questions should give us 
>>>>>>>> much more insights what is actually happening in practice.
>>>>>>>>
>>>>>>>> Speaking of the first deliverables, it would be good to 
>>>>>>>> introduce a new experimental mode to be able to easily conduct 
>>>>>>>> such experiments with product binaries and I'd like to see the 
>>>>>>>> patch evolving in that direction. It'll enable us to gather 
>>>>>>>> important data to guide our decisions about how to enhance the 
>>>>>>>> heuristics in the product.
>>>>>>> This patch does not change the default shape of the generated 
>>>>>>> code with bimorphic guarded inlining, because the default value 
>>>>>>> of TypeProfileWidth is 2. If your concern is that 
>>>>>>> TypeProfileWidth is used for other purposes and that I should add 
>>>>>>> a dedicated knob to control the maximum morphism of these guards, 
>>>>>>> then I agree. I am using TypeProfileWidth because it's the 
>>>>>>> available and more straightforward knob today.
>>>>>>> Overall, this change does not propose to go from bimorphic to 
>>>>>>> N-morphic by default (with N between 0 and 8). This change 
>>>>>>> focuses on using an existing knob (TypeProfileWidth) to open the 
>>>>>>> possibility for N-morphic guarded inlining. I would want the 
>>>>>>> discussion to change the default to be part of a separate RFR, to 
>>>>>>> separate the feature change discussion from the default change 
>>>>>>> discussion.
>>>>>>>> Such optimizations are usually not unqualified wins because of 
>>>>>>>> highly "non-linear" or "non-local" effects, where a local change 
>>>>>>>> in one direction might couple to nearby change in a different 
>>>>>>>> direction, with a net change that's "wrong", due to side effects 
>>>>>>>> rolling out from the "good" change. (I'm talking about side 
>>>>>>>> effects in our IR graph shaping heuristics, not memory side 
>>>>>>>> effects.)
>>>>>>>>
>>>>>>>> One out of many such "wrong" changes is a local optimization 
>>>>>>>> which expands code on a medium-hot path, which has the side 
>>>>>>>> effect of making a containing block of code larger than 
>>>>>>>> convenient.? Three ways of being "larger than convenient" are a. 
>>>>>>>> the object code of some containing loop doesn't fit as well in 
>>>>>>>> the instruction memory, b. the total IR size tips over some 
>>>>>>>> budgetary limit which causes further IR creation to be throttled 
>>>>>>>> (or the whole graph to be thrown away!), or c. some loop gains 
>>>>>>>> additional branch structure that impedes the optimization of the 
>>>>>>>> loop, where an out of line call would not.
>>>>>>>>
>>>>>>>> My overall point here is that an eager expansion of IR that is 
>>>>>>>> locally "better" (we might even say "optimal") with respect to 
>>>>>>>> the specific path under consideration hurts the optimization of 
>>>>>>>> nearby paths which are more important.
>>>>>>> I generally agree with this statement and explanation. Again, it 
>>>>>>> is not the intention of this patch to change the default number 
>>>>>>> of guards for polymorphic call-sites, but it is to give users the 
>>>>>>> ability to optimize the code generation of their JVM to their 
>>>>>>> application.
>>>>>>> Since I am relying on the existing inlining infrastructure, late 
>>>>>>> inlining and hot/warm/cold call generators allows to have a 
>>>>>>> "best-of-both-world" approach: it inlines code in the hot guards, 
>>>>>>> it direct calls or inline (if inlining thresholds permits) the 
>>>>>>> method in the warm guards, and it doesn't even generate the guard 
>>>>>>> in the cold guards. The question here is, then how do you define 
>>>>>>> hot, warm, and cold. As discussed above, I want to explore using 
>>>>>>> a low-threshold even to try to generate a guard (at least 10% of 
>>>>>>> calls are to this specific receiver).
>>>>>>> On the overhead of adding more guards, I see this change as 
>>>>>>> beneficial because it removes an arbitrary limit on what code can 
>>>>>>> be inlined. For example, if you have a call-site with 3 types, 
>>>>>>> each with a hit probability of 30%, then with a maximum limit of 
>>>>>>> 2 types (with bimorphic guarded inlining), only the first 2 types 
>>>>>>> are guarded and inlined. That is despite an apparent gain in 
>>>>>>> guarding and inlining against the 3 types.
>>>>>>> I agree we want to have guardrails to avoid worst-case 
>>>>>>> degradations. It is my understanding that the existing inlining 
>>>>>>> infrastructure (with late inlining, for example) provides many 
>>>>>>> safeguards already, and it is up to this change not to abuse these.
>>>>>>>> (It clearly doesn't work to tell an impacted customer, well, you 
>>>>>>>> may get a 5% loss, but the micro created to test this thing 
>>>>>>>> shows a 20% gain, and all the functional tests pass.)
>>>>>>>>
>>>>>>>> This leads me to the following suggestion:? Your code is a very 
>>>>>>>> good POC, and deserves more work, and the next step in that work 
>>>>>>>> is probably looking for and thinking about performance 
>>>>>>>> regressions, and figuring out how to throttle this thing.
>>>>>>> Here again, I want that feature to be behind a configuration 
>>>>>>> knob, and then discuss in a future RFR to change the default.
>>>>>>>> A specific next step would be to make the throttling of this 
>>>>>>>> feature be controllable. MorphismLimit should be a global on its 
>>>>>>>> own.? And it should be configurable through the CompilerOracle 
>>>>>>>> per method.? (See similar code for similar throttles.)? And it 
>>>>>>>> should be more sensitive to the hotness of the overall call and 
>>>>>>>> of the various slices of the call's profile.? (I notice with 
>>>>>>>> suspicion that the comment "The single majority receiver 
>>>>>>>> sufficiently outweighs the minority" is missing in the changed 
>>>>>>>> code.)? And, if the change is as disruptive to heuristics as I 
>>>>>>>> suspect it *might* be, the call site itself *might* need some 
>>>>>>>> kind of dynamic feedback which says, after some deopt or 
>>>>>>>> reprofiling, "take it easy here, try plan B." That last point is 
>>>>>>>> just speculation, but I threw it in to show the kinds of 
>>>>>>>> measures we *sometimes* have to take in avoiding "side effects" 
>>>>>>>> to our locally pleasant optimizations.
>>>>>>> I'll add this per-method knob on the CompilerOracle in the next 
>>>>>>> iteration of this patch.
>>>>>>>> But, let me repeat: I'm glad to see this experiment. And very, 
>>>>>>>> very glad to see all the cool stuff that is coming out of your 
>>>>>>>> work-group.? Welcome to the adventure!
>>>>>>> For future improvements, I will keep focusing on inlining as I 
>>>>>>> see it as the door opener to many more optimizations in C2. I am 
>>>>>>> still learning at what can be done to reduce the size of the 
>>>>>>> inlined code by, for example, applying specific optimizations 
>>>>>>> that simplify the CG (like dead-code elimination or constant 
>>>>>>> propagation) before inlining the code. As you said, we are not 
>>>>>>> short of ideas on *how* to improve it, but we have to be very 
>>>>>>> wary of *what impact* it'll have on real-world applications. 
>>>>>>> We're working with internal customers to figure that out, and 
>>>>>>> we'll share them as soon as we are ready with benchmarks for 
>>>>>>> those use-case patterns.
>>>>>>> What I am working on now is:
>>>>>>> ??? - Add a per-method flag through CompilerOracle
>>>>>>> ??? - Add a threshold on the probability of a receiver to 
>>>>>>> generate a guard (I am thinking of 10%, i.e., if a receiver is 
>>>>>>> observed less than 1 in every 10 calls, then don't generate a 
>>>>>>> guard and use the fallback)
>>>>>>> ??? - Check the overhead of increasing TypeProfileWidth on 
>>>>>>> profiling speed (in the interpreter and level #3 code)
>>>>>>> Thank you, and looking forward to the next review (I expect to 
>>>>>>> post the next iteration of the patch today or tomorrow).
>>>>>>> -- 
>>>>>>> Ludovic
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>>>> Sent: Thursday, February 6, 2020 1:07 PM
>>>>>>> To: Ludovic Henry <luhenry at microsoft.com>; 
>>>>>>> hotspot-compiler-dev at openjdk.java.net
>>>>>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>>>>>
>>>>>>> Very interesting results, Ludovic!
>>>>>>>
>>>>>>>> The image can be found at 
>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fnam06.safelinks.protection.outlook.com%2F%3Furl%3Dhttps*3A*2F*2Fgist.github.com*2Fluhenry*2Fb7cc1ed55c51cb0fbc527cbc45018473%26amp%3Bdata%3D02*7C01*7Cluhenry*40microsoft.com*7Cc802bcf8b3154a1bb97508d7af47850b*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637170594027774879%26amp%3Bsdata%3D6dxACbgyeU0S51dVY6fSVZ*2B4USDsTOdnuxTOXkiVYqI*3D%26amp%3Breserved%3D0__%3BJSUlJSUlJSUlJSUlJSUl!!GqivPVa7Brio!MIQWA4cQu4RzL0yqjiU5mO4tpHcImj_lUv7tEh-gG6NCMtSRIAyhvfyJZQBPoAJC_Ajjaxg%24&amp;data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&amp;sdata=aLAxfc5BDeyOK6VMwM7bUAgjVLMCAt8A6LDlsNkC6fU%3D&amp;reserved=0 
>>>>>>>>
>>>>>>>
>>>>>>> Can you elaborate on the experiment itself, please? In 
>>>>>>> particular, what
>>>>>>> does PERCENTILES actually mean?
>>>>>>>
>>>>>>> I haven't looked through the patch in details, but here are some 
>>>>>>> thoughts.
>>>>>>>
>>>>>>> As of now, there are 4 main scenarios for devirtualization [1]. 
>>>>>>> It seems
>>>>>>> you try to generalize (b) which becomes:
>>>>>>>
>>>>>>> ????? if (recv.klass == K1) {
>>>>>>> ???????? m1(...); // either inline or a direct call
>>>>>>> ????? } else if (recv.klass == K2) {
>>>>>>> ???????? m2(...); // either inline or a direct call
>>>>>>> ????? ...
>>>>>>> ????? } else if (recv.klass == Kn) {
>>>>>>> ???????? mn(...); // either inline or a direct call
>>>>>>> ????? } else {
>>>>>>> ???????? deopt(); // invalidate + reinterpret
>>>>>>> ????? }
>>>>>>>
>>>>>>> Question #1: what if you generalize polymorphic shape instead and 
>>>>>>> allow
>>>>>>> multiple major receivers? Deoptimizing (and then recompiling) 
>>>>>>> look less
>>>>>>> beneficial the higher morphism is (especially considering the 
>>>>>>> inlining
>>>>>>> on all paths becomes less likely as well). So, having a virtual call
>>>>>>> (which becomes less likely due to lower frequency) on the 
>>>>>>> fallback path
>>>>>>> may be a better option.
>>>>>>>
>>>>>>>
>>>>>>> Question #2: it would be very interesting to understand what exactly
>>>>>>> contributes the most to performance improvements? Is it inlining? Or
>>>>>>> maybe devirtualization (avoid the cost of virtual call)? How much 
>>>>>>> come
>>>>>>> from optimizing interface calls (itable vs vtable stubs)?
>>>>>>>
>>>>>>> Deciding how to spend inlining budget on multiple targets with 
>>>>>>> moderate
>>>>>>> frequency can be hard, so it makes sense to consider expanding
>>>>>>> 3/4/mega-morphic call sites in post-parse phase (during incremental
>>>>>>> inlining).
>>>>>>>
>>>>>>>
>>>>>>> Question #3: how much TypeProfileWidth affects profiling speed
>>>>>>> (interpreter and level #3 code) and dynamic footprint?
>>>>>>>
>>>>>>>
>>>>>>> Getting answers to those (and similar) questions should give us much
>>>>>>> more insights what is actually happening in practice.
>>>>>>>
>>>>>>> Speaking of the first deliverables, it would be good to introduce 
>>>>>>> a new
>>>>>>> experimental mode to be able to easily conduct such experiments with
>>>>>>> product binaries and I'd like to see the patch evolving in that
>>>>>>> direction. It'll enable us to gather important data to guide our
>>>>>>> decisions about how to enhance the heuristics in the product.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Vladimir Ivanov
>>>>>>>
>>>>>>> [1] (a) Monomorphic:
>>>>>>> ????? if (recv.klass == K1) {
>>>>>>> ???????? m1(...); // either inline or a direct call
>>>>>>> ????? } else {
>>>>>>> ???????? deopt(); // invalidate + reinterpret
>>>>>>> ????? }
>>>>>>>
>>>>>>> ????? (b) Bimorphic:
>>>>>>> ????? if (recv.klass == K1) {
>>>>>>> ???????? m1(...); // either inline or a direct call
>>>>>>> ????? } else if (recv.klass == K2) {
>>>>>>> ???????? m2(...); // either inline or a direct call
>>>>>>> ????? } else {
>>>>>>> ???????? deopt(); // invalidate + reinterpret
>>>>>>> ????? }
>>>>>>>
>>>>>>> ????? (c) Polymorphic:
>>>>>>> ????? if (recv.klass == K1) { // major receiver (by default, >90%)
>>>>>>> ???????? m1(...); // either inline or a direct call
>>>>>>> ????? } else {
>>>>>>> ???????? K.m(); // virtual call
>>>>>>> ????? }
>>>>>>>
>>>>>>> ????? (d) Megamorphic:
>>>>>>> ????? K.m(); // virtual (K is either concrete or interface class)
>>>>>>>
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Ludovic
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: hotspot-compiler-dev 
>>>>>>>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of 
>>>>>>>> Ludovic Henry
>>>>>>>> Sent: Thursday, February 6, 2020 9:18 AM
>>>>>>>> To: hotspot-compiler-dev at openjdk.java.net
>>>>>>>> Subject: RFR: Polymorphic Guarded Inlining in C2
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> In our evergoing search of improving performance, I've looked at 
>>>>>>>> inlining and, more specifically, at polymorphic guarded 
>>>>>>>> inlining. Today in HotSpot, the maximum number of guards for 
>>>>>>>> types at any call site is two - with bimorphic guarded inlining. 
>>>>>>>> However, Graal and Zing have observed great results with 
>>>>>>>> increasing that limit.
>>>>>>>>
>>>>>>>> You'll find following a patch that makes the number of guards 
>>>>>>>> for types configurable with the `TypeProfileWidth` global.
>>>>>>>>
>>>>>>>> Testing:
>>>>>>>> Passing tier1 on Linux and Windows, plus other large 
>>>>>>>> applications (through the Adopt testing scripts)
>>>>>>>>
>>>>>>>> Benchmarking:
>>>>>>>> To get data, we run a benchmark against Apache Pinot and observe 
>>>>>>>> the following results:
>>>>>>>>
>>>>>>>> [cid:image001.png at 01D5D2DB.F5165550]
>>>>>>>>
>>>>>>>> We observe close to 20% improvements on this sample benchmark 
>>>>>>>> with a morphism (=width) of 3 or 4. We are currently validating 
>>>>>>>> these numbers on a more extensive set of benchmarks and 
>>>>>>>> platforms, and I'll share them as soon as we have them.
>>>>>>>>
>>>>>>>> I am happy to provide more information, just let me know if you 
>>>>>>>> have any question.
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Ludovic
>>>>>>>>
>>>>>>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp 
>>>>>>>> b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>>>> index 73854806ed..845070fbe1 100644
>>>>>>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>>>> @@ -38,7 +38,7 @@ private:
>>>>>>>> ?????? friend class ciMethod;
>>>>>>>> ?????? friend class ciMethodHandle;
>>>>>>>>
>>>>>>>> -? enum { MorphismLimit = 2 }; // Max call site's morphism we 
>>>>>>>> care about
>>>>>>>> +? enum { MorphismLimit = 8 }; // Max call site's morphism we 
>>>>>>>> care about
>>>>>>>> ?????? int? _limit;??????????????? // number of receivers have 
>>>>>>>> been determined
>>>>>>>> ?????? int? _morphism;???????????? // determined call site's 
>>>>>>>> morphism
>>>>>>>> ?????? int? _count;??????????????? // # times has this call been 
>>>>>>>> executed
>>>>>>>> @@ -47,6 +47,7 @@ private:
>>>>>>>> ?????? ciKlass*? _receiver[MorphismLimit + 1];? // receivers 
>>>>>>>> (exact)
>>>>>>>>
>>>>>>>> ?????? ciCallProfile() {
>>>>>>>> +??? guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit 
>>>>>>>> can't be smaller than TypeProfileWidth");
>>>>>>>> ???????? _limit = 0;
>>>>>>>> ???????? _morphism??? = 0;
>>>>>>>> ???????? _count = -1;
>>>>>>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp 
>>>>>>>> b/src/hotspot/share/ci/ciMethod.cpp
>>>>>>>> index d771be8dac..8e4ecc8597 100644
>>>>>>>> --- a/src/hotspot/share/ci/ciMethod.cpp
>>>>>>>> +++ b/src/hotspot/share/ci/ciMethod.cpp
>>>>>>>> @@ -496,9 +496,7 @@ ciCallProfile 
>>>>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>>>>> ?????????? // Every profiled call site has a counter.
>>>>>>>> ?????????? int count = 
>>>>>>>> check_overflow(data->as_CounterData()->count(), 
>>>>>>>> java_code_at_bci(bci));
>>>>>>>>
>>>>>>>> -????? if (!data->is_ReceiverTypeData()) {
>>>>>>>> -??????? result._receiver_count[0] = 0;? // that's a definite zero
>>>>>>>> -????? } else { // ReceiverTypeData is a subclass of CounterData
>>>>>>>> +????? if (data->is_ReceiverTypeData()) {
>>>>>>>> ???????????? ciReceiverTypeData* call = 
>>>>>>>> (ciReceiverTypeData*)data->as_ReceiverTypeData();
>>>>>>>> ???????????? // In addition, virtual call sites have receiver 
>>>>>>>> type information
>>>>>>>> ???????????? int receivers_count_total = 0;
>>>>>>>> @@ -515,7 +513,7 @@ ciCallProfile 
>>>>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>>>>> ?????????????? // is recorded or an associated counter is 
>>>>>>>> incremented, but not both. With
>>>>>>>> ?????????????? // tiered compilation, however, both can happen 
>>>>>>>> due to the interpreter and
>>>>>>>> ?????????????? // C1 profiling invocations differently. Address 
>>>>>>>> that inconsistency here.
>>>>>>>> -????????? if (morphism == 1 && count > 0) {
>>>>>>>> +????????? if (morphism >= 1 && count > 0) {
>>>>>>>> ???????????????? epsilon = count;
>>>>>>>> ???????????????? count = 0;
>>>>>>>> ?????????????? }
>>>>>>>> @@ -531,25 +529,26 @@ ciCallProfile 
>>>>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>>>>> ????????????? // If we extend profiling to record methods,
>>>>>>>> ?????????????? // we will set result._method also.
>>>>>>>> ???????????? }
>>>>>>>> +??????? result._morphism = morphism;
>>>>>>>> ???????????? // Determine call site's morphism.
>>>>>>>> ???????????? // The call site count is 0 with known morphism 
>>>>>>>> (only 1 or 2 receivers)
>>>>>>>> ???????????? // or < 0 in the case of a type check failure for 
>>>>>>>> checkcast, aastore, instanceof.
>>>>>>>> ???????????? // The call site count is > 0 in the case of a 
>>>>>>>> polymorphic virtual call.
>>>>>>>> -??????? if (morphism > 0 && morphism == result._limit) {
>>>>>>>> -?????????? // The morphism <= MorphismLimit.
>>>>>>>> -?????????? if ((morphism <? ciCallProfile::MorphismLimit) ||
>>>>>>>> -?????????????? (morphism == ciCallProfile::MorphismLimit && 
>>>>>>>> count == 0)) {
>>>>>>>> +??????? assert(result._morphism == result._limit, "");
>>>>>>>> #ifdef ASSERT
>>>>>>>> +??????? if (result._morphism > 0) {
>>>>>>>> +?????????? // The morphism <= TypeProfileWidth.
>>>>>>>> +?????????? if ((result._morphism <? TypeProfileWidth) ||
>>>>>>>> +?????????????? (result._morphism == TypeProfileWidth && count 
>>>>>>>> == 0)) {
>>>>>>>> ????????????????? if (count > 0) {
>>>>>>>> ??????????????????? this->print_short_name(tty);
>>>>>>>> ??????????????????? tty->print_cr(" @ bci:%d", bci);
>>>>>>>> ??????????????????? this->print_codes();
>>>>>>>> ??????????????????? assert(false, "this call site should not be 
>>>>>>>> polymorphic");
>>>>>>>> ????????????????? }
>>>>>>>> -#endif
>>>>>>>> -???????????? result._morphism = morphism;
>>>>>>>> ??????????????? }
>>>>>>>> ???????????? }
>>>>>>>> +#endif
>>>>>>>> ???????????? // Make the count consistent if this is a call 
>>>>>>>> profile. If count is
>>>>>>>> ???????????? // zero or less, presume that this is a typecheck 
>>>>>>>> profile and
>>>>>>>> ???????????? // do nothing.? Otherwise, increase count to be the 
>>>>>>>> sum of all
>>>>>>>> @@ -578,7 +577,7 @@ void ciCallProfile::add_receiver(ciKlass* 
>>>>>>>> receiver, int receiver_count) {
>>>>>>>> ?????? }
>>>>>>>> ?????? _receiver[i] = receiver;
>>>>>>>> ?????? _receiver_count[i] = receiver_count;
>>>>>>>> -? if (_limit < MorphismLimit) _limit++;
>>>>>>>> +? if (_limit < TypeProfileWidth) _limit++;
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp 
>>>>>>>> b/src/hotspot/share/opto/c2_globals.hpp
>>>>>>>> index d605bdb7bd..7a8dee43e5 100644
>>>>>>>> --- a/src/hotspot/share/opto/c2_globals.hpp
>>>>>>>> +++ b/src/hotspot/share/opto/c2_globals.hpp
>>>>>>>> @@ -389,9 +389,16 @@
>>>>>>>> ?????? product(bool, UseBimorphicInlining, 
>>>>>>>> true,???????????????????????????????? \
>>>>>>>> ?????????????? "Profiling based inlining for two 
>>>>>>>> receivers")???????????????????? \
>>>>>>>> \
>>>>>>>> +? product(bool, UsePolymorphicInlining, 
>>>>>>>> true,?????????????????????????????? \
>>>>>>>> +????????? "Profiling based inlining for two or more 
>>>>>>>> receivers")???????????? \
>>>>>>>> + \
>>>>>>>> ?????? product(bool, UseOnlyInlinedBimorphic, 
>>>>>>>> true,????????????????????????????? \
>>>>>>>> ?????????????? "Don't use BimorphicInlining if can't inline a 
>>>>>>>> second method")??? \
>>>>>>>> \
>>>>>>>> +? product(bool, UseOnlyInlinedPolymorphic, 
>>>>>>>> true,??????????????????????????? \
>>>>>>>> +????????? "Don't use PolymorphicInlining if can't inline a 
>>>>>>>> non-major "????? \
>>>>>>>> +????????? "receiver's 
>>>>>>>> method")????????????????????????????????????????????? \
>>>>>>>> + \
>>>>>>>> ?????? product(bool, InsertMemBarAfterArraycopy, 
>>>>>>>> true,?????????????????????????? \
>>>>>>>> ?????????????? "Insert memory barrier after arraycopy 
>>>>>>>> call")???????????????????? \
>>>>>>>> \
>>>>>>>> diff --git a/src/hotspot/share/opto/doCall.cpp 
>>>>>>>> b/src/hotspot/share/opto/doCall.cpp
>>>>>>>> index 44ab387ac8..6f940209ce 100644
>>>>>>>> --- a/src/hotspot/share/opto/doCall.cpp
>>>>>>>> +++ b/src/hotspot/share/opto/doCall.cpp
>>>>>>>> @@ -83,25 +83,23 @@ CallGenerator* 
>>>>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>>>>>
>>>>>>>> ?????? // See how many times this site has been invoked.
>>>>>>>> ?????? int site_count = profile.count();
>>>>>>>> -? int receiver_count = -1;
>>>>>>>> -? if (call_does_dispatch && UseTypeProfile && 
>>>>>>>> profile.has_receiver(0)) {
>>>>>>>> -??? // Receivers in the profile structure are ordered by call 
>>>>>>>> counts
>>>>>>>> -??? // so that the most called (major) receiver is 
>>>>>>>> profile.receiver(0).
>>>>>>>> -??? receiver_count = profile.receiver_count(0);
>>>>>>>> -? }
>>>>>>>>
>>>>>>>> ?????? CompileLog* log = this->log();
>>>>>>>> ?????? if (log != NULL) {
>>>>>>>> -??? int rid = (receiver_count >= 0)? 
>>>>>>>> log->identify(profile.receiver(0)): -1;
>>>>>>>> -??? int r2id = (rid != -1 && profile.has_receiver(1))? 
>>>>>>>> log->identify(profile.receiver(1)):-1;
>>>>>>>> +??? ResourceMark rm;
>>>>>>>> +??? int* rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth);
>>>>>>>> +??? for (int i = 0; i < TypeProfileWidth && 
>>>>>>>> profile.has_receiver(i); i++) {
>>>>>>>> +????? rids[i] = log->identify(profile.receiver(i));
>>>>>>>> +??? }
>>>>>>>> ???????? log->begin_elem("call method='%d' count='%d' 
>>>>>>>> prof_factor='%f'",
>>>>>>>> ???????????????????????? log->identify(callee), site_count, 
>>>>>>>> prof_factor);
>>>>>>>> ???????? if (call_does_dispatch)? log->print(" virtual='1'");
>>>>>>>> ???????? if (allow_inline)???? log->print(" inline='1'");
>>>>>>>> -??? if (receiver_count >= 0) {
>>>>>>>> -????? log->print(" receiver='%d' receiver_count='%d'", rid, 
>>>>>>>> receiver_count);
>>>>>>>> -?????? if (profile.has_receiver(1)) {
>>>>>>>> -??????? log->print(" receiver2='%d' receiver2_count='%d'", 
>>>>>>>> r2id, profile.receiver_count(1));
>>>>>>>> +??? for (int i = 0; i < TypeProfileWidth && 
>>>>>>>> profile.has_receiver(i); i++) {
>>>>>>>> +????? if (i == 0) {
>>>>>>>> +??????? log->print(" receiver='%d' receiver_count='%d'", 
>>>>>>>> rids[i], profile.receiver_count(i));
>>>>>>>> +????? } else {
>>>>>>>> +??????? log->print(" receiver%d='%d' receiver%d_count='%d'", i 
>>>>>>>> + 1, rids[i], i + 1, profile.receiver_count(i));
>>>>>>>> ?????????? }
>>>>>>>> ???????? }
>>>>>>>> ???????? if (callee->is_method_handle_intrinsic()) {
>>>>>>>> @@ -205,90 +203,96 @@ CallGenerator* 
>>>>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>>>>> ???????? if (call_does_dispatch && site_count > 0 && 
>>>>>>>> UseTypeProfile) {
>>>>>>>> ?????????? // The major receiver's count >= 
>>>>>>>> TypeProfileMajorReceiverPercent of site_count.
>>>>>>>> ?????????? bool have_major_receiver = profile.has_receiver(0) && 
>>>>>>>> (100.*profile.receiver_prob(0) >= 
>>>>>>>> (float)TypeProfileMajorReceiverPercent);
>>>>>>>> -????? ciMethod* receiver_method = NULL;
>>>>>>>>
>>>>>>>> ?????????? int morphism = profile.morphism();
>>>>>>>> +
>>>>>>>> +????? ciMethod** receiver_methods = 
>>>>>>>> NEW_RESOURCE_ARRAY(ciMethod*, MAX(1, morphism));
>>>>>>>> +????? memset(receiver_methods, 0, sizeof(ciMethod*) * MAX(1, 
>>>>>>>> morphism));
>>>>>>>> +
>>>>>>>> ?????????? if (speculative_receiver_type != NULL) {
>>>>>>>> ???????????? if (!too_many_traps_or_recompiles(caller, bci, 
>>>>>>>> Deoptimization::Reason_speculate_class_check)) {
>>>>>>>> ?????????????? // We have a speculative type, we should be able 
>>>>>>>> to resolve
>>>>>>>> ?????????????? // the call. We do that before looking at the 
>>>>>>>> profiling at
>>>>>>>> -????????? // this invoke because it may lead to bimorphic 
>>>>>>>> inlining which
>>>>>>>> +????????? // this invoke because it may lead to polymorphic 
>>>>>>>> inlining which
>>>>>>>> ?????????????? // a speculative type should help us avoid.
>>>>>>>> -????????? receiver_method = 
>>>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>>>> - speculative_receiver_type);
>>>>>>>> -????????? if (receiver_method == NULL) {
>>>>>>>> +????????? receiver_methods[0] = 
>>>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>>>> + speculative_receiver_type);
>>>>>>>> +????????? if (receiver_methods[0] == NULL) {
>>>>>>>> ???????????????? speculative_receiver_type = NULL;
>>>>>>>> ?????????????? } else {
>>>>>>>> ???????????????? morphism = 1;
>>>>>>>> ?????????????? }
>>>>>>>> ???????????? } else {
>>>>>>>> ?????????????? // speculation failed before. Use profiling at 
>>>>>>>> the call
>>>>>>>> -????????? // (could allow bimorphic inlining for instance).
>>>>>>>> +????????? // (could allow polymorphic inlining for instance).
>>>>>>>> ?????????????? speculative_receiver_type = NULL;
>>>>>>>> ???????????? }
>>>>>>>> ?????????? }
>>>>>>>> -????? if (receiver_method == NULL &&
>>>>>>>> +????? if (receiver_methods[0] == NULL &&
>>>>>>>> ?????????????? (have_major_receiver || morphism == 1 ||
>>>>>>>> -?????????? (morphism == 2 && UseBimorphicInlining))) {
>>>>>>>> -??????? // receiver_method = profile.method();
>>>>>>>> +?????????? (morphism == 2 && UseBimorphicInlining) ||
>>>>>>>> +?????????? (morphism >= 2 && UsePolymorphicInlining))) {
>>>>>>>> +??????? assert(profile.has_receiver(0), "no receiver at 0");
>>>>>>>> +??????? // receiver_methods[0] = profile.method();
>>>>>>>> ???????????? // Profiles do not suggest methods now.? Look it up 
>>>>>>>> in the major receiver.
>>>>>>>> -??????? receiver_method = 
>>>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>>>> - profile.receiver(0));
>>>>>>>> +??????? receiver_methods[0] = 
>>>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>>>> + profile.receiver(0));
>>>>>>>> ?????????? }
>>>>>>>> -????? if (receiver_method != NULL) {
>>>>>>>> -??????? // The single majority receiver sufficiently outweighs 
>>>>>>>> the minority.
>>>>>>>> -??????? CallGenerator* hit_cg = 
>>>>>>>> this->call_generator(receiver_method,
>>>>>>>> -????????????? vtable_index, !call_does_dispatch, jvms, 
>>>>>>>> allow_inline, prof_factor);
>>>>>>>> -??????? if (hit_cg != NULL) {
>>>>>>>> -????????? // Look up second receiver.
>>>>>>>> -????????? CallGenerator* next_hit_cg = NULL;
>>>>>>>> -????????? ciMethod* next_receiver_method = NULL;
>>>>>>>> -????????? if (morphism == 2 && UseBimorphicInlining) {
>>>>>>>> -??????????? next_receiver_method = 
>>>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>>>> - profile.receiver(1));
>>>>>>>> -??????????? if (next_receiver_method != NULL) {
>>>>>>>> -????????????? next_hit_cg = 
>>>>>>>> this->call_generator(next_receiver_method,
>>>>>>>> -????????????????????????????????? vtable_index, 
>>>>>>>> !call_does_dispatch, jvms,
>>>>>>>> -????????????????????????????????? allow_inline, prof_factor);
>>>>>>>> -????????????? if (next_hit_cg != NULL && 
>>>>>>>> !next_hit_cg->is_inline() &&
>>>>>>>> -????????????????? have_major_receiver && 
>>>>>>>> UseOnlyInlinedBimorphic) {
>>>>>>>> -????????????????? // Skip if we can't inline second receiver's 
>>>>>>>> method
>>>>>>>> -????????????????? next_hit_cg = NULL;
>>>>>>>> +????? if (receiver_methods[0] != NULL) {
>>>>>>>> +??????? CallGenerator** hit_cgs = 
>>>>>>>> NEW_RESOURCE_ARRAY(CallGenerator*, MAX(1, morphism));
>>>>>>>> +??????? memset(hit_cgs, 0, sizeof(CallGenerator*) * MAX(1, 
>>>>>>>> morphism));
>>>>>>>> +
>>>>>>>> +??????? hit_cgs[0] = this->call_generator(receiver_methods[0],
>>>>>>>> +??????????????????????????? vtable_index, !call_does_dispatch, 
>>>>>>>> jvms,
>>>>>>>> +??????????????????????????? allow_inline, prof_factor);
>>>>>>>> +??????? if (hit_cgs[0] != NULL) {
>>>>>>>> +????????? if ((morphism == 2 && UseBimorphicInlining) || 
>>>>>>>> (morphism >= 2 && UsePolymorphicInlining)) {
>>>>>>>> +??????????? for (int i = 1; i < morphism; i++) {
>>>>>>>> +????????????? assert(profile.has_receiver(i), "no receiver at 
>>>>>>>> %d", i);
>>>>>>>> +????????????? receiver_methods[i] = 
>>>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>>>> + profile.receiver(i));
>>>>>>>> +????????????? if (receiver_methods[i] != NULL) {
>>>>>>>> +??????????????? hit_cgs[i] = 
>>>>>>>> this->call_generator(receiver_methods[i],
>>>>>>>> +????????????????????????????????????? vtable_index, 
>>>>>>>> !call_does_dispatch, jvms,
>>>>>>>> +????????????????????????????????????? allow_inline, prof_factor);
>>>>>>>> +??????????????? if (hit_cgs[i] != NULL && 
>>>>>>>> !hit_cgs[i]->is_inline() && have_major_receiver &&
>>>>>>>> +??????????????????? ((morphism == 2 && UseOnlyInlinedBimorphic) 
>>>>>>>> || (morphism >= 2 && UseOnlyInlinedPolymorphic))) {
>>>>>>>> +????????????????? // Skip if we can't inline non-major 
>>>>>>>> receiver's method
>>>>>>>> +????????????????? hit_cgs[i] = NULL;
>>>>>>>> +??????????????? }
>>>>>>>> ?????????????????? }
>>>>>>>> ???????????????? }
>>>>>>>> ?????????????? }
>>>>>>>> ?????????????? CallGenerator* miss_cg;
>>>>>>>> -????????? Deoptimization::DeoptReason reason = (morphism == 2
>>>>>>>> -?????????????????????????????????????????????? ? 
>>>>>>>> Deoptimization::Reason_bimorphic
>>>>>>>> +????????? Deoptimization::DeoptReason reason = (morphism >= 2
>>>>>>>> +?????????????????????????????????????????????? ? 
>>>>>>>> Deoptimization::Reason_polymorphic
>>>>>>>> ??????????????????????????????????????????????????? : 
>>>>>>>> Deoptimization::reason_class_check(speculative_receiver_type != 
>>>>>>>> NULL));
>>>>>>>> -????????? if ((morphism == 1 || (morphism == 2 && next_hit_cg 
>>>>>>>> != NULL)) &&
>>>>>>>> -????????????? !too_many_traps_or_recompiles(caller, bci, reason)
>>>>>>>> -???????????? ) {
>>>>>>>> +????????? if (!too_many_traps_or_recompiles(caller, bci, 
>>>>>>>> reason)) {
>>>>>>>> ???????????????? // Generate uncommon trap for class check 
>>>>>>>> failure path
>>>>>>>> -??????????? // in case of monomorphic or bimorphic virtual call 
>>>>>>>> site.
>>>>>>>> +??????????? // in case of polymorphic virtual call site.
>>>>>>>> ???????????????? miss_cg = 
>>>>>>>> CallGenerator::for_uncommon_trap(callee, reason,
>>>>>>>>                              
>>>>>>>> Deoptimization::Action_maybe_recompile);
>>>>>>>> ?????????????? } else {
>>>>>>>> ???????????????? // Generate virtual call for class check 
>>>>>>>> failure path
>>>>>>>> -??????????? // in case of polymorphic virtual call site.
>>>>>>>> +??????????? // in case of megamorphic virtual call site.
>>>>>>>> ???????????????? miss_cg = 
>>>>>>>> CallGenerator::for_virtual_call(callee, vtable_index);
>>>>>>>> ?????????????? }
>>>>>>>> -????????? if (miss_cg != NULL) {
>>>>>>>> -??????????? if (next_hit_cg != NULL) {
>>>>>>>> +????????? for (int i = morphism - 1; i >= 1 && miss_cg != NULL; 
>>>>>>>> i--) {
>>>>>>>> +??????????? if (hit_cgs[i] != NULL) {
>>>>>>>> ?????????????????? assert(speculative_receiver_type == NULL, 
>>>>>>>> "shouldn't end up here if we used speculation");
>>>>>>>> -????????????? trace_type_profile(C, jvms->method(), 
>>>>>>>> jvms->depth() - 1, jvms->bci(), next_receiver_method, 
>>>>>>>> profile.receiver(1), site_count, profile.receiver_count(1));
>>>>>>>> +????????????? trace_type_profile(C, jvms->method(), 
>>>>>>>> jvms->depth() - 1, jvms->bci(), receiver_methods[i], 
>>>>>>>> profile.receiver(i), site_count, profile.receiver_count(i));
>>>>>>>> ?????????????????? // We don't need to record dependency on a 
>>>>>>>> receiver here and below.
>>>>>>>> ?????????????????? // Whenever we inline, the dependency is 
>>>>>>>> added by Parse::Parse().
>>>>>>>> -????????????? miss_cg = 
>>>>>>>> CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, 
>>>>>>>> next_hit_cg, PROB_MAX);
>>>>>>>> -??????????? }
>>>>>>>> -??????????? if (miss_cg != NULL) {
>>>>>>>> -????????????? ciKlass* k = speculative_receiver_type != NULL ? 
>>>>>>>> speculative_receiver_type : profile.receiver(0);
>>>>>>>> -????????????? trace_type_profile(C, jvms->method(), 
>>>>>>>> jvms->depth() - 1, jvms->bci(), receiver_method, k, site_count, 
>>>>>>>> receiver_count);
>>>>>>>> -????????????? float hit_prob = speculative_receiver_type != 
>>>>>>>> NULL ? 1.0 : profile.receiver_prob(0);
>>>>>>>> -????????????? CallGenerator* cg = 
>>>>>>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob);
>>>>>>>> -????????????? if (cg != NULL)? return cg;
>>>>>>>> +????????????? miss_cg = 
>>>>>>>> CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, 
>>>>>>>> hit_cgs[i], PROB_MAX);
>>>>>>>> ???????????????? }
>>>>>>>> ?????????????? }
>>>>>>>> +????????? if (miss_cg != NULL) {
>>>>>>>> +??????????? ciKlass* k = speculative_receiver_type != NULL ? 
>>>>>>>> speculative_receiver_type : profile.receiver(0);
>>>>>>>> +??????????? trace_type_profile(C, jvms->method(), jvms->depth() 
>>>>>>>> - 1, jvms->bci(), receiver_methods[0], k, site_count, 
>>>>>>>> profile.receiver_count(0));
>>>>>>>> +??????????? float hit_prob = speculative_receiver_type != NULL 
>>>>>>>> ? 1.0 : profile.receiver_prob(0);
>>>>>>>> +??????????? CallGenerator* cg = 
>>>>>>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cgs[0], 
>>>>>>>> hit_prob);
>>>>>>>> +??????????? if (cg != NULL)? return cg;
>>>>>>>> +????????? }
>>>>>>>> ???????????? }
>>>>>>>> ????????? }
>>>>>>>> ???????? }
>>>>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp 
>>>>>>>> b/src/hotspot/share/runtime/deoptimization.cpp
>>>>>>>> index 11df15e004..2d14b52854 100644
>>>>>>>> --- a/src/hotspot/share/runtime/deoptimization.cpp
>>>>>>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp
>>>>>>>> @@ -2382,7 +2382,7 @@ const char* 
>>>>>>>> Deoptimization::_trap_reason_name[] = {
>>>>>>>> ?????? "class_check",
>>>>>>>> ?????? "array_check",
>>>>>>>> ?????? "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"),
>>>>>>>> -? "bimorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>>>>> +? "polymorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>>>>> ?????? "profile_predicate",
>>>>>>>> ?????? "unloaded",
>>>>>>>> ?????? "uninitialized",
>>>>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp 
>>>>>>>> b/src/hotspot/share/runtime/deoptimization.hpp
>>>>>>>> index 1cfff5394e..c1eb998aba 100644
>>>>>>>> --- a/src/hotspot/share/runtime/deoptimization.hpp
>>>>>>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp
>>>>>>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic {
>>>>>>>> ???????? Reason_class_check,?????????? // saw unexpected object 
>>>>>>>> class (@bci)
>>>>>>>> ???????? Reason_array_check,?????????? // saw unexpected array 
>>>>>>>> class (aastore @bci)
>>>>>>>> ???????? Reason_intrinsic,???????????? // saw unexpected operand 
>>>>>>>> to intrinsic (@bci)
>>>>>>>> -??? Reason_bimorphic,???????????? // saw unexpected object 
>>>>>>>> class in bimorphic inlining (@bci)
>>>>>>>> +??? Reason_polymorphic,?????????? // saw unexpected object 
>>>>>>>> class in bimorphic inlining (@bci)
>>>>>>>>
>>>>>>>> #if INCLUDE_JVMCI
>>>>>>>> ???????? Reason_unreached0???????????? = Reason_null_assert,
>>>>>>>> ???????? Reason_type_checked_inlining? = Reason_intrinsic,
>>>>>>>> -??? Reason_optimized_type_check?? = Reason_bimorphic,
>>>>>>>> +??? Reason_optimized_type_check?? = Reason_polymorphic,
>>>>>>>> #endif
>>>>>>>>
>>>>>>>> ???????? Reason_profile_predicate,???? // compiler generated 
>>>>>>>> predicate moved from frequent branch in a loop failed
>>>>>>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp 
>>>>>>>> b/src/hotspot/share/runtime/vmStructs.cpp
>>>>>>>> index 94b544824e..ee761626c4 100644
>>>>>>>> --- a/src/hotspot/share/runtime/vmStructs.cpp
>>>>>>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp
>>>>>>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry<InstanceKlass*, 
>>>>>>>> mtClass>? KlassHashtableEntry;
>>>>>>>> declare_constant(Deoptimization::Reason_class_check) \
>>>>>>>> declare_constant(Deoptimization::Reason_array_check) \
>>>>>>>> declare_constant(Deoptimization::Reason_intrinsic) \
>>>>>>>> - declare_constant(Deoptimization::Reason_bimorphic) \
>>>>>>>> + declare_constant(Deoptimization::Reason_polymorphic) \
>>>>>>>> declare_constant(Deoptimization::Reason_profile_predicate) \
>>>>>>>> declare_constant(Deoptimization::Reason_unloaded) \
>>>>>>>> declare_constant(Deoptimization::Reason_uninitialized) \
>>>>>>>>

From ekaterina.pavlova at oracle.com  Tue Apr  7 21:05:55 2020
From: ekaterina.pavlova at oracle.com (Ekaterina Pavlova)
Date: Tue, 7 Apr 2020 14:05:55 -0700
Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator):
 General HotSpot changes
In-Reply-To: <7dc065c6-b8c4-3d83-4b5d-788e07d8d6e5@oracle.com>
References: <c1bdf88c-5de2-d069-5f31-5a95c6988bf8@oracle.com>
 <3b6f8501-5d90-0a14-ed21-5978567d95d2@oracle.com>
 <7dc065c6-b8c4-3d83-4b5d-788e07d8d6e5@oracle.com>
Message-ID: <b5140094-f5ce-c3b9-814e-8a949f788d1a@oracle.com>

Thanks Vladimir,

Running tier1-tier4 tests and not getting any regressions is very good.
I would also recommend to run other tiers as they contain more stress tests as well as jck.
Doing it at least once before the integration would be very helpful and prevents us from
getting late issues.

Please let me know if you need any help with this.

regards,
-katya

On 4/7/20 2:39 AM, Vladimir Ivanov wrote:
> Hi Katya,
> 
>> what kind of testing has been done to verify these changes?
>> Taking into account the changes are quite large and touch share code
>> running hs compiler and perhaps runtime tiers would be very advisable.
> 
> The changes (and previous versions) were tested in 2 modes:
> 
>  ? * ran through tier1-tier4 with the functionality turned OFF; (also, some previous version went through tier1-tier6 once)
> 
>  ? * unit tests on Vector API were run on different x86 hardware in the following modes: -XX:UseAVX=[3,2,1,0] -XX:UseSSE=[4,3,2]. Arm engineers tested the version in vector-unstable branch on AArch64 hardware.
> 
> As of now, the only known test failure is compiler/graalunit/HotspotTest.java in org.graalvm.compiler.hotspot.test.CheckGraalIntrinsics which should be taught about new JVM intrinsics added.
> 
> Best regards,
> Vladimir Ivanov
> 
>> On 4/3/20 4:12 PM, Vladimir Ivanov wrote:
>>> Hi,
>>>
>>> Following up on review requests of API [0] and Java implementation [1] for Vector API (JEP 338 [2]), here's a request for review of general HotSpot changes (in shared code) required for supporting the API:
>>>
>>>
>>> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/all.00-03/
>>>
>>> (First of all, to set proper expectations: since the JEP is still in Candidate state, the intention is to initiate preliminary round(s) of review to inform the community and gather feedback before sending out final/official RFRs once the JEP is Targeted to a release.)
>>>
>>> Vector API (being developed in Project Panama [3]) relies on JVM support to utilize optimal vector hardware instructions at runtime. It interacts with JVM through intrinsics (declared in jdk.internal.vm.vector.VectorSupport [4]) which expose vector operations support in C2 JIT-compiler.
>>>
>>> As Paul wrote earlier: "A vector intrinsic is an internal low-level vector operation. The last argument to the intrinsic is fall back behavior in Java, implementing the scalar operation over the number of elements held by the vector.? Thus, If the intrinsic is not supported in C2 for the other arguments then the Java implementation is executed (the Java implementation is always executed when running in the interpreter or for C1)."
>>>
>>> The rest of JVM support is about aggressively optimizing vector boxes to minimize (ideally eliminate) the overhead of boxing for vector values.
>>> It's a stop-the-gap solution for vector box elimination problem until inline classes arrive. Vector classes are value-based and in the longer term will be migrated to inline classes once the support becomes available.
>>>
>>> Vector API talk from JVMLS'18 [5] contains brief overview of JVM implementation and some details.
>>>
>>> Complete implementation resides in vector-unstable branch of panama/dev repository [6].
>>>
>>> Now to gory details (the patch is split in multiple "sub-webrevs"):
>>>
>>> ===========================================================
>>>
>>> (1) http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/00.backend.shared/
>>>
>>> Ideal vector nodes for new operations introduced by Vector API.
>>>
>>> (Platform-specific back end support will be posted for review separately).
>>>
>>> ===========================================================
>>>
>>> (2) http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/
>>>
>>> JVM Java interface (VectorSupport) and intrinsic support in C2.
>>>
>>> Vector instances are initially represented as VectorBox macro nodes and "unboxing" is represented by VectorUnbox node. It simplifies vector box elimination analysis and the nodes are expanded later right before EA pass.
>>>
>>> Vectors have 2-level on-heap representation: for the vector value primitive array is used as a backing storage and it is encapsulated in a typed wrapper (e.g., Int256Vector - vector of 8 ints - contains a int[8] instance which is used to store vector value).
>>>
>>> Unless VectorBox node goes away, it needs to be expanded into an allocation eventually, but it is a pure node and doesn't have any JVM state associated with it. The problem is solved by keeping JVM state separately in a VectorBoxAllocate node associated with VectorBox node and use it during expansion.
>>>
>>> Also, to simplify vector box elimination, inlining of vector reboxing calls (VectorSupport::maybeRebox) is delayed until the analysis is over.
>>>
>>> ===========================================================
>>>
>>> (3) http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/02.vbox_elimination/
>>>
>>> Vector box elimination analysis implementation. (Brief overview: slides #36-42 [5].)
>>>
>>> The main part is devoted to scalarization across safepoints and rematerialization support during deoptimization. In C2-generated code vector operations work with raw vector values which live in registers or spilled on the stack and it allows to avoid boxing/unboxing when a vector value is alive across a safepoint. As with other values, there's just a location of the vector value at the safepoint and vector type information recorded in the relevant nmethod metadata and all the heavy-lifting happens only when rematerialization takes place.
>>>
>>> The analysis preserves object identity invariants except during aggressive reboxing (guarded by -XX:+EnableAggressiveReboxing).
>>>
>>> (Aggressive reboxing is crucial for cases when vectors "escape": it allocates a fresh instance at every escape point thus enabling original instance to go away.)
>>>
>>> ===========================================================
>>>
>>> (4) http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/03.module.hotspot/
>>>
>>> HotSpot changes for jdk.incubator.vector module. Vector support is makred experimental and turned off by default. JEP 338 proposes the API to be released as an incubator module, so a user has to specify "--add-module jdk.incubator.vector" on the command line to be able to use it.
>>> When user does that, JVM automatically enables Vector API support.
>>> It improves usability (user doesn't need to separately "open" the API and enable JVM support) while minimizing risks of destabilitzation from new code when the API is not used.
>>>
>>>
>>> That's it! Will be happy to answer any questions.
>>>
>>> And thanks in advance for any feedback!
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> [0] https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html
>>>
>>> [1] https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041228.html
>>>
>>> [2] https://openjdk.java.net/jeps/338
>>>
>>> [3] https://openjdk.java.net/projects/panama/
>>>
>>> [4] http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java.html
>>>
>>> [5] http://cr.openjdk.java.net/~vlivanov/talks/2018_JVMLS_VectorAPI.pdf
>>>
>>> [6] http://hg.openjdk.java.net/panama/dev/shortlog/92bbd44386e9
>>>
>>> ???? $ hg clone http://hg.openjdk.java.net/panama/dev/ -b vector-unstable
>>


From igor.ignatyev at oracle.com  Wed Apr  8 01:04:54 2020
From: igor.ignatyev at oracle.com (Igor Ignatyev)
Date: Tue, 7 Apr 2020 18:04:54 -0700
Subject: RFR(S/M) : 8242310 : use reproducible random in hotspot compiler tests
Message-ID: <3EDC354E-9CC1-418B-978A-689FB50BE061@oracle.com>

http://cr.openjdk.java.net/~iignatyev//8242310/webrev.00
> 282 lines changed: 123 ins; 24 del; 135 mod; 

Hi all,

could you please review the patch which marks hotspot compiler tests w/ randomness k/w and uses Utils.getRandomInstance() instead of Random w/ _random_ seeds where possible? To identify tests which should be marked, I've used both static (in a form of analyzing classes which directly or indirectly depend on Random/SecureRandom/ThreadLocalRandom) and dynamic (by instrumenting the said classes to log tests which called their 'next' methods) analyses. I've decided *not* to mark tests which use SecureRandom only via File.createTemp* b/c in all such cases temp files are not used as a source of randomness, but rather just a reliable way to get a new/empty file/dir.

the patch also 
 - replaces fixed seed w/ 42 (in the tests which don't really depend on a seed value) as it's most common fixed seed in hotspot test suite;
 - moves Utils.getRandomInstance() calls closer to usage, so 'To re-run test with same seed value please add ... ' won't be printed out by the tests which don't actually use random but use share classes which might use random.

webrevs: for the sake of reviewers, I've split the patch into parts, webrev.code.00 has only changes in the code, webrev.kw.00 -- only adds the k/w (and comments in few places where one might think k/w is needed), and webrev.00 contains all changes (including copyright year updates) 
http://cr.openjdk.java.net/~iignatyev//8242310/webrev.code.00
> 109 lines changed: 41 ins; 24 del; 44 mod;
http://cr.openjdk.java.net/~iignatyev//8242310/webrev.kw.00
> 84 lines changed: 82 ins; 0 del; 2 mod; 
http://cr.openjdk.java.net/~iignatyev//8242310/webrev.00
> 282 lines changed: 123 ins; 24 del; 135 mod; 

NB the patch depends on 8241707[1], which is currently under review[2].

testing: test/hotspot/jtreg/compiler tests on {linux,windows,macosx}-x64
JBS: https://bugs.openjdk.java.net/browse/JDK-8242310

[1] https://bugs.openjdk.java.net/browse/JDK-8241707
[2] https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041300.html

Thanks,
-- Igor

From tobias.hartmann at oracle.com  Wed Apr  8 06:17:46 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Wed, 8 Apr 2020 08:17:46 +0200
Subject: RFR(S/M) : 8242310 : use reproducible random in hotspot compiler
 tests
In-Reply-To: <3EDC354E-9CC1-418B-978A-689FB50BE061@oracle.com>
References: <3EDC354E-9CC1-418B-978A-689FB50BE061@oracle.com>
Message-ID: <8ccc1dfe-5389-0fcc-898e-1d1b1414d907@oracle.com>

Hi Igor,

On 08.04.20 03:04, Igor Ignatyev wrote:
>  - replaces fixed seed w/ 42 (in the tests which don't really depend on a seed value) as it's most common fixed seed in hotspot test suite;

What's the reason to use a fixed seed in the first place? Seems to me that even if the test does not
directly use the random value, it doesn't hurt to use a non-fixed seed. In fact, wouldn't using a
non-fixed seed increase coverage? Even if the value is not checked, it's still propagated through
registers, stack and heap space and might therefore make a difference.

> http://cr.openjdk.java.net/~iignatyev//8242310/webrev.00

Looks good.

Best regards,
Tobias

From rwestrel at redhat.com  Wed Apr  8 07:32:16 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Wed, 08 Apr 2020 09:32:16 +0200
Subject: RFR(S): 8241900: Loop unswitching may cause dependence on null
 check to be lost
In-Reply-To: <289f3e63-9603-d90e-8b31-1d02d22d6ae7@oracle.com>
References: <878sjdc5jl.fsf@redhat.com>
 <36d43333-81e8-1f79-1a04-06f8e34a2c30@oracle.com> <87zhbpau71.fsf@redhat.com>
 <6cd91986-952c-1419-47d1-f7d25b955a70@oracle.com>
 <289f3e63-9603-d90e-8b31-1d02d22d6ae7@oracle.com>
Message-ID: <87wo6qbfgf.fsf@redhat.com>


Thanks for the review, Vladimir.

Roland.


From jiefu at tencent.com  Wed Apr  8 13:51:34 2020
From: jiefu at tencent.com (=?utf-8?B?amllZnUo5YKF5p2wKQ==?=)
Date: Wed, 8 Apr 2020 13:51:34 +0000
Subject: RFR: 8242379: [TESTBUG]
 compiler/loopopts/TestLoopUnswitchingLostCastDependency.java fails with
 release VMs
Message-ID: <A88CDA1E-42CE-48DE-9F8C-05FAD6D08F05@tencent.com>

Hi all,

JBS:    https://bugs.openjdk.java.net/browse/JDK-8242379
Webrev: http://cr.openjdk.java.net/~jiefu/8242379/webrev.00/

Please review this trivial fix.
It only adds -XX:+UnlockDiagnosticVMOptions in the test.

Thanks a lot.
Best regards,
Jie


From rwestrel at redhat.com  Wed Apr  8 13:56:35 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Wed, 08 Apr 2020 15:56:35 +0200
Subject: RFR: 8242379: [TESTBUG]
 compiler/loopopts/TestLoopUnswitchingLostCastDependency.java fails with
 release VMs
In-Reply-To: <A88CDA1E-42CE-48DE-9F8C-05FAD6D08F05@tencent.com>
References: <A88CDA1E-42CE-48DE-9F8C-05FAD6D08F05@tencent.com>
Message-ID: <87r1wyaxnw.fsf@redhat.com>


> JBS:    https://bugs.openjdk.java.net/browse/JDK-8242379
> Webrev: http://cr.openjdk.java.net/~jiefu/8242379/webrev.00/

That looks good to me. Thanks for fixing this.

Roland.


From jiefu at tencent.com  Wed Apr  8 14:03:00 2020
From: jiefu at tencent.com (=?utf-8?B?amllZnUo5YKF5p2wKQ==?=)
Date: Wed, 8 Apr 2020 14:03:00 +0000
Subject: 8242379: [TESTBUG]
 compiler/loopopts/TestLoopUnswitchingLostCastDependency.java fails with
 release VMs(Internet mail)
Message-ID: <3AEBD8DD-51CB-4C07-9026-52E4E9C7842B@tencent.com>

Thanks for your review, Roland.

Do you think it's trivial to be pushed now?

Thanks a lot.
Best regards,
Jie

?On 2020/4/8, 9:56 PM, "Roland Westrelin" <rwestrel at redhat.com> wrote:

    
    > JBS:    https://bugs.openjdk.java.net/browse/JDK-8242379
    > Webrev: http://cr.openjdk.java.net/~jiefu/8242379/webrev.00/
    
    That looks good to me. Thanks for fixing this.
    
    Roland.
    
    
From igor.ignatyev at oracle.com  Wed Apr  8 14:47:15 2020
From: igor.ignatyev at oracle.com (Igor Ignatyev)
Date: Wed, 8 Apr 2020 07:47:15 -0700
Subject: RFR(S/M) : 8242310 : use reproducible random in hotspot compiler
 tests
In-Reply-To: <8ccc1dfe-5389-0fcc-898e-1d1b1414d907@oracle.com>
References: <3EDC354E-9CC1-418B-978A-689FB50BE061@oracle.com>
 <8ccc1dfe-5389-0fcc-898e-1d1b1414d907@oracle.com>
Message-ID: <2E6F0667-70AF-46DF-8EE5-8FB03C527AC4@oracle.com>


> On Apr 7, 2020, at 11:17 PM, Tobias Hartmann <tobias.hartmann at oracle.com> wrote:
> 
> Hi Igor,
> 
> On 08.04.20 03:04, Igor Ignatyev wrote:
>> - replaces fixed seed w/ 42 (in the tests which don't really depend on a seed value) as it's most common fixed seed in hotspot test suite;
> 
> What's the reason to use a fixed seed in the first place? Seems to me that even if the test does not
> directly use the random value, it doesn't hurt to use a non-fixed seed. In fact, wouldn't using a
> non-fixed seed increase coverage? Even if the value is not checked, it's still propagated through
> registers, stack and heap space and might therefore make a difference.

the thing is randomness (even reproducible) in tests comes w/ a price -- you had to be more careful when use such tests to verify fixes, compare results across different runs, etc. so in some cases, the possible gain in code coverage doesn't justify the drawbacks, and frankly I'm not a big fun of using something just b/c it might increase coverage in areas unrelated to the original goals of a test. I had to admit thought that I had several internal discussions w/ myself, at first I removed almost all fixed seed values, then I was going back and forth weighing pros and cons; at the end I decided to leave it as-is for now and reevaluate later on a test-by-test basis. 
 
> 
>> http://cr.openjdk.java.net/~iignatyev//8242310/webrev.00
> 
> Looks good.
> 
> Best regards,
> Tobias


From tobias.hartmann at oracle.com  Wed Apr  8 14:56:23 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Wed, 8 Apr 2020 16:56:23 +0200
Subject: RFR(S/M) : 8242310 : use reproducible random in hotspot compiler
 tests
In-Reply-To: <2E6F0667-70AF-46DF-8EE5-8FB03C527AC4@oracle.com>
References: <3EDC354E-9CC1-418B-978A-689FB50BE061@oracle.com>
 <8ccc1dfe-5389-0fcc-898e-1d1b1414d907@oracle.com>
 <2E6F0667-70AF-46DF-8EE5-8FB03C527AC4@oracle.com>
Message-ID: <6009359f-bf3e-e37d-6f39-7f8a1c604a2c@oracle.com>

Hi Igor,

On 08.04.20 16:47, Igor Ignatyev wrote:
> the thing is randomness (even reproducible) in tests comes w/ a price -- you had to be more careful when use such tests to verify fixes, compare results across different runs, etc. so in some cases, the possible gain in code coverage doesn't justify the drawbacks, and frankly I'm not a big fun of using something just b/c it might increase coverage in areas unrelated to the original goals of a test. I had to admit thought that I had several internal discussions w/ myself, at first I removed almost all fixed seed values, then I was going back and forth weighing pros and cons; at the end I decided to leave it as-is for now and reevaluate later on a test-by-test basis. 

Okay, fair enough. I agree that this discussion is independent of your fix.

Best regards,
Tobias

From rwestrel at redhat.com  Wed Apr  8 15:11:35 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Wed, 08 Apr 2020 17:11:35 +0200
Subject: 8242379: [TESTBUG]
 compiler/loopopts/TestLoopUnswitchingLostCastDependency.java fails with
 release VMs(Internet mail)
In-Reply-To: <3AEBD8DD-51CB-4C07-9026-52E4E9C7842B@tencent.com>
References: <3AEBD8DD-51CB-4C07-9026-52E4E9C7842B@tencent.com>
Message-ID: <87o8s2au6w.fsf@redhat.com>


> Do you think it's trivial to be pushed now?

Yes I think it is.

Roland.


From vladimir.kozlov at oracle.com  Wed Apr  8 18:10:06 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 8 Apr 2020 11:10:06 -0700
Subject: 8242379: [TESTBUG]
 compiler/loopopts/TestLoopUnswitchingLostCastDependency.java fails with
 release VMs(Internet mail)
In-Reply-To: <3AEBD8DD-51CB-4C07-9026-52E4E9C7842B@tencent.com>
References: <3AEBD8DD-51CB-4C07-9026-52E4E9C7842B@tencent.com>
Message-ID: <b09b428b-38bc-e5b6-34bd-c0b9d72b57d0@oracle.com>

Please, also add:

* @requires vm.compiler2.enabled

because both Stress flags are C2 flags.

Thanks,
Vladimir

On 4/8/20 7:03 AM, jiefu(??) wrote:
> Thanks for your review, Roland.
> 
> Do you think it's trivial to be pushed now?
> 
> Thanks a lot.
> Best regards,
> Jie
> 
> ?On 2020/4/8, 9:56 PM, "Roland Westrelin" <rwestrel at redhat.com> wrote:
> 
>      
>      > JBS:    https://bugs.openjdk.java.net/browse/JDK-8242379
>      > Webrev: http://cr.openjdk.java.net/~jiefu/8242379/webrev.00/
>      
>      That looks good to me. Thanks for fixing this.
>      
>      Roland.
>      
>      
>      
> 

From vladimir.kozlov at oracle.com  Wed Apr  8 18:54:48 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 8 Apr 2020 11:54:48 -0700
Subject: RFR(S/M) : 8242310 : use reproducible random in hotspot compiler
 tests
In-Reply-To: <3EDC354E-9CC1-418B-978A-689FB50BE061@oracle.com>
References: <3EDC354E-9CC1-418B-978A-689FB50BE061@oracle.com>
Message-ID: <9cdfad14-904d-94ef-156a-eae2f741976c@oracle.com>

Looks good.

Thanks,
Vladimir

On 4/7/20 6:04 PM, Igor Ignatyev wrote:
> http://cr.openjdk.java.net/~iignatyev//8242310/webrev.00
>> 282 lines changed: 123 ins; 24 del; 135 mod;
> 
> Hi all,
> 
> could you please review the patch which marks hotspot compiler tests w/ randomness k/w and uses Utils.getRandomInstance() instead of Random w/ _random_ seeds where possible? To identify tests which should be marked, I've used both static (in a form of analyzing classes which directly or indirectly depend on Random/SecureRandom/ThreadLocalRandom) and dynamic (by instrumenting the said classes to log tests which called their 'next' methods) analyses. I've decided *not* to mark tests which use SecureRandom only via File.createTemp* b/c in all such cases temp files are not used as a source of randomness, but rather just a reliable way to get a new/empty file/dir.
> 
> the patch also
>   - replaces fixed seed w/ 42 (in the tests which don't really depend on a seed value) as it's most common fixed seed in hotspot test suite;
>   - moves Utils.getRandomInstance() calls closer to usage, so 'To re-run test with same seed value please add ... ' won't be printed out by the tests which don't actually use random but use share classes which might use random.
> 
> webrevs: for the sake of reviewers, I've split the patch into parts, webrev.code.00 has only changes in the code, webrev.kw.00 -- only adds the k/w (and comments in few places where one might think k/w is needed), and webrev.00 contains all changes (including copyright year updates)
> http://cr.openjdk.java.net/~iignatyev//8242310/webrev.code.00
>> 109 lines changed: 41 ins; 24 del; 44 mod;
> http://cr.openjdk.java.net/~iignatyev//8242310/webrev.kw.00
>> 84 lines changed: 82 ins; 0 del; 2 mod;
> http://cr.openjdk.java.net/~iignatyev//8242310/webrev.00
>> 282 lines changed: 123 ins; 24 del; 135 mod;
> 
> NB the patch depends on 8241707[1], which is currently under review[2].
> 
> testing: test/hotspot/jtreg/compiler tests on {linux,windows,macosx}-x64
> JBS: https://bugs.openjdk.java.net/browse/JDK-8242310
> 
> [1] https://bugs.openjdk.java.net/browse/JDK-8241707
> [2] https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041300.html
> 
> Thanks,
> -- Igor
> 

From john.r.rose at oracle.com  Wed Apr  8 20:11:38 2020
From: john.r.rose at oracle.com (John Rose)
Date: Wed, 8 Apr 2020 13:11:38 -0700
Subject: is it time fully optimize long loops? (JDK-8223051)
Message-ID: <BBACDCA0-2E6A-4ACE-893A-55ED03509625@oracle.com>

I see that strip mining [1] is pretty mature now.
I think this may open up new options for dealing with
an RFE for 64-bit iteration variables [2], specifically
using some combination of predication and/or strip
mining for strength-reducing 64-bit-tripcount loops
into one or more 32-bit-tripcount loops.

Because Project Panama works on loops over native
addresses, and is attempting to produce code that is
competitive with C code, it is necessary that Panama
code uses 64-bit iteration variables (?long loops?),
but it also expects that such loops get optimized fully,
including (but not limited to) iteration range splitting,
predication, unswitching (e.g., of type tests), and
escape analysis.

Some of this stuff works best (or only works) with
32-bit iteration variables (we can call them ?short
loops?, can?t we?).  To get good performance today,
Panama library code sometimes has to perform
predication or strip mining manually, in Java code,
but this is risky (like any premature optimization)
because it makes the intention of the code more
obscure to the real optimizer, such as C2.  When we
get long loops fully supported, Panama?s performance
model will get more reliable.  But for now, Panama
is making uncomfortable compromises (e.g., [3]).

Getting the whole story working well, especially for
explicitly vectorized loops, may require new intrinsics
(such as [4]), but I think we can make progress with strip
mining or predication alone.  Is now a good time to
investigate this?

? John

[1] https://bugs.openjdk.java.net/browse/JDK-8186027
[2] https://bugs.openjdk.java.net/browse/JDK-8223051
[3] https://mail.openjdk.java.net/pipermail/panama-dev/2020-April/008411.html
[4] https://bugs.openjdk.java.net/browse/JDK-8221358


From igor.ignatyev at oracle.com  Wed Apr  8 22:31:24 2020
From: igor.ignatyev at oracle.com (Igor Ignatyev)
Date: Wed, 8 Apr 2020 15:31:24 -0700
Subject: RFR(S/M) : 8242310 : use reproducible random in hotspot compiler
 tests
In-Reply-To: <9cdfad14-904d-94ef-156a-eae2f741976c@oracle.com>
References: <3EDC354E-9CC1-418B-978A-689FB50BE061@oracle.com>
 <9cdfad14-904d-94ef-156a-eae2f741976c@oracle.com>
Message-ID: <8B9A462B-8594-4CEF-9102-813C47772ABE@oracle.com>

Vladimir, Tobias,

thank you for review! could you please also review 8241707 (on hotspot-dev) which prevents me from pushing this patch?

-- Igor

> On Apr 8, 2020, at 11:54 AM, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:
> 
> Looks good.
> 
> Thanks,
> Vladimir
> 
> On 4/7/20 6:04 PM, Igor Ignatyev wrote:
>> http://cr.openjdk.java.net/~iignatyev//8242310/webrev.00
>>> 282 lines changed: 123 ins; 24 del; 135 mod;
>> Hi all,
>> could you please review the patch which marks hotspot compiler tests w/ randomness k/w and uses Utils.getRandomInstance() instead of Random w/ _random_ seeds where possible? To identify tests which should be marked, I've used both static (in a form of analyzing classes which directly or indirectly depend on Random/SecureRandom/ThreadLocalRandom) and dynamic (by instrumenting the said classes to log tests which called their 'next' methods) analyses. I've decided *not* to mark tests which use SecureRandom only via File.createTemp* b/c in all such cases temp files are not used as a source of randomness, but rather just a reliable way to get a new/empty file/dir.
>> the patch also
>>  - replaces fixed seed w/ 42 (in the tests which don't really depend on a seed value) as it's most common fixed seed in hotspot test suite;
>>  - moves Utils.getRandomInstance() calls closer to usage, so 'To re-run test with same seed value please add ... ' won't be printed out by the tests which don't actually use random but use share classes which might use random.
>> webrevs: for the sake of reviewers, I've split the patch into parts, webrev.code.00 has only changes in the code, webrev.kw.00 -- only adds the k/w (and comments in few places where one might think k/w is needed), and webrev.00 contains all changes (including copyright year updates)
>> http://cr.openjdk.java.net/~iignatyev//8242310/webrev.code.00
>>> 109 lines changed: 41 ins; 24 del; 44 mod;
>> http://cr.openjdk.java.net/~iignatyev//8242310/webrev.kw.00
>>> 84 lines changed: 82 ins; 0 del; 2 mod;
>> http://cr.openjdk.java.net/~iignatyev//8242310/webrev.00
>>> 282 lines changed: 123 ins; 24 del; 135 mod;
>> NB the patch depends on 8241707[1], which is currently under review[2].
>> testing: test/hotspot/jtreg/compiler tests on {linux,windows,macosx}-x64
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8242310
>> [1] https://bugs.openjdk.java.net/browse/JDK-8241707
>> [2] https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041300.html
>> Thanks,
>> -- Igor


From jiefu at tencent.com  Thu Apr  9 01:23:47 2020
From: jiefu at tencent.com (=?utf-8?B?amllZnUo5YKF5p2wKQ==?=)
Date: Thu, 9 Apr 2020 01:23:47 +0000
Subject: 8242379: [TESTBUG]
 compiler/loopopts/TestLoopUnswitchingLostCastDependency.java fails with
 release VMs(Internet mail)
In-Reply-To: <87o8s2au6w.fsf@redhat.com>
References: <3AEBD8DD-51CB-4C07-9026-52E4E9C7842B@tencent.com>
 <87o8s2au6w.fsf@redhat.com>
Message-ID: <F2E08C13-6FB8-4FAF-865B-E3E8AB959244@tencent.com>

Pushed: http://hg.openjdk.java.net/jdk/jdk/rev/801bd63c32f2
Thanks.

?On 2020/4/8, 11:11 PM, "Roland Westrelin" <rwestrel at redhat.com> wrote:

    
    > Do you think it's trivial to be pushed now?
    
    Yes I think it is.
    
    Roland.
    
    
From jiefu at tencent.com  Thu Apr  9 01:23:03 2020
From: jiefu at tencent.com (=?utf-8?B?amllZnUo5YKF5p2wKQ==?=)
Date: Thu, 9 Apr 2020 01:23:03 +0000
Subject: 8242379: [TESTBUG]
 compiler/loopopts/TestLoopUnswitchingLostCastDependency.java fails with
 release VMs(Internet mail)
In-Reply-To: <b09b428b-38bc-e5b6-34bd-c0b9d72b57d0@oracle.com>
References: <3AEBD8DD-51CB-4C07-9026-52E4E9C7842B@tencent.com>
 <b09b428b-38bc-e5b6-34bd-c0b9d72b57d0@oracle.com>
Message-ID: <246370D4-F29E-41AB-A33F-94ED368DFD4B@tencent.com>

On 2020/4/9, 2:11 AM, "Vladimir Kozlov" <vladimir.kozlov at oracle.com> wrote:

    Please, also add:
    
    * @requires vm.compiler2.enabled
    
    because both Stress flags are C2 flags.

Done.
Thanks for your review, Vladimir K.
    
    
From Yang.Zhang at arm.com  Thu Apr  9 06:43:12 2020
From: Yang.Zhang at arm.com (Yang Zhang)
Date: Thu, 9 Apr 2020 06:43:12 +0000
Subject: [aarch64-port-dev ] RFR(S): 8241911: AArch64: Fix a potential
 register clash issue in reduce_add2I
In-Reply-To: <VI1PR0802MB2558190831829ACF49F38C4A8EC70@VI1PR0802MB2558.eurprd08.prod.outlook.com>
References: <VI1PR0802MB2558190831829ACF49F38C4A8EC70@VI1PR0802MB2558.eurprd08.prod.outlook.com>
Message-ID: <VI1PR0802MB2558A330564CE46630DFED748EC10@VI1PR0802MB2558.eurprd08.prod.outlook.com>

Hi 

Update the patch a little. Could you please help to review it?
http://cr.openjdk.java.net/~yzhang/8241911/webrev.01/

Test: tier1.

-----Original Message-----
From: aarch64-port-dev <aarch64-port-dev-bounces at openjdk.java.net> On Behalf Of Yang Zhang
Sent: Friday, April 3, 2020 6:49 PM
To: hotspot-compiler-dev at openjdk.java.net; aarch64-port-dev at openjdk.java.net
Cc: nd <nd at arm.com>
Subject: [aarch64-port-dev ] RFR(S): 8241911: AArch64: Fix a potential register clash issue in reduce_add2I

Hi,

Could you please help to review this patch?

In original reduce_add2I, dst may be the same as tmp2, which may get incorrect result.
Some reduction operation instruct code formats are also cleaned up.

JBS: https://bugs.openjdk.java.net/browse/JDK-8241911
Webrev: http://cr.openjdk.java.net/~yzhang/8241911/webrev.00/


Regards
Yang


From aph at redhat.com  Thu Apr  9 09:41:59 2020
From: aph at redhat.com (Andrew Haley)
Date: Thu, 9 Apr 2020 10:41:59 +0100
Subject: [aarch64-port-dev ] RFR(S): 8241911: AArch64: Fix a potential
 register clash issue in reduce_add2I
In-Reply-To: <VI1PR0802MB2558A330564CE46630DFED748EC10@VI1PR0802MB2558.eurprd08.prod.outlook.com>
References: <VI1PR0802MB2558190831829ACF49F38C4A8EC70@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <VI1PR0802MB2558A330564CE46630DFED748EC10@VI1PR0802MB2558.eurprd08.prod.outlook.com>
Message-ID: <1a9ed6d0-40bb-1dc4-4eff-b55c86627a47@redhat.com>

On 4/9/20 7:43 AM, Yang Zhang wrote:
> Hi
>
> Update the patch a little. Could you please help to review it?
> http://cr.openjdk.java.net/~yzhang/8241911/webrev.01/

I've been trying to figure out why this code is so difficult to
understand. I think it's because names like tmp1 and src1 are used
regardless of what kind of thing tmp1 is.

I suggest something like

instruct reduce_add4I(iRegINoSp dst, iRegIorL2I i_src, vecX v_src, vecX v_tmp, iRegINoSp i_tmp)
%{
  match(Set dst (AddReductionVI i_src v_src));
  ins_cost(INSN_COST);
  effect(TEMP v_tmp, TEMP i_tmp);
  format %{ "addv  $v_tmp, T4S, $v_src\n\t"
            "umov  $i_tmp, $v_tmp, S, 0\n\t"
            "addw  $dst, $i_tmp, $i_src\t# add reduction4I"
  %}
  ins_encode %{
    __ addv(as_FloatRegister($v_tmp$$reg), __ T4S,
            as_FloatRegister($v_src$$reg));
    __ umov($i_tmp$$Register, as_FloatRegister($v_tmp$$reg), __ S, 0);
    __ addw($dst$$Register, $i_tmp$$Register, $i_src$$Register);
  %}
  ins_pipe(pipe_class_default);
%}

I think this makes the intent much clearer. Thanks.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From Yang.Zhang at arm.com  Thu Apr  9 11:21:42 2020
From: Yang.Zhang at arm.com (Yang Zhang)
Date: Thu, 9 Apr 2020 11:21:42 +0000
Subject: [aarch64-port-dev ] RFR(S): 8241911: AArch64: Fix a potential
 register clash issue in reduce_add2I
In-Reply-To: <1a9ed6d0-40bb-1dc4-4eff-b55c86627a47@redhat.com>
References: <VI1PR0802MB2558190831829ACF49F38C4A8EC70@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <VI1PR0802MB2558A330564CE46630DFED748EC10@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <1a9ed6d0-40bb-1dc4-4eff-b55c86627a47@redhat.com>
Message-ID: <VI1PR0802MB255836F34678A22D9E0BE8868EC10@VI1PR0802MB2558.eurprd08.prod.outlook.com>

Hi Andrew

>instruct reduce_add4I(iRegINoSp dst, iRegIorL2I i_src, vecX v_src, vecX v_tmp, iRegINoSp i_tmp) %{

Besides reduce_add4I, other reduction operations (reduce_mul4I, reduce_max4F, etc) also have such issues. How about creating another JBS and patch to fix this issue? 

-----Original Message-----
From: Andrew Haley <aph at redhat.com> 
Sent: Thursday, April 9, 2020 5:42 PM
To: Yang Zhang <Yang.Zhang at arm.com>; hotspot-compiler-dev at openjdk.java.net; aarch64-port-dev at openjdk.java.net
Cc: nd <nd at arm.com>
Subject: Re: [aarch64-port-dev ] RFR(S): 8241911: AArch64: Fix a potential register clash issue in reduce_add2I

On 4/9/20 7:43 AM, Yang Zhang wrote:
> Hi
>
> Update the patch a little. Could you please help to review it?
> http://cr.openjdk.java.net/~yzhang/8241911/webrev.01/

I've been trying to figure out why this code is so difficult to understand. I think it's because names like tmp1 and src1 are used regardless of what kind of thing tmp1 is.

I suggest something like

instruct reduce_add4I(iRegINoSp dst, iRegIorL2I i_src, vecX v_src, vecX v_tmp, iRegINoSp i_tmp) %{
  match(Set dst (AddReductionVI i_src v_src));
  ins_cost(INSN_COST);
  effect(TEMP v_tmp, TEMP i_tmp);
  format %{ "addv  $v_tmp, T4S, $v_src\n\t"
            "umov  $i_tmp, $v_tmp, S, 0\n\t"
            "addw  $dst, $i_tmp, $i_src\t# add reduction4I"
  %}
  ins_encode %{
    __ addv(as_FloatRegister($v_tmp$$reg), __ T4S,
            as_FloatRegister($v_src$$reg));
    __ umov($i_tmp$$Register, as_FloatRegister($v_tmp$$reg), __ S, 0);
    __ addw($dst$$Register, $i_tmp$$Register, $i_src$$Register);
  %}
  ins_pipe(pipe_class_default);
%}

I think this makes the intent much clearer. Thanks.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From kuaiwei.kw at alibaba-inc.com  Thu Apr  9 11:58:36 2020
From: kuaiwei.kw at alibaba-inc.com (Kuai Wei)
Date: Thu, 09 Apr 2020 19:58:36 +0800
Subject: =?UTF-8?B?UkZSOiBoZWFwYmFzZSByZWdpc3RlciBjYW4gYmUgYWxsb2NhdGVkIGluIGNvbXByZXNzZWQg?=
 =?UTF-8?B?bW9kZQ==?=
Message-ID: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>


Hi,

  I made an enhancement for aarch64 platform. It's based on great work of https://bugs.openjdk.java.net/browse/JDK-8233743
and .

  In compressed oops mode , if heapbase is zero, jvm don't use heapbase register to encode/decode. So it can be allocated by
JIT compiler.

The webrev is:
http://cr.openjdk.java.net/~wzhuo/8242449/webrev.00/

The bug link:
https://bugs.openjdk.java.net/browse/JDK-8242449

Thanks,
Kuai Wei

From eric.c.liu at arm.com  Thu Apr  9 12:17:08 2020
From: eric.c.liu at arm.com (Eric Liu)
Date: Thu, 9 Apr 2020 12:17:08 +0000
Subject: RFR(S):8242429:Better implementation for signed extract
Message-ID: <AM6PR08MB4422DCEBC3FC64084DB24828C5C10@AM6PR08MB4422.eurprd08.prod.outlook.com>

Hi,

This is a small enhancement for C2 compiler.


For java code "(i >> 31) >>> 31", it can be optimized to "i >>> 31".
AArch64 has implemented this in back-end match rules, while AMD64
hasn?t.

Indeed, this pattern can be optimized in mid-end by adding some simple
transformations. Besides, "0 - (i >> 31)" could also be optimized to
"i >>> 31".

This patch adds two conversions:

        1. URShiftINode:        (i >> 31) >>> 31 ==> i >>> 31

        +------+   +----------+
        | Parm  |     | ConI(31) |
        +------+   +----------+
           |             /       |
           |           /         |
           |         /           |
      +---------+         |
      | RShiftI    |          |
      +---------+         |
               \                 |
                \                |
                 \               |
                  +----------+
                  |  URShiftI  |
                  +----------+

        2. SubINode:            0 - (i >> 31) ==> i >>> 31

        +------+    +----------+
        |  Parm |      | ConI(31) |
        +------+    +----------+
                    \              |
                      \            |
                        \          |
                          \        |
   +---------+      +---------+
    |  ConI(0) |        | RShiftI   |
   +---------+      +---------+
                   \              |
                    \             |
                     \            |
                      +------+
                       |  SubI |
                      +------+

With this patch, these two graghs above both can be optimized to below:

        +------+   +----------+
        | Parm |      | ConI(31) |
        +------+   +----------+
            |                /
            |             /
            |          /
            |        /
        +----------+
         | URShiftI  |
        +----------+

This patch solved the same issue for long type and also removed the
relevant match rules in "aarch64.ad" which become useless now.


JBS: https://bugs.openjdk.java.net/browse/JDK-8242429
Webrev: http://cr.openjdk.java.net/~yzhang/ericliu/8242429/webrev.00/

[Tests]
Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core and langtools::tier1.
No new failure found.


--
Thanks,
Eric

From aph at redhat.com  Thu Apr  9 12:21:22 2020
From: aph at redhat.com (Andrew Haley)
Date: Thu, 9 Apr 2020 13:21:22 +0100
Subject: [aarch64-port-dev ] RFR(S): 8241911: AArch64: Fix a potential
 register clash issue in reduce_add2I
In-Reply-To: <VI1PR0802MB255836F34678A22D9E0BE8868EC10@VI1PR0802MB2558.eurprd08.prod.outlook.com>
References: <VI1PR0802MB2558190831829ACF49F38C4A8EC70@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <VI1PR0802MB2558A330564CE46630DFED748EC10@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <1a9ed6d0-40bb-1dc4-4eff-b55c86627a47@redhat.com>
 <VI1PR0802MB255836F34678A22D9E0BE8868EC10@VI1PR0802MB2558.eurprd08.prod.outlook.com>
Message-ID: <d9e67f4b-3038-b48d-ca41-4d0541e0e0a0@redhat.com>

On 4/9/20 12:21 PM, Yang Zhang wrote:
> Hi Andrew
> 
>> instruct reduce_add4I(iRegINoSp dst, iRegIorL2I i_src, vecX v_src, vecX v_tmp, iRegINoSp i_tmp) %{
> 
> Besides reduce_add4I, other reduction operations (reduce_mul4I, reduce_max4F, etc) also have such issues. How about creating another JBS and patch to fix this issue? 

That's a good point. I'll accept http://cr.openjdk.java.net/~yzhang/8241911/webrev.01/
as it is, with a separate patch to clarify those reduction operations.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From eric.c.liu at arm.com  Thu Apr  9 12:57:32 2020
From: eric.c.liu at arm.com (Eric Liu)
Date: Thu, 9 Apr 2020 12:57:32 +0000
Subject: RFR(S):8242429:Better implementation for signed extract
In-Reply-To: <AM6PR08MB4422DCEBC3FC64084DB24828C5C10@AM6PR08MB4422.eurprd08.prod.outlook.com>
References: <AM6PR08MB4422DCEBC3FC64084DB24828C5C10@AM6PR08MB4422.eurprd08.prod.outlook.com>
Message-ID: <AM6PR08MB44224F43E9B222B981706B52C5C10@AM6PR08MB4422.eurprd08.prod.outlook.com>

Hi,

This is a small enhancement for C2 compiler.


For java code "(i >> 31) >>> 31", it can be optimized to "i >>> 31".
AArch64 has implemented this in back-end match rules, while AMD64
hasn't.

Indeed, this pattern can be optimized in mid-end by adding some simple
transformations. Besides, "0 - (i >> 31)" could also be optimized to
"i >>> 31".

This patch adds two conversions:

        1. URShiftINode:        (i >> 31) >>> 31 ==> i >>> 31

        +------+   +----------+
        | Parm |   | ConI(31) |
        +------+   +----------+
           |       /       |
           |      /        |
           |     /         |
      +---------+          |
      | RShiftI |          |
      +---------+          |
               \           |
                \          |
                 \         |
                  +----------+
                  |  URShiftI|
                  +----------+

        2. SubINode:            0 - (i >> 31) ==> i >>> 31

        +------+    +----------+
        | Parm |    | ConI(31) |
        +------+    +----------+
               \           |
                \          |
                 \         |
                  \        |
   +---------+      +---------+
   | ConI(0) |      | RShiftI |
   +---------+      +---------+
               \        |
                \       |
                 \      |
                   +------+
                   | SubI |
                   +------+

With this patch, these two graghs above both can be optimized to below:

        +------+   +----------+
        | Parm |   | ConI(31) |
        +------+   +----------+
            |         /
            |        /
            |       /
            |      /
        +----------+
        | URShiftI |
        +----------+

This patch solved the same issue for long type and also removed the
relevant match rules in "aarch64.ad" which become useless now.


JBS: https://bugs.openjdk.java.net/browse/JDK-8242429
Webrev: http://cr.openjdk.java.net/~yzhang/ericliu/8242429/webrev.00/

[Tests]
Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core and langtools::tier1.
No new failure found.


--
Thanks,
Eric

From rwestrel at redhat.com  Thu Apr  9 14:28:28 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Thu, 09 Apr 2020 16:28:28 +0200
Subject: is it time fully optimize long loops? (JDK-8223051)
In-Reply-To: <BBACDCA0-2E6A-4ACE-893A-55ED03509625@oracle.com>
References: <BBACDCA0-2E6A-4ACE-893A-55ED03509625@oracle.com>
Message-ID: <87imi8bunn.fsf@redhat.com>


> Getting the whole story working well, especially for
> explicitly vectorized loops, may require new intrinsics
> (such as [4]), but I think we can make progress with strip
> mining or predication alone.  Is now a good time to
> investigate this?

I'll give it a shot.

Roland.


From aph at redhat.com  Thu Apr  9 17:00:38 2020
From: aph at redhat.com (Andrew Haley)
Date: Thu, 9 Apr 2020 18:00:38 +0100
Subject: RFR: heapbase register can be allocated in compressed mode
In-Reply-To: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>
References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>
Message-ID: <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com>

Hi,

On 4/9/20 12:58 PM, Kuai Wei wrote:
>   I made an enhancement for aarch64 platform. It's based on great work of https://bugs.openjdk.java.net/browse/JDK-8233743
> and .
>
>   In compressed oops mode , if heapbase is zero, jvm don't use heapbase register to encode/decode. So it can be allocated by
> JIT compiler.
>
> The webrev is:
> http://cr.openjdk.java.net/~wzhuo/8242449/webrev.00/
>
> The bug link:
> https://bugs.openjdk.java.net/browse/JDK-8242449

That looks safe. I think the only reason we never did something like
that before was because no-one felt brave enough, but perhaps we
should do it now.

MacroAssembler::reinit_heapbase() points to a potential problem,
though: we generate some of this code before we know what the heapbase
is going to be, so we unconditionally write to rheapbase. I think this
only happens in three places: generate_call_stub,
interpreter::generate_throw_exception, and
interpreter::generate_native_entry, so we should be safe.

It's tricky to test this stuff, though. OK for mainline, and let's
test it as much as we can. Thanks.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From vladimir.x.ivanov at oracle.com  Thu Apr  9 18:29:18 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Thu, 9 Apr 2020 21:29:18 +0300
Subject: [15] RFR (S): 8242289: C2: Support platform-specific node cloning
 in Matcher
In-Reply-To: <da5e6888-041d-b067-9389-f536d702c837@oracle.com>
References: <2645fc7c-76f7-dd3b-bee3-72d0b923d46f@oracle.com>
 <da5e6888-041d-b067-9389-f536d702c837@oracle.com>
Message-ID: <967a7fb2-931a-e0fc-d8e0-88166d8ffe43@oracle.com>

Thanks, Vladimir.

Best regards,
Vladimir Ivanov

On 07.04.2020 20:43, Vladimir Kozlov wrote:
> Good.
> 
> Thanks,
> Vladimir
> 
> On 4/7/20 10:29 AM, Vladimir Ivanov wrote:
>> http://cr.openjdk.java.net/~vlivanov/8242289/webrev.00/
>> https://bugs.openjdk.java.net/browse/JDK-8242289
>>
>> Introduce a platform-specific entry point (Matcher::pd_clone_node) and 
>> move platform-specific node cloning during matching.
>>
>> Matcher processes every node only once unless it is marked as shared.
>> It is too restrictive in some cases, so the workaround is to 
>> explicitly check for particular IR patterns and clone relevant nodes 
>> during matching phase.
>>
>> As an example, take a look at ShiftCntV. There are the following match 
>> rules in aarch64.ad:
>>
>> ?? match(Set dst (RShiftVB src (RShiftCntV shift)));
>>
>> By default, RShiftCntV node is matched only once, so when it has 
>> multiple users, only it will be folded only into one of them and for 
>> the rest the value it produces will be put in register. To overcome 
>> that, Matcher is taught to detect such pattern and "clone" RShiftCntV 
>> input every time it matches RShiftV node. In case of RShiftCntV, it's 
>> arm32/aarch64-specific and other platforms (x86 in particular) don't 
>> optimize for it.
>>
>> To avoid polluting shared code (in matcher.cpp) with platform-specific 
>> portions, I propose to add Matcher::pd_clone_node and place 
>> platform-specific checks there.
>>
>> Also, as a cleanup, renamed Matcher::clone_address_expressions() to 
>> pd_clone_address_expressions since it's a platform-specific method.
>>
>> Testing: hs-precheckin-comp, hs-tier1, hs-tier2,
>> ????????? cross-builds on all affected platforms
>>
>> Thanks!
>>
>> Best regards,
>> Vladimir Ivanov

From john.r.rose at oracle.com  Thu Apr  9 21:59:40 2020
From: john.r.rose at oracle.com (John Rose)
Date: Thu, 9 Apr 2020 14:59:40 -0700
Subject: is it time fully optimize long loops? (JDK-8223051)
In-Reply-To: <87imi8bunn.fsf@redhat.com>
References: <BBACDCA0-2E6A-4ACE-893A-55ED03509625@oracle.com>
 <87imi8bunn.fsf@redhat.com>
Message-ID: <B87BCA31-E466-4AD6-B39E-9893FD74FA56@oracle.com>

On Apr 9, 2020, at 7:28 AM, Roland Westrelin <rwestrel at redhat.com> wrote:
> 
>> Is now a good time to
>> investigate this?
> 
> I'll give it a shot.

Thanks Roland!

From Yang.Zhang at arm.com  Fri Apr 10 02:45:45 2020
From: Yang.Zhang at arm.com (Yang Zhang)
Date: Fri, 10 Apr 2020 02:45:45 +0000
Subject: [aarch64-port-dev ] RFR(S): 8241911: AArch64: Fix a potential
 register clash issue in reduce_add2I
In-Reply-To: <d9e67f4b-3038-b48d-ca41-4d0541e0e0a0@redhat.com>
References: <VI1PR0802MB2558190831829ACF49F38C4A8EC70@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <VI1PR0802MB2558A330564CE46630DFED748EC10@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <1a9ed6d0-40bb-1dc4-4eff-b55c86627a47@redhat.com>
 <VI1PR0802MB255836F34678A22D9E0BE8868EC10@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <d9e67f4b-3038-b48d-ca41-4d0541e0e0a0@redhat.com>
Message-ID: <VI1PR0802MB2558C6BF0B64E7FD27CFFD168EDE0@VI1PR0802MB2558.eurprd08.prod.outlook.com>

Okay. When the patch is ready, I will send it for review.

Regards
Yang

-----Original Message-----
From: Andrew Haley <aph at redhat.com> 
Sent: Thursday, April 9, 2020 8:21 PM
To: Yang Zhang <Yang.Zhang at arm.com>; hotspot-compiler-dev at openjdk.java.net; aarch64-port-dev at openjdk.java.net
Cc: nd <nd at arm.com>
Subject: Re: [aarch64-port-dev ] RFR(S): 8241911: AArch64: Fix a potential register clash issue in reduce_add2I

On 4/9/20 12:21 PM, Yang Zhang wrote:
> Hi Andrew
> 
>> instruct reduce_add4I(iRegINoSp dst, iRegIorL2I i_src, vecX v_src, 
>> vecX v_tmp, iRegINoSp i_tmp) %{
> 
> Besides reduce_add4I, other reduction operations (reduce_mul4I, reduce_max4F, etc) also have such issues. How about creating another JBS and patch to fix this issue? 

That's a good point. I'll accept http://cr.openjdk.java.net/~yzhang/8241911/webrev.01/
as it is, with a separate patch to clarify those reduction operations.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From Yang.Zhang at arm.com  Fri Apr 10 02:52:45 2020
From: Yang.Zhang at arm.com (Yang Zhang)
Date: Fri, 10 Apr 2020 02:52:45 +0000
Subject: [aarch64-port-dev ] RFR(XS): 8242070: AArch64: Fix a typo introduced
 by JDK-8238690 
Message-ID: <VI1PR0802MB25580275D036617C1713AF158EDE0@VI1PR0802MB2558.eurprd08.prod.outlook.com>

Hi,

Could you please help to review this patch?

JBS: https://bugs.openjdk.java.net/browse/JDK-8242070
Webrev: http://cr.openjdk.java.net/~yzhang/8242070/webrev.00/

In JDK-8238690, it unified IR shape for vector shifts by scalar and always used

ShiftV src (ShiftCntV shift)

When shift is scalar, the following IR nodes are generated.

         scalar_shift
               |
     src  ShiftCntV
      |     /
      |    /
      ShiftV

But when implementing this on AArch64, there is an issue in match rule
of vector shift right with imm shift for short type.

match(Set dst (RShiftVS src (LShiftCntV shift)));

LShiftCntV should be RShiftCntV here.

Test case:
  public static void shiftR(short[] a, short[] c) {
      for (int i = 0; i < a.length; i++) {
          c[i] = (short)(a[i] >> 2);
      }
  }

IR nodes:
                               imm:2
                                  |
      LoadVector RShiftCntV
           |                  /
           |               /
           RShiftVS

C2 aassembly generated:

Before:
  0x0000ffffac563764:   orr	w11, wzr, #0x2
  0x0000ffffac563768:   dup	v16.16b, w11  -------- vshiftcnt16B

  0x0000ffffac5637a8:   ldr	q24, [x18, #16]
  0x0000ffffac5637ac:   neg	v25.16b, v16.16b       ------
  0x0000ffffac5637b0:   sshl	v24.8h, v24.8h, v25.8h ------vsra8S
  0x0000ffffac5637b8:   str	q24, [x14, #16]

"match(Set dst (RShiftVS src (LShiftCntV shift)));" matching fails.
RShiftCntV and RShiftVS are matched separately by vshiftcnt16B and vsra8S.

After:
  0x0000ffffac563808:   ldr	q16, [x15, #16]
  0x0000ffffac56380c:   sshr	v16.8h, v16.8h, #2
  0x0000ffffac563814:   str	q16, [x14, #16]

"match(Set dst (RShiftVS src (RShiftCntV shift)));" matching succeeds.

Performance:
JMH test case is attached in JBS.

Before:
Benchmark               Mode  Cnt   Score   Error  Units
TestVect.testVectShift  avgt   10  66.964 ? 0.052  us/op

After:
Benchmark               Mode  Cnt   Score   Error  Units
TestVect.testVectShift  avgt   10  56.156 ? 0.053  us/op

Testing: tier1
Pass and no new failure.

Regards
Yang


From kuaiwei.kw at alibaba-inc.com  Fri Apr 10 04:16:30 2020
From: kuaiwei.kw at alibaba-inc.com (Kuai Wei)
Date: Fri, 10 Apr 2020 12:16:30 +0800
Subject: =?UTF-8?B?UmU6IFJGUjogaGVhcGJhc2UgcmVnaXN0ZXIgY2FuIGJlIGFsbG9jYXRlZCBpbiBjb21wcmVz?=
 =?UTF-8?B?c2VkIG1vZGU=?=
In-Reply-To: <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com>
References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>,
 <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com>
Message-ID: <ef37fac9-364b-442c-88ef-eb0cc9855cb5.kuaiwei.kw@alibaba-inc.com>


Hi Andrew,

  Thanks for your review.

  As you pointed out, some stubs are generated before universe fully initialized and they will reset r27 in reinit_heap.
 My initial think is they are not the problem.  Interpreters can be safe because they are initialized after heap. I can change
them not to dependent on fully_initialized flag.
  I will check call stubs to guarantee they are safe.

Thanks,
Kuai Wei


------------------------------------------------------------------
From:Andrew Haley <aph at redhat.com>
Send Time:2020?4?10?(???) 01:01
To:??(??) <kuaiwei.kw at alibaba-inc.com>; hotspot compiler <hotspot-compiler-dev at openjdk.java.net>
Subject:Re: RFR: heapbase register can be allocated in compressed mode

Hi,

On 4/9/20 12:58 PM, Kuai Wei wrote:
>   I made an enhancement for aarch64 platform. It's based on great work of https://bugs.openjdk.java.net/browse/JDK-8233743
> and .
>
>   In compressed oops mode , if heapbase is zero, jvm don't use heapbase register to encode/decode. So it can be allocated by
> JIT compiler.
>
> The webrev is:
> http://cr.openjdk.java.net/~wzhuo/8242449/webrev.00/
>
> The bug link:
> https://bugs.openjdk.java.net/browse/JDK-8242449

That looks safe. I think the only reason we never did something like
that before was because no-one felt brave enough, but perhaps we
should do it now.

MacroAssembler::reinit_heapbase() points to a potential problem,
though: we generate some of this code before we know what the heapbase
is going to be, so we unconditionally write to rheapbase. I think this
only happens in three places: generate_call_stub,
interpreter::generate_throw_exception, and
interpreter::generate_native_entry, so we should be safe.

It's tricky to test this stuff, though. OK for mainline, and let's
test it as much as we can. Thanks.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From rwestrel at redhat.com  Fri Apr 10 07:38:38 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Fri, 10 Apr 2020 09:38:38 +0200
Subject: is it time fully optimize long loops? (JDK-8223051)
In-Reply-To: <B87BCA31-E466-4AD6-B39E-9893FD74FA56@oracle.com>
References: <BBACDCA0-2E6A-4ACE-893A-55ED03509625@oracle.com>
 <87imi8bunn.fsf@redhat.com> <B87BCA31-E466-4AD6-B39E-9893FD74FA56@oracle.com>
Message-ID: <87ftdbbxj5.fsf@redhat.com>


Once the long loop is transformed to an int counted loop what are the
optimizations that need to trigger reliably? In particular do we need
range check elimination? Can you or someone from the panama project shar
code samples that I can use to verify the long loop optimizes well?

Roland.


From HORIE at jp.ibm.com  Fri Apr 10 08:47:42 2020
From: HORIE at jp.ibm.com (Michihiro Horie)
Date: Fri, 10 Apr 2020 17:47:42 +0900
Subject: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes()
 and Integer.reverseBytes() on Power9
In-Reply-To: <67fa8056-a8ed-cdfc-1e5a-d36b49c4af18@linux.ibm.com>
References: <67fa8056-a8ed-cdfc-1e5a-d36b49c4af18@linux.ibm.com>,
 <0079874e-7bc2-5ff4-f004-337c718ec6df@linux.ibm.com>
 <OF54F8418D.F0B8E4CB-ON0025853E.004E3DC9-4925853E.004F648A@notes.na.collabserv.com>
Message-ID: <OFE7B015D3.EC53D085-ON00258546.002BA949-49258546.00305059@notes.na.collabserv.com>


Hi Corey,

Thank you for sharing your benchmarks. I confirmed your change reduced the
elapsed time of the benchmarks by more than 30% on my P9 node. Also, I
checked JTREG results, which look no problem.

BTW, I cannot find further points of improvement in your change.

Best regards,
Michihiro


 ----- Original message -----
 From: "Corey Ashford" <cjashfor at linux.ibm.com>
 To: Michihiro Horie/Japan/IBM at IBMJP
 Cc: hotspot-compiler-dev at openjdk.java.net,
 ppc-aix-port-dev at openjdk.java.net, "Gustavo Romero"
 <gromero at linux.vnet.ibm.com>
 Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of
 Long.reverseBytes() and Integer.reverseBytes() on Power9
 Date: Fri, Apr 3, 2020 8:07 AM

 On 4/2/20 7:27 AM, Michihiro Horie wrote:
 > Hi Corey,
 >
 > I?m not a reviewer, but I can run your benchmark in my local P9 node if
 > you share it.
 >
 > Best regards,
 > Michihiro

 The tests are somewhat hokey; I added the shifts to keep the compiler
 from hoisting the code that it could predetermine the result.

 Here's the one for Long.reverseBytes():

 import java.lang.*;

 class ReverseLong
 {
      public static void main(String args[])
      {
          long reversed, re_reversed;
 long accum = 0;
 long orig = 0x1122334455667788L;
 long start = System.currentTimeMillis();
 for (int i = 0; i < 1_000_000_000; i++) {
 // Try to keep java from figuring out stuff in advance
 reversed = Long.reverseBytes(orig);
 re_reversed = Long.reverseBytes(reversed);
 if (re_reversed != orig) {
          System.out.println("Orig: " + String.format("%16x", orig) +
 "  Re-reversed: " + String.format("%16x", re_reversed));
 }
 accum += orig;
 orig = Long.rotateRight(orig, 3);
 }
 System.out.println("Elapsed time: " +
 Long.toString(System.currentTimeMillis() - start));
 System.out.println("accum: " + Long.toString(accum));
      }
 }


 And the one for Integer.reverseBytes():

 import java.lang.*;

 class ReverseInt
 {
      public static void main(String args[])
      {
          int reversed, re_reversed;
 int orig = 0x11223344;
 int accum = 0;
 long start = System.currentTimeMillis();
 for (int i = 0; i < 1_000_000_000; i++) {
 // Try to keep java from figuring out stuff in advance
 reversed = Integer.reverseBytes(orig);
 re_reversed = Integer.reverseBytes(reversed);
 if (re_reversed != orig) {
          System.out.println("Orig: " + String.format("%08x", orig) +
 "  Re-reversed: " + String.format("%08x", re_reversed));
 }
 accum += orig;
 orig = Integer.rotateRight(orig, 3);
 }
 System.out.println("Elapsed time: " +
 Long.toString(System.currentTimeMillis() - start));
 System.out.println("accum: " + Integer.toString(accum));
      }
 }


From rwestrel at redhat.com  Fri Apr 10 11:26:56 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Fri, 10 Apr 2020 13:26:56 +0200
Subject: is it time fully optimize long loops? (JDK-8223051)
In-Reply-To: <87ftdbbxj5.fsf@redhat.com>
References: <BBACDCA0-2E6A-4ACE-893A-55ED03509625@oracle.com>
 <87imi8bunn.fsf@redhat.com> <B87BCA31-E466-4AD6-B39E-9893FD74FA56@oracle.com>
 <87ftdbbxj5.fsf@redhat.com>
Message-ID: <87d08fbmyn.fsf@redhat.com>


> Once the long loop is transformed to an int counted loop what are the
> optimizations that need to trigger reliably? In particular do we need
> range check elimination? Can you or someone from the panama project shar
> code samples that I can use to verify the long loop optimizes well?

I see now that you mentioned RCE in JDK-8223051.

Roland.


From aph at redhat.com  Fri Apr 10 12:19:01 2020
From: aph at redhat.com (Andrew Haley)
Date: Fri, 10 Apr 2020 13:19:01 +0100
Subject: RFR: heapbase register can be allocated in compressed mode
In-Reply-To: <ef37fac9-364b-442c-88ef-eb0cc9855cb5.kuaiwei.kw@alibaba-inc.com>
References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>
 <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com>
 <ef37fac9-364b-442c-88ef-eb0cc9855cb5.kuaiwei.kw@alibaba-inc.com>
Message-ID: <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com>

On 4/10/20 5:16 AM, Kuai Wei wrote:

>   As you pointed out, some stubs are generated before universe fully
> initialized and they will reset r27 in reinit_heap.  My initial
> think is they are not the problem.  Interpreters can be safe because
> they are initialized after heap. I can change them not to dependent
> on fully_initialized flag.

Please don't change that; there's no need. Loading r27 unnecessarily
in these places does no harm.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From vladimir.x.ivanov at oracle.com  Fri Apr 10 14:07:08 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Fri, 10 Apr 2020 17:07:08 +0300
Subject: [15] RFR (S): 8242491: C2: assert(v2->bottom_type() == vt) failed:
 mismatch when creating MacroLogicV
Message-ID: <0e456498-e4d6-7147-b7db-2e2c3a539430@oracle.com>

http://cr.openjdk.java.net/~vlivanov/8242491/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8242491

Asserts on input types for MacroLogicV are too strong.
SuperWord pass can mix vectors of distinct subword types (byte and 
boolean or short and char).

Though it's possible to explicitly check for such particular cases, the 
fix relaxes the assert even more and only verifies that inputs are of 
the same size (in bytes), so bitwise reinterpretation of vector values 
is safe.

Testing: hs-precheckin-comp,hs-tier1,hs-tier2

Thanks!

Best regards,
Vladimir Ivanov

From vladimir.x.ivanov at oracle.com  Fri Apr 10 14:25:56 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Fri, 10 Apr 2020 17:25:56 +0300
Subject: [15] RFR (S): 8242492: C2: Remove
 Matcher::vector_shift_count_ideal_reg()
Message-ID: <a2f57a50-3739-4e1d-a21d-b11d204f90fc@oracle.com>

http://cr.openjdk.java.net/~vlivanov/8242492/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8242492

Matcher::vector_shift_count_ideal_reg() was introduced specifically for 
x86 to communicate that only low 32 bits are used by vector shift 
instructions, so only those bits should be spilled when needed.

Unfortunately, it is broken for LShiftCntV/RShiftCntV: Matcher doesn't 
capute overridden ideal_reg value and spills use bottom type instead. 
So, it causes a mismatch during RA.

Fortunately, LShiftCntV/RShiftCntV are never spilled on x86. Considering 
how simple AD instructions for LShiftCntV/RShiftCntV are, RA prefers to 
rematerialize the value instead (which is a reg-to-reg move).

I propose to simplify the implementation and completely remove 
Matcher::vector_shift_count_ideal_reg() along with additional special 
handling logic for LShiftCntV/RShiftCntV.

Testing: hs-precheckin-comp, hs-tier1, hs-tier2

Thanks!

Best regards,
Vladimir Ivanov

From john.r.rose at oracle.com  Sat Apr 11 05:37:23 2020
From: john.r.rose at oracle.com (John Rose)
Date: Fri, 10 Apr 2020 22:37:23 -0700
Subject: is it time fully optimize long loops? (JDK-8223051)
In-Reply-To: <87d08fbmyn.fsf@redhat.com>
References: <BBACDCA0-2E6A-4ACE-893A-55ED03509625@oracle.com>
 <87imi8bunn.fsf@redhat.com> <B87BCA31-E466-4AD6-B39E-9893FD74FA56@oracle.com>
 <87ftdbbxj5.fsf@redhat.com> <87d08fbmyn.fsf@redhat.com>
Message-ID: <D46E6700-D945-4FCC-BD51-29A1989B7521@oracle.com>

On Apr 10, 2020, at 4:26 AM, Roland Westrelin <rwestrel at redhat.com> wrote:
> 
>> Once the long loop is transformed to an int counted loop what are the
>> optimizations that need to trigger reliably? In particular do we need
>> range check elimination? Can you or someone from the panama project shar
>> code samples that I can use to verify the long loop optimizes well?
> 
> I see now that you mentioned RCE in JDK-8223051.

RCE focuses on comparisons against array lengths but it is
more general than that.  If long loops are strip mined into
short loops, and if the range checks in those short loops are
somehow transformed into 32-bit comparisons, they should
be amenable to RCE transformations.

I hope we don?t need to generalize RCE transformations to
know about 64-bit comparisons; that seems to be harder.

? John

From kuaiwei.kw at alibaba-inc.com  Mon Apr 13 01:32:45 2020
From: kuaiwei.kw at alibaba-inc.com (Kuai Wei)
Date: Mon, 13 Apr 2020 09:32:45 +0800
Subject: =?UTF-8?B?UmU6IFJGUjogaGVhcGJhc2UgcmVnaXN0ZXIgY2FuIGJlIGFsbG9jYXRlZCBpbiBjb21wcmVz?=
 =?UTF-8?B?c2VkIG1vZGU=?=
In-Reply-To: <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com>
References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>
 <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com>
 <ef37fac9-364b-442c-88ef-eb0cc9855cb5.kuaiwei.kw@alibaba-inc.com>,
 <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com>
Message-ID: <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com>

Ok, I will keep the origin change.

I can not push to tip branch. Can you help me to push it ? Or do we need other reivew?

Thanks,
Kuai Wei


------------------------------------------------------------------
From:Andrew Haley <aph at redhat.com>
Send Time:2020?4?10?(???) 20:19
To:??(??) <kuaiwei.kw at alibaba-inc.com>; hotspot compiler <hotspot-compiler-dev at openjdk.java.net>
Subject:Re: RFR: heapbase register can be allocated in compressed mode

On 4/10/20 5:16 AM, Kuai Wei wrote:

>   As you pointed out, some stubs are generated before universe fully
> initialized and they will reset r27 in reinit_heap.  My initial
> think is they are not the problem.  Interpreters can be safe because
> they are initialized after heap. I can change them not to dependent
> on fully_initialized flag.

Please don't change that; there's no need. Loading r27 unnecessarily
in these places does no harm.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From Pengfei.Li at arm.com  Mon Apr 13 02:22:40 2020
From: Pengfei.Li at arm.com (Pengfei Li)
Date: Mon, 13 Apr 2020 02:22:40 +0000
Subject: RFR: heapbase register can be allocated in compressed mode
In-Reply-To: <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com>
References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>
 <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com>
 <ef37fac9-364b-442c-88ef-eb0cc9855cb5.kuaiwei.kw@alibaba-inc.com>,
 <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com>
 <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com>
Message-ID: <DB8PR08MB49694C72021174B76D032E5C96DD0@DB8PR08MB4969.eurprd08.prod.outlook.com>

Hi Wei,

> I can not push to tip branch. Can you help me to push it ? Or do we need
> other reivew?

Thanks for your enhancement patch. I ran full jtreg in the weekend and found no new failure after this change.

We could also help push if there's no other review comments.

--
Thanks,
Pengfei


From vladimir.x.ivanov at oracle.com  Mon Apr 13 08:41:21 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Mon, 13 Apr 2020 11:41:21 +0300
Subject: RFR(S):8242429:Better implementation for signed extract
In-Reply-To: <AM6PR08MB4422DCEBC3FC64084DB24828C5C10@AM6PR08MB4422.eurprd08.prod.outlook.com>
References: <AM6PR08MB4422DCEBC3FC64084DB24828C5C10@AM6PR08MB4422.eurprd08.prod.outlook.com>
Message-ID: <d3eada15-a9f5-01a5-1110-6b268cec76b8@oracle.com>

Hi Eric,

I was confused at first by what "signed extract" means.
It should be "sign extract".

Overall, the changes look good.

One comment:

   (i >> 31) >>> 31 ==> i >>> 31

The shift count value is irrelevant here, isn't it?

So, the transformation can be generalized to:

   (i >> n) >>> 31 ==> i >>> 31

Best regards,
Vladimir Ivanov

On 09.04.2020 15:17, Eric Liu wrote:
> Hi,
> 
> This is a small enhancement for C2 compiler.
> 
> 
> For java code "(i >> 31) >>> 31", it can be optimized to "i >>> 31".
> AArch64 has implemented this in back-end match rules, while AMD64
> hasn?t.
> 
> Indeed, this pattern can be optimized in mid-end by adding some simple
> transformations. Besides, "0 - (i >> 31)" could also be optimized to
> "i >>> 31".
> 
> This patch adds two conversions:
> 
>          1. URShiftINode:        (i >> 31) >>> 31 ==> i >>> 31
> 
>          +------+   +----------+
>          | Parm  |     | ConI(31) |
>          +------+   +----------+
>             |             /       |
>             |           /         |
>             |         /           |
>        +---------+         |
>        | RShiftI    |          |
>        +---------+         |
>                 \                 |
>                  \                |
>                   \               |
>                    +----------+
>                    |  URShiftI  |
>                    +----------+
> 
>          2. SubINode:            0 - (i >> 31) ==> i >>> 31
> 
>          +------+    +----------+
>          |  Parm |      | ConI(31) |
>          +------+    +----------+
>                      \              |
>                        \            |
>                          \          |
>                            \        |
>     +---------+      +---------+
>      |  ConI(0) |        | RShiftI   |
>     +---------+      +---------+
>                     \              |
>                      \             |
>                       \            |
>                        +------+
>                         |  SubI |
>                        +------+
> 
> With this patch, these two graghs above both can be optimized to below:
> 
>          +------+   +----------+
>          | Parm |      | ConI(31) |
>          +------+   +----------+
>              |                /
>              |             /
>              |          /
>              |        /
>          +----------+
>           | URShiftI  |
>          +----------+
> 
> This patch solved the same issue for long type and also removed the
> relevant match rules in "aarch64.ad" which become useless now.
> 
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8242429
> Webrev: http://cr.openjdk.java.net/~yzhang/ericliu/8242429/webrev.00/
> 
> [Tests]
> Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core and langtools::tier1.
> No new failure found.
> 
> 
> --
> Thanks,
> Eric
> 

From kuaiwei.kw at alibaba-inc.com  Mon Apr 13 09:52:33 2020
From: kuaiwei.kw at alibaba-inc.com (Kuai Wei)
Date: Mon, 13 Apr 2020 17:52:33 +0800
Subject: =?UTF-8?B?UmU6IFJGUjogaGVhcGJhc2UgcmVnaXN0ZXIgY2FuIGJlIGFsbG9jYXRlZCBpbiBjb21wcmVz?=
 =?UTF-8?B?c2VkIG1vZGU=?=
In-Reply-To: <DB8PR08MB49694C72021174B76D032E5C96DD0@DB8PR08MB4969.eurprd08.prod.outlook.com>
References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>
 <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com>
 <ef37fac9-364b-442c-88ef-eb0cc9855cb5.kuaiwei.kw@alibaba-inc.com>,
 <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com>
 <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com>,
 <DB8PR08MB49694C72021174B76D032E5C96DD0@DB8PR08MB4969.eurprd08.prod.outlook.com>
Message-ID: <a69c8cd9-b14e-4e5a-95f1-604197534d98.kuaiwei.kw@alibaba-inc.com>

Hi Pengfei,

  Thanks for your help.

Kuai Wei


------------------------------------------------------------------
From:Pengfei Li <Pengfei.Li at arm.com>
Send Time:2020?4?13?(???) 10:37
To:??(??) <kuaiwei.kw at alibaba-inc.com>; Andrew Haley <aph at redhat.com>; hotspot compiler <hotspot-compiler-dev at openjdk.java.net>
Cc:nd <nd at arm.com>
Subject:RE: RFR: heapbase register can be allocated in compressed mode

Hi Wei,

> I can not push to tip branch. Can you help me to push it ? Or do we need
> other reivew?

Thanks for your enhancement patch. I ran full jtreg in the weekend and found no new failure after this change.

We could also help push if there's no other review comments.

--
Thanks,
Pengfei


From sandhya.viswanathan at intel.com  Mon Apr 13 17:02:15 2020
From: sandhya.viswanathan at intel.com (Viswanathan, Sandhya)
Date: Mon, 13 Apr 2020 17:02:15 +0000
Subject: [15] RFR (S): 8242491: C2: assert(v2->bottom_type() == vt)
 failed: mismatch when creating MacroLogicV
In-Reply-To: <0e456498-e4d6-7147-b7db-2e2c3a539430@oracle.com>
References: <0e456498-e4d6-7147-b7db-2e2c3a539430@oracle.com>
Message-ID: <MW3PR11MB4745B3765A336E4ED236BB7CEFDD0@MW3PR11MB4745.namprd11.prod.outlook.com>

Hi Vladimir,

Your change looks good to me.

Best Regards,
Sandhya

-----Original Message-----
From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Vladimir Ivanov
Sent: Friday, April 10, 2020 7:07 AM
To: hotspot compiler <hotspot-compiler-dev at openjdk.java.net>
Subject: [15] RFR (S): 8242491: C2: assert(v2->bottom_type() == vt) failed: mismatch when creating MacroLogicV

http://cr.openjdk.java.net/~vlivanov/8242491/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8242491

Asserts on input types for MacroLogicV are too strong.
SuperWord pass can mix vectors of distinct subword types (byte and boolean or short and char).

Though it's possible to explicitly check for such particular cases, the fix relaxes the assert even more and only verifies that inputs are of the same size (in bytes), so bitwise reinterpretation of vector values is safe.

Testing: hs-precheckin-comp,hs-tier1,hs-tier2

Thanks!

Best regards,
Vladimir Ivanov

From xxinliu at amazon.com  Mon Apr 13 17:33:54 2020
From: xxinliu at amazon.com (Liu, Xin)
Date: Mon, 13 Apr 2020 17:33:54 +0000
Subject: FR[M]: 8151779: Some intrinsic flags could be replaced with one
 general flag
Message-ID: <C9E0E925-48CF-41AE-B337-5C3C7C853775@amazon.com>

Hi, compiler developers,
I attempt to refactor UseXXXIntrinsics for JDK-8151779.  I think we still need to keep UseXXXIntrinsics options because many applications may be using them.

My change provide 2 new features:
1) a shorthand to enable/disable intrinsics.
A comma-separated string. Each one is an intrinsic. An optional tailing symbol + or '-' denotes enabling or disabling.
If the tailing symbol is missing, it means enable.
Eg. -XX:UseIntrinsics="AESCTR-,CRC32C+,CRC32-,MathExact"
This jvm option will expand to multiple options -XX:-UseAESCTRIntrinsics, -XX:+UseCRC32CIntrinsics, -XX:-UseCRC32Intrinsics, -XX:UseMathExactIntrinsics

2) provide a set of macro to declare intrinsic options
Developers declare once in intrinsics.hpp and macros will take care all other places.
Here are example: https://cr.openjdk.java.net/~xliu/8151779/00/webrev/src/hotspot/share/compiler/intrinsics.hpp.html
Ion Lam is overhauling jvm options.  I am thinking how to be consistent with his proposal.

I handle UseIntrinsics before VM_Version::initialize. It means that platform-specific initialization still has a chance to correct those options.
If we do that after VM_Version::initialize,  some intrinsics may cause JVM crash.  Eg. +UseBase64Intrinsics on x86_64 Linux.
Even though this behavior is same as -XX:+UseXXXIntrinsics, from user's perspective, it's not straightforward when JVM overrides what users specify implicitly. It's dilemma here,  stable jvm or fidelity of cmdline.  What do you think?

Another problem is naming convention. Almost all intrinsics options use UseXXXIntrinsics. One exception is UseVectorizedMismatchIntrinsic.
Personally, I think it should be "UseXXXIntrinsic" because one option is for one intrinsic, right?  Is it possible to change this name convention?

I plan to write a gtest to test intrinsics.cpp and finalize the webrev when Ion finalize his overhaul.
But here is quick preview of my change.  I really appreciate if you can give me some feedback.
https://cr.openjdk.java.net/~xliu/8151779/00/webrev/

I use -XX:+PrintFlagsFinal to verify my expression work or not. eg.
$java -XX:UseIntrinsics=",AESCTR-,CRC32C,,CRC32-,,MathExact," -XX:+PrintFlagsFinal -version  |& grep "Use.*Intrinsics"

Thanks.
--lx


From jatin.bhateja at intel.com  Mon Apr 13 19:07:00 2020
From: jatin.bhateja at intel.com (Bhateja, Jatin)
Date: Mon, 13 Apr 2020 19:07:00 +0000
Subject: [15] RFR (S): 8242491: C2: assert(v2->bottom_type() == vt)
 failed: mismatch when creating MacroLogicV
In-Reply-To: <MW3PR11MB4745B3765A336E4ED236BB7CEFDD0@MW3PR11MB4745.namprd11.prod.outlook.com>
References: <0e456498-e4d6-7147-b7db-2e2c3a539430@oracle.com>
 <MW3PR11MB4745B3765A336E4ED236BB7CEFDD0@MW3PR11MB4745.namprd11.prod.outlook.com>
Message-ID: <A66BBE673E08E1428E3A918AE4D5B32C1AEFE8D2@BGSMSX106.gar.corp.intel.com>

+1

Looks good to me.

Regards,
Jatin

> -----Original Message-----
> From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net>
> On Behalf Of Viswanathan, Sandhya
> Sent: Monday, April 13, 2020 10:32 PM
> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; hotspot compiler
> <hotspot-compiler-dev at openjdk.java.net>
> Subject: RE: [15] RFR (S): 8242491: C2: assert(v2->bottom_type() == vt)
> failed: mismatch when creating MacroLogicV
> 
> Hi Vladimir,
> 
> Your change looks good to me.
> 
> Best Regards,
> Sandhya
> 
> -----Original Message-----
> From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net>
> On Behalf Of Vladimir Ivanov
> Sent: Friday, April 10, 2020 7:07 AM
> To: hotspot compiler <hotspot-compiler-dev at openjdk.java.net>
> Subject: [15] RFR (S): 8242491: C2: assert(v2->bottom_type() == vt) failed:
> mismatch when creating MacroLogicV
> 
> http://cr.openjdk.java.net/~vlivanov/8242491/webrev.00/
> https://bugs.openjdk.java.net/browse/JDK-8242491
> 
> Asserts on input types for MacroLogicV are too strong.
> SuperWord pass can mix vectors of distinct subword types (byte and boolean
> or short and char).
> 
> Though it's possible to explicitly check for such particular cases, the fix
> relaxes the assert even more and only verifies that inputs are of the same
> size (in bytes), so bitwise reinterpretation of vector values is safe.
> 
> Testing: hs-precheckin-comp,hs-tier1,hs-tier2
> 
> Thanks!
> 
> Best regards,
> Vladimir Ivanov

From cjashfor at linux.ibm.com  Mon Apr 13 20:42:40 2020
From: cjashfor at linux.ibm.com (Corey Ashford)
Date: Mon, 13 Apr 2020 13:42:40 -0700
Subject: FR[M]: 8151779: Some intrinsic flags could be replaced with one
 general flag
In-Reply-To: <C9E0E925-48CF-41AE-B337-5C3C7C853775@amazon.com>
References: <C9E0E925-48CF-41AE-B337-5C3C7C853775@amazon.com>
Message-ID: <2cfeb040-72a6-7e40-8356-56ee3bda3cdf@linux.ibm.com>

On 4/13/20 10:33 AM, Liu, Xin wrote:
> Hi, compiler developers,
> I attempt to refactor UseXXXIntrinsics for JDK-8151779.  I think we still need to keep UseXXXIntrinsics options because many applications may be using them.
> 
> My change provide 2 new features:
> 1) a shorthand to enable/disable intrinsics.
> A comma-separated string. Each one is an intrinsic. An optional tailing symbol + or '-' denotes enabling or disabling.
> If the tailing symbol is missing, it means enable.
> Eg. -XX:UseIntrinsics="AESCTR-,CRC32C+,CRC32-,MathExact"
> This jvm option will expand to multiple options -XX:-UseAESCTRIntrinsics, -XX:+UseCRC32CIntrinsics, -XX:-UseCRC32Intrinsics, -XX:UseMathExactIntrinsics
> 
> 2) provide a set of macro to declare intrinsic options
> Developers declare once in intrinsics.hpp and macros will take care all other places.
> Here are example: https://cr.openjdk.java.net/~xliu/8151779/00/webrev/src/hotspot/share/compiler/intrinsics.hpp.html
> Ion Lam is overhauling jvm options.  I am thinking how to be consistent with his proposal.
>

Great idea, though to be consistent with the original syntax, I think 
the +/- should be in front of the name:

-XX:UseIntrinsics=-AESCTR,+CRC32C,...


> I handle UseIntrinsics before VM_Version::initialize. It means that platform-specific initialization still has a chance to correct those options.
> If we do that after VM_Version::initialize,  some intrinsics may cause JVM crash.  Eg. +UseBase64Intrinsics on x86_64 Linux.
> Even though this behavior is same as -XX:+UseXXXIntrinsics, from user's perspective, it's not straightforward when JVM overrides what users specify implicitly. It's dilemma here,  stable jvm or fidelity of cmdline.  What do you think?
> 
> Another problem is naming convention. Almost all intrinsics options use UseXXXIntrinsics. One exception is UseVectorizedMismatchIntrinsic.
> Personally, I think it should be "UseXXXIntrinsic" because one option is for one intrinsic, right?  Is it possible to change this name convention?

Some (many?) intrinsic options turn on more than one .ad instruct 
instrinsic, or library instrinsics at the same time.  I think that's why 
the plural is there.  Also, consistently adding the plural allows you to 
add more capabilities to a flag that initially only had one intrinsic 
without changing the plurality (and thus backward compatibility).

Regards,

- Corey


From xxinliu at amazon.com  Tue Apr 14 07:41:25 2020
From: xxinliu at amazon.com (Liu, Xin)
Date: Tue, 14 Apr 2020 07:41:25 +0000
Subject: RFR: heapbase register can be allocated in compressed mode
In-Reply-To: <a69c8cd9-b14e-4e5a-95f1-604197534d98.kuaiwei.kw@alibaba-inc.com>
References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>
 <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com>
 <ef37fac9-364b-442c-88ef-eb0cc9855cb5.kuaiwei.kw@alibaba-inc.com>
 <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com>
 <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com>
 <DB8PR08MB49694C72021174B76D032E5C96DD0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <a69c8cd9-b14e-4e5a-95f1-604197534d98.kuaiwei.kw@alibaba-inc.com>
Message-ID: <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com>

Hi, Wei, 

Your change of aarch64.ad is definitely correct, but I feel that's the only place c2 refers to reg_class  heapbase_reg. 
If it's gone, is that possible we use R27 no matter what UseCompressedOops is? I read JDK-8234794 but I don't understand why that change involves in r27 and CompressedOop. 

Btw, I think you can just keep the assignment in MacroAssembler::reinit_heapbase() for simplicity.  Leaving a comment is better. 
I think Assignment of rheapbase is harmless. Only c2-generated code will use rheapbase and it's for locals. I still can pass hotspot-tier1 without your change of macroAssembler_aarch64.cpp.
Another argument is that your change of reinit_heapbase() makes verify_heapbase() more complex.  I don't know why it is commented out, but it looks quite easy to fix currently. 

Thanks,
--lx


?On 4/13/20, 2:55 AM, "hotspot-compiler-dev on behalf of Kuai Wei" <hotspot-compiler-dev-bounces at openjdk.java.net on behalf of kuaiwei.kw at alibaba-inc.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
    
    
    Hi Pengfei,
    
      Thanks for your help.
    
    Kuai Wei
    
    
    ------------------------------------------------------------------
    From:Pengfei Li <Pengfei.Li at arm.com>
    Send Time:2020?4?13?(???) 10:37
    To:??(??) <kuaiwei.kw at alibaba-inc.com>; Andrew Haley <aph at redhat.com>; hotspot compiler <hotspot-compiler-dev at openjdk.java.net>
    Cc:nd <nd at arm.com>
    Subject:RE: RFR: heapbase register can be allocated in compressed mode
    
    Hi Wei,
    
    > I can not push to tip branch. Can you help me to push it ? Or do we need
    > other reivew?
    
    Thanks for your enhancement patch. I ran full jtreg in the weekend and found no new failure after this change.
    
    We could also help push if there's no other review comments.
    
    --
    Thanks,
    Pengfei
    
    
From Pengfei.Li at arm.com  Tue Apr 14 08:38:37 2020
From: Pengfei.Li at arm.com (Pengfei Li)
Date: Tue, 14 Apr 2020 08:38:37 +0000
Subject: RFR: heapbase register can be allocated in compressed mode
In-Reply-To: <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com>
References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>
 <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com>
 <ef37fac9-364b-442c-88ef-eb0cc9855cb5.kuaiwei.kw@alibaba-inc.com>
 <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com>
 <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com>
 <DB8PR08MB49694C72021174B76D032E5C96DD0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <a69c8cd9-b14e-4e5a-95f1-604197534d98.kuaiwei.kw@alibaba-inc.com>
 <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com>
Message-ID: <DB8PR08MB4969EB393F9ACF9724A6D4C796DA0@DB8PR08MB4969.eurprd08.prod.outlook.com>

Hi Xin,

> I read JDK-8234794 but I don't understand why that change involves in r27
> and CompressedOop.

JDK-8234794 is the metaspace reservation fix. It also simplifies the encoding/decoding of compressed class pointers. Before that patch, r27 is used for both compressed oops and compressed class pointers. At that time we have to consider if r27 is allocatable if compressed class pointers is on. But after that patch, r27 is for compressed oops only. That's why I could simplify my JDK-8233743 patch after JDK-8234794 was merged.

--
Thanks,
Pengfei


From xxinliu at amazon.com  Tue Apr 14 09:37:22 2020
From: xxinliu at amazon.com (Liu, Xin)
Date: Tue, 14 Apr 2020 09:37:22 +0000
Subject: RFR: heapbase register can be allocated in compressed mode
In-Reply-To: <DB8PR08MB4969EB393F9ACF9724A6D4C796DA0@DB8PR08MB4969.eurprd08.prod.outlook.com>
References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>
 <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com>
 <ef37fac9-364b-442c-88ef-eb0cc9855cb5.kuaiwei.kw@alibaba-inc.com>
 <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com>
 <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com>
 <DB8PR08MB49694C72021174B76D032E5C96DD0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <a69c8cd9-b14e-4e5a-95f1-604197534d98.kuaiwei.kw@alibaba-inc.com>
 <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com>
 <DB8PR08MB4969EB393F9ACF9724A6D4C796DA0@DB8PR08MB4969.eurprd08.prod.outlook.com>
Message-ID: <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com>

Hi, Pengfei and Kuai, 

Thanks to point out. 
Aarch64.ad does use MacroAssembler::encode_heap_oop, which refers to rheapbase.
That's why we can't use rheapbase as a GP register in C2. Got it. thanks!

--lx


?On 4/14/20, 1:39 AM, "Pengfei Li" <Pengfei.Li at arm.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
    
    
    Hi Xin,
    
    > I read JDK-8234794 but I don't understand why that change involves in r27
    > and CompressedOop.
    
    JDK-8234794 is the metaspace reservation fix. It also simplifies the encoding/decoding of compressed class pointers. Before that patch, r27 is used for both compressed oops and compressed class pointers. At that time we have to consider if r27 is allocatable if compressed class pointers is on. But after that patch, r27 is for compressed oops only. That's why I could simplify my JDK-8233743 patch after JDK-8234794 was merged.
    
    --
    Thanks,
    Pengfei
    
    
From kuaiwei.kw at alibaba-inc.com  Tue Apr 14 13:25:01 2020
From: kuaiwei.kw at alibaba-inc.com (Kuai Wei)
Date: Tue, 14 Apr 2020 21:25:01 +0800
Subject: =?UTF-8?B?UmU6IFJGUjogaGVhcGJhc2UgcmVnaXN0ZXIgY2FuIGJlIGFsbG9jYXRlZCBpbiBjb21wcmVz?=
 =?UTF-8?B?c2VkIG1vZGU=?=
In-Reply-To: <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com>
References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>
 <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com>
 <ef37fac9-364b-442c-88ef-eb0cc9855cb5.kuaiwei.kw@alibaba-inc.com>
 <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com>
 <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com>
 <DB8PR08MB49694C72021174B76D032E5C96DD0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <a69c8cd9-b14e-4e5a-95f1-604197534d98.kuaiwei.kw@alibaba-inc.com>
 <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com>
 <DB8PR08MB4969EB393F9ACF9724A6D4C796DA0@DB8PR08MB4969.eurprd08.prod.outlook.com>,
 <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com>
Message-ID: <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com>

Hi Xin and Pengfei,

  Thanks for your comments. I checked change in reinit_heapbase and decide to revert it since it's no harm to set rheapbase. I also made change in verify_heapbase in case someone want to enable this check again.

  The new patch is in http://cr.openjdk.java.net/~wzhuo/8242449/webrev.01/

  It has passed tiered 1 test without new failure.

Thanks,
Kuai Wei


------------------------------------------------------------------
From:Liu, Xin <xxinliu at amazon.com>
Send Time:2020?4?14?(???) 17:37
To:Pengfei Li <Pengfei.Li at arm.com>; ??(??) <kuaiwei.kw at alibaba-inc.com>; Andrew Haley <aph at redhat.com>; hotspot compiler <hotspot-compiler-dev at openjdk.java.net>
Cc:nd <nd at arm.com>
Subject:Re: RFR: heapbase register can be allocated in compressed mode

Hi, Pengfei and Kuai, 

Thanks to point out. 
Aarch64.ad does use MacroAssembler::encode_heap_oop, which refers to rheapbase.
That's why we can't use rheapbase as a GP register in C2. Got it. thanks!

--lx


 On 4/14/20, 1:39 AM, "Pengfei Li" <Pengfei.Li at arm.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.


    Hi Xin,

    > I read JDK-8234794 but I don't understand why that change involves in r27
    > and CompressedOop.

    JDK-8234794 is the metaspace reservation fix. It also simplifies the encoding/decoding of compressed class pointers. Before that patch, r27 is used for both compressed oops and compressed class pointers. At that time we have to consider if r27 is allocatable if compressed class pointers is on. But after that patch, r27 is for compressed oops only. That's why I could simplify my JDK-8233743 patch after JDK-8234794 was merged.

    --
    Thanks,
    Pengfei


From martin.doerr at sap.com  Tue Apr 14 13:26:08 2020
From: martin.doerr at sap.com (Doerr, Martin)
Date: Tue, 14 Apr 2020 13:26:08 +0000
Subject: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes()
 and Integer.reverseBytes() on Power9
In-Reply-To: <OFE7B015D3.EC53D085-ON00258546.002BA949-49258546.00305059@notes.na.collabserv.com>
References: <67fa8056-a8ed-cdfc-1e5a-d36b49c4af18@linux.ibm.com>,
 <0079874e-7bc2-5ff4-f004-337c718ec6df@linux.ibm.com>
 <OF54F8418D.F0B8E4CB-ON0025853E.004E3DC9-4925853E.004F648A@notes.na.collabserv.com>
 <OFE7B015D3.EC53D085-ON00258546.002BA949-49258546.00305059@notes.na.collabserv.com>
Message-ID: <AM0PR0202MB3297222F0FEBE1239877EE479ADA0@AM0PR0202MB3297.eurprd02.prod.outlook.com>

Hi Corey,

thanks for contributing it. Looks good to me. I?ll run it through our testing and let you know about the results.

Best regards,
Martin


From: ppc-aix-port-dev <ppc-aix-port-dev-bounces at openjdk.java.net> On Behalf Of Michihiro Horie
Sent: Freitag, 10. April 2020 10:48
To: cjashfor at linux.ibm.com
Cc: hotspot-compiler-dev at openjdk.java.net; ppc-aix-port-dev at openjdk.java.net
Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9


Hi Corey,

Thank you for sharing your benchmarks. I confirmed your change reduced the elapsed time of the benchmarks by more than 30% on my P9 node. Also, I checked JTREG results, which look no problem.

BTW, I cannot find further points of improvement in your change.

Best regards,
Michihiro


----- Original message -----
From: "Corey Ashford" <cjashfor at linux.ibm.com<mailto:cjashfor at linux.ibm.com>>
To: Michihiro Horie/Japan/IBM at IBMJP
Cc: hotspot-compiler-dev at openjdk.java.net<mailto:hotspot-compiler-dev at openjdk.java.net>, ppc-aix-port-dev at openjdk.java.net<mailto:ppc-aix-port-dev at openjdk.java.net>, "Gustavo Romero" <gromero at linux.vnet.ibm.com<mailto:gromero at linux.vnet.ibm.com>>
Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9
Date: Fri, Apr 3, 2020 8:07 AM

On 4/2/20 7:27 AM, Michihiro Horie wrote:
> Hi Corey,
>
> I?m not a reviewer, but I can run your benchmark in my local P9 node if
> you share it.
>
> Best regards,
> Michihiro

The tests are somewhat hokey; I added the shifts to keep the compiler
from hoisting the code that it could predetermine the result.

Here's the one for Long.reverseBytes():

import java.lang.*;

class ReverseLong
{
     public static void main(String args[])
     {
         long reversed, re_reversed;
long accum = 0;
long orig = 0x1122334455667788L;
long start = System.currentTimeMillis();
for (int i = 0; i < 1_000_000_000; i++) {
// Try to keep java from figuring out stuff in advance
reversed = Long.reverseBytes(orig);
re_reversed = Long.reverseBytes(reversed);
if (re_reversed != orig) {
         System.out.println("Orig: " + String.format("%16x", orig) +
"  Re-reversed: " + String.format("%16x", re_reversed));
}
accum += orig;
orig = Long.rotateRight(orig, 3);
}
System.out.println("Elapsed time: " +
Long.toString(System.currentTimeMillis() - start));
System.out.println("accum: " + Long.toString(accum));
     }
}


And the one for Integer.reverseBytes():

import java.lang.*;

class ReverseInt
{
     public static void main(String args[])
     {
         int reversed, re_reversed;
int orig = 0x11223344;
int accum = 0;
long start = System.currentTimeMillis();
for (int i = 0; i < 1_000_000_000; i++) {
// Try to keep java from figuring out stuff in advance
reversed = Integer.reverseBytes(orig);
re_reversed = Integer.reverseBytes(reversed);
if (re_reversed != orig) {
         System.out.println("Orig: " + String.format("%08x", orig) +
"  Re-reversed: " + String.format("%08x", re_reversed));
}
accum += orig;
orig = Integer.rotateRight(orig, 3);
}
System.out.println("Elapsed time: " +
Long.toString(System.currentTimeMillis() - start));
System.out.println("accum: " + Integer.toString(accum));
     }
}

From martin.doerr at sap.com  Tue Apr 14 14:07:06 2020
From: martin.doerr at sap.com (Doerr, Martin)
Date: Tue, 14 Apr 2020 14:07:06 +0000
Subject: RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range
Message-ID: <AM0PR0202MB329721F804F88BEC62CE93289ADA0@AM0PR0202MB3297.eurprd02.prod.outlook.com>

Hi,

I'd like to resolve a very old PPC64 issue:
https://bugs.openjdk.java.net/browse/JDK-8151030

There's code for AllocatePrefetchStyle=4 which is not an accepted option. It was used for a special experimental prefetch mode using dcbz instructions to combine prefetching and zeroing in the TLABs.
However, this code was never contributed and there are no plans to work on it. So I'd like to simply remove this small part of it.

In addition to that, AllocatePrefetchLines is currently set to 3 by default which doesn't make sense to me. PPC64 has an automatic prefetch engine and executing several prefetch instructions for succeeding cache lines doesn't seem to be beneficial at all.
So I'm setting it to 1 by default. I couldn't observe regressions on Power7, Power8 and Power9.

Webrev:
http://cr.openjdk.java.net/~mdoerr/8151030_ppc_prefetch/webrev.00/

Please review.

If somebody from IBM would like to check performance impact of changing the AllocatePrefetchLines + Distance, I'll be glad to receive feedback.

Best regards,
Martin


From tom.rodriguez at oracle.com  Tue Apr 14 20:44:20 2020
From: tom.rodriguez at oracle.com (Tom Rodriguez)
Date: Tue, 14 Apr 2020 13:44:20 -0700
Subject: RFR(S) 8231756: [JVMCI] need support for deoptimizing virtual
 byte arrays encoding non-byte primitives
In-Reply-To: <68c8608f-fb2c-83ed-3b14-39dccb9e80ad@oracle.com>
References: <5eb640c1-fb33-2fea-ef2f-416e19220d97@oracle.com>
 <68e115c1-2088-c075-8ffc-2c79f489fa81@oracle.com>
 <19684e18-4d08-c687-1bb0-c869ab707d74@oracle.com>
 <68c8608f-fb2c-83ed-3b14-39dccb9e80ad@oracle.com>
Message-ID: <113b3322-ddd1-9a7a-6247-c70c5bc5f5fc@oracle.com>


Vladimir Kozlov wrote on 4/3/20 5:41 PM:
> I think new code in deoptimize.cpp should be JVMCI specific.
> 
> I filed 8242150 for serviceability tests failures in testing. It seems 
> caused by recent changes.
> 
> It is weird to see SPARC_32 checks in deoptimization.cpp which we should 
> not have in new code:
> 
> #ifdef _LP64
>  ??????? jlong res = (jlong) *((jlong *) &val);
> #else
> #ifdef SPARC
>  ????? // For SPARC we have to swap high and low words.
> 
> We don't support such configuration for eons.

Currently there are 3 places in deoptimization.cpp that handle sparc 32 
bit, like 
http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/share/runtime/deoptimization.cpp#l1084. 
  Should remove those and the logic in my new code?  output.cpp appears 
to have a case as well.

> 
> I don't see? where _support_large_access_byte_array_virtualization? is 
> checked. If it is only in Graal then it should be guarded by #if.

I'll add the requested ifdefs.

tom

> 
> Thanks,
> Vladimir
> 
> On 4/3/20 12:37 PM, Tom Rodriguez wrote:
>>
>>
>> Vladimir Kozlov wrote on 4/3/20 10:31 AM:
>>> Hi Tom,
>>>
>>> I looked on testing results and one test fails consistently:
>>>
>>> compiler/jvmci/jdk.vm.ci.hotspot.test/src/jdk/vm/ci/hotspot/test/VirtualObjectLayoutTest.java 
>>
>>
>> Sorry that was an old mach5 run and I forgot to update with the new 
>> one. ?There are some failures but they seem unrelated to me.
>>
>> tom
>>
>>>
>>>
>>> Vladimir K
>>>
>>> On 4/2/20 12:12 PM, Tom Rodriguez wrote:
>>>> http://cr.openjdk.java.net/~never/8231756/webrev
>>>> https://bugs.openjdk.java.net/browse/JDK-8231756
>>>>
>>>> This adds support for deoptimizing with non-byte primitive values 
>>>> stored on top of a byte array, similarly to the way that a double or 
>>>> long can be stored on top of 2 int fields.? More detail is provided 
>>>> in the bug report and new unit tests exercise the deoptimization.  
>>>> mach5 testing is in progress.
>>>>
>>>> tom

From vladimir.kozlov at oracle.com  Tue Apr 14 21:07:42 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Tue, 14 Apr 2020 14:07:42 -0700
Subject: RFR(S) 8231756: [JVMCI] need support for deoptimizing virtual
 byte arrays encoding non-byte primitives
In-Reply-To: <113b3322-ddd1-9a7a-6247-c70c5bc5f5fc@oracle.com>
References: <5eb640c1-fb33-2fea-ef2f-416e19220d97@oracle.com>
 <68e115c1-2088-c075-8ffc-2c79f489fa81@oracle.com>
 <19684e18-4d08-c687-1bb0-c869ab707d74@oracle.com>
 <68c8608f-fb2c-83ed-3b14-39dccb9e80ad@oracle.com>
 <113b3322-ddd1-9a7a-6247-c70c5bc5f5fc@oracle.com>
Message-ID: <5a7bb358-2d9a-5218-95a7-20d1d51bad1c@oracle.com>

On 4/14/20 1:44 PM, Tom Rodriguez wrote:
> 
> 
> Vladimir Kozlov wrote on 4/3/20 5:41 PM:
>> I think new code in deoptimize.cpp should be JVMCI specific.
>>
>> I filed 8242150 for serviceability tests failures in testing. It seems caused by recent changes.
>>
>> It is weird to see SPARC_32 checks in deoptimization.cpp which we should not have in new code:
>>
>> #ifdef _LP64
>> ???????? jlong res = (jlong) *((jlong *) &val);
>> #else
>> #ifdef SPARC
>> ?????? // For SPARC we have to swap high and low words.
>>
>> We don't support such configuration for eons.
> 
> Currently there are 3 places in deoptimization.cpp that handle sparc 32 bit, like 
> http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/share/runtime/deoptimization.cpp#l1084. ?Should remove those and 
> the logic in my new code?? output.cpp appears to have a case as well.

No, we will remove them soon for JEP: 381: Remove the Solaris and SPARC Ports.

I don't want you to add new case.

> 
>>
>> I don't see? where _support_large_access_byte_array_virtualization? is checked. If it is only in Graal then it should 
>> be guarded by #if.
> 
> I'll add the requested ifdefs.

Good.

Thanks,
Vladimir

> 
> tom
> 
>>
>> Thanks,
>> Vladimir
>>
>> On 4/3/20 12:37 PM, Tom Rodriguez wrote:
>>>
>>>
>>> Vladimir Kozlov wrote on 4/3/20 10:31 AM:
>>>> Hi Tom,
>>>>
>>>> I looked on testing results and one test fails consistently:
>>>>
>>>> compiler/jvmci/jdk.vm.ci.hotspot.test/src/jdk/vm/ci/hotspot/test/VirtualObjectLayoutTest.java 
>>>
>>>
>>> Sorry that was an old mach5 run and I forgot to update with the new one. ?There are some failures but they seem 
>>> unrelated to me.
>>>
>>> tom
>>>
>>>>
>>>>
>>>> Vladimir K
>>>>
>>>> On 4/2/20 12:12 PM, Tom Rodriguez wrote:
>>>>> http://cr.openjdk.java.net/~never/8231756/webrev
>>>>> https://bugs.openjdk.java.net/browse/JDK-8231756
>>>>>
>>>>> This adds support for deoptimizing with non-byte primitive values stored on top of a byte array, similarly to the 
>>>>> way that a double or long can be stored on top of 2 int fields.? More detail is provided in the bug report and new 
>>>>> unit tests exercise the deoptimization. mach5 testing is in progress.
>>>>>
>>>>> tom

From xxinliu at amazon.com  Wed Apr 15 03:16:55 2020
From: xxinliu at amazon.com (Liu, Xin)
Date: Wed, 15 Apr 2020 03:16:55 +0000
Subject: RFR: heapbase register can be allocated in compressed mode
In-Reply-To: <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com>
References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>
 <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com>
 <ef37fac9-364b-442c-88ef-eb0cc9855cb5.kuaiwei.kw@alibaba-inc.com>
 <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com>
 <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com>
 <DB8PR08MB49694C72021174B76D032E5C96DD0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <a69c8cd9-b14e-4e5a-95f1-604197534d98.kuaiwei.kw@alibaba-inc.com>
 <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com>
 <DB8PR08MB4969EB393F9ACF9724A6D4C796DA0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com>
 <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com>
Message-ID: <781CB090-0386-4D32-8465-8238E516789B@amazon.com>

Hi, Wei, 
LGTM. 

Thanks.
--lx

From: Kuai Wei <kuaiwei.kw at alibaba-inc.com>
Reply-To: Kuai Wei <kuaiwei.kw at alibaba-inc.com>
Date: Tuesday, April 14, 2020 at 6:26 AM
To: "Liu, Xin" <xxinliu at amazon.com>, Pengfei Li <Pengfei.Li at arm.com>, Andrew Haley <aph at redhat.com>, hotspot compiler <hotspot-compiler-dev at openjdk.java.net>
Cc: nd <nd at arm.com>
Subject: RE: RFR: heapbase register can be allocated in compressed mode

Hi Xin and Pengfei,

? Thanks for your comments. I checked change in reinit_heapbase and decide to revert it since it's no harm to set rheapbase. I also made change in verify_heapbase in case someone want to enable this check again.

? The new patch is in?http://cr.openjdk.java.net/~wzhuo/8242449/webrev.01/

? It has passed tiered 1 test without new failure.

Thanks,
Kuai Wei

------------------------------------------------------------------
From:Liu, Xin <xxinliu at amazon.com>
Send Time:2020?4?14?(???) 17:37
To:Pengfei Li <Pengfei.Li at arm.com>; ??(??) <kuaiwei.kw at alibaba-inc.com>; Andrew Haley <aph at redhat.com>; hotspot compiler <hotspot-compiler-dev at openjdk.java.net>
Cc:nd <nd at arm.com>
Subject:Re: RFR: heapbase register can be allocated in compressed mode

Hi,?Pengfei?and?Kuai,?

Thanks?to?point?out.?
Aarch64.ad?does?use?MacroAssembler::encode_heap_oop,?which?refers?to?rheapbase.
That's?why?we?can't?use?rheapbase?as?a?GP?register?in?C2.?Got?it.?thanks!

--lx


?On?4/14/20,?1:39?AM,?"Pengfei?Li"?<Pengfei.Li at arm.com>?wrote:

????CAUTION:?This?email?originated?from?outside?of?the?organization.?Do?not?click?links?or?open?attachments?unless?you?can?confirm?the?sender?and?know?the?content?is?safe.
????
????
????
????Hi?Xin,
????
????>?I?read?JDK-8234794?but?I?don't?understand?why?that?change?involves?in?r27
????>?and?CompressedOop.
????
????JDK-8234794?is?the?metaspace?reservation?fix.?It?also?simplifies?the?encoding/decoding?of?compressed?class?pointers.?Before?that?patch,?r27?is?used?for?both?compressed?oops?and?compressed?class?pointers.?At?that?time?we?have?to?consider?if?r27?is?allocatable?if?compressed?class?pointers?is?on.?But?after?that?patch,?r27?is?for?compressed?oops?only.?That's?why?I?could?simplify?my?JDK-8233743?patch?after?JDK-8234794?was?merged.
????
????--
????Thanks,
????Pengfei
????
????


From martin.doerr at sap.com  Wed Apr 15 12:33:16 2020
From: martin.doerr at sap.com (Doerr, Martin)
Date: Wed, 15 Apr 2020 12:33:16 +0000
Subject: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes()
 and Integer.reverseBytes() on Power9
In-Reply-To: <AM0PR0202MB3297222F0FEBE1239877EE479ADA0@AM0PR0202MB3297.eurprd02.prod.outlook.com>
References: <67fa8056-a8ed-cdfc-1e5a-d36b49c4af18@linux.ibm.com>,
 <0079874e-7bc2-5ff4-f004-337c718ec6df@linux.ibm.com>
 <OF54F8418D.F0B8E4CB-ON0025853E.004E3DC9-4925853E.004F648A@notes.na.collabserv.com>
 <OFE7B015D3.EC53D085-ON00258546.002BA949-49258546.00305059@notes.na.collabserv.com>
 <AM0PR0202MB3297222F0FEBE1239877EE479ADA0@AM0PR0202MB3297.eurprd02.prod.outlook.com>
Message-ID: <AM0PR0202MB329763F03E2E8F848AC824B09ADB0@AM0PR0202MB3297.eurprd02.prod.outlook.com>

Hi again,

testing didn?t show any new issues.

Only the copyright years should get updated before pushing.

Is there already a sponsor or do you want me to push it?

Best regards,
Martin


From: Doerr, Martin
Sent: Dienstag, 14. April 2020 15:26
To: Michihiro Horie <HORIE at jp.ibm.com>; cjashfor at linux.ibm.com
Cc: hotspot-compiler-dev at openjdk.java.net; ppc-aix-port-dev at openjdk.java.net
Subject: RE: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9

Hi Corey,

thanks for contributing it. Looks good to me. I?ll run it through our testing and let you know about the results.

Best regards,
Martin


From: ppc-aix-port-dev <ppc-aix-port-dev-bounces at openjdk.java.net<mailto:ppc-aix-port-dev-bounces at openjdk.java.net>> On Behalf Of Michihiro Horie
Sent: Freitag, 10. April 2020 10:48
To: cjashfor at linux.ibm.com<mailto:cjashfor at linux.ibm.com>
Cc: hotspot-compiler-dev at openjdk.java.net<mailto:hotspot-compiler-dev at openjdk.java.net>; ppc-aix-port-dev at openjdk.java.net<mailto:ppc-aix-port-dev at openjdk.java.net>
Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9


Hi Corey,

Thank you for sharing your benchmarks. I confirmed your change reduced the elapsed time of the benchmarks by more than 30% on my P9 node. Also, I checked JTREG results, which look no problem.

BTW, I cannot find further points of improvement in your change.

Best regards,
Michihiro


----- Original message -----
From: "Corey Ashford" <cjashfor at linux.ibm.com<mailto:cjashfor at linux.ibm.com>>
To: Michihiro Horie/Japan/IBM at IBMJP
Cc: hotspot-compiler-dev at openjdk.java.net<mailto:hotspot-compiler-dev at openjdk.java.net>, ppc-aix-port-dev at openjdk.java.net<mailto:ppc-aix-port-dev at openjdk.java.net>, "Gustavo Romero" <gromero at linux.vnet.ibm.com<mailto:gromero at linux.vnet.ibm.com>>
Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes() and Integer.reverseBytes() on Power9
Date: Fri, Apr 3, 2020 8:07 AM

On 4/2/20 7:27 AM, Michihiro Horie wrote:
> Hi Corey,
>
> I?m not a reviewer, but I can run your benchmark in my local P9 node if
> you share it.
>
> Best regards,
> Michihiro

The tests are somewhat hokey; I added the shifts to keep the compiler
from hoisting the code that it could predetermine the result.

Here's the one for Long.reverseBytes():

import java.lang.*;

class ReverseLong
{
     public static void main(String args[])
     {
         long reversed, re_reversed;
long accum = 0;
long orig = 0x1122334455667788L;
long start = System.currentTimeMillis();
for (int i = 0; i < 1_000_000_000; i++) {
// Try to keep java from figuring out stuff in advance
reversed = Long.reverseBytes(orig);
re_reversed = Long.reverseBytes(reversed);
if (re_reversed != orig) {
         System.out.println("Orig: " + String.format("%16x", orig) +
"  Re-reversed: " + String.format("%16x", re_reversed));
}
accum += orig;
orig = Long.rotateRight(orig, 3);
}
System.out.println("Elapsed time: " +
Long.toString(System.currentTimeMillis() - start));
System.out.println("accum: " + Long.toString(accum));
     }
}


And the one for Integer.reverseBytes():

import java.lang.*;

class ReverseInt
{
     public static void main(String args[])
     {
         int reversed, re_reversed;
int orig = 0x11223344;
int accum = 0;
long start = System.currentTimeMillis();
for (int i = 0; i < 1_000_000_000; i++) {
// Try to keep java from figuring out stuff in advance
reversed = Integer.reverseBytes(orig);
re_reversed = Integer.reverseBytes(reversed);
if (re_reversed != orig) {
         System.out.println("Orig: " + String.format("%08x", orig) +
"  Re-reversed: " + String.format("%08x", re_reversed));
}
accum += orig;
orig = Integer.rotateRight(orig, 3);
}
System.out.println("Elapsed time: " +
Long.toString(System.currentTimeMillis() - start));
System.out.println("accum: " + Integer.toString(accum));
     }
}

From vladimir.kozlov at oracle.com  Wed Apr 15 18:12:53 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 15 Apr 2020 11:12:53 -0700
Subject: RFR(S) 8231756: [JVMCI] need support for deoptimizing virtual
 byte arrays encoding non-byte primitives
In-Reply-To: <5a7bb358-2d9a-5218-95a7-20d1d51bad1c@oracle.com>
References: <5eb640c1-fb33-2fea-ef2f-416e19220d97@oracle.com>
 <68e115c1-2088-c075-8ffc-2c79f489fa81@oracle.com>
 <19684e18-4d08-c687-1bb0-c869ab707d74@oracle.com>
 <68c8608f-fb2c-83ed-3b14-39dccb9e80ad@oracle.com>
 <113b3322-ddd1-9a7a-6247-c70c5bc5f5fc@oracle.com>
 <5a7bb358-2d9a-5218-95a7-20d1d51bad1c@oracle.com>
Message-ID: <73936b07-976d-52aa-6427-339878a571b0@oracle.com>

After discussion with Tom offline I agree to keep his SPARC code because we would need to backport this later into 11u.

Thanks,
Vladimir

On 4/14/20 2:07 PM, Vladimir Kozlov wrote:
> On 4/14/20 1:44 PM, Tom Rodriguez wrote:
>>
>>
>> Vladimir Kozlov wrote on 4/3/20 5:41 PM:
>>> I think new code in deoptimize.cpp should be JVMCI specific.
>>>
>>> I filed 8242150 for serviceability tests failures in testing. It seems caused by recent changes.
>>>
>>> It is weird to see SPARC_32 checks in deoptimization.cpp which we should not have in new code:
>>>
>>> #ifdef _LP64
>>> ???????? jlong res = (jlong) *((jlong *) &val);
>>> #else
>>> #ifdef SPARC
>>> ?????? // For SPARC we have to swap high and low words.
>>>
>>> We don't support such configuration for eons.
>>
>> Currently there are 3 places in deoptimization.cpp that handle sparc 32 bit, like 
>> http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/share/runtime/deoptimization.cpp#l1084. ?Should remove those 
>> and the logic in my new code?? output.cpp appears to have a case as well.
> 
> No, we will remove them soon for JEP: 381: Remove the Solaris and SPARC Ports.
> 
> I don't want you to add new case.
> 
>>
>>>
>>> I don't see? where _support_large_access_byte_array_virtualization? is checked. If it is only in Graal then it should 
>>> be guarded by #if.
>>
>> I'll add the requested ifdefs.
> 
> Good.
> 
> Thanks,
> Vladimir
> 
>>
>> tom
>>
>>>
>>> Thanks,
>>> Vladimir
>>>
>>> On 4/3/20 12:37 PM, Tom Rodriguez wrote:
>>>>
>>>>
>>>> Vladimir Kozlov wrote on 4/3/20 10:31 AM:
>>>>> Hi Tom,
>>>>>
>>>>> I looked on testing results and one test fails consistently:
>>>>>
>>>>> compiler/jvmci/jdk.vm.ci.hotspot.test/src/jdk/vm/ci/hotspot/test/VirtualObjectLayoutTest.java 
>>>>
>>>>
>>>> Sorry that was an old mach5 run and I forgot to update with the new one. ?There are some failures but they seem 
>>>> unrelated to me.
>>>>
>>>> tom
>>>>
>>>>>
>>>>>
>>>>> Vladimir K
>>>>>
>>>>> On 4/2/20 12:12 PM, Tom Rodriguez wrote:
>>>>>> http://cr.openjdk.java.net/~never/8231756/webrev
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8231756
>>>>>>
>>>>>> This adds support for deoptimizing with non-byte primitive values stored on top of a byte array, similarly to the 
>>>>>> way that a double or long can be stored on top of 2 int fields.? More detail is provided in the bug report and new 
>>>>>> unit tests exercise the deoptimization. mach5 testing is in progress.
>>>>>>
>>>>>> tom

From tkachuk.vladyslav at gmail.com  Wed Apr 15 22:05:27 2020
From: tkachuk.vladyslav at gmail.com (Vladyslav Tkachuk)
Date: Thu, 16 Apr 2020 00:05:27 +0200
Subject: Master Thesis Research Advice. JIT
Message-ID: <CACK4yMn2Ywui5eQzRMaE-Wv_i+k_nOEoLwiY_x1PCZYjdxioig@mail.gmail.com>

Hello,

I am a Master's student at the University of Passau, Germany.
My master thesis research is concerned with detecting equivalent mutants in
Java.
The main research question is to use the Trivial Compiler Equivalency
technique. This means that we acquire Assembly code produced by Java JIT
compiler for initial and mutated source and then compare them.

I have previously contacted Tobias Hartmann, who advised me to write here
regarding technical questions. I would like to ask you if there is any
solution to a problem I have.

Last time Tobias recommended me to use Opto-Assembly to achieve my purpose.
It was a good hint and it helped me to get more precise data.
However, after doing some research I noticed that in some cases C2 compiler
unloaded the method code which I expected to find in assembly. As I found
out this was a part of deoptimization and the method code was meant to be
executed by the interpreter.
Here is an example of what I mean:

{method}
 - this oop:          0x000000000d2319c8
 - method holder:     'Rational'
 - constants:         0x000000000d230cf8 constant pool [85]
{0x000000000d230d00} for 'Rational' cache=0x000000000d231cd8
 - access:            0x81000001  public
 - name:              'toString'
 - signature:         '()Ljava/lang/String;'
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
some setup code
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
02c    movq    RBP, RDX # spill
02f    movl    RDX, #11 # int
      nop # 3 bytes pad for loops and calls
*037    call,static  wrapper for: uncommon_trap(reason='unloaded'
action='reinterpret' index='11')*
*        # Rational::toString @ bci:0  L[0]=RBP L[1]=_ L[2]=_ L[3]=_ L[4]=_
L[5]=_ L[6]=_ L[7]=_*
*        # OopMap{rbp=Oop off=60}*
03c    int3 # ShouldNotReachHere
03c


This is a 'toString' method and as I could see and understand, there is no
actual method code, but only a call to it.

I would like to know if it is possible to completely disable any
deoptimizations and consistently receive the full asm code? I consent that
it is not practical and hurts performance, but it is not a goal in this
scope. According to my observations, in most cases the method code is full,
but strangely here it did not work. I have tried to google any useful info,
unfortunately, I did not see anything helpful, despite the explanations
about what deoptimization is and its types.

I would be grateful if you could shed some light on the issue.
Thanks in advance for any useful information.

Best regards,
Vladyslav Tkachuk

From vladimir.kozlov at oracle.com  Wed Apr 15 23:29:25 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 15 Apr 2020 16:29:25 -0700
Subject: [15] RFR (S): 8242491: C2: assert(v2->bottom_type() == vt)
 failed: mismatch when creating MacroLogicV
In-Reply-To: <0e456498-e4d6-7147-b7db-2e2c3a539430@oracle.com>
References: <0e456498-e4d6-7147-b7db-2e2c3a539430@oracle.com>
Message-ID: <41a3a12c-2361-fef8-bc81-9012b75a1c9e@oracle.com>

Good.

Thanks,
Vladimir K

On 4/10/20 7:07 AM, Vladimir Ivanov wrote:
> http://cr.openjdk.java.net/~vlivanov/8242491/webrev.00/
> https://bugs.openjdk.java.net/browse/JDK-8242491
> 
> Asserts on input types for MacroLogicV are too strong.
> SuperWord pass can mix vectors of distinct subword types (byte and boolean or short and char).
> 
> Though it's possible to explicitly check for such particular cases, the fix relaxes the assert even more and only 
> verifies that inputs are of the same size (in bytes), so bitwise reinterpretation of vector values is safe.
> 
> Testing: hs-precheckin-comp,hs-tier1,hs-tier2
> 
> Thanks!
> 
> Best regards,
> Vladimir Ivanov

From vladimir.kozlov at oracle.com  Wed Apr 15 23:33:54 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 15 Apr 2020 16:33:54 -0700
Subject: [15] RFR (S): 8242492: C2: Remove
 Matcher::vector_shift_count_ideal_reg()
In-Reply-To: <a2f57a50-3739-4e1d-a21d-b11d204f90fc@oracle.com>
References: <a2f57a50-3739-4e1d-a21d-b11d204f90fc@oracle.com>
Message-ID: <8466f935-5ace-bb02-9258-44541582c00d@oracle.com>

Good.

Thanks,
Vladimir K

On 4/10/20 7:25 AM, Vladimir Ivanov wrote:
> http://cr.openjdk.java.net/~vlivanov/8242492/webrev.00/
> https://bugs.openjdk.java.net/browse/JDK-8242492
> 
> Matcher::vector_shift_count_ideal_reg() was introduced specifically for x86 to communicate that only low 32 bits are 
> used by vector shift instructions, so only those bits should be spilled when needed.
> 
> Unfortunately, it is broken for LShiftCntV/RShiftCntV: Matcher doesn't capute overridden ideal_reg value and spills use 
> bottom type instead. So, it causes a mismatch during RA.
> 
> Fortunately, LShiftCntV/RShiftCntV are never spilled on x86. Considering how simple AD instructions for 
> LShiftCntV/RShiftCntV are, RA prefers to rematerialize the value instead (which is a reg-to-reg move).
> 
> I propose to simplify the implementation and completely remove Matcher::vector_shift_count_ideal_reg() along with 
> additional special handling logic for LShiftCntV/RShiftCntV.
> 
> Testing: hs-precheckin-comp, hs-tier1, hs-tier2
> 
> Thanks!
> 
> Best regards,
> Vladimir Ivanov

From tom.rodriguez at oracle.com  Thu Apr 16 00:34:35 2020
From: tom.rodriguez at oracle.com (Tom Rodriguez)
Date: Wed, 15 Apr 2020 17:34:35 -0700
Subject: RFR(S) 8231756: [JVMCI] need support for deoptimizing virtual
 byte arrays encoding non-byte primitives
In-Reply-To: <73936b07-976d-52aa-6427-339878a571b0@oracle.com>
References: <5eb640c1-fb33-2fea-ef2f-416e19220d97@oracle.com>
 <68e115c1-2088-c075-8ffc-2c79f489fa81@oracle.com>
 <19684e18-4d08-c687-1bb0-c869ab707d74@oracle.com>
 <68c8608f-fb2c-83ed-3b14-39dccb9e80ad@oracle.com>
 <113b3322-ddd1-9a7a-6247-c70c5bc5f5fc@oracle.com>
 <5a7bb358-2d9a-5218-95a7-20d1d51bad1c@oracle.com>
 <73936b07-976d-52aa-6427-339878a571b0@oracle.com>
Message-ID: <2426180e-0c29-e8e9-2ffc-f5de005608e5@oracle.com>

I've updated the webrev in place with the new ifdefs in 
deoptimization.cpp.  The mach5 run was clean apart from known failures.

tom

Vladimir Kozlov wrote on 4/15/20 11:12 AM:
> After discussion with Tom offline I agree to keep his SPARC code because 
> we would need to backport this later into 11u.
> 
> Thanks,
> Vladimir
> 
> On 4/14/20 2:07 PM, Vladimir Kozlov wrote:
>> On 4/14/20 1:44 PM, Tom Rodriguez wrote:
>>>
>>>
>>> Vladimir Kozlov wrote on 4/3/20 5:41 PM:
>>>> I think new code in deoptimize.cpp should be JVMCI specific.
>>>>
>>>> I filed 8242150 for serviceability tests failures in testing. It 
>>>> seems caused by recent changes.
>>>>
>>>> It is weird to see SPARC_32 checks in deoptimization.cpp which we 
>>>> should not have in new code:
>>>>
>>>> #ifdef _LP64
>>>> ???????? jlong res = (jlong) *((jlong *) &val);
>>>> #else
>>>> #ifdef SPARC
>>>> ?????? // For SPARC we have to swap high and low words.
>>>>
>>>> We don't support such configuration for eons.
>>>
>>> Currently there are 3 places in deoptimization.cpp that handle sparc 
>>> 32 bit, like 
>>> http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/share/runtime/deoptimization.cpp#l1084. 
>>> ?Should remove those and the logic in my new code?? output.cpp 
>>> appears to have a case as well.
>>
>> No, we will remove them soon for JEP: 381: Remove the Solaris and 
>> SPARC Ports.
>>
>> I don't want you to add new case.
>>
>>>
>>>>
>>>> I don't see? where _support_large_access_byte_array_virtualization  
>>>> is checked. If it is only in Graal then it should be guarded by #if.
>>>
>>> I'll add the requested ifdefs.
>>
>> Good.
>>
>> Thanks,
>> Vladimir
>>
>>>
>>> tom
>>>
>>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>> On 4/3/20 12:37 PM, Tom Rodriguez wrote:
>>>>>
>>>>>
>>>>> Vladimir Kozlov wrote on 4/3/20 10:31 AM:
>>>>>> Hi Tom,
>>>>>>
>>>>>> I looked on testing results and one test fails consistently:
>>>>>>
>>>>>> compiler/jvmci/jdk.vm.ci.hotspot.test/src/jdk/vm/ci/hotspot/test/VirtualObjectLayoutTest.java 
>>>>>
>>>>>
>>>>>
>>>>> Sorry that was an old mach5 run and I forgot to update with the new 
>>>>> one. ?There are some failures but they seem unrelated to me.
>>>>>
>>>>> tom
>>>>>
>>>>>>
>>>>>>
>>>>>> Vladimir K
>>>>>>
>>>>>> On 4/2/20 12:12 PM, Tom Rodriguez wrote:
>>>>>>> http://cr.openjdk.java.net/~never/8231756/webrev
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8231756
>>>>>>>
>>>>>>> This adds support for deoptimizing with non-byte primitive values 
>>>>>>> stored on top of a byte array, similarly to the way that a double 
>>>>>>> or long can be stored on top of 2 int fields.? More detail is 
>>>>>>> provided in the bug report and new unit tests exercise the 
>>>>>>> deoptimization. mach5 testing is in progress.
>>>>>>>
>>>>>>> tom

From vladimir.kozlov at oracle.com  Thu Apr 16 00:40:19 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 15 Apr 2020 17:40:19 -0700
Subject: RFR(S) 8231756: [JVMCI] need support for deoptimizing virtual
 byte arrays encoding non-byte primitives
In-Reply-To: <2426180e-0c29-e8e9-2ffc-f5de005608e5@oracle.com>
References: <5eb640c1-fb33-2fea-ef2f-416e19220d97@oracle.com>
 <68e115c1-2088-c075-8ffc-2c79f489fa81@oracle.com>
 <19684e18-4d08-c687-1bb0-c869ab707d74@oracle.com>
 <68c8608f-fb2c-83ed-3b14-39dccb9e80ad@oracle.com>
 <113b3322-ddd1-9a7a-6247-c70c5bc5f5fc@oracle.com>
 <5a7bb358-2d9a-5218-95a7-20d1d51bad1c@oracle.com>
 <73936b07-976d-52aa-6427-339878a571b0@oracle.com>
 <2426180e-0c29-e8e9-2ffc-f5de005608e5@oracle.com>
Message-ID: <6e075a13-cda5-9d2d-5d96-5b2c7c2c7cdd@oracle.com>

Good.

Thanks,
Vladimir

On 4/15/20 5:34 PM, Tom Rodriguez wrote:
> I've updated the webrev in place with the new ifdefs in deoptimization.cpp.? The mach5 run was clean apart from known 
> failures.
> 
> tom
> 
> Vladimir Kozlov wrote on 4/15/20 11:12 AM:
>> After discussion with Tom offline I agree to keep his SPARC code because we would need to backport this later into 11u.
>>
>> Thanks,
>> Vladimir
>>
>> On 4/14/20 2:07 PM, Vladimir Kozlov wrote:
>>> On 4/14/20 1:44 PM, Tom Rodriguez wrote:
>>>>
>>>>
>>>> Vladimir Kozlov wrote on 4/3/20 5:41 PM:
>>>>> I think new code in deoptimize.cpp should be JVMCI specific.
>>>>>
>>>>> I filed 8242150 for serviceability tests failures in testing. It seems caused by recent changes.
>>>>>
>>>>> It is weird to see SPARC_32 checks in deoptimization.cpp which we should not have in new code:
>>>>>
>>>>> #ifdef _LP64
>>>>> ???????? jlong res = (jlong) *((jlong *) &val);
>>>>> #else
>>>>> #ifdef SPARC
>>>>> ?????? // For SPARC we have to swap high and low words.
>>>>>
>>>>> We don't support such configuration for eons.
>>>>
>>>> Currently there are 3 places in deoptimization.cpp that handle sparc 32 bit, like 
>>>> http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/share/runtime/deoptimization.cpp#l1084. ?Should remove those 
>>>> and the logic in my new code?? output.cpp appears to have a case as well.
>>>
>>> No, we will remove them soon for JEP: 381: Remove the Solaris and SPARC Ports.
>>>
>>> I don't want you to add new case.
>>>
>>>>
>>>>>
>>>>> I don't see? where _support_large_access_byte_array_virtualization is checked. If it is only in Graal then it 
>>>>> should be guarded by #if.
>>>>
>>>> I'll add the requested ifdefs.
>>>
>>> Good.
>>>
>>> Thanks,
>>> Vladimir
>>>
>>>>
>>>> tom
>>>>
>>>>>
>>>>> Thanks,
>>>>> Vladimir
>>>>>
>>>>> On 4/3/20 12:37 PM, Tom Rodriguez wrote:
>>>>>>
>>>>>>
>>>>>> Vladimir Kozlov wrote on 4/3/20 10:31 AM:
>>>>>>> Hi Tom,
>>>>>>>
>>>>>>> I looked on testing results and one test fails consistently:
>>>>>>>
>>>>>>> compiler/jvmci/jdk.vm.ci.hotspot.test/src/jdk/vm/ci/hotspot/test/VirtualObjectLayoutTest.java 
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sorry that was an old mach5 run and I forgot to update with the new one. ?There are some failures but they seem 
>>>>>> unrelated to me.
>>>>>>
>>>>>> tom
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Vladimir K
>>>>>>>
>>>>>>> On 4/2/20 12:12 PM, Tom Rodriguez wrote:
>>>>>>>> http://cr.openjdk.java.net/~never/8231756/webrev
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8231756
>>>>>>>>
>>>>>>>> This adds support for deoptimizing with non-byte primitive values stored on top of a byte array, similarly to 
>>>>>>>> the way that a double or long can be stored on top of 2 int fields.? More detail is provided in the bug report 
>>>>>>>> and new unit tests exercise the deoptimization. mach5 testing is in progress.
>>>>>>>>
>>>>>>>> tom

From cjashfor at linux.ibm.com  Thu Apr 16 01:34:46 2020
From: cjashfor at linux.ibm.com (Corey Ashford)
Date: Wed, 15 Apr 2020 18:34:46 -0700
Subject: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes()
 and Integer.reverseBytes() on Power9
In-Reply-To: <AM0PR0202MB3297222F0FEBE1239877EE479ADA0@AM0PR0202MB3297.eurprd02.prod.outlook.com>
References: <67fa8056-a8ed-cdfc-1e5a-d36b49c4af18@linux.ibm.com>
 <0079874e-7bc2-5ff4-f004-337c718ec6df@linux.ibm.com>
 <OF54F8418D.F0B8E4CB-ON0025853E.004E3DC9-4925853E.004F648A@notes.na.collabserv.com>
 <OFE7B015D3.EC53D085-ON00258546.002BA949-49258546.00305059@notes.na.collabserv.com>
 <AM0PR0202MB3297222F0FEBE1239877EE479ADA0@AM0PR0202MB3297.eurprd02.prod.outlook.com>
Message-ID: <1964be00-8926-7a70-d23a-2f7e85eb4ef3@linux.ibm.com>

Hello Martin,

I'm having some trouble with my email server, so I'm having to reply to 
your earlier post, but I saw your most recent post on the mailing list 
archive.

Thanks for reviewing and testing this patch.  I went to look at the 
copyright dates, and see two date ranges: one for Oracle and its 
affiliates, and another for SAP.  In the files I looked at, the end date 
wasn't the same between the two.  Which one (or both) should I modify?

Thanks,

- Corey

On 4/14/20 6:26 AM, Doerr, Martin wrote:
> Hi Corey,
> 
> thanks for contributing it. Looks good to me. I?ll run it through our 
> testing and let you know about the results.
> 
> Best regards,
> 
> Martin
> 
> *From:*ppc-aix-port-dev <ppc-aix-port-dev-bounces at openjdk.java.net> *On 
> Behalf Of *Michihiro Horie
> *Sent:* Freitag, 10. April 2020 10:48
> *To:* cjashfor at linux.ibm.com
> *Cc:* hotspot-compiler-dev at openjdk.java.net; 
> ppc-aix-port-dev at openjdk.java.net
> *Subject:* Re: RFR[S]:8241874 [PPC64] Improve performance of 
> Long.reverseBytes() and Integer.reverseBytes() on Power9
> 
> Hi Corey,
> 
> Thank you for sharing your benchmarks. I confirmed your change reduced 
> the elapsed time of the benchmarks by more than 30% on my P9 node. Also, 
> I checked JTREG results, which look no problem.
> 
> BTW, I cannot find further points of improvement in your change.
> 
> Best regards,
> Michihiro
> 
> 
> ----- Original message -----
> From: "Corey Ashford" <cjashfor at linux.ibm.com 
> <mailto:cjashfor at linux.ibm.com>>
> To: Michihiro Horie/Japan/IBM at IBMJP
> Cc: hotspot-compiler-dev at openjdk.java.net 
> <mailto:hotspot-compiler-dev at openjdk.java.net>, 
> ppc-aix-port-dev at openjdk.java.net 
> <mailto:ppc-aix-port-dev at openjdk.java.net>, "Gustavo Romero" 
> <gromero at linux.vnet.ibm.com <mailto:gromero at linux.vnet.ibm.com>>
> Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of 
> Long.reverseBytes() and Integer.reverseBytes() on Power9
> Date: Fri, Apr 3, 2020 8:07 AM
> 
> On 4/2/20 7:27 AM, Michihiro Horie wrote:
>> Hi Corey,
>>
>> I?m not a reviewer, but I can run your benchmark in my local P9 node if
>> you share it.
>>
>> Best regards,
>> Michihiro
> 
> The tests are somewhat hokey; I added the shifts to keep the compiler
> from hoisting the code that it could predetermine the result.
> 
> Here's the one for Long.reverseBytes():
> 
> import java.lang.*;
> 
> class ReverseLong
> {
>  ? ? ?public static void main(String args[])
>  ? ? ?{
>  ? ? ? ? ?long reversed, re_reversed;
> long accum = 0;
> long orig = 0x1122334455667788L;
> long start = System.currentTimeMillis();
> for (int i = 0; i < 1_000_000_000; i++) {
> // Try to keep java from figuring out stuff in advance
> reversed = Long.reverseBytes(orig);
> re_reversed = Long.reverseBytes(reversed);
> if (re_reversed != orig) {
>  ? ? ? ? ?System.out.println("Orig: " + String.format("%16x", orig) +
> " ?Re-reversed: " + String.format("%16x", re_reversed));
> }
> accum += orig;
> orig = Long.rotateRight(orig, 3);
> }
> System.out.println("Elapsed time: " +
> Long.toString(System.currentTimeMillis() - start));
> System.out.println("accum: " + Long.toString(accum));
>  ? ? ?}
> }
> 
> 
> And the one for Integer.reverseBytes():
> 
> import java.lang.*;
> 
> class ReverseInt
> {
>  ? ? ?public static void main(String args[])
>  ? ? ?{
>  ? ? ? ? ?int reversed, re_reversed;
> int orig = 0x11223344;
> int accum = 0;
> long start = System.currentTimeMillis();
> for (int i = 0; i < 1_000_000_000; i++) {
> // Try to keep java from figuring out stuff in advance
> reversed = Integer.reverseBytes(orig);
> re_reversed = Integer.reverseBytes(reversed);
> if (re_reversed != orig) {
>  ? ? ? ? ?System.out.println("Orig: " + String.format("%08x", orig) +
> " ?Re-reversed: " + String.format("%08x", re_reversed));
> }
> accum += orig;
> orig = Integer.rotateRight(orig, 3);
> }
> System.out.println("Elapsed time: " +
> Long.toString(System.currentTimeMillis() - start));
> System.out.println("accum: " + Integer.toString(accum));
>  ? ? ?}
> }
> 


From eric.c.liu at arm.com  Thu Apr 16 04:13:32 2020
From: eric.c.liu at arm.com (Eric Liu)
Date: Thu, 16 Apr 2020 04:13:32 +0000
Subject: RFR(S):8242429:Better implementation for signed extract
In-Reply-To: <d3eada15-a9f5-01a5-1110-6b268cec76b8@oracle.com>
References: <AM6PR08MB4422DCEBC3FC64084DB24828C5C10@AM6PR08MB4422.eurprd08.prod.outlook.com>,
 <d3eada15-a9f5-01a5-1110-6b268cec76b8@oracle.com>
Message-ID: <DBBPR08MB44266E6C6FBA9CDEC1D0E0B0C5D80@DBBPR08MB4426.eurprd08.prod.outlook.com>

Hi Vladimir,

Thanks for your review.

> One comment:
>?
>? ?(i >> 31) >>> 31 ==> i >>> 31
>
> The shift count value is irrelevant here, isn't it?
>?
> So, the transformation can be generalized to:
>?
>? ?(i >> n) >>> 31 ==> i >>> 31

Yes. This match rule exactly could be more general.

JBS:?https://bugs.openjdk.java.net/browse/JDK-8242429
Webrev:?http://cr.openjdk.java.net/~yzhang/ericliu/8242429/webrev.01/

[Tests]
Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core and langtools::tier1.?
No new failure found.?

JMH: A simple JMH case [1] on AArch64 and AMD64 machines.?

For AArch64, one platform has no obvious improvement, but on others the
performance gain is 7.3%~32.7%.?

For AMD64, one platform has no obvious improvement, but on others the
performance gain is 13.7%~32.4%.?

A simple test case [2] has checked the correctness for some corner
cases.

[1] https://bugs.openjdk.java.net/secure/attachment/87712/IdealNegate.java
[2] https://bugs.openjdk.java.net/secure/attachment/87713/SignExtractTest.java


Thanks,
Eric


From martin.doerr at sap.com  Thu Apr 16 08:08:24 2020
From: martin.doerr at sap.com (Doerr, Martin)
Date: Thu, 16 Apr 2020 08:08:24 +0000
Subject: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes()
 and Integer.reverseBytes() on Power9
In-Reply-To: <1964be00-8926-7a70-d23a-2f7e85eb4ef3@linux.ibm.com>
References: <67fa8056-a8ed-cdfc-1e5a-d36b49c4af18@linux.ibm.com>
 <0079874e-7bc2-5ff4-f004-337c718ec6df@linux.ibm.com>
 <OF54F8418D.F0B8E4CB-ON0025853E.004E3DC9-4925853E.004F648A@notes.na.collabserv.com>
 <OFE7B015D3.EC53D085-ON00258546.002BA949-49258546.00305059@notes.na.collabserv.com>
 <AM0PR0202MB3297222F0FEBE1239877EE479ADA0@AM0PR0202MB3297.eurprd02.prod.outlook.com>
 <1964be00-8926-7a70-d23a-2f7e85eb4ef3@linux.ibm.com>
Message-ID: <AM0PR0202MB329767069BE76CD006FF2D679AD80@AM0PR0202MB3297.eurprd02.prod.outlook.com>

Hi Corey,

please use 2020 for both, the Oracle and the SAP copyright.
Usually, both should be the same, but some people forget to update one of them.

Best regards,
Martin


> -----Original Message-----
> From: Corey Ashford <cjashfor at linux.ibm.com>
> Sent: Donnerstag, 16. April 2020 03:35
> To: Doerr, Martin <martin.doerr at sap.com>
> Cc: Michihiro Horie <HORIE at jp.ibm.com>; hotspot-compiler-
> dev at openjdk.java.net; ppc-aix-port-dev at openjdk.java.net
> Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of
> Long.reverseBytes() and Integer.reverseBytes() on Power9
> 
> Hello Martin,
> 
> I'm having some trouble with my email server, so I'm having to reply to
> your earlier post, but I saw your most recent post on the mailing list
> archive.
> 
> Thanks for reviewing and testing this patch.  I went to look at the
> copyright dates, and see two date ranges: one for Oracle and its
> affiliates, and another for SAP.  In the files I looked at, the end date
> wasn't the same between the two.  Which one (or both) should I modify?
> 
> Thanks,
> 
> - Corey
> 
> On 4/14/20 6:26 AM, Doerr, Martin wrote:
> > Hi Corey,
> >
> > thanks for contributing it. Looks good to me. I?ll run it through our
> > testing and let you know about the results.
> >
> > Best regards,
> >
> > Martin
> >
> > *From:*ppc-aix-port-dev <ppc-aix-port-dev-bounces at openjdk.java.net>
> *On
> > Behalf Of *Michihiro Horie
> > *Sent:* Freitag, 10. April 2020 10:48
> > *To:* cjashfor at linux.ibm.com
> > *Cc:* hotspot-compiler-dev at openjdk.java.net;
> > ppc-aix-port-dev at openjdk.java.net
> > *Subject:* Re: RFR[S]:8241874 [PPC64] Improve performance of
> > Long.reverseBytes() and Integer.reverseBytes() on Power9
> >
> > Hi Corey,
> >
> > Thank you for sharing your benchmarks. I confirmed your change reduced
> > the elapsed time of the benchmarks by more than 30% on my P9 node.
> Also,
> > I checked JTREG results, which look no problem.
> >
> > BTW, I cannot find further points of improvement in your change.
> >
> > Best regards,
> > Michihiro
> >
> >
> > ----- Original message -----
> > From: "Corey Ashford" <cjashfor at linux.ibm.com
> > <mailto:cjashfor at linux.ibm.com>>
> > To: Michihiro Horie/Japan/IBM at IBMJP
> > Cc: hotspot-compiler-dev at openjdk.java.net
> > <mailto:hotspot-compiler-dev at openjdk.java.net>,
> > ppc-aix-port-dev at openjdk.java.net
> > <mailto:ppc-aix-port-dev at openjdk.java.net>, "Gustavo Romero"
> > <gromero at linux.vnet.ibm.com <mailto:gromero at linux.vnet.ibm.com>>
> > Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of
> > Long.reverseBytes() and Integer.reverseBytes() on Power9
> > Date: Fri, Apr 3, 2020 8:07 AM
> >
> > On 4/2/20 7:27 AM, Michihiro Horie wrote:
> >> Hi Corey,
> >>
> >> I?m not a reviewer, but I can run your benchmark in my local P9 node if
> >> you share it.
> >>
> >> Best regards,
> >> Michihiro
> >
> > The tests are somewhat hokey; I added the shifts to keep the compiler
> > from hoisting the code that it could predetermine the result.
> >
> > Here's the one for Long.reverseBytes():
> >
> > import java.lang.*;
> >
> > class ReverseLong
> > {
> >  ? ? ?public static void main(String args[])
> >  ? ? ?{
> >  ? ? ? ? ?long reversed, re_reversed;
> > long accum = 0;
> > long orig = 0x1122334455667788L;
> > long start = System.currentTimeMillis();
> > for (int i = 0; i < 1_000_000_000; i++) {
> > // Try to keep java from figuring out stuff in advance
> > reversed = Long.reverseBytes(orig);
> > re_reversed = Long.reverseBytes(reversed);
> > if (re_reversed != orig) {
> >  ? ? ? ? ?System.out.println("Orig: " + String.format("%16x", orig) +
> > " ?Re-reversed: " + String.format("%16x", re_reversed));
> > }
> > accum += orig;
> > orig = Long.rotateRight(orig, 3);
> > }
> > System.out.println("Elapsed time: " +
> > Long.toString(System.currentTimeMillis() - start));
> > System.out.println("accum: " + Long.toString(accum));
> >  ? ? ?}
> > }
> >
> >
> > And the one for Integer.reverseBytes():
> >
> > import java.lang.*;
> >
> > class ReverseInt
> > {
> >  ? ? ?public static void main(String args[])
> >  ? ? ?{
> >  ? ? ? ? ?int reversed, re_reversed;
> > int orig = 0x11223344;
> > int accum = 0;
> > long start = System.currentTimeMillis();
> > for (int i = 0; i < 1_000_000_000; i++) {
> > // Try to keep java from figuring out stuff in advance
> > reversed = Integer.reverseBytes(orig);
> > re_reversed = Integer.reverseBytes(reversed);
> > if (re_reversed != orig) {
> >  ? ? ? ? ?System.out.println("Orig: " + String.format("%08x", orig) +
> > " ?Re-reversed: " + String.format("%08x", re_reversed));
> > }
> > accum += orig;
> > orig = Integer.rotateRight(orig, 3);
> > }
> > System.out.println("Elapsed time: " +
> > Long.toString(System.currentTimeMillis() - start));
> > System.out.println("accum: " + Integer.toString(accum));
> >  ? ? ?}
> > }
> >


From Yang.Zhang at arm.com  Thu Apr 16 08:58:15 2020
From: Yang.Zhang at arm.com (Yang Zhang)
Date: Thu, 16 Apr 2020 08:58:15 +0000
Subject: RFR(XS): 8242796: Fix client build failure
Message-ID: <VI1PR0802MB2558E68B52A0CDB9E7CDC8B08ED80@VI1PR0802MB2558.eurprd08.prod.outlook.com>

Hi,

Could you please help to review this patch?

JBS: https://bugs.openjdk.java.net/browse/JDK-8242796
Webrev: http://cr.openjdk.java.net/~yzhang/8242796/webrev.00/

This build failure is introduced by JDK-8193210 [JVMCI/Graal] add JFR
compiler phase/inlining events.
C2 only code is used for JFR. To fix this issue, I use COMPILER2 macro.

With this patch, x86 client build succeeds. But AArch64 client build
still fails, which is caused by [1]. I have filed [2] for AArch64
client build failure and will summit another patch for that.

[1] https://bugs.openjdk.java.net/browse/JDK-8241665
[2] https://bugs.openjdk.java.net/browse/JDK-8242905

Regards
Yang


From richard.reingruber at sap.com  Thu Apr 16 09:57:22 2020
From: richard.reingruber at sap.com (Reingruber, Richard)
Date: Thu, 16 Apr 2020 09:57:22 +0000
Subject: [15] RFR(T) 8242793: Incorrect copyright header in
 ContinuousCallSiteTargetChange.java
Message-ID: <AM0PR0202MB3331C95CD80068D2D6C7DF289BD80@AM0PR0202MB3331.eurprd02.prod.outlook.com>

Hi,

please review this trivial patch that adds a comma to the copyright header of the test
ContinuousCallSiteTargetChange.java

Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8242793/webrev.0/
Bug:    https://bugs.openjdk.java.net/browse/JDK-8242793

The test still succeeds with the patch. The license check fails without and succeeds with the patch.

  sh make/scripts/lic_check.sh -gpl test/hotspot/jtreg/compiler/jsr292/ContinuousCallSiteTargetChange.java

Thanks,
Richard.

From vladimir.x.ivanov at oracle.com  Thu Apr 16 10:08:59 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Thu, 16 Apr 2020 13:08:59 +0300
Subject: [15] RFR (S): 8242491: C2: assert(v2->bottom_type() == vt)
 failed: mismatch when creating MacroLogicV
In-Reply-To: <41a3a12c-2361-fef8-bc81-9012b75a1c9e@oracle.com>
References: <0e456498-e4d6-7147-b7db-2e2c3a539430@oracle.com>
 <41a3a12c-2361-fef8-bc81-9012b75a1c9e@oracle.com>
Message-ID: <d6254a4a-b767-dd87-a125-bc0cb92c92d4@oracle.com>

Thanks for the reviews, Vladimir, Sandhya, and Jatin.

Best regards,
Vladimir Ivanov

On 16.04.2020 02:29, Vladimir Kozlov wrote:
> Good.
> 
> Thanks,
> Vladimir K
> 
> On 4/10/20 7:07 AM, Vladimir Ivanov wrote:
>> http://cr.openjdk.java.net/~vlivanov/8242491/webrev.00/
>> https://bugs.openjdk.java.net/browse/JDK-8242491
>>
>> Asserts on input types for MacroLogicV are too strong.
>> SuperWord pass can mix vectors of distinct subword types (byte and 
>> boolean or short and char).
>>
>> Though it's possible to explicitly check for such particular cases, 
>> the fix relaxes the assert even more and only verifies that inputs are 
>> of the same size (in bytes), so bitwise reinterpretation of vector 
>> values is safe.
>>
>> Testing: hs-precheckin-comp,hs-tier1,hs-tier2
>>
>> Thanks!
>>
>> Best regards,
>> Vladimir Ivanov

From vladimir.x.ivanov at oracle.com  Thu Apr 16 10:09:56 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Thu, 16 Apr 2020 13:09:56 +0300
Subject: [15] RFR (S): 8242492: C2: Remove
 Matcher::vector_shift_count_ideal_reg()
In-Reply-To: <8466f935-5ace-bb02-9258-44541582c00d@oracle.com>
References: <a2f57a50-3739-4e1d-a21d-b11d204f90fc@oracle.com>
 <8466f935-5ace-bb02-9258-44541582c00d@oracle.com>
Message-ID: <f3f75886-aa5d-de4a-bb7a-37c6eb1bc3d6@oracle.com>

Thanks for the review, Vladimir.

Best regards,
Vladimir Ivanov

On 16.04.2020 02:33, Vladimir Kozlov wrote:
> Good.
> 
> Thanks,
> Vladimir K
> 
> On 4/10/20 7:25 AM, Vladimir Ivanov wrote:
>> http://cr.openjdk.java.net/~vlivanov/8242492/webrev.00/
>> https://bugs.openjdk.java.net/browse/JDK-8242492
>>
>> Matcher::vector_shift_count_ideal_reg() was introduced specifically 
>> for x86 to communicate that only low 32 bits are used by vector shift 
>> instructions, so only those bits should be spilled when needed.
>>
>> Unfortunately, it is broken for LShiftCntV/RShiftCntV: Matcher doesn't 
>> capute overridden ideal_reg value and spills use bottom type instead. 
>> So, it causes a mismatch during RA.
>>
>> Fortunately, LShiftCntV/RShiftCntV are never spilled on x86. 
>> Considering how simple AD instructions for LShiftCntV/RShiftCntV are, 
>> RA prefers to rematerialize the value instead (which is a reg-to-reg 
>> move).
>>
>> I propose to simplify the implementation and completely remove 
>> Matcher::vector_shift_count_ideal_reg() along with additional special 
>> handling logic for LShiftCntV/RShiftCntV.
>>
>> Testing: hs-precheckin-comp, hs-tier1, hs-tier2
>>
>> Thanks!
>>
>> Best regards,
>> Vladimir Ivanov

From vladimir.x.ivanov at oracle.com  Thu Apr 16 10:28:38 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Thu, 16 Apr 2020 13:28:38 +0300
Subject: Master Thesis Research Advice. JIT
In-Reply-To: <CACK4yMn2Ywui5eQzRMaE-Wv_i+k_nOEoLwiY_x1PCZYjdxioig@mail.gmail.com>
References: <CACK4yMn2Ywui5eQzRMaE-Wv_i+k_nOEoLwiY_x1PCZYjdxioig@mail.gmail.com>
Message-ID: <9765f74c-bfd5-19da-a343-6efccde73195@oracle.com>

Hi Vladyslav,

C2 has a number of aggressive optimizations which heavily rely on 
profiling data. It leads to numerous uncommon traps in the generated 
code. You can disable some of such optimizations, but there's no way to 
completely eliminate uncommon traps in the generated code: they are a 
core piece of the design.

Have you tried switching to C1 instead? C1 doesn't rely on profiling 
data that much and use code patching techniques in place of uncommon 
traps. So, the generated code usually has complete coverage of the 
compiled method.

Best regards,
Vladimir Ivanov

On 16.04.2020 01:05, Vladyslav Tkachuk wrote:
> Hello,
> 
> I am a Master's student at the University of Passau, Germany.
> My master thesis research is concerned with detecting equivalent mutants in
> Java.
> The main research question is to use the Trivial Compiler Equivalency
> technique. This means that we acquire Assembly code produced by Java JIT
> compiler for initial and mutated source and then compare them.
> 
> I have previously contacted Tobias Hartmann, who advised me to write here
> regarding technical questions. I would like to ask you if there is any
> solution to a problem I have.
> 
> Last time Tobias recommended me to use Opto-Assembly to achieve my purpose.
> It was a good hint and it helped me to get more precise data.
> However, after doing some research I noticed that in some cases C2 compiler
> unloaded the method code which I expected to find in assembly. As I found
> out this was a part of deoptimization and the method code was meant to be
> executed by the interpreter.
> Here is an example of what I mean:
> 
> {method}
>   - this oop:          0x000000000d2319c8
>   - method holder:     'Rational'
>   - constants:         0x000000000d230cf8 constant pool [85]
> {0x000000000d230d00} for 'Rational' cache=0x000000000d231cd8
>   - access:            0x81000001  public
>   - name:              'toString'
>   - signature:         '()Ljava/lang/String;'
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> some setup code
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 02c    movq    RBP, RDX # spill
> 02f    movl    RDX, #11 # int
>        nop # 3 bytes pad for loops and calls
> *037    call,static  wrapper for: uncommon_trap(reason='unloaded'
> action='reinterpret' index='11')*
> *        # Rational::toString @ bci:0  L[0]=RBP L[1]=_ L[2]=_ L[3]=_ L[4]=_
> L[5]=_ L[6]=_ L[7]=_*
> *        # OopMap{rbp=Oop off=60}*
> 03c    int3 # ShouldNotReachHere
> 03c
> 
> 
> This is a 'toString' method and as I could see and understand, there is no
> actual method code, but only a call to it.
> 
> I would like to know if it is possible to completely disable any
> deoptimizations and consistently receive the full asm code? I consent that
> it is not practical and hurts performance, but it is not a goal in this
> scope. According to my observations, in most cases the method code is full,
> but strangely here it did not work. I have tried to google any useful info,
> unfortunately, I did not see anything helpful, despite the explanations
> about what deoptimization is and its types.
> 
> I would be grateful if you could shed some light on the issue.
> Thanks in advance for any useful information.
> 
> Best regards,
> Vladyslav Tkachuk
> 

From vladimir.x.ivanov at oracle.com  Thu Apr 16 12:28:46 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Thu, 16 Apr 2020 15:28:46 +0300
Subject: RFR(S):8242429:Better implementation for signed extract
In-Reply-To: <DBBPR08MB44266E6C6FBA9CDEC1D0E0B0C5D80@DBBPR08MB4426.eurprd08.prod.outlook.com>
References: <AM6PR08MB4422DCEBC3FC64084DB24828C5C10@AM6PR08MB4422.eurprd08.prod.outlook.com>
 <d3eada15-a9f5-01a5-1110-6b268cec76b8@oracle.com>
 <DBBPR08MB44266E6C6FBA9CDEC1D0E0B0C5D80@DBBPR08MB4426.eurprd08.prod.outlook.com>
Message-ID: <b31ca0c0-b5ca-cdba-f3f2-91fafa195b9d@oracle.com>


> Webrev:?http://cr.openjdk.java.net/~yzhang/ericliu/8242429/webrev.01/

Looks good.

Have you tested it through submit repo?

Best regards,
Vladimir Ivanov

> [Tests]
> Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core and langtools::tier1.
> No new failure found.
> 
> JMH: A simple JMH case [1] on AArch64 and AMD64 machines.
> 
> For AArch64, one platform has no obvious improvement, but on others the
> performance gain is 7.3%~32.7%.
> 
> For AMD64, one platform has no obvious improvement, but on others the
> performance gain is 13.7%~32.4%.
> 
> A simple test case [2] has checked the correctness for some corner
> cases.
> 
> [1] https://bugs.openjdk.java.net/secure/attachment/87712/IdealNegate.java
> [2] https://bugs.openjdk.java.net/secure/attachment/87713/SignExtractTest.java
> 
> 
> Thanks,
> Eric
> 

From vladimir.x.ivanov at oracle.com  Thu Apr 16 12:32:52 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Thu, 16 Apr 2020 15:32:52 +0300
Subject: RFR (XXL): 8223347: Integration of Vector API (Incubator):
 General HotSpot changes
In-Reply-To: <c1bdf88c-5de2-d069-5f31-5a95c6988bf8@oracle.com>
References: <c1bdf88c-5de2-d069-5f31-5a95c6988bf8@oracle.com>
Message-ID: <25a564a1-7f40-6988-060f-86b06e02ad21@oracle.com>

Hi,

Any more reviews, please? Especially, compiler and runtime-related changes.

Thanks in advance!

Best regards,
Vladimir Ivanov

On 04.04.2020 02:12, Vladimir Ivanov wrote:
> Hi,
> 
> Following up on review requests of API [0] and Java implementation [1] 
> for Vector API (JEP 338 [2]), here's a request for review of general 
> HotSpot changes (in shared code) required for supporting the API:
> 
> 
> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/all.00-03/ 
> 
> 
> (First of all, to set proper expectations: since the JEP is still in 
> Candidate state, the intention is to initiate preliminary round(s) of 
> review to inform the community and gather feedback before sending out 
> final/official RFRs once the JEP is Targeted to a release.)
> 
> Vector API (being developed in Project Panama [3]) relies on JVM support 
> to utilize optimal vector hardware instructions at runtime. It interacts 
> with JVM through intrinsics (declared in 
> jdk.internal.vm.vector.VectorSupport [4]) which expose vector operations 
> support in C2 JIT-compiler.
> 
> As Paul wrote earlier: "A vector intrinsic is an internal low-level 
> vector operation. The last argument to the intrinsic is fall back 
> behavior in Java, implementing the scalar operation over the number of 
> elements held by the vector.? Thus, If the intrinsic is not supported in 
> C2 for the other arguments then the Java implementation is executed (the 
> Java implementation is always executed when running in the interpreter 
> or for C1)."
> 
> The rest of JVM support is about aggressively optimizing vector boxes to 
> minimize (ideally eliminate) the overhead of boxing for vector values.
> It's a stop-the-gap solution for vector box elimination problem until 
> inline classes arrive. Vector classes are value-based and in the longer 
> term will be migrated to inline classes once the support becomes available.
> 
> Vector API talk from JVMLS'18 [5] contains brief overview of JVM 
> implementation and some details.
> 
> Complete implementation resides in vector-unstable branch of panama/dev 
> repository [6].
> 
> Now to gory details (the patch is split in multiple "sub-webrevs"):
> 
> ===========================================================
> 
> (1) 
> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/00.backend.shared/ 
> 
> 
> Ideal vector nodes for new operations introduced by Vector API.
> 
> (Platform-specific back end support will be posted for review separately).
> 
> ===========================================================
> 
> (2) 
> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/ 
> 
> 
> JVM Java interface (VectorSupport) and intrinsic support in C2.
> 
> Vector instances are initially represented as VectorBox macro nodes and 
> "unboxing" is represented by VectorUnbox node. It simplifies vector box 
> elimination analysis and the nodes are expanded later right before EA pass.
> 
> Vectors have 2-level on-heap representation: for the vector value 
> primitive array is used as a backing storage and it is encapsulated in a 
> typed wrapper (e.g., Int256Vector - vector of 8 ints - contains a int[8] 
> instance which is used to store vector value).
> 
> Unless VectorBox node goes away, it needs to be expanded into an 
> allocation eventually, but it is a pure node and doesn't have any JVM 
> state associated with it. The problem is solved by keeping JVM state 
> separately in a VectorBoxAllocate node associated with VectorBox node 
> and use it during expansion.
> 
> Also, to simplify vector box elimination, inlining of vector reboxing 
> calls (VectorSupport::maybeRebox) is delayed until the analysis is over.
> 
> ===========================================================
> 
> (3) 
> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/02.vbox_elimination/ 
> 
> 
> Vector box elimination analysis implementation. (Brief overview: slides 
> #36-42 [5].)
> 
> The main part is devoted to scalarization across safepoints and 
> rematerialization support during deoptimization. In C2-generated code 
> vector operations work with raw vector values which live in registers or 
> spilled on the stack and it allows to avoid boxing/unboxing when a 
> vector value is alive across a safepoint. As with other values, there's 
> just a location of the vector value at the safepoint and vector type 
> information recorded in the relevant nmethod metadata and all the 
> heavy-lifting happens only when rematerialization takes place.
> 
> The analysis preserves object identity invariants except during 
> aggressive reboxing (guarded by -XX:+EnableAggressiveReboxing).
> 
> (Aggressive reboxing is crucial for cases when vectors "escape": it 
> allocates a fresh instance at every escape point thus enabling original 
> instance to go away.)
> 
> ===========================================================
> 
> (4) 
> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/03.module.hotspot/ 
> 
> 
> HotSpot changes for jdk.incubator.vector module. Vector support is 
> makred experimental and turned off by default. JEP 338 proposes the API 
> to be released as an incubator module, so a user has to specify 
> "--add-module jdk.incubator.vector" on the command line to be able to 
> use it.
> When user does that, JVM automatically enables Vector API support.
> It improves usability (user doesn't need to separately "open" the API 
> and enable JVM support) while minimizing risks of destabilitzation from 
> new code when the API is not used.
> 
> 
> That's it! Will be happy to answer any questions.
> 
> And thanks in advance for any feedback!
> 
> Best regards,
> Vladimir Ivanov
> 
> [0] 
> https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-March/065345.html 
> 
> 
> [1] 
> https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-April/041228.html
> 
> [2] https://openjdk.java.net/jeps/338
> 
> [3] https://openjdk.java.net/projects/panama/
> 
> [4] 
> http://cr.openjdk.java.net/~vlivanov/panama/vector/jep338/hotspot.shared/webrev.00/01.intrinsics/src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java.html 
> 
> 
> [5] http://cr.openjdk.java.net/~vlivanov/talks/2018_JVMLS_VectorAPI.pdf
> 
> [6] http://hg.openjdk.java.net/panama/dev/shortlog/92bbd44386e9
> 
>  ??? $ hg clone http://hg.openjdk.java.net/panama/dev/ -b vector-unstable

From jamsheed.c.m at oracle.com  Thu Apr 16 13:12:49 2020
From: jamsheed.c.m at oracle.com (Jamsheed C M)
Date: Thu, 16 Apr 2020 18:42:49 +0530
Subject: RFR: 8237949: CTW: C1 compilation fails with "too many stack slots
 used"
Message-ID: <b99795f0-55a8-2dca-06b9-4f82e083f39a@oracle.com>

Hi all,

As part of the enhancement requirement from truffle use case [1] 
OopMapValue was extended by 2 bits,? this change will be automatically 
handled in c1 here [2].

There was a day one code[3] that handled this case before [2] covering 
more cases than Oop cases. But it seems this extension is not really 
useful for C1 java use case.

So the earlier bailout is preserved with change in the comments. [4]

Request for review

JBS: https://bugs.openjdk.java.net/browse/JDK-8237949

webrev: http://cr.openjdk.java.net/~jcm/8237949/webrev.00/

Best regards,

Jamsheed

[1] https://bugs.openjdk.java.net/browse/JDK-8231586

[2] 
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/c1/c1_LinearScan.hpp#L341

[3] 
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/c1/c1_LinearScan.cpp#L246

[4] 
http://cr.openjdk.java.net/~jcm/8237949/webrev.00/src/hotspot/share/c1/c1_LinearScan.cpp.udiff.html


From vladimir.x.ivanov at oracle.com  Thu Apr 16 13:29:55 2020
From: vladimir.x.ivanov at oracle.com (Vladimir Ivanov)
Date: Thu, 16 Apr 2020 16:29:55 +0300
Subject: RFR: 8237949: CTW: C1 compilation fails with "too many stack
 slots used"
In-Reply-To: <b99795f0-55a8-2dca-06b9-4f82e083f39a@oracle.com>
References: <b99795f0-55a8-2dca-06b9-4f82e083f39a@oracle.com>
Message-ID: <3ec95b34-f40f-689e-5e97-369ad42b949a@oracle.com>

Looks good and trivial.

Best regards,
Vladimir Ivanov

On 16.04.2020 16:12, Jamsheed C M wrote:
> Hi all,
> 
> As part of the enhancement requirement from truffle use case [1] 
> OopMapValue was extended by 2 bits,? this change will be automatically 
> handled in c1 here [2].
> 
> There was a day one code[3] that handled this case before [2] covering 
> more cases than Oop cases. But it seems this extension is not really 
> useful for C1 java use case.
> 
> So the earlier bailout is preserved with change in the comments. [4]
> 
> Request for review
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8237949
> 
> webrev: http://cr.openjdk.java.net/~jcm/8237949/webrev.00/
> 
> Best regards,
> 
> Jamsheed
> 
> [1] https://bugs.openjdk.java.net/browse/JDK-8231586
> 
> [2] 
> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/c1/c1_LinearScan.hpp#L341 
> 
> 
> [3] 
> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/c1/c1_LinearScan.cpp#L246 
> 
> 
> [4] 
> http://cr.openjdk.java.net/~jcm/8237949/webrev.00/src/hotspot/share/c1/c1_LinearScan.cpp.udiff.html 
> 
> 

From jamsheed.c.m at oracle.com  Thu Apr 16 13:52:51 2020
From: jamsheed.c.m at oracle.com (Jamsheed C M)
Date: Thu, 16 Apr 2020 19:22:51 +0530
Subject: RFR: 8237949: CTW: C1 compilation fails with "too many stack
 slots used"
In-Reply-To: <3ec95b34-f40f-689e-5e97-369ad42b949a@oracle.com>
References: <b99795f0-55a8-2dca-06b9-4f82e083f39a@oracle.com>
 <3ec95b34-f40f-689e-5e97-369ad42b949a@oracle.com>
Message-ID: <25345e8a-0c14-95e4-91af-41427a408f85@oracle.com>

Hi Vladimir Ivanov,

Thank you for the review

Best regards,

Jamsheed

On 16/04/2020 18:59, Vladimir Ivanov wrote:
> Looks good and trivial.
>
> Best regards,
> Vladimir Ivanov
>
> On 16.04.2020 16:12, Jamsheed C M wrote:
>> Hi all,
>>
>> As part of the enhancement requirement from truffle use case [1] 
>> OopMapValue was extended by 2 bits,? this change will be 
>> automatically handled in c1 here [2].
>>
>> There was a day one code[3] that handled this case before [2] 
>> covering more cases than Oop cases. But it seems this extension is 
>> not really useful for C1 java use case.
>>
>> So the earlier bailout is preserved with change in the comments. [4]
>>
>> Request for review
>>
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8237949
>>
>> webrev: http://cr.openjdk.java.net/~jcm/8237949/webrev.00/
>>
>> Best regards,
>>
>> Jamsheed
>>
>> [1] https://bugs.openjdk.java.net/browse/JDK-8231586
>>
>> [2] 
>> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/c1/c1_LinearScan.hpp#L341 
>>
>>
>> [3] 
>> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/c1/c1_LinearScan.cpp#L246 
>>
>>
>> [4] 
>> http://cr.openjdk.java.net/~jcm/8237949/webrev.00/src/hotspot/share/c1/c1_LinearScan.cpp.udiff.html 
>>
>>

From vladimir.kozlov at oracle.com  Thu Apr 16 21:27:10 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Thu, 16 Apr 2020 14:27:10 -0700
Subject: RFR(XS): 8242796: Fix client build failure
In-Reply-To: <VI1PR0802MB2558E68B52A0CDB9E7CDC8B08ED80@VI1PR0802MB2558.eurprd08.prod.outlook.com>
References: <VI1PR0802MB2558E68B52A0CDB9E7CDC8B08ED80@VI1PR0802MB2558.eurprd08.prod.outlook.com>
Message-ID: <a6bc8da1-f48c-8813-2a8d-d033948e4d48@oracle.com>

Method register_jfr_phasetype_serializer() is only called when C2 or JVMCI code is included in JVM build.
I think you need to put whole method under checks:

#if INCLUDE_JFR && COMPILER2_OR_JVMCI
// It appends new compiler phase names to growable array phase_names(a new CompilerPhaseType mapping
// in compiler/compilerEvent.cpp) and registers it with its serializer.

Thanks,
Vladimir

On 4/16/20 1:58 AM, Yang Zhang wrote:
> Hi,
> 
> Could you please help to review this patch?
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8242796
> Webrev: http://cr.openjdk.java.net/~yzhang/8242796/webrev.00/
> 
> This build failure is introduced by JDK-8193210 [JVMCI/Graal] add JFR
> compiler phase/inlining events.
> C2 only code is used for JFR. To fix this issue, I use COMPILER2 macro.
> 
> With this patch, x86 client build succeeds. But AArch64 client build
> still fails, which is caused by [1]. I have filed [2] for AArch64
> client build failure and will summit another patch for that.
> 
> [1] https://bugs.openjdk.java.net/browse/JDK-8241665
> [2] https://bugs.openjdk.java.net/browse/JDK-8242905
> 
> Regards
> Yang
> 

From vladimir.kozlov at oracle.com  Thu Apr 16 21:28:26 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Thu, 16 Apr 2020 14:28:26 -0700
Subject: [15] RFR(T) 8242793: Incorrect copyright header in
 ContinuousCallSiteTargetChange.java
In-Reply-To: <AM0PR0202MB3331C95CD80068D2D6C7DF289BD80@AM0PR0202MB3331.eurprd02.prod.outlook.com>
References: <AM0PR0202MB3331C95CD80068D2D6C7DF289BD80@AM0PR0202MB3331.eurprd02.prod.outlook.com>
Message-ID: <9b95031f-668b-449e-b779-b59980364c24@oracle.com>

Good and trivial.

Thanks,
Vladimir K

On 4/16/20 2:57 AM, Reingruber, Richard wrote:
> Hi,
> 
> please review this trivial patch that adds a comma to the copyright header of the test
> ContinuousCallSiteTargetChange.java
> 
> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8242793/webrev.0/
> Bug:    https://bugs.openjdk.java.net/browse/JDK-8242793
> 
> The test still succeeds with the patch. The license check fails without and succeeds with the patch.
> 
>    sh make/scripts/lic_check.sh -gpl test/hotspot/jtreg/compiler/jsr292/ContinuousCallSiteTargetChange.java
> 
> Thanks,
> Richard.
> 

From Yang.Zhang at arm.com  Fri Apr 17 06:34:20 2020
From: Yang.Zhang at arm.com (Yang Zhang)
Date: Fri, 17 Apr 2020 06:34:20 +0000
Subject: [aarch64-port-dev ] RFR(M): 8242482: AArch64: Change parameter names
 of reduction operations to make code clear 
Message-ID: <VI1PR0802MB2558670821C29DC6202817258ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>

Hi,

Could you please help to review this patch?

JBS: https://bugs.openjdk.java.net/browse/JDK-8242482
Webrev: http://cr.openjdk.java.net/~yzhang/8242482/webrev.00/

This patch is a followup patch of previous discussion.
https://mail.openjdk.java.net/pipermail/aarch64-port-dev/2020-April/008740.html

To make the intent clear, the scalar parameter name is changed to isrc, fsrc or dsrc based on
its data type. The vector parameter name is changed to vsrc. And so does temp register.

Testing: tier1

Regards
Yang


From eric.c.liu at arm.com  Fri Apr 17 06:39:53 2020
From: eric.c.liu at arm.com (Eric Liu)
Date: Fri, 17 Apr 2020 06:39:53 +0000
Subject: RFR(S):8242429:Better implementation for signed extract
In-Reply-To: <b31ca0c0-b5ca-cdba-f3f2-91fafa195b9d@oracle.com>
References: <AM6PR08MB4422DCEBC3FC64084DB24828C5C10@AM6PR08MB4422.eurprd08.prod.outlook.com>
 <d3eada15-a9f5-01a5-1110-6b268cec76b8@oracle.com>
 <DBBPR08MB44266E6C6FBA9CDEC1D0E0B0C5D80@DBBPR08MB4426.eurprd08.prod.outlook.com>,
 <b31ca0c0-b5ca-cdba-f3f2-91fafa195b9d@oracle.com>
Message-ID: <AM6PR08MB4422F2B5EE1069122F7C392BC5D90@AM6PR08MB4422.eurprd08.prod.outlook.com>

Hi?Vladimir,

Thanks for your review.
Ningsheng will help me to submit it.

Thanks,
Eric


From xxinliu at amazon.com  Fri Apr 17 06:58:35 2020
From: xxinliu at amazon.com (Liu, Xin)
Date: Fri, 17 Apr 2020 06:58:35 +0000
Subject: RFR[M]: 8151779: Some intrinsic flags could be replaced with one
 general flag
Message-ID: <19CD3956-4DC6-4908-8626-27D48A9AB4A4@amazon.com>

Hi, Corey and Vladimir, 

I recently go through vmSymbols.hpp/cpp. I think I understand your comments. 
Each UseXXXIntrinsics does control a bunch of intrinsics (plural). Thanks for the hint.

Even though I feel I know intrinsics mechanism of hotspot better, I still need a clarification of JDK- 8151779. 

There're 321 intrinsics (https://chriswhocodes.com/hotspot_intrinsics_jdk15.html).
If there's no any option, they are all available for compilers.  That makes sense because intrinsics are always beneficial.  
But there're reasons we need to disable a subset of them. A specific architecture may miss efficient instructions or fixed functions. Or simply because an intrinsic is buggy.

Currently, JDK provides developers 2 ways to control intrinsics. 
1. Some diagnostic options. Eg. InlineMathNatives, UseBase64Intrinsics.
Developers can use one option to disable a group of intrinsics.  That is to say, it's a coarse-grained approach.
 
2. DisableIntrinsic="a,b,c"
By passing a string list of vmIntrinsics::IDs, it's capable of disabling any specified intrinsic.

But even putting above 2 approaches together, we still can't precisely control any intrinsic.
If we want to enable an intrinsic which is under control of InlineMathNatives but keep others disable, it's impossible now.  [please correct if I am wrong here].
I think that the motivation JDK-8151779 tried to solve.

If we provide a new option EnableIntrinsic and put it least priority, then we can precisely control any intrinsic.
Quote Vladimir Kozlov "DisableIntrinsic list prevails if an intrinsic is specified on both EnableIntrinsic and DisableIntrinsic."

 "-XX:ControlIntrinsic=+_dabs,-_fabs,-_getClass" looks more elegant, but it will confuse developers with DisableIntrinsic.  
If we decide to deprecate DisableIntrinsic, I think ControlIntrinsic may be a better option. Now I prefer to provide EnableIntrinsic for simplicity and symmetry. 
What do you think?

Thanks,
--lx


?On 4/13/20, 1:47 PM, "hotspot-compiler-dev on behalf of Corey Ashford" <hotspot-compiler-dev-bounces at openjdk.java.net on behalf of cjashfor at linux.ibm.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.


    On 4/13/20 10:33 AM, Liu, Xin wrote:
    > Hi, compiler developers,
    > I attempt to refactor UseXXXIntrinsics for JDK-8151779.  I think we still need to keep UseXXXIntrinsics options because many applications may be using them.
    >
    > My change provide 2 new features:
    > 1) a shorthand to enable/disable intrinsics.
    > A comma-separated string. Each one is an intrinsic. An optional tailing symbol + or '-' denotes enabling or disabling.
    > If the tailing symbol is missing, it means enable.
    > Eg. -XX:UseIntrinsics="AESCTR-,CRC32C+,CRC32-,MathExact"
    > This jvm option will expand to multiple options -XX:-UseAESCTRIntrinsics, -XX:+UseCRC32CIntrinsics, -XX:-UseCRC32Intrinsics, -XX:UseMathExactIntrinsics
    >
    > 2) provide a set of macro to declare intrinsic options
    > Developers declare once in intrinsics.hpp and macros will take care all other places.
    > Here are example: https://cr.openjdk.java.net/~xliu/8151779/00/webrev/src/hotspot/share/compiler/intrinsics.hpp.html
    > Ion Lam is overhauling jvm options.  I am thinking how to be consistent with his proposal.
    >

    Great idea, though to be consistent with the original syntax, I think
    the +/- should be in front of the name:

    -XX:UseIntrinsics=-AESCTR,+CRC32C,...


    > I handle UseIntrinsics before VM_Version::initialize. It means that platform-specific initialization still has a chance to correct those options.
    > If we do that after VM_Version::initialize,  some intrinsics may cause JVM crash.  Eg. +UseBase64Intrinsics on x86_64 Linux.
    > Even though this behavior is same as -XX:+UseXXXIntrinsics, from user's perspective, it's not straightforward when JVM overrides what users specify implicitly. It's dilemma here,  stable jvm or fidelity of cmdline.  What do you think?
    >
    > Another problem is naming convention. Almost all intrinsics options use UseXXXIntrinsics. One exception is UseVectorizedMismatchIntrinsic.
    > Personally, I think it should be "UseXXXIntrinsic" because one option is for one intrinsic, right?  Is it possible to change this name convention?

    Some (many?) intrinsic options turn on more than one .ad instruct
    instrinsic, or library instrinsics at the same time.  I think that's why
    the plural is there.  Also, consistently adding the plural allows you to
    add more capabilities to a flag that initially only had one intrinsic
    without changing the plurality (and thus backward compatibility).

    Regards,

    - Corey


From Yang.Zhang at arm.com  Fri Apr 17 08:37:16 2020
From: Yang.Zhang at arm.com (Yang Zhang)
Date: Fri, 17 Apr 2020 08:37:16 +0000
Subject: RFR(XS): 8242796: Fix client build failure
In-Reply-To: <a6bc8da1-f48c-8813-2a8d-d033948e4d48@oracle.com>
References: <VI1PR0802MB2558E68B52A0CDB9E7CDC8B08ED80@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <a6bc8da1-f48c-8813-2a8d-d033948e4d48@oracle.com>
Message-ID: <VI1PR0802MB2558E70EAFD873290A1967E88ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>

Hi Vladimir

I update the patch according to your comment. 
http://cr.openjdk.java.net/~yzhang/8242796/webrev.01/

These checks are needed. 
#if INCLUDE_JFR && COMPILER2_OR_JVMCI
#if COMPILER2 ----------- without it, configuring --with-jvm-features=-compiler2 fails to build. The error is the same as before.

Regards
Yang

-----Original Message-----
From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Vladimir Kozlov
Sent: Friday, April 17, 2020 5:27 AM
To: hotspot-compiler-dev at openjdk.java.net
Subject: Re: RFR(XS): 8242796: Fix client build failure

Method register_jfr_phasetype_serializer() is only called when C2 or JVMCI code is included in JVM build.
I think you need to put whole method under checks:

#if INCLUDE_JFR && COMPILER2_OR_JVMCI
// It appends new compiler phase names to growable array phase_names(a new CompilerPhaseType mapping // in compiler/compilerEvent.cpp) and registers it with its serializer.

Thanks,
Vladimir

On 4/16/20 1:58 AM, Yang Zhang wrote:
> Hi,
> 
> Could you please help to review this patch?
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8242796
> Webrev: http://cr.openjdk.java.net/~yzhang/8242796/webrev.00/
> 
> This build failure is introduced by JDK-8193210 [JVMCI/Graal] add JFR 
> compiler phase/inlining events.
> C2 only code is used for JFR. To fix this issue, I use COMPILER2 macro.
> 
> With this patch, x86 client build succeeds. But AArch64 client build 
> still fails, which is caused by [1]. I have filed [2] for AArch64 
> client build failure and will summit another patch for that.
> 
> [1] https://bugs.openjdk.java.net/browse/JDK-8241665
> [2] https://bugs.openjdk.java.net/browse/JDK-8242905
> 
> Regards
> Yang
> 

From aph at redhat.com  Fri Apr 17 08:42:10 2020
From: aph at redhat.com (Andrew Haley)
Date: Fri, 17 Apr 2020 09:42:10 +0100
Subject: [aarch64-port-dev ] RFR(M): 8242482: AArch64: Change parameter
 names of reduction operations to make code clear
In-Reply-To: <VI1PR0802MB2558670821C29DC6202817258ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>
References: <VI1PR0802MB2558670821C29DC6202817258ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>
Message-ID: <f998a1cc-2ff8-d3b0-e3db-6c9ef4ffecd8@redhat.com>

On 4/17/20 7:34 AM, Yang Zhang wrote:
> JBS: https://bugs.openjdk.java.net/browse/JDK-8242482
> Webrev: http://cr.openjdk.java.net/~yzhang/8242482/webrev.00/
>
> This patch is a followup patch of previous discussion.
> https://mail.openjdk.java.net/pipermail/aarch64-port-dev/2020-April/008740.html
>
> To make the intent clear, the scalar parameter name is changed to isrc, fsrc or dsrc based on
> its data type. The vector parameter name is changed to vsrc. And so does temp register.

Thanks, that's much nicer. I haven't been able to check every
substitution, though. I'm not quite sure about how to do that.
Is all this stuff covered by our test cases?

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From Yang.Zhang at arm.com  Fri Apr 17 09:13:11 2020
From: Yang.Zhang at arm.com (Yang Zhang)
Date: Fri, 17 Apr 2020 09:13:11 +0000
Subject: [aarch64-port-dev ] RFR(M): 8242482: AArch64: Change parameter
 names of reduction operations to make code clear
In-Reply-To: <f998a1cc-2ff8-d3b0-e3db-6c9ef4ffecd8@redhat.com>
References: <VI1PR0802MB2558670821C29DC6202817258ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <f998a1cc-2ff8-d3b0-e3db-6c9ef4ffecd8@redhat.com>
Message-ID: <VI1PR0802MB2558027F96432B28B3C25EFF8ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>

Hi Andrew
Besides tier1, I also test these operations in Vector API test, which can cover all the reduction operations.  

In this directory, there are also some test cases about reduction operations,  which is added in [1].
https://hg.openjdk.java.net/jdk/jdk/file/55c4283a7606/test/hotspot/jtreg/compiler/loopopts/superword

[1] https://bugs.openjdk.java.net/browse/JDK-8240248

Regards
Yang

-----Original Message-----
From: Andrew Haley <aph at redhat.com> 
Sent: Friday, April 17, 2020 4:42 PM
To: Yang Zhang <Yang.Zhang at arm.com>; aarch64-port-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net
Cc: nd <nd at arm.com>
Subject: Re: [aarch64-port-dev ] RFR(M): 8242482: AArch64: Change parameter names of reduction operations to make code clear

On 4/17/20 7:34 AM, Yang Zhang wrote:
> JBS: https://bugs.openjdk.java.net/browse/JDK-8242482
> Webrev: http://cr.openjdk.java.net/~yzhang/8242482/webrev.00/
>
> This patch is a followup patch of previous discussion.
> https://mail.openjdk.java.net/pipermail/aarch64-port-dev/2020-April/00
> 8740.html
>
> To make the intent clear, the scalar parameter name is changed to 
> isrc, fsrc or dsrc based on its data type. The vector parameter name is changed to vsrc. And so does temp register.

Thanks, that's much nicer. I haven't been able to check every substitution, though. I'm not quite sure about how to do that.
Is all this stuff covered by our test cases?

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From Yang.Zhang at arm.com  Fri Apr 17 09:14:24 2020
From: Yang.Zhang at arm.com (Yang Zhang)
Date: Fri, 17 Apr 2020 09:14:24 +0000
Subject: [aarch64-port-dev ] RFR(XS): 8242070: AArch64: Fix a typo
 introduced by JDK-8238690
In-Reply-To: <VI1PR0802MB25580275D036617C1713AF158EDE0@VI1PR0802MB2558.eurprd08.prod.outlook.com>
References: <VI1PR0802MB25580275D036617C1713AF158EDE0@VI1PR0802MB2558.eurprd08.prod.outlook.com>
Message-ID: <VI1PR0802MB255835F5D4BD55CDAFF4B1578ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>

Hi Andrew

Ping it again. Could you please help to review this?

Regards
Yang

-----Original Message-----
From: aarch64-port-dev <aarch64-port-dev-bounces at openjdk.java.net> On Behalf Of Yang Zhang
Sent: Friday, April 10, 2020 10:53 AM
To: aarch64-port-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net
Cc: nd <nd at arm.com>
Subject: [aarch64-port-dev ] RFR(XS): 8242070: AArch64: Fix a typo introduced by JDK-8238690

Hi,

Could you please help to review this patch?

JBS: https://bugs.openjdk.java.net/browse/JDK-8242070
Webrev: http://cr.openjdk.java.net/~yzhang/8242070/webrev.00/

In JDK-8238690, it unified IR shape for vector shifts by scalar and always used

ShiftV src (ShiftCntV shift)

When shift is scalar, the following IR nodes are generated.

         scalar_shift
               |
     src  ShiftCntV
      |     /
      |    /
      ShiftV

But when implementing this on AArch64, there is an issue in match rule of vector shift right with imm shift for short type.

match(Set dst (RShiftVS src (LShiftCntV shift)));

LShiftCntV should be RShiftCntV here.

Test case:
  public static void shiftR(short[] a, short[] c) {
      for (int i = 0; i < a.length; i++) {
          c[i] = (short)(a[i] >> 2);
      }
  }

IR nodes:
                               imm:2
                                  |
      LoadVector RShiftCntV
           |                  /
           |               /
           RShiftVS

C2 aassembly generated:

Before:
  0x0000ffffac563764:   orr	w11, wzr, #0x2
  0x0000ffffac563768:   dup	v16.16b, w11  -------- vshiftcnt16B

  0x0000ffffac5637a8:   ldr	q24, [x18, #16]
  0x0000ffffac5637ac:   neg	v25.16b, v16.16b       ------
  0x0000ffffac5637b0:   sshl	v24.8h, v24.8h, v25.8h ------vsra8S
  0x0000ffffac5637b8:   str	q24, [x14, #16]

"match(Set dst (RShiftVS src (LShiftCntV shift)));" matching fails.
RShiftCntV and RShiftVS are matched separately by vshiftcnt16B and vsra8S.

After:
  0x0000ffffac563808:   ldr	q16, [x15, #16]
  0x0000ffffac56380c:   sshr	v16.8h, v16.8h, #2
  0x0000ffffac563814:   str	q16, [x14, #16]

"match(Set dst (RShiftVS src (RShiftCntV shift)));" matching succeeds.

Performance:
JMH test case is attached in JBS.

Before:
Benchmark               Mode  Cnt   Score   Error  Units
TestVect.testVectShift  avgt   10  66.964 ? 0.052  us/op

After:
Benchmark               Mode  Cnt   Score   Error  Units
TestVect.testVectShift  avgt   10  56.156 ? 0.053  us/op

Testing: tier1
Pass and no new failure.

Regards
Yang


From richard.reingruber at sap.com  Fri Apr 17 14:55:01 2020
From: richard.reingruber at sap.com (Reingruber, Richard)
Date: Fri, 17 Apr 2020 14:55:01 +0000
Subject: [15] RFR(T) 8242793: Incorrect copyright header in
 ContinuousCallSiteTargetChange.java
In-Reply-To: <9b95031f-668b-449e-b779-b59980364c24@oracle.com>
References: <AM0PR0202MB3331C95CD80068D2D6C7DF289BD80@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <9b95031f-668b-449e-b779-b59980364c24@oracle.com>
Message-ID: <AM0PR0202MB3331F6716884880E7E3F5E759BD90@AM0PR0202MB3331.eurprd02.prod.outlook.com>

Thank you, Vladimir.
Richard.

-----Original Message-----
From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Vladimir Kozlov
Sent: Donnerstag, 16. April 2020 23:28
To: hotspot-compiler-dev at openjdk.java.net
Subject: Re: [15] RFR(T) 8242793: Incorrect copyright header in ContinuousCallSiteTargetChange.java

Good and trivial.

Thanks,
Vladimir K

On 4/16/20 2:57 AM, Reingruber, Richard wrote:
> Hi,
> 
> please review this trivial patch that adds a comma to the copyright header of the test
> ContinuousCallSiteTargetChange.java
> 
> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8242793/webrev.0/
> Bug:    https://bugs.openjdk.java.net/browse/JDK-8242793
> 
> The test still succeeds with the patch. The license check fails without and succeeds with the patch.
> 
>    sh make/scripts/lic_check.sh -gpl test/hotspot/jtreg/compiler/jsr292/ContinuousCallSiteTargetChange.java
> 
> Thanks,
> Richard.
> 

From rwestrel at redhat.com  Fri Apr 17 15:51:13 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Fri, 17 Apr 2020 17:51:13 +0200
Subject: RFR(XS): 8242502: UnexpectedDeoptimizationTest.java failed
 "assert(phase->type(obj)->isa_oopptr()) failed: only for oop input"
Message-ID: <878siu9klq.fsf@redhat.com>


https://bugs.openjdk.java.net/browse/JDK-8242502
http://cr.openjdk.java.net/~roland/8242502/webrev.00/

I wasn't able to reproduce that failure (neither by running the test or
with the replay file) but I suspect the assert fails because it
encounters a unexpected top node.

Roland.


From vladimir.kozlov at oracle.com  Fri Apr 17 19:07:17 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Fri, 17 Apr 2020 12:07:17 -0700
Subject: RFR(XS): 8242796: Fix client build failure
In-Reply-To: <VI1PR0802MB2558E70EAFD873290A1967E88ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>
References: <VI1PR0802MB2558E68B52A0CDB9E7CDC8B08ED80@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <a6bc8da1-f48c-8813-2a8d-d033948e4d48@oracle.com>
 <VI1PR0802MB2558E70EAFD873290A1967E88ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>
Message-ID: <b80d5553-3473-b8db-fc5d-3dc91faf45fc@oracle.com>

Hi Yang

On 4/17/20 1:37 AM, Yang Zhang wrote:
> Hi Vladimir
> 
> I update the patch according to your comment.
> http://cr.openjdk.java.net/~yzhang/8242796/webrev.01/
> 
> These checks are needed.
> #if INCLUDE_JFR && COMPILER2_OR_JVMCI
> #if COMPILER2 ----------- without it, configuring --with-jvm-features=-compiler2 fails to build. The error is the same as before.

Yes, I agree that additional #ifdef COMPILER2 is needed.
The only comment I have is to may be include compiler_c2 check under that #ifdef and leaving #endif at the same place:

+ #ifdef COMPILER2
     } else if (compiler_type == compiler_c2) {

       first_registration = false;
+ #endif // COMPILER2
     }

Thanks,
Vladimir

> 
> Regards
> Yang
> 
> -----Original Message-----
> From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Vladimir Kozlov
> Sent: Friday, April 17, 2020 5:27 AM
> To: hotspot-compiler-dev at openjdk.java.net
> Subject: Re: RFR(XS): 8242796: Fix client build failure
> 
> Method register_jfr_phasetype_serializer() is only called when C2 or JVMCI code is included in JVM build.
> I think you need to put whole method under checks:
> 
> #if INCLUDE_JFR && COMPILER2_OR_JVMCI
> // It appends new compiler phase names to growable array phase_names(a new CompilerPhaseType mapping // in compiler/compilerEvent.cpp) and registers it with its serializer.
> 
> Thanks,
> Vladimir
> 
> On 4/16/20 1:58 AM, Yang Zhang wrote:
>> Hi,
>>
>> Could you please help to review this patch?
>>
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8242796
>> Webrev: http://cr.openjdk.java.net/~yzhang/8242796/webrev.00/
>>
>> This build failure is introduced by JDK-8193210 [JVMCI/Graal] add JFR
>> compiler phase/inlining events.
>> C2 only code is used for JFR. To fix this issue, I use COMPILER2 macro.
>>
>> With this patch, x86 client build succeeds. But AArch64 client build
>> still fails, which is caused by [1]. I have filed [2] for AArch64
>> client build failure and will summit another patch for that.
>>
>> [1] https://bugs.openjdk.java.net/browse/JDK-8241665
>> [2] https://bugs.openjdk.java.net/browse/JDK-8242905
>>
>> Regards
>> Yang
>>

From vladimir.kozlov at oracle.com  Fri Apr 17 23:58:10 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Fri, 17 Apr 2020 16:58:10 -0700
Subject: [15] RFR(T) 8242357: [JVMCI] Incorrect use of JVMCI_CHECK_ on return
 statement
Message-ID: <e57d4378-b965-7c9d-5c5f-3e9edd64a748@oracle.com>

https://bugs.openjdk.java.net/browse/JDK-8242357

CHECK macros can't be used on a return statement - they expand to include code after the return [2] and so have no affect.

Fix:

src/hotspot/share/jvmci/jvmciEnv.hpp
@@ -262,7 +262,8 @@
    char* as_utf8_string(JVMCIObject str, char* buf, int buflen);

    JVMCIObject create_string(Symbol* str, JVMCI_TRAPS) {
-    return create_string(str->as_C_string(), JVMCI_CHECK_(JVMCIObject()));
+    JVMCIObject s = create_string(str->as_C_string(), JVMCI_CHECK_(JVMCIObject()));
+    return s;
    }

I tried to find similar cases but it was the only one.
Clang -Wunreachable-code-aggressive does not catch this case.

Tested hs-tier1,hs-tier3-graal

Thanks,
Vladimir

[1] http://hg.openjdk.java.net/jdk/jdk/file/90882ba9f488/src/hotspot/share/jvmci/jvmciExceptions.hpp#l48

From xxinliu at amazon.com  Sat Apr 18 00:36:43 2020
From: xxinliu at amazon.com (Liu, Xin)
Date: Sat, 18 Apr 2020 00:36:43 +0000
Subject: [15] RFR(T) 8242357: [JVMCI] Incorrect use of JVMCI_CHECK_ on
 return statement
In-Reply-To: <e57d4378-b965-7c9d-5c5f-3e9edd64a748@oracle.com>
References: <e57d4378-b965-7c9d-5c5f-3e9edd64a748@oracle.com>
Message-ID: <395A687F-8883-4210-BA6E-AE83B32D76E9@amazon.com>

LGTM. I used to backport a similar change (exceptions.hpp) to jdk8u. 
I also use regex to scan the whole source code, I think it?s the only place in hotspot. 

Thanks,
--lx

?On 4/17/20, 5:02 PM, "hotspot-compiler-dev on behalf of Vladimir Kozlov" <hotspot-compiler-dev-bounces at openjdk.java.net on behalf of vladimir.kozlov at oracle.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.


    https://bugs.openjdk.java.net/browse/JDK-8242357

    CHECK macros can't be used on a return statement - they expand to include code after the return [2] and so have no affect.

    Fix:

    src/hotspot/share/jvmci/jvmciEnv.hpp
    @@ -262,7 +262,8 @@
        char* as_utf8_string(JVMCIObject str, char* buf, int buflen);

        JVMCIObject create_string(Symbol* str, JVMCI_TRAPS) {
    -    return create_string(str->as_C_string(), JVMCI_CHECK_(JVMCIObject()));
    +    JVMCIObject s = create_string(str->as_C_string(), JVMCI_CHECK_(JVMCIObject()));
    +    return s;
        }

    I tried to find similar cases but it was the only one.
    Clang -Wunreachable-code-aggressive does not catch this case.

    Tested hs-tier1,hs-tier3-graal

    Thanks,
    Vladimir

    [1] http://hg.openjdk.java.net/jdk/jdk/file/90882ba9f488/src/hotspot/share/jvmci/jvmciExceptions.hpp#l48


From vladimir.kozlov at oracle.com  Sat Apr 18 00:43:01 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Fri, 17 Apr 2020 17:43:01 -0700
Subject: [15] RFR(T) 8242357: [JVMCI] Incorrect use of JVMCI_CHECK_ on
 return statement
In-Reply-To: <395A687F-8883-4210-BA6E-AE83B32D76E9@amazon.com>
References: <e57d4378-b965-7c9d-5c5f-3e9edd64a748@oracle.com>
 <395A687F-8883-4210-BA6E-AE83B32D76E9@amazon.com>
Message-ID: <58abe636-27d9-b027-b8b1-8f7ed862d7bc@oracle.com>

Thank you, Xin

Vladimir K

On 4/17/20 5:36 PM, Liu, Xin wrote:
> LGTM. I used to backport a similar change (exceptions.hpp) to jdk8u.
> I also use regex to scan the whole source code, I think it?s the only place in hotspot.
> 
> Thanks,
> --lx
> 
> ?On 4/17/20, 5:02 PM, "hotspot-compiler-dev on behalf of Vladimir Kozlov" <hotspot-compiler-dev-bounces at openjdk.java.net on behalf of vladimir.kozlov at oracle.com> wrote:
> 
>      CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
>      https://bugs.openjdk.java.net/browse/JDK-8242357
> 
>      CHECK macros can't be used on a return statement - they expand to include code after the return [2] and so have no affect.
> 
>      Fix:
> 
>      src/hotspot/share/jvmci/jvmciEnv.hpp
>      @@ -262,7 +262,8 @@
>          char* as_utf8_string(JVMCIObject str, char* buf, int buflen);
> 
>          JVMCIObject create_string(Symbol* str, JVMCI_TRAPS) {
>      -    return create_string(str->as_C_string(), JVMCI_CHECK_(JVMCIObject()));
>      +    JVMCIObject s = create_string(str->as_C_string(), JVMCI_CHECK_(JVMCIObject()));
>      +    return s;
>          }
> 
>      I tried to find similar cases but it was the only one.
>      Clang -Wunreachable-code-aggressive does not catch this case.
> 
>      Tested hs-tier1,hs-tier3-graal
> 
>      Thanks,
>      Vladimir
> 
>      [1] http://hg.openjdk.java.net/jdk/jdk/file/90882ba9f488/src/hotspot/share/jvmci/jvmciExceptions.hpp#l48
> 

From vladimir.kozlov at oracle.com  Sat Apr 18 01:44:55 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Fri, 17 Apr 2020 18:44:55 -0700
Subject: RFR[M]: 8151779: Some intrinsic flags could be replaced with one
 general flag
In-Reply-To: <19CD3956-4DC6-4908-8626-27D48A9AB4A4@amazon.com>
References: <19CD3956-4DC6-4908-8626-27D48A9AB4A4@amazon.com>
Message-ID: <b2b2226f-8e97-75d0-8e3d-b8ffbf5f474d@oracle.com>

I withdraw my suggestion about EnableIntrinsic from JDK-8151779 because ControlIntrinsics will provide such 
functionality and will replace existing DisableIntrinsic.

Note, we can start deprecating Use*Intrinsic flags (and DisableIntrinsic) later in other changes. You don't need to do 
everything at once. What we need now a mechanism to replace them.

On 4/16/20 11:58 PM, Liu, Xin wrote:
> Hi, Corey and Vladimir,
> 
> I recently go through vmSymbols.hpp/cpp. I think I understand your comments.
> Each UseXXXIntrinsics does control a bunch of intrinsics (plural). Thanks for the hint.
> 
> Even though I feel I know intrinsics mechanism of hotspot better, I still need a clarification of JDK- 8151779.
> 
> There're 321 intrinsics (https://chriswhocodes.com/hotspot_intrinsics_jdk15.html).
> If there's no any option, they are all available for compilers.  That makes sense because intrinsics are always beneficial.
> But there're reasons we need to disable a subset of them. A specific architecture may miss efficient instructions or fixed functions. Or simply because an intrinsic is buggy.
> 
> Currently, JDK provides developers 2 ways to control intrinsics. > 1. Some diagnostic options. Eg. InlineMathNatives, UseBase64Intrinsics.
> Developers can use one option to disable a group of intrinsics.  That is to say, it's a coarse-grained approach.
>   
> 2. DisableIntrinsic="a,b,c"
> By passing a string list of vmIntrinsics::IDs, it's capable of disabling any specified intrinsic.
> 
> But even putting above 2 approaches together, we still can't precisely control any intrinsic.

Yes, you are right. We seems are trying to put these 2 different ways into one flag which may be mistake.

-XX:ControlIntrinsic=-_updateBytesCRC32C,-_updateDirectByteBufferCRC32C is a similar to -XX:-UseCRC32CIntrinsics but it 
requires more detailed knowledge about intrinsics ids.

May be we can have 2nd flag, as you suggested -XX:UseIntrinsics=-AESCTR,+CRC32C, for such cases.

> If we want to enable an intrinsic which is under control of InlineMathNatives but keep others disable, it's impossible now.  [please correct if I am wrong here].

You can disable all other from 321 intrinsics with DisableIntrinsic flag which is very tedious I agree.

> I think that the motivation JDK-8151779 tried to solve.

The idea is that instead of flags we use to control particular intrinsics depending on CPU we will use vmIntrinsics::IDs 
or other tables as you showed in your changes. It will require changes in vm_version_<cpu> codes.

> 
> If we provide a new option EnableIntrinsic and put it least priority, then we can precisely control any intrinsic.
> Quote Vladimir Kozlov "DisableIntrinsic list prevails if an intrinsic is specified on both EnableIntrinsic and DisableIntrinsic."
> 
>   "-XX:ControlIntrinsic=+_dabs,-_fabs,-_getClass" looks more elegant, but it will confuse developers with DisableIntrinsic.
> If we decide to deprecate DisableIntrinsic, I think ControlIntrinsic may be a better option. Now I prefer to provide EnableIntrinsic for simplicity and symmetry.

I prefer to have one ControlIntrinsic flag and deprecate DisableIntrinsic. I don't think it is confusing.

Thanks,
Vladimir

> What do you think?
> 
> Thanks,
> --lx
> 
> 
> ?On 4/13/20, 1:47 PM, "hotspot-compiler-dev on behalf of Corey Ashford" <hotspot-compiler-dev-bounces at openjdk.java.net on behalf of cjashfor at linux.ibm.com> wrote:
> 
>      CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
>      On 4/13/20 10:33 AM, Liu, Xin wrote:
>      > Hi, compiler developers,
>      > I attempt to refactor UseXXXIntrinsics for JDK-8151779.  I think we still need to keep UseXXXIntrinsics options because many applications may be using them.
>      >
>      > My change provide 2 new features:
>      > 1) a shorthand to enable/disable intrinsics.
>      > A comma-separated string. Each one is an intrinsic. An optional tailing symbol + or '-' denotes enabling or disabling.
>      > If the tailing symbol is missing, it means enable.
>      > Eg. -XX:UseIntrinsics="AESCTR-,CRC32C+,CRC32-,MathExact"
>      > This jvm option will expand to multiple options -XX:-UseAESCTRIntrinsics, -XX:+UseCRC32CIntrinsics, -XX:-UseCRC32Intrinsics, -XX:UseMathExactIntrinsics
>      >
>      > 2) provide a set of macro to declare intrinsic options
>      > Developers declare once in intrinsics.hpp and macros will take care all other places.
>      > Here are example: https://cr.openjdk.java.net/~xliu/8151779/00/webrev/src/hotspot/share/compiler/intrinsics.hpp.html
>      > Ion Lam is overhauling jvm options.  I am thinking how to be consistent with his proposal.
>      >
> 
>      Great idea, though to be consistent with the original syntax, I think
>      the +/- should be in front of the name:
> 
>      -XX:UseIntrinsics=-AESCTR,+CRC32C,...
> 
> 
>      > I handle UseIntrinsics before VM_Version::initialize. It means that platform-specific initialization still has a chance to correct those options.
>      > If we do that after VM_Version::initialize,  some intrinsics may cause JVM crash.  Eg. +UseBase64Intrinsics on x86_64 Linux.
>      > Even though this behavior is same as -XX:+UseXXXIntrinsics, from user's perspective, it's not straightforward when JVM overrides what users specify implicitly. It's dilemma here,  stable jvm or fidelity of cmdline.  What do you think?
>      >
>      > Another problem is naming convention. Almost all intrinsics options use UseXXXIntrinsics. One exception is UseVectorizedMismatchIntrinsic.
>      > Personally, I think it should be "UseXXXIntrinsic" because one option is for one intrinsic, right?  Is it possible to change this name convention?
> 
>      Some (many?) intrinsic options turn on more than one .ad instruct
>      instrinsic, or library instrinsics at the same time.  I think that's why
>      the plural is there.  Also, consistently adding the plural allows you to
>      add more capabilities to a flag that initially only had one intrinsic
>      without changing the plurality (and thus backward compatibility).
> 
>      Regards,
> 
>      - Corey
> 
> 

From xxinliu at amazon.com  Sat Apr 18 02:19:11 2020
From: xxinliu at amazon.com (Liu, Xin)
Date: Sat, 18 Apr 2020 02:19:11 +0000
Subject: RFR[M]: 8151779: Some intrinsic flags could be replaced with one
 general flag
In-Reply-To: <b2b2226f-8e97-75d0-8e3d-b8ffbf5f474d@oracle.com>
References: <19CD3956-4DC6-4908-8626-27D48A9AB4A4@amazon.com>
 <b2b2226f-8e97-75d0-8e3d-b8ffbf5f474d@oracle.com>
Message-ID: <0EDAAC88-E5D9-424F-A19E-5E20C689C2F3@amazon.com>

Hi, Vladimir, 

Thanks for the clarification. 
Oh, yes, it's theoretically possible, but it's tedious. I am wrong at that point.
I think I got your point. ControlIntrinsics will make developer's life easier. I will implement it. 

Thanks,
--lx


?On 4/17/20, 6:46 PM, "Vladimir Kozlov" <vladimir.kozlov at oracle.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.


    I withdraw my suggestion about EnableIntrinsic from JDK-8151779 because ControlIntrinsics will provide such
    functionality and will replace existing DisableIntrinsic.

    Note, we can start deprecating Use*Intrinsic flags (and DisableIntrinsic) later in other changes. You don't need to do
    everything at once. What we need now a mechanism to replace them.

    On 4/16/20 11:58 PM, Liu, Xin wrote:
    > Hi, Corey and Vladimir,
    >
    > I recently go through vmSymbols.hpp/cpp. I think I understand your comments.
    > Each UseXXXIntrinsics does control a bunch of intrinsics (plural). Thanks for the hint.
    >
    > Even though I feel I know intrinsics mechanism of hotspot better, I still need a clarification of JDK- 8151779.
    >
    > There're 321 intrinsics (https://chriswhocodes.com/hotspot_intrinsics_jdk15.html).
    > If there's no any option, they are all available for compilers.  That makes sense because intrinsics are always beneficial.
    > But there're reasons we need to disable a subset of them. A specific architecture may miss efficient instructions or fixed functions. Or simply because an intrinsic is buggy.
    >
    > Currently, JDK provides developers 2 ways to control intrinsics. > 1. Some diagnostic options. Eg. InlineMathNatives, UseBase64Intrinsics.
    > Developers can use one option to disable a group of intrinsics.  That is to say, it's a coarse-grained approach.
    >
    > 2. DisableIntrinsic="a,b,c"
    > By passing a string list of vmIntrinsics::IDs, it's capable of disabling any specified intrinsic.
    >
    > But even putting above 2 approaches together, we still can't precisely control any intrinsic.

    Yes, you are right. We seems are trying to put these 2 different ways into one flag which may be mistake.

    -XX:ControlIntrinsic=-_updateBytesCRC32C,-_updateDirectByteBufferCRC32C is a similar to -XX:-UseCRC32CIntrinsics but it
    requires more detailed knowledge about intrinsics ids.

    May be we can have 2nd flag, as you suggested -XX:UseIntrinsics=-AESCTR,+CRC32C, for such cases.

    > If we want to enable an intrinsic which is under control of InlineMathNatives but keep others disable, it's impossible now.  [please correct if I am wrong here].

    You can disable all other from 321 intrinsics with DisableIntrinsic flag which is very tedious I agree.

    > I think that the motivation JDK-8151779 tried to solve.

    The idea is that instead of flags we use to control particular intrinsics depending on CPU we will use vmIntrinsics::IDs
    or other tables as you showed in your changes. It will require changes in vm_version_<cpu> codes.

    >
    > If we provide a new option EnableIntrinsic and put it least priority, then we can precisely control any intrinsic.
    > Quote Vladimir Kozlov "DisableIntrinsic list prevails if an intrinsic is specified on both EnableIntrinsic and DisableIntrinsic."
    >
    >   "-XX:ControlIntrinsic=+_dabs,-_fabs,-_getClass" looks more elegant, but it will confuse developers with DisableIntrinsic.
    > If we decide to deprecate DisableIntrinsic, I think ControlIntrinsic may be a better option. Now I prefer to provide EnableIntrinsic for simplicity and symmetry.

    I prefer to have one ControlIntrinsic flag and deprecate DisableIntrinsic. I don't think it is confusing.

    Thanks,
    Vladimir

    > What do you think?
    >
    > Thanks,
    > --lx
    >
    >
    > On 4/13/20, 1:47 PM, "hotspot-compiler-dev on behalf of Corey Ashford" <hotspot-compiler-dev-bounces at openjdk.java.net on behalf of cjashfor at linux.ibm.com> wrote:
    >
    >      CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
    >
    >
    >
    >      On 4/13/20 10:33 AM, Liu, Xin wrote:
    >      > Hi, compiler developers,
    >      > I attempt to refactor UseXXXIntrinsics for JDK-8151779.  I think we still need to keep UseXXXIntrinsics options because many applications may be using them.
    >      >
    >      > My change provide 2 new features:
    >      > 1) a shorthand to enable/disable intrinsics.
    >      > A comma-separated string. Each one is an intrinsic. An optional tailing symbol + or '-' denotes enabling or disabling.
    >      > If the tailing symbol is missing, it means enable.
    >      > Eg. -XX:UseIntrinsics="AESCTR-,CRC32C+,CRC32-,MathExact"
    >      > This jvm option will expand to multiple options -XX:-UseAESCTRIntrinsics, -XX:+UseCRC32CIntrinsics, -XX:-UseCRC32Intrinsics, -XX:UseMathExactIntrinsics
    >      >
    >      > 2) provide a set of macro to declare intrinsic options
    >      > Developers declare once in intrinsics.hpp and macros will take care all other places.
    >      > Here are example: https://cr.openjdk.java.net/~xliu/8151779/00/webrev/src/hotspot/share/compiler/intrinsics.hpp.html
    >      > Ion Lam is overhauling jvm options.  I am thinking how to be consistent with his proposal.
    >      >
    >
    >      Great idea, though to be consistent with the original syntax, I think
    >      the +/- should be in front of the name:
    >
    >      -XX:UseIntrinsics=-AESCTR,+CRC32C,...
    >
    >
    >      > I handle UseIntrinsics before VM_Version::initialize. It means that platform-specific initialization still has a chance to correct those options.
    >      > If we do that after VM_Version::initialize,  some intrinsics may cause JVM crash.  Eg. +UseBase64Intrinsics on x86_64 Linux.
    >      > Even though this behavior is same as -XX:+UseXXXIntrinsics, from user's perspective, it's not straightforward when JVM overrides what users specify implicitly. It's dilemma here,  stable jvm or fidelity of cmdline.  What do you think?
    >      >
    >      > Another problem is naming convention. Almost all intrinsics options use UseXXXIntrinsics. One exception is UseVectorizedMismatchIntrinsic.
    >      > Personally, I think it should be "UseXXXIntrinsic" because one option is for one intrinsic, right?  Is it possible to change this name convention?
    >
    >      Some (many?) intrinsic options turn on more than one .ad instruct
    >      instrinsic, or library instrinsics at the same time.  I think that's why
    >      the plural is there.  Also, consistently adding the plural allows you to
    >      add more capabilities to a flag that initially only had one intrinsic
    >      without changing the plurality (and thus backward compatibility).
    >
    >      Regards,
    >
    >      - Corey
    >
    >


From david.holmes at oracle.com  Sat Apr 18 13:34:11 2020
From: david.holmes at oracle.com (David Holmes)
Date: Sat, 18 Apr 2020 23:34:11 +1000
Subject: [15] RFR(T) 8242357: [JVMCI] Incorrect use of JVMCI_CHECK_ on
 return statement
In-Reply-To: <e57d4378-b965-7c9d-5c5f-3e9edd64a748@oracle.com>
References: <e57d4378-b965-7c9d-5c5f-3e9edd64a748@oracle.com>
Message-ID: <4b477c0f-22c8-5271-fb1a-96e1ad8c5cba@oracle.com>

Looks good!

Thanks,
David

On 18/04/2020 9:58 am, Vladimir Kozlov wrote:
> https://bugs.openjdk.java.net/browse/JDK-8242357
> 
> CHECK macros can't be used on a return statement - they expand to 
> include code after the return [2] and so have no affect.
> 
> Fix:
> 
> src/hotspot/share/jvmci/jvmciEnv.hpp
> @@ -262,7 +262,8 @@
>  ?? char* as_utf8_string(JVMCIObject str, char* buf, int buflen);
> 
>  ?? JVMCIObject create_string(Symbol* str, JVMCI_TRAPS) {
> -??? return create_string(str->as_C_string(), JVMCI_CHECK_(JVMCIObject()));
> +??? JVMCIObject s = create_string(str->as_C_string(), 
> JVMCI_CHECK_(JVMCIObject()));
> +??? return s;
>  ?? }
> 
> I tried to find similar cases but it was the only one.
> Clang -Wunreachable-code-aggressive does not catch this case.
> 
> Tested hs-tier1,hs-tier3-graal
> 
> Thanks,
> Vladimir
> 
> [1] 
> http://hg.openjdk.java.net/jdk/jdk/file/90882ba9f488/src/hotspot/share/jvmci/jvmciExceptions.hpp#l48 
> 

From vladimir.kozlov at oracle.com  Sat Apr 18 14:41:19 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Sat, 18 Apr 2020 07:41:19 -0700
Subject: [15] RFR(T) 8242357: [JVMCI] Incorrect use of JVMCI_CHECK_ on
 return statement
In-Reply-To: <4b477c0f-22c8-5271-fb1a-96e1ad8c5cba@oracle.com>
References: <e57d4378-b965-7c9d-5c5f-3e9edd64a748@oracle.com>
 <4b477c0f-22c8-5271-fb1a-96e1ad8c5cba@oracle.com>
Message-ID: <1f8bf985-1290-088c-1982-1d058076cbcb@oracle.com>

Thank you, David

Vladimir

On 4/18/20 6:34 AM, David Holmes wrote:
> Looks good!
> 
> Thanks,
> David
> 
> On 18/04/2020 9:58 am, Vladimir Kozlov wrote:
>> https://bugs.openjdk.java.net/browse/JDK-8242357
>>
>> CHECK macros can't be used on a return statement - they expand to include code after the return [2] and so have no 
>> affect.
>>
>> Fix:
>>
>> src/hotspot/share/jvmci/jvmciEnv.hpp
>> @@ -262,7 +262,8 @@
>> ??? char* as_utf8_string(JVMCIObject str, char* buf, int buflen);
>>
>> ??? JVMCIObject create_string(Symbol* str, JVMCI_TRAPS) {
>> -??? return create_string(str->as_C_string(), JVMCI_CHECK_(JVMCIObject()));
>> +??? JVMCIObject s = create_string(str->as_C_string(), JVMCI_CHECK_(JVMCIObject()));
>> +??? return s;
>> ??? }
>>
>> I tried to find similar cases but it was the only one.
>> Clang -Wunreachable-code-aggressive does not catch this case.
>>
>> Tested hs-tier1,hs-tier3-graal
>>
>> Thanks,
>> Vladimir
>>
>> [1] http://hg.openjdk.java.net/jdk/jdk/file/90882ba9f488/src/hotspot/share/jvmci/jvmciExceptions.hpp#l48

From tkachuk.vladyslav at gmail.com  Sun Apr 19 19:56:57 2020
From: tkachuk.vladyslav at gmail.com (Vladyslav Tkachuk)
Date: Sun, 19 Apr 2020 21:56:57 +0200
Subject: Master Thesis Research Advice. JIT
In-Reply-To: <9765f74c-bfd5-19da-a343-6efccde73195@oracle.com>
References: <CACK4yMn2Ywui5eQzRMaE-Wv_i+k_nOEoLwiY_x1PCZYjdxioig@mail.gmail.com>
 <9765f74c-bfd5-19da-a343-6efccde73195@oracle.com>
Message-ID: <CACK4yMkYED9z4xqksvXYYk5FDKvKz2=QzqQ6=2-horMivV=eXQ@mail.gmail.com>

Hello Vladimir,

Thank you for your reply.

I have considered all compiler levels from C1 and C2, but the main problem
was that the code produced by them has too many aspects that make it hard
to analyze.
The point of my task is Trivial Compiler Equivalence, meaning that I
literally compare the Asm code for a source class and mutants line by line
and I expect that the same Java code produced same Asm code. However, the
code produced by C1 contains many addresses which vary every time the code
is run.
That is why I switched to Opto-Asm which has much less "variability".

Best regards,
Vladyslav Tkachuk

??, 16 ????. 2020 ? 12:26 Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
????:

> Hi Vladyslav,
>
> C2 has a number of aggressive optimizations which heavily rely on
> profiling data. It leads to numerous uncommon traps in the generated
> code. You can disable some of such optimizations, but there's no way to
> completely eliminate uncommon traps in the generated code: they are a
> core piece of the design.
>
> Have you tried switching to C1 instead? C1 doesn't rely on profiling
> data that much and use code patching techniques in place of uncommon
> traps. So, the generated code usually has complete coverage of the
> compiled method.
>
> Best regards,
> Vladimir Ivanov
>
> On 16.04.2020 01:05, Vladyslav Tkachuk wrote:
> > Hello,
> >
> > I am a Master's student at the University of Passau, Germany.
> > My master thesis research is concerned with detecting equivalent mutants
> in
> > Java.
> > The main research question is to use the Trivial Compiler Equivalency
> > technique. This means that we acquire Assembly code produced by Java JIT
> > compiler for initial and mutated source and then compare them.
> >
> > I have previously contacted Tobias Hartmann, who advised me to write here
> > regarding technical questions. I would like to ask you if there is any
> > solution to a problem I have.
> >
> > Last time Tobias recommended me to use Opto-Assembly to achieve my
> purpose.
> > It was a good hint and it helped me to get more precise data.
> > However, after doing some research I noticed that in some cases C2
> compiler
> > unloaded the method code which I expected to find in assembly. As I found
> > out this was a part of deoptimization and the method code was meant to be
> > executed by the interpreter.
> > Here is an example of what I mean:
> >
> > {method}
> >   - this oop:          0x000000000d2319c8
> >   - method holder:     'Rational'
> >   - constants:         0x000000000d230cf8 constant pool [85]
> > {0x000000000d230d00} for 'Rational' cache=0x000000000d231cd8
> >   - access:            0x81000001  public
> >   - name:              'toString'
> >   - signature:         '()Ljava/lang/String;'
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > some setup code
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > 02c    movq    RBP, RDX # spill
> > 02f    movl    RDX, #11 # int
> >        nop # 3 bytes pad for loops and calls
> > *037    call,static  wrapper for: uncommon_trap(reason='unloaded'
> > action='reinterpret' index='11')*
> > *        # Rational::toString @ bci:0  L[0]=RBP L[1]=_ L[2]=_ L[3]=_
> L[4]=_
> > L[5]=_ L[6]=_ L[7]=_*
> > *        # OopMap{rbp=Oop off=60}*
> > 03c    int3 # ShouldNotReachHere
> > 03c
> >
> >
> > This is a 'toString' method and as I could see and understand, there is
> no
> > actual method code, but only a call to it.
> >
> > I would like to know if it is possible to completely disable any
> > deoptimizations and consistently receive the full asm code? I consent
> that
> > it is not practical and hurts performance, but it is not a goal in this
> > scope. According to my observations, in most cases the method code is
> full,
> > but strangely here it did not work. I have tried to google any useful
> info,
> > unfortunately, I did not see anything helpful, despite the explanations
> > about what deoptimization is and its types.
> >
> > I would be grateful if you could shed some light on the issue.
> > Thanks in advance for any useful information.
> >
> > Best regards,
> > Vladyslav Tkachuk
> >
>

From kuaiwei.kw at alibaba-inc.com  Mon Apr 20 02:19:20 2020
From: kuaiwei.kw at alibaba-inc.com (Kuai Wei)
Date: Mon, 20 Apr 2020 10:19:20 +0800
Subject: =?UTF-8?B?UmU6IFJGUjogaGVhcGJhc2UgcmVnaXN0ZXIgY2FuIGJlIGFsbG9jYXRlZCBpbiBjb21wcmVz?=
 =?UTF-8?B?c2VkIG1vZGU=?=
In-Reply-To: <781CB090-0386-4D32-8465-8238E516789B@amazon.com>
References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>
 <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com>
 <ef37fac9-364b-442c-88ef-eb0cc9855cb5.kuaiwei.kw@alibaba-inc.com>
 <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com>
 <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com>
 <DB8PR08MB49694C72021174B76D032E5C96DD0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <a69c8cd9-b14e-4e5a-95f1-604197534d98.kuaiwei.kw@alibaba-inc.com>
 <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com>
 <DB8PR08MB4969EB393F9ACF9724A6D4C796DA0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com>
 <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com>,
 <781CB090-0386-4D32-8465-8238E516789B@amazon.com>
Message-ID: <77fd9246-b951-47b9-9743-11aa3fd851bd.kuaiwei.kw@alibaba-inc.com>

Thanks for all feedback. I think this patch has enough review and can be merged.

Hi Pengfei,

  I need help to push it. Could you help to merge it?

Thanks,
Kuai Wei


------------------------------------------------------------------
From:Liu, Xin <xxinliu at amazon.com>
Send Time:2020?4?15?(???) 11:17
To:??(??) <kuaiwei.kw at alibaba-inc.com>; Pengfei Li <Pengfei.Li at arm.com>; Andrew Haley <aph at redhat.com>; hotspot compiler <hotspot-compiler-dev at openjdk.java.net>
Cc:nd <nd at arm.com>
Subject:Re: RFR: heapbase register can be allocated in compressed mode

Hi, Wei, 
LGTM. 

Thanks.
--lx

From: Kuai Wei <kuaiwei.kw at alibaba-inc.com>
Reply-To: Kuai Wei <kuaiwei.kw at alibaba-inc.com>
Date: Tuesday, April 14, 2020 at 6:26 AM
To: "Liu, Xin" <xxinliu at amazon.com>, Pengfei Li <Pengfei.Li at arm.com>, Andrew Haley <aph at redhat.com>, hotspot compiler <hotspot-compiler-dev at openjdk.java.net>
Cc: nd <nd at arm.com>
Subject: RE: RFR: heapbase register can be allocated in compressed mode

Hi Xin and Pengfei,

  Thanks for your comments. I checked change in reinit_heapbase and decide to revert it since it's no harm to set rheapbase. I also made change in verify_heapbase in case someone want to enable this check again.

  The new patch is in http://cr.openjdk.java.net/~wzhuo/8242449/webrev.01/

  It has passed tiered 1 test without new failure.

Thanks,
Kuai Wei

------------------------------------------------------------------
From:Liu, Xin <xxinliu at amazon.com>
Send Time:2020?4?14?(???) 17:37
To:Pengfei Li <Pengfei.Li at arm.com>; ??(??) <kuaiwei.kw at alibaba-inc.com>; Andrew Haley <aph at redhat.com>; hotspot compiler <hotspot-compiler-dev at openjdk.java.net>
Cc:nd <nd at arm.com>
Subject:Re: RFR: heapbase register can be allocated in compressed mode

Hi, Pengfei and Kuai, 

Thanks to point out. 
Aarch64.ad does use MacroAssembler::encode_heap_oop, which refers to rheapbase.
That's why we can't use rheapbase as a GP register in C2. Got it. thanks!

--lx


 On 4/14/20, 1:39 AM, "Pengfei Li" <Pengfei.Li at arm.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.


    Hi Xin,

    > I read JDK-8234794 but I don't understand why that change involves in r27
    > and CompressedOop.

    JDK-8234794 is the metaspace reservation fix. It also simplifies the encoding/decoding of compressed class pointers. Before that patch, r27 is used for both compressed oops and compressed class pointers. At that time we have to consider if r27 is allocatable if compressed class pointers is on. But after that patch, r27 is for compressed oops only. That's why I could simplify my JDK-8233743 patch after JDK-8234794 was merged.

    --
    Thanks,
    Pengfei


From Pengfei.Li at arm.com  Mon Apr 20 04:32:00 2020
From: Pengfei.Li at arm.com (Pengfei Li)
Date: Mon, 20 Apr 2020 04:32:00 +0000
Subject: RFR: heapbase register can be allocated in compressed mode
In-Reply-To: <77fd9246-b951-47b9-9743-11aa3fd851bd.kuaiwei.kw@alibaba-inc.com>
References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>
 <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com>
 <ef37fac9-364b-442c-88ef-eb0cc9855cb5.kuaiwei.kw@alibaba-inc.com>
 <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com>
 <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com>
 <DB8PR08MB49694C72021174B76D032E5C96DD0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <a69c8cd9-b14e-4e5a-95f1-604197534d98.kuaiwei.kw@alibaba-inc.com>
 <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com>
 <DB8PR08MB4969EB393F9ACF9724A6D4C796DA0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com>
 <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com>,
 <781CB090-0386-4D32-8465-8238E516789B@amazon.com>
 <77fd9246-b951-47b9-9743-11aa3fd851bd.kuaiwei.kw@alibaba-inc.com>
Message-ID: <DB8PR08MB4969167A32C497396AF1693396D40@DB8PR08MB4969.eurprd08.prod.outlook.com>

Hi Wei,

> Thanks for all feedback. I think this patch has enough review and can be merged.
> 
> Hi Pengfei,
>
>  I need help to push it. Could you help to merge it?

I'm not a reviewer, and not sure whether your updated webrev.01 [1] still requires an official reviewer to confirm.

Maybe Andrew Haley or other AArch64 reviewers can help?

[1] http://cr.openjdk.java.net/~wzhuo/8242449/webrev.01/

--
Thanks,
Pengfei


From Yang.Zhang at arm.com  Mon Apr 20 06:30:47 2020
From: Yang.Zhang at arm.com (Yang Zhang)
Date: Mon, 20 Apr 2020 06:30:47 +0000
Subject: RFR(XS): 8242796: Fix client build failure
In-Reply-To: <b80d5553-3473-b8db-fc5d-3dc91faf45fc@oracle.com>
References: <VI1PR0802MB2558E68B52A0CDB9E7CDC8B08ED80@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <a6bc8da1-f48c-8813-2a8d-d033948e4d48@oracle.com>
 <VI1PR0802MB2558E70EAFD873290A1967E88ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <b80d5553-3473-b8db-fc5d-3dc91faf45fc@oracle.com>
Message-ID: <VI1PR0802MB2558B20A413585178E384ABA8ED40@VI1PR0802MB2558.eurprd08.prod.outlook.com>

Hi Vladimir

Thanks for your comment. I update the patch.
http://cr.openjdk.java.net/~yzhang/8242796/webrev.02/

Regards
Yang

-----Original Message-----
From: Vladimir Kozlov <vladimir.kozlov at oracle.com> 
Sent: Saturday, April 18, 2020 3:07 AM
To: Yang Zhang <Yang.Zhang at arm.com>; hotspot-compiler-dev at openjdk.java.net
Subject: Re: RFR(XS): 8242796: Fix client build failure

Hi Yang

On 4/17/20 1:37 AM, Yang Zhang wrote:
> Hi Vladimir
> 
> I update the patch according to your comment.
> http://cr.openjdk.java.net/~yzhang/8242796/webrev.01/
> 
> These checks are needed.
> #if INCLUDE_JFR && COMPILER2_OR_JVMCI
> #if COMPILER2 ----------- without it, configuring --with-jvm-features=-compiler2 fails to build. The error is the same as before.

Yes, I agree that additional #ifdef COMPILER2 is needed.
The only comment I have is to may be include compiler_c2 check under that #ifdef and leaving #endif at the same place:

+ #ifdef COMPILER2
     } else if (compiler_type == compiler_c2) {

       first_registration = false;
+ #endif // COMPILER2
     }

Thanks,
Vladimir

> 
> Regards
> Yang
> 
> -----Original Message-----
> From: hotspot-compiler-dev 
> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Vladimir 
> Kozlov
> Sent: Friday, April 17, 2020 5:27 AM
> To: hotspot-compiler-dev at openjdk.java.net
> Subject: Re: RFR(XS): 8242796: Fix client build failure
> 
> Method register_jfr_phasetype_serializer() is only called when C2 or JVMCI code is included in JVM build.
> I think you need to put whole method under checks:
> 
> #if INCLUDE_JFR && COMPILER2_OR_JVMCI
> // It appends new compiler phase names to growable array phase_names(a new CompilerPhaseType mapping // in compiler/compilerEvent.cpp) and registers it with its serializer.
> 
> Thanks,
> Vladimir
> 
> On 4/16/20 1:58 AM, Yang Zhang wrote:
>> Hi,
>>
>> Could you please help to review this patch?
>>
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8242796
>> Webrev: http://cr.openjdk.java.net/~yzhang/8242796/webrev.00/
>>
>> This build failure is introduced by JDK-8193210 [JVMCI/Graal] add JFR 
>> compiler phase/inlining events.
>> C2 only code is used for JFR. To fix this issue, I use COMPILER2 macro.
>>
>> With this patch, x86 client build succeeds. But AArch64 client build 
>> still fails, which is caused by [1]. I have filed [2] for AArch64 
>> client build failure and will summit another patch for that.
>>
>> [1] https://bugs.openjdk.java.net/browse/JDK-8241665
>> [2] https://bugs.openjdk.java.net/browse/JDK-8242905
>>
>> Regards
>> Yang
>>

From aph at redhat.com  Mon Apr 20 08:48:50 2020
From: aph at redhat.com (Andrew Haley)
Date: Mon, 20 Apr 2020 09:48:50 +0100
Subject: RFR: heapbase register can be allocated in compressed mode
In-Reply-To: <DB8PR08MB4969167A32C497396AF1693396D40@DB8PR08MB4969.eurprd08.prod.outlook.com>
References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>
 <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com>
 <ef37fac9-364b-442c-88ef-eb0cc9855cb5.kuaiwei.kw@alibaba-inc.com>
 <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com>
 <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com>
 <DB8PR08MB49694C72021174B76D032E5C96DD0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <a69c8cd9-b14e-4e5a-95f1-604197534d98.kuaiwei.kw@alibaba-inc.com>
 <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com>
 <DB8PR08MB4969EB393F9ACF9724A6D4C796DA0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com>
 <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com>
 <781CB090-0386-4D32-8465-8238E516789B@amazon.com>
 <77fd9246-b951-47b9-9743-11aa3fd851bd.kuaiwei.kw@alibaba-inc.com>
 <DB8PR08MB4969167A32C497396AF1693396D40@DB8PR08MB4969.eurprd08.prod.outlook.com>
Message-ID: <d8bd968e-b376-d9ae-dcc7-9d79e2c382ac@redhat.com>

On 4/20/20 5:32 AM, Pengfei Li wrote:
> Maybe Andrew Haley or other AArch64 reviewers can help?
> 
> [1] http://cr.openjdk.java.net/~wzhuo/8242449/webrev.01/

It's fine. At some point in the future maybe we can get round to taking
out all references to rheapbase, but it'll require careful thinking about
JVMCI and Graal-precompiled code.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From Pengfei.Li at arm.com  Mon Apr 20 09:54:40 2020
From: Pengfei.Li at arm.com (Pengfei Li)
Date: Mon, 20 Apr 2020 09:54:40 +0000
Subject: RFR: heapbase register can be allocated in compressed mode
In-Reply-To: <d8bd968e-b376-d9ae-dcc7-9d79e2c382ac@redhat.com>
References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>
 <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com>
 <ef37fac9-364b-442c-88ef-eb0cc9855cb5.kuaiwei.kw@alibaba-inc.com>
 <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com>
 <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com>
 <DB8PR08MB49694C72021174B76D032E5C96DD0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <a69c8cd9-b14e-4e5a-95f1-604197534d98.kuaiwei.kw@alibaba-inc.com>
 <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com>
 <DB8PR08MB4969EB393F9ACF9724A6D4C796DA0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com>
 <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com>
 <781CB090-0386-4D32-8465-8238E516789B@amazon.com>
 <77fd9246-b951-47b9-9743-11aa3fd851bd.kuaiwei.kw@alibaba-inc.com>
 <DB8PR08MB4969167A32C497396AF1693396D40@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <d8bd968e-b376-d9ae-dcc7-9d79e2c382ac@redhat.com>
Message-ID: <DB8PR08MB4969181925F2C63B60AF178596D40@DB8PR08MB4969.eurprd08.prod.outlook.com>


> It's fine. At some point in the future maybe we can get round to taking out all
> references to rheapbase, but it'll require careful thinking about JVMCI and
> Graal-precompiled code.

Thanks Andrew.
Pushed here http://hg.openjdk.java.net/jdk/jdk/rev/aedc9bf21743

--
Thanks,
Pengfei


From aph at redhat.com  Mon Apr 20 10:01:10 2020
From: aph at redhat.com (Andrew Haley)
Date: Mon, 20 Apr 2020 11:01:10 +0100
Subject: RFR: heapbase register can be allocated in compressed mode
In-Reply-To: <d8bd968e-b376-d9ae-dcc7-9d79e2c382ac@redhat.com>
References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>
 <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com>
 <ef37fac9-364b-442c-88ef-eb0cc9855cb5.kuaiwei.kw@alibaba-inc.com>
 <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com>
 <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com>
 <DB8PR08MB49694C72021174B76D032E5C96DD0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <a69c8cd9-b14e-4e5a-95f1-604197534d98.kuaiwei.kw@alibaba-inc.com>
 <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com>
 <DB8PR08MB4969EB393F9ACF9724A6D4C796DA0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com>
 <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com>
 <781CB090-0386-4D32-8465-8238E516789B@amazon.com>
 <77fd9246-b951-47b9-9743-11aa3fd851bd.kuaiwei.kw@alibaba-inc.com>
 <DB8PR08MB4969167A32C497396AF1693396D40@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <d8bd968e-b376-d9ae-dcc7-9d79e2c382ac@redhat.com>
Message-ID: <84c21683-eaba-5598-6a1d-c58abdb39014@redhat.com>

On 4/20/20 9:48 AM, Andrew Haley wrote:
> On 4/20/20 5:32 AM, Pengfei Li wrote:
>> Maybe Andrew Haley or other AArch64 reviewers can help?
>>
>> [1] http://cr.openjdk.java.net/~wzhuo/8242449/webrev.01/
> It's fine.

Sorry, no it isn't fine. Please get rid of this hunk:

--- old/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp	2020-04-14 21:18:52.009758661 +0800
+++ new/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp	2020-04-14 21:18:51.785764043 +0800
@@ -2185,6 +2185,10 @@
 #if 0
   assert (UseCompressedOops || UseCompressedClassPointers, "should be compressed");
   assert (Universe::heap() != NULL, "java heap should be initialized");
+  if (!UseCompressedOops || Universe::ptr_base() == NULL) {
+    // rheapbase is allocated as general register
+    return;
+  }
   if (CheckCompressedOops) {
     Label ok;
     push(1 << rscratch1->encoding(), sp); // cmpptr trashes rscratch1

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From Pengfei.Li at arm.com  Mon Apr 20 10:10:05 2020
From: Pengfei.Li at arm.com (Pengfei Li)
Date: Mon, 20 Apr 2020 10:10:05 +0000
Subject: RFR: heapbase register can be allocated in compressed mode
In-Reply-To: <84c21683-eaba-5598-6a1d-c58abdb39014@redhat.com>
References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>
 <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com>
 <ef37fac9-364b-442c-88ef-eb0cc9855cb5.kuaiwei.kw@alibaba-inc.com>
 <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com>
 <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com>
 <DB8PR08MB49694C72021174B76D032E5C96DD0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <a69c8cd9-b14e-4e5a-95f1-604197534d98.kuaiwei.kw@alibaba-inc.com>
 <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com>
 <DB8PR08MB4969EB393F9ACF9724A6D4C796DA0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com>
 <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com>
 <781CB090-0386-4D32-8465-8238E516789B@amazon.com>
 <77fd9246-b951-47b9-9743-11aa3fd851bd.kuaiwei.kw@alibaba-inc.com>
 <DB8PR08MB4969167A32C497396AF1693396D40@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <d8bd968e-b376-d9ae-dcc7-9d79e2c382ac@redhat.com>
 <84c21683-eaba-5598-6a1d-c58abdb39014@redhat.com>
Message-ID: <DB8PR08MB4969AE6B4F49E3CF882C96F596D40@DB8PR08MB4969.eurprd08.prod.outlook.com>

Hi Andrew,

> Sorry, no it isn't fine. Please get rid of this hunk:
> 
> --- old/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp	2020-
> 04-14 21:18:52.009758661 +0800
> +++ new/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp	2020-
> 04-14 21:18:51.785764043 +0800
> @@ -2185,6 +2185,10 @@
>  #if 0
>    assert (UseCompressedOops || UseCompressedClassPointers, "should be
> compressed");
>    assert (Universe::heap() != NULL, "java heap should be initialized");
> +  if (!UseCompressedOops || Universe::ptr_base() == NULL) {
> +    // rheapbase is allocated as general register
> +    return;
> +  }
>    if (CheckCompressedOops) {
>      Label ok;
>      push(1 << rscratch1->encoding(), sp); // cmpptr trashes rscratch1

Oh. It's already pushed just now. According to the process, we may need Wei to create another JBS to backout that part?

--
Thanks,
Pengfei


From aph at redhat.com  Mon Apr 20 10:23:41 2020
From: aph at redhat.com (Andrew Haley)
Date: Mon, 20 Apr 2020 11:23:41 +0100
Subject: [aarch64-port-dev ] RFR(XS): 8242070: AArch64: Fix a typo
 introduced by JDK-8238690
In-Reply-To: <VI1PR0802MB255835F5D4BD55CDAFF4B1578ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>
References: <VI1PR0802MB25580275D036617C1713AF158EDE0@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <VI1PR0802MB255835F5D4BD55CDAFF4B1578ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>
Message-ID: <3b47599e-6b2f-06a9-6ea4-057795850065@redhat.com>

On 4/17/20 10:14 AM, Yang Zhang wrote:
> Ping it again. Could you please help to review this?

I'm running it, and I get no vector code generated. How did you test it?

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Mon Apr 20 10:36:19 2020
From: aph at redhat.com (Andrew Haley)
Date: Mon, 20 Apr 2020 11:36:19 +0100
Subject: [aarch64-port-dev ] RFR(XS): 8242070: AArch64: Fix a typo
 introduced by JDK-8238690
In-Reply-To: <3b47599e-6b2f-06a9-6ea4-057795850065@redhat.com>
References: <VI1PR0802MB25580275D036617C1713AF158EDE0@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <VI1PR0802MB255835F5D4BD55CDAFF4B1578ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <3b47599e-6b2f-06a9-6ea4-057795850065@redhat.com>
Message-ID: <b7eb43e4-72e3-7c3d-81cb-111ead6acac4@redhat.com>

On 4/20/20 11:23 AM, Andrew Haley wrote:
> On 4/17/20 10:14 AM, Yang Zhang wrote:
>> Ping it again. Could you please help to review this?
> 
> I'm running it, and I get no vector code generated. How did you test it?

Sorry, my mistake. I'm testing it now.
-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From kuaiwei.kw at alibaba-inc.com  Mon Apr 20 11:12:55 2020
From: kuaiwei.kw at alibaba-inc.com (Kuai Wei)
Date: Mon, 20 Apr 2020 19:12:55 +0800
Subject: =?UTF-8?B?UmU6IFJGUjogaGVhcGJhc2UgcmVnaXN0ZXIgY2FuIGJlIGFsbG9jYXRlZCBpbiBjb21wcmVz?=
 =?UTF-8?B?c2VkIG1vZGU=?=
In-Reply-To: <84c21683-eaba-5598-6a1d-c58abdb39014@redhat.com>
References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>
 <9f991f61-2d59-ca87-d68e-7b8c257d9be4@redhat.com>
 <ef37fac9-364b-442c-88ef-eb0cc9855cb5.kuaiwei.kw@alibaba-inc.com>
 <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com>
 <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com>
 <DB8PR08MB49694C72021174B76D032E5C96DD0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <a69c8cd9-b14e-4e5a-95f1-604197534d98.kuaiwei.kw@alibaba-inc.com>
 <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com>
 <DB8PR08MB4969EB393F9ACF9724A6D4C796DA0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com>
 <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com>
 <781CB090-0386-4D32-8465-8238E516789B@amazon.com>
 <77fd9246-b951-47b9-9743-11aa3fd851bd.kuaiwei.kw@alibaba-inc.com>
 <DB8PR08MB4969167A32C497396AF1693396D40@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <d8bd968e-b376-d9ae-dcc7-9d79e2c382ac@redhat.com>,
 <84c21683-eaba-5598-6a1d-c58abdb39014@redhat.com>
Message-ID: <74ad538f-3247-4b31-832f-b3cb1bd9f41a.kuaiwei.kw@alibaba-inc.com>

Hi Andrew,

  Could you tell more detail about it? I can start a new patch for it if it break anything.

Kuai Wei


------------------------------------------------------------------
From:Andrew Haley <aph at redhat.com>
Send Time:2020?4?20?(???) 18:01
To:Pengfei Li <Pengfei.Li at arm.com>; ??(??) <kuaiwei.kw at alibaba-inc.com>; "Liu, Xin" <xxinliu at amazon.com>; hotspot compiler <hotspot-compiler-dev at openjdk.java.net>
Cc:nd <nd at arm.com>; aarch64-port-dev at openjdk.java.net <aarch64-port-dev at openjdk.java.net>
Subject:Re: RFR: heapbase register can be allocated in compressed mode

On 4/20/20 9:48 AM, Andrew Haley wrote:
> On 4/20/20 5:32 AM, Pengfei Li wrote:
>> Maybe Andrew Haley or other AArch64 reviewers can help?
>>
>> [1] http://cr.openjdk.java.net/~wzhuo/8242449/webrev.01/
> It's fine.

Sorry, no it isn't fine. Please get rid of this hunk:

--- old/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp 2020-04-14 21:18:52.009758661 +0800
+++ new/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp 2020-04-14 21:18:51.785764043 +0800
@@ -2185,6 +2185,10 @@
 #if 0
   assert (UseCompressedOops || UseCompressedClassPointers, "should be compressed");
   assert (Universe::heap() != NULL, "java heap should be initialized");
+  if (!UseCompressedOops || Universe::ptr_base() == NULL) {
+    // rheapbase is allocated as general register
+    return;
+  }
   if (CheckCompressedOops) {
     Label ok;
     push(1 << rscratch1->encoding(), sp); // cmpptr trashes rscratch1

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

From aph at redhat.com  Mon Apr 20 11:50:33 2020
From: aph at redhat.com (Andrew Haley)
Date: Mon, 20 Apr 2020 12:50:33 +0100
Subject: [aarch64-port-dev ] RFR(XS): 8242070: AArch64: Fix a typo
 introduced by JDK-8238690
In-Reply-To: <VI1PR0802MB255835F5D4BD55CDAFF4B1578ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>
References: <VI1PR0802MB25580275D036617C1713AF158EDE0@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <VI1PR0802MB255835F5D4BD55CDAFF4B1578ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>
Message-ID: <ec98e42e-23de-9d64-5e3b-7f4589e07674@redhat.com>

On 4/17/20 10:14 AM, Yang Zhang wrote:
> 
> Ping it again. Could you please help to review this?

Before:

Benchmark                    Mode  Cnt    Score   Error  Units
TestVect.testVectShift       avgt    5  141.027 ? 0.117  us/op

  0.41%    0x0000ffffa8c5fc40:   sbfiz	x15, x11, #1, #32
           0x0000ffffa8c5fc44:   add	x16, x18, x15               ;*saload {reexecute=0 rethrow=0 return_oop=0}
                                                                     ; - org.sample.TestVect::testVectShift at 16 (line 31)
           0x0000ffffa8c5fc48:   ldr	q16, [x16, #16]
  0.51%    0x0000ffffa8c5fc4c:   neg	v17.16b, v18.16b
           0x0000ffffa8c5fc50:   sshl	v16.8h, v16.8h, v17.8h
           0x0000ffffa8c5fc54:   add	x15, x17, x15


After:

Benchmark                    Mode  Cnt    Score   Error  Units
TestVect.testVectShift       avgt    5  143.021 ? 0.506  us/op

  0.46%    0x0000ffff78c61f00:   sbfiz	x13, x15, #1, #32
           0x0000ffff78c61f04:   add	x14, x17, x13               ;*saload {reexecute=0 rethrow=0 return_oop=0}
                                                                     ; - org.sample.TestVect::testVectShift at 16 (line 31)
           0x0000ffff78c61f08:   ldr	q16, [x14, #16]
  0.36%    0x0000ffff78c61f0c:   sshr	v16.8h, v16.8h, #2
           0x0000ffff78c61f10:   add	x13, x16, x13

So, at least on this thing it makes no difference. I'll grant you it's
less code, so OK.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From aph at redhat.com  Mon Apr 20 12:14:29 2020
From: aph at redhat.com (Andrew Haley)
Date: Mon, 20 Apr 2020 13:14:29 +0100
Subject: RFR: heapbase register can be allocated in compressed mode
In-Reply-To: <74ad538f-3247-4b31-832f-b3cb1bd9f41a.kuaiwei.kw@alibaba-inc.com>
References: <613724a7-1dd1-448d-aaaa-dbbe0d0beca4.kuaiwei.kw@alibaba-inc.com>
 <7bd76285-b58d-5359-85ed-4430288a675e@redhat.com>
 <0c6fdf72-3c83-4563-8d13-45e83ee70310.kuaiwei.kw@alibaba-inc.com>
 <DB8PR08MB49694C72021174B76D032E5C96DD0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <a69c8cd9-b14e-4e5a-95f1-604197534d98.kuaiwei.kw@alibaba-inc.com>
 <8E4A835E-3853-40BA-B44F-DD0A4ECC0308@amazon.com>
 <DB8PR08MB4969EB393F9ACF9724A6D4C796DA0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <78D18021-2129-485A-8407-A37D385D0DE6@amazon.com>
 <229d2a57-8fd0-4826-889d-cca833ca19f3.kuaiwei.kw@alibaba-inc.com>
 <781CB090-0386-4D32-8465-8238E516789B@amazon.com>
 <77fd9246-b951-47b9-9743-11aa3fd851bd.kuaiwei.kw@alibaba-inc.com>
 <DB8PR08MB4969167A32C497396AF1693396D40@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <d8bd968e-b376-d9ae-dcc7-9d79e2c382ac@redhat.com>
 <84c21683-eaba-5598-6a1d-c58abdb39014@redhat.co m>
 <74ad538f-3247-4b31-832f-b3cb1bd9f41a.kuaiwei.kw@alibaba-inc.com>
Message-ID: <f523bd5c-27a7-e024-0822-0afb5fee0b79@redhat.com>

On 4/20/20 12:12 PM, Kuai Wei wrote:

>  Could you tell more detail about it? I can start a new patch for it
>  if it break anything.

Well, it's ifdef'd out at the moment, so by definition it can't break anything.
But there may be issues with Graal whereby we really do need to check rheapbase,
but it's OK for now.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From maurizio.cimadamore at oracle.com  Mon Apr 20 14:59:49 2020
From: maurizio.cimadamore at oracle.com (Maurizio Cimadamore)
Date: Mon, 20 Apr 2020 15:59:49 +0100
Subject: Intrinsics for divideUnsigned/remainderUnsigned
In-Reply-To: <CANghgrQyci1zZ1_aS+yMQEQkZbQgrnp01qJgEQ9kwM6Jj7t+iw@mail.gmail.com>
References: <CANghgrQyci1zZ1_aS+yMQEQkZbQgrnp01qJgEQ9kwM6Jj7t+iw@mail.gmail.com>
Message-ID: <84dcf12f-7a75-0a8b-b590-3ebb122c7ec3@oracle.com>

Hi David,
did you mean to write to hotspot compiler (CCed) ?

Maurizio

On 20/04/2020 15:38, David Lloyd wrote:
> Am I correct in understanding that there are no compiler intrinsics
> for Long.divideUnsigned/remainderUnsigned?
>
> The implementation seems pretty expensive for an operation that is, if
> I understand correctly, a single instruction on many CPU
> architectures.  But maybe these methods are not very frequently used?
> (My clue was a comment in the source referencing an algorithm from
> Hacker's Delight that could be used - if such an algorithm exists, but
> wasn't implemented, presumably demand is low?)

From david.lloyd at redhat.com  Mon Apr 20 15:07:56 2020
From: david.lloyd at redhat.com (David Lloyd)
Date: Mon, 20 Apr 2020 10:07:56 -0500
Subject: Intrinsics for divideUnsigned/remainderUnsigned
In-Reply-To: <84dcf12f-7a75-0a8b-b590-3ebb122c7ec3@oracle.com>
References: <CANghgrQyci1zZ1_aS+yMQEQkZbQgrnp01qJgEQ9kwM6Jj7t+iw@mail.gmail.com>
 <84dcf12f-7a75-0a8b-b590-3ebb122c7ec3@oracle.com>
Message-ID: <CANghgrT3GbEBJXxN+O2GKpn87_+t3nbxGq87nyC_+Cg=Xku=yA@mail.gmail.com>

Yes, I did, sorry about that.

On Mon, Apr 20, 2020 at 10:02 AM Maurizio Cimadamore
<maurizio.cimadamore at oracle.com> wrote:
>
> Hi David,
> did you mean to write to hotspot compiler (CCed) ?
>
> Maurizio
>
> On 20/04/2020 15:38, David Lloyd wrote:
> > Am I correct in understanding that there are no compiler intrinsics
> > for Long.divideUnsigned/remainderUnsigned?
> >
> > The implementation seems pretty expensive for an operation that is, if
> > I understand correctly, a single instruction on many CPU
> > architectures.  But maybe these methods are not very frequently used?
> > (My clue was a comment in the source referencing an algorithm from
> > Hacker's Delight that could be used - if such an algorithm exists, but
> > wasn't implemented, presumably demand is low?)
>


-- 
- DML


From tobias.hartmann at oracle.com  Mon Apr 20 15:52:27 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Mon, 20 Apr 2020 17:52:27 +0200
Subject: [15] RFR(M): 8242108: Performance regression after fix for JDK-8229496
Message-ID: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com>

Hi,

please review the following patch:
https://bugs.openjdk.java.net/browse/JDK-8242108
http://cr.openjdk.java.net/~thartmann/8242108/webrev.00/

The fix for 8229496 [1] triggers a performance regression with NumberFormat.format(). The problem is
the additional control dependency on a CastII/LL which restricts optimizations due to
_carry_dependency being set (which was necessary because we can not represent non-null integers/long
values in C2's type system).

While investigating, I've noticed that Roland's fix for 8241900 [2] fixes the exact same problem but
in a more elegant way, avoiding an impact on performance.

I'm therefore proposing to back out the original fix for 8229496, leaving the regression test in and
also adding a microbenchmark. I've verified that this solves the performance regression (4547 ops/ms
vs. 5048 ops/ms on my machine).

Thanks,
Tobias

[1] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-August/034865.html
[2] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-April/037778.html

From joe.darcy at oracle.com  Mon Apr 20 17:40:52 2020
From: joe.darcy at oracle.com (Joe Darcy)
Date: Mon, 20 Apr 2020 10:40:52 -0700
Subject: Intrinsics for divideUnsigned/remainderUnsigned
In-Reply-To: <CANghgrT3GbEBJXxN+O2GKpn87_+t3nbxGq87nyC_+Cg=Xku=yA@mail.gmail.com>
References: <CANghgrQyci1zZ1_aS+yMQEQkZbQgrnp01qJgEQ9kwM6Jj7t+iw@mail.gmail.com>
 <84dcf12f-7a75-0a8b-b590-3ebb122c7ec3@oracle.com>
 <CANghgrT3GbEBJXxN+O2GKpn87_+t3nbxGq87nyC_+Cg=Xku=yA@mail.gmail.com>
Message-ID: <2104801b-5942-0982-ade7-1e6f418cfd06@oracle.com>

The divideUnsigned methods in question are not marked with the 
@HotSpotIntrinsicCandidate annotation so it doesn't look like there are 
currently intrinsics.

Cheers,

-Joe

On 4/20/2020 8:07 AM, David Lloyd wrote:
> Yes, I did, sorry about that.
>
> On Mon, Apr 20, 2020 at 10:02 AM Maurizio Cimadamore
> <maurizio.cimadamore at oracle.com> wrote:
>> Hi David,
>> did you mean to write to hotspot compiler (CCed) ?
>>
>> Maurizio
>>
>> On 20/04/2020 15:38, David Lloyd wrote:
>>> Am I correct in understanding that there are no compiler intrinsics
>>> for Long.divideUnsigned/remainderUnsigned?
>>>
>>> The implementation seems pretty expensive for an operation that is, if
>>> I understand correctly, a single instruction on many CPU
>>> architectures.  But maybe these methods are not very frequently used?
>>> (My clue was a comment in the source referencing an algorithm from
>>> Hacker's Delight that could be used - if such an algorithm exists, but
>>> wasn't implemented, presumably demand is low?)
>

From vladimir.kozlov at oracle.com  Mon Apr 20 19:32:16 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Mon, 20 Apr 2020 12:32:16 -0700
Subject: RFR(XS): 8242796: Fix client build failure
In-Reply-To: <VI1PR0802MB2558B20A413585178E384ABA8ED40@VI1PR0802MB2558.eurprd08.prod.outlook.com>
References: <VI1PR0802MB2558E68B52A0CDB9E7CDC8B08ED80@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <a6bc8da1-f48c-8813-2a8d-d033948e4d48@oracle.com>
 <VI1PR0802MB2558E70EAFD873290A1967E88ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <b80d5553-3473-b8db-fc5d-3dc91faf45fc@oracle.com>
 <VI1PR0802MB2558B20A413585178E384ABA8ED40@VI1PR0802MB2558.eurprd08.prod.outlook.com>
Message-ID: <ad6a75f6-0bfc-5d1e-75e5-34c3e4d55298@oracle.com>

Looks good.

Thanks,
Vladimir

On 4/19/20 11:30 PM, Yang Zhang wrote:
> Hi Vladimir
> 
> Thanks for your comment. I update the patch.
> http://cr.openjdk.java.net/~yzhang/8242796/webrev.02/
> 
> Regards
> Yang
> 
> -----Original Message-----
> From: Vladimir Kozlov <vladimir.kozlov at oracle.com>
> Sent: Saturday, April 18, 2020 3:07 AM
> To: Yang Zhang <Yang.Zhang at arm.com>; hotspot-compiler-dev at openjdk.java.net
> Subject: Re: RFR(XS): 8242796: Fix client build failure
> 
> Hi Yang
> 
> On 4/17/20 1:37 AM, Yang Zhang wrote:
>> Hi Vladimir
>>
>> I update the patch according to your comment.
>> http://cr.openjdk.java.net/~yzhang/8242796/webrev.01/
>>
>> These checks are needed.
>> #if INCLUDE_JFR && COMPILER2_OR_JVMCI
>> #if COMPILER2 ----------- without it, configuring --with-jvm-features=-compiler2 fails to build. The error is the same as before.
> 
> Yes, I agree that additional #ifdef COMPILER2 is needed.
> The only comment I have is to may be include compiler_c2 check under that #ifdef and leaving #endif at the same place:
> 
> + #ifdef COMPILER2
>       } else if (compiler_type == compiler_c2) {
> 
>         first_registration = false;
> + #endif // COMPILER2
>       }
> 
> Thanks,
> Vladimir
> 
>>
>> Regards
>> Yang
>>
>> -----Original Message-----
>> From: hotspot-compiler-dev
>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Vladimir
>> Kozlov
>> Sent: Friday, April 17, 2020 5:27 AM
>> To: hotspot-compiler-dev at openjdk.java.net
>> Subject: Re: RFR(XS): 8242796: Fix client build failure
>>
>> Method register_jfr_phasetype_serializer() is only called when C2 or JVMCI code is included in JVM build.
>> I think you need to put whole method under checks:
>>
>> #if INCLUDE_JFR && COMPILER2_OR_JVMCI
>> // It appends new compiler phase names to growable array phase_names(a new CompilerPhaseType mapping // in compiler/compilerEvent.cpp) and registers it with its serializer.
>>
>> Thanks,
>> Vladimir
>>
>> On 4/16/20 1:58 AM, Yang Zhang wrote:
>>> Hi,
>>>
>>> Could you please help to review this patch?
>>>
>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8242796
>>> Webrev: http://cr.openjdk.java.net/~yzhang/8242796/webrev.00/
>>>
>>> This build failure is introduced by JDK-8193210 [JVMCI/Graal] add JFR
>>> compiler phase/inlining events.
>>> C2 only code is used for JFR. To fix this issue, I use COMPILER2 macro.
>>>
>>> With this patch, x86 client build succeeds. But AArch64 client build
>>> still fails, which is caused by [1]. I have filed [2] for AArch64
>>> client build failure and will summit another patch for that.
>>>
>>> [1] https://bugs.openjdk.java.net/browse/JDK-8241665
>>> [2] https://bugs.openjdk.java.net/browse/JDK-8242905
>>>
>>> Regards
>>> Yang
>>>

From vladimir.kozlov at oracle.com  Mon Apr 20 19:55:39 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Mon, 20 Apr 2020 12:55:39 -0700
Subject: [15] RFR(M): 8242108: Performance regression after fix for
 JDK-8229496
In-Reply-To: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com>
References: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com>
Message-ID: <9b641dc6-4a25-f26f-9dc1-822d616f0e75@oracle.com>

Hi Tobias,

aarch64.ad has more changes than just undo 8229496.
Otherwise it is good.
Does it affect performance of our standard benchmarks?

Thanks,
Vladimir K

On 4/20/20 8:52 AM, Tobias Hartmann wrote:
> Hi,
> 
> please review the following patch:
> https://bugs.openjdk.java.net/browse/JDK-8242108
> http://cr.openjdk.java.net/~thartmann/8242108/webrev.00/
> 
> The fix for 8229496 [1] triggers a performance regression with NumberFormat.format(). The problem is
> the additional control dependency on a CastII/LL which restricts optimizations due to
> _carry_dependency being set (which was necessary because we can not represent non-null integers/long
> values in C2's type system).
> 
> While investigating, I've noticed that Roland's fix for 8241900 [2] fixes the exact same problem but
> in a more elegant way, avoiding an impact on performance.
> 
> I'm therefore proposing to back out the original fix for 8229496, leaving the regression test in and
> also adding a microbenchmark. I've verified that this solves the performance regression (4547 ops/ms
> vs. 5048 ops/ms on my machine).
> 
> Thanks,
> Tobias
> 
> [1] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-August/034865.html
> [2] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-April/037778.html
> 

From cjashfor at linux.ibm.com  Mon Apr 20 20:39:33 2020
From: cjashfor at linux.ibm.com (Corey Ashford)
Date: Mon, 20 Apr 2020 13:39:33 -0700
Subject: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes()
 and Integer.reverseBytes() on Power9
In-Reply-To: <AM0PR0202MB329767069BE76CD006FF2D679AD80@AM0PR0202MB3297.eurprd02.prod.outlook.com>
References: <67fa8056-a8ed-cdfc-1e5a-d36b49c4af18@linux.ibm.com>
 <0079874e-7bc2-5ff4-f004-337c718ec6df@linux.ibm.com>
 <OF54F8418D.F0B8E4CB-ON0025853E.004E3DC9-4925853E.004F648A@notes.na.collabserv.com>
 <OFE7B015D3.EC53D085-ON00258546.002BA949-49258546.00305059@notes.na.collabserv.com>
 <AM0PR0202MB3297222F0FEBE1239877EE479ADA0@AM0PR0202MB3297.eurprd02.prod.outlook.com>
 <1964be00-8926-7a70-d23a-2f7e85eb4ef3@linux.ibm.com>
 <AM0PR0202MB329767069BE76CD006FF2D679AD80@AM0PR0202MB3297.eurprd02.prod.outlook.com>
Message-ID: <13786032-d4e9-9682-5cd7-698ceb4f8c00@linux.ibm.com>

Hi Martin,

Sorry for the delay on getting the copyright changes in (I work half 
time).  Here's the revised patch, with all copyright dates set to 2020:

https://bugs.openjdk.java.net/browse/JDK-8241874
http://cr.openjdk.java.net/~gromero/8241874/v2/

Thanks for your consideration,

- Corey

On 4/16/20 1:08 AM, Doerr, Martin wrote:
> Hi Corey,
> 
> please use 2020 for both, the Oracle and the SAP copyright.
> Usually, both should be the same, but some people forget to update one of them.
> 
> Best regards,
> Martin
> 
> 
>> -----Original Message-----
>> From: Corey Ashford <cjashfor at linux.ibm.com>
>> Sent: Donnerstag, 16. April 2020 03:35
>> To: Doerr, Martin <martin.doerr at sap.com>
>> Cc: Michihiro Horie <HORIE at jp.ibm.com>; hotspot-compiler-
>> dev at openjdk.java.net; ppc-aix-port-dev at openjdk.java.net
>> Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of
>> Long.reverseBytes() and Integer.reverseBytes() on Power9
>>
>> Hello Martin,
>>
>> I'm having some trouble with my email server, so I'm having to reply to
>> your earlier post, but I saw your most recent post on the mailing list
>> archive.
>>
>> Thanks for reviewing and testing this patch.  I went to look at the
>> copyright dates, and see two date ranges: one for Oracle and its
>> affiliates, and another for SAP.  In the files I looked at, the end date
>> wasn't the same between the two.  Which one (or both) should I modify?
>>
>> Thanks,
>>
>> - Corey
>>
>> On 4/14/20 6:26 AM, Doerr, Martin wrote:
>>> Hi Corey,
>>>
>>> thanks for contributing it. Looks good to me. I?ll run it through our
>>> testing and let you know about the results.
>>>
>>> Best regards,
>>>
>>> Martin
>>>
>>> *From:*ppc-aix-port-dev <ppc-aix-port-dev-bounces at openjdk.java.net>
>> *On
>>> Behalf Of *Michihiro Horie
>>> *Sent:* Freitag, 10. April 2020 10:48
>>> *To:* cjashfor at linux.ibm.com
>>> *Cc:* hotspot-compiler-dev at openjdk.java.net;
>>> ppc-aix-port-dev at openjdk.java.net
>>> *Subject:* Re: RFR[S]:8241874 [PPC64] Improve performance of
>>> Long.reverseBytes() and Integer.reverseBytes() on Power9
>>>
>>> Hi Corey,
>>>
>>> Thank you for sharing your benchmarks. I confirmed your change reduced
>>> the elapsed time of the benchmarks by more than 30% on my P9 node.
>> Also,
>>> I checked JTREG results, which look no problem.
>>>
>>> BTW, I cannot find further points of improvement in your change.
>>>
>>> Best regards,
>>> Michihiro
>>>
>>>
>>> ----- Original message -----
>>> From: "Corey Ashford" <cjashfor at linux.ibm.com
>>> <mailto:cjashfor at linux.ibm.com>>
>>> To: Michihiro Horie/Japan/IBM at IBMJP
>>> Cc: hotspot-compiler-dev at openjdk.java.net
>>> <mailto:hotspot-compiler-dev at openjdk.java.net>,
>>> ppc-aix-port-dev at openjdk.java.net
>>> <mailto:ppc-aix-port-dev at openjdk.java.net>, "Gustavo Romero"
>>> <gromero at linux.vnet.ibm.com <mailto:gromero at linux.vnet.ibm.com>>
>>> Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of
>>> Long.reverseBytes() and Integer.reverseBytes() on Power9
>>> Date: Fri, Apr 3, 2020 8:07 AM
>>>
>>> On 4/2/20 7:27 AM, Michihiro Horie wrote:
>>>> Hi Corey,
>>>>
>>>> I?m not a reviewer, but I can run your benchmark in my local P9 node if
>>>> you share it.
>>>>
>>>> Best regards,
>>>> Michihiro
>>>
>>> The tests are somewhat hokey; I added the shifts to keep the compiler
>>> from hoisting the code that it could predetermine the result.
>>>
>>> Here's the one for Long.reverseBytes():
>>>
>>> import java.lang.*;
>>>
>>> class ReverseLong
>>> {
>>>   ? ? ?public static void main(String args[])
>>>   ? ? ?{
>>>   ? ? ? ? ?long reversed, re_reversed;
>>> long accum = 0;
>>> long orig = 0x1122334455667788L;
>>> long start = System.currentTimeMillis();
>>> for (int i = 0; i < 1_000_000_000; i++) {
>>> // Try to keep java from figuring out stuff in advance
>>> reversed = Long.reverseBytes(orig);
>>> re_reversed = Long.reverseBytes(reversed);
>>> if (re_reversed != orig) {
>>>   ? ? ? ? ?System.out.println("Orig: " + String.format("%16x", orig) +
>>> " ?Re-reversed: " + String.format("%16x", re_reversed));
>>> }
>>> accum += orig;
>>> orig = Long.rotateRight(orig, 3);
>>> }
>>> System.out.println("Elapsed time: " +
>>> Long.toString(System.currentTimeMillis() - start));
>>> System.out.println("accum: " + Long.toString(accum));
>>>   ? ? ?}
>>> }
>>>
>>>
>>> And the one for Integer.reverseBytes():
>>>
>>> import java.lang.*;
>>>
>>> class ReverseInt
>>> {
>>>   ? ? ?public static void main(String args[])
>>>   ? ? ?{
>>>   ? ? ? ? ?int reversed, re_reversed;
>>> int orig = 0x11223344;
>>> int accum = 0;
>>> long start = System.currentTimeMillis();
>>> for (int i = 0; i < 1_000_000_000; i++) {
>>> // Try to keep java from figuring out stuff in advance
>>> reversed = Integer.reverseBytes(orig);
>>> re_reversed = Integer.reverseBytes(reversed);
>>> if (re_reversed != orig) {
>>>   ? ? ? ? ?System.out.println("Orig: " + String.format("%08x", orig) +
>>> " ?Re-reversed: " + String.format("%08x", re_reversed));
>>> }
>>> accum += orig;
>>> orig = Integer.rotateRight(orig, 3);
>>> }
>>> System.out.println("Elapsed time: " +
>>> Long.toString(System.currentTimeMillis() - start));
>>> System.out.println("accum: " + Integer.toString(accum));
>>>   ? ? ?}
>>> }
>>>
> 


From eric.c.liu at arm.com  Tue Apr 21 03:20:44 2020
From: eric.c.liu at arm.com (Eric Liu)
Date: Tue, 21 Apr 2020 03:20:44 +0000
Subject: RFR(S):8242429:Better implementation for signed extract
In-Reply-To: <DBBPR08MB44266E6C6FBA9CDEC1D0E0B0C5D80@DBBPR08MB4426.eurprd08.prod.outlook.com>
References: <AM6PR08MB4422DCEBC3FC64084DB24828C5C10@AM6PR08MB4422.eurprd08.prod.outlook.com>,
 <d3eada15-a9f5-01a5-1110-6b268cec76b8@oracle.com>
 <DBBPR08MB44266E6C6FBA9CDEC1D0E0B0C5D80@DBBPR08MB4426.eurprd08.prod.outlook.com>
Message-ID: <AM6PR08MB4422D0D52420FAC3DB4F080AC5D50@AM6PR08MB4422.eurprd08.prod.outlook.com>

Hi Vladimir,

There's one failure, but I don't know whether it's cause by my patch. Unfortunately I don't have detailed report.

Could you help to check the result?

http://hg.openjdk.java.net/jdk/submit/rev/01cbc15277b8


Thanks,
Eric

From eric.c.liu at arm.com  Tue Apr 21 04:12:33 2020
From: eric.c.liu at arm.com (Eric Liu)
Date: Tue, 21 Apr 2020 04:12:33 +0000
Subject: [15] RFR(M): 8242108: Performance regression after fix for
 JDK-8229496
In-Reply-To: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com>
References: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com>
Message-ID: <AM6PR08MB442233E20CA3A9FD3407E58AC5D50@AM6PR08MB4422.eurprd08.prod.outlook.com>

Hi Tobias,

I'm not sure whether you noticed that with https://bugs.openjdk.java.net/browse/JDK-8229496, the 'CastII/CastLL' would mislead GVN that make it unable to recognize the same 'CmpNode' as before.

E.g. for java code:
```
        public int foo(int a, int b) {
            int r = a / b;
            r = r / b; // no need zero-check
            r = r / b; // no need zero-check
            return r;
        }
```

The zero-check for 'b' could not be removed as before if 'b' is boxed with CastII.

I think backing out the original fix for 8229496 would solve this problem.


One comment:

The test case [1] try to detect the wrong dependency order but I assume it unable to find the same issue in AArch64 due to the different behavior of div/mod compared with AMD64.
 
[1] http://cr.openjdk.java.net/~thartmann/8229496/webrev.00/raw_files/new/test/hotspot/jtreg/compiler/loopopts/TestDivZeroCheckControl.java 


Thanks,
Eric

-----Original Message-----
From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Tobias Hartmann
Sent: Monday, April 20, 2020 11:52 PM
To: hotspot compiler <hotspot-compiler-dev at openjdk.java.net>
Subject: [15] RFR(M): 8242108: Performance regression after fix for JDK-8229496

Hi,

please review the following patch:
https://bugs.openjdk.java.net/browse/JDK-8242108
http://cr.openjdk.java.net/~thartmann/8242108/webrev.00/

The fix for 8229496 [1] triggers a performance regression with NumberFormat.format(). The problem is the additional control dependency on a CastII/LL which restricts optimizations due to _carry_dependency being set (which was necessary because we can not represent non-null integers/long values in C2's type system).

While investigating, I've noticed that Roland's fix for 8241900 [2] fixes the exact same problem but in a more elegant way, avoiding an impact on performance.

I'm therefore proposing to back out the original fix for 8229496, leaving the regression test in and also adding a microbenchmark. I've verified that this solves the performance regression (4547 ops/ms vs. 5048 ops/ms on my machine).

Thanks,
Tobias

[1] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-August/034865.html
[2] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-April/037778.html

From eric.c.liu at arm.com  Tue Apr 21 05:03:00 2020
From: eric.c.liu at arm.com (Eric Liu)
Date: Tue, 21 Apr 2020 05:03:00 +0000
Subject: RFR(S):8242429:Better implementation for signed extract
In-Reply-To: <AM6PR08MB4422D0D52420FAC3DB4F080AC5D50@AM6PR08MB4422.eurprd08.prod.outlook.com>
References: <AM6PR08MB4422DCEBC3FC64084DB24828C5C10@AM6PR08MB4422.eurprd08.prod.outlook.com>,
 <d3eada15-a9f5-01a5-1110-6b268cec76b8@oracle.com>
 <DBBPR08MB44266E6C6FBA9CDEC1D0E0B0C5D80@DBBPR08MB4426.eurprd08.prod.outlook.com>
 <AM6PR08MB4422D0D52420FAC3DB4F080AC5D50@AM6PR08MB4422.eurprd08.prod.outlook.com>
Message-ID: <AM6PR08MB44227C7F5EC2406AC6A962C2C5D50@AM6PR08MB4422.eurprd08.prod.outlook.com>

Hi Vladimir,

The test report I received:

[Mach5] mach5-one-yzhang-JDK-8242429-1-20200420-1153-10322515: [FAILED]

1 Failed
		
	tier1-debug-open_test_hotspot_jtreg_tier1_serviceability-macosx-x64-debug-64 
	TimeoutException in EXECUTION.


Thanks,
Eric

-----Original Message-----
From: Eric Liu <eric.c.liu at arm.com> 
Sent: Tuesday, April 21, 2020 11:21 AM
To: Eric Liu <eric.c.liu at arm.com>; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; hotspot-compiler-dev at openjdk.java.net
Cc: nd <nd at arm.com>
Subject: RE: RFR(S):8242429:Better implementation for signed extract

Hi Vladimir,

There's one failure, but I don't know whether it's cause by my patch. Unfortunately I don't have detailed report.

Could you help to check the result?

http://hg.openjdk.java.net/jdk/submit/rev/01cbc15277b8


Thanks,
Eric

From tobias.hartmann at oracle.com  Tue Apr 21 06:29:51 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Tue, 21 Apr 2020 08:29:51 +0200
Subject: [15] RFR(M): 8242108: Performance regression after fix for
 JDK-8229496
In-Reply-To: <9b641dc6-4a25-f26f-9dc1-822d616f0e75@oracle.com>
References: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com>
 <9b641dc6-4a25-f26f-9dc1-822d616f0e75@oracle.com>
Message-ID: <19e8467e-4abc-5f26-bf49-cec6b5aa29e6@oracle.com>

Hi Vladimir,

On 20.04.20 21:55, Vladimir Kozlov wrote:
> aarch64.ad has more changes than just undo 8229496.

Oops, not sure how that happened. I've updated the webrev in-place.

> Otherwise it is good.

Thanks for the review!

> Does it affect performance of our standard benchmarks?

No, I've checked performance already with the fix for 8229496 and there was no measurable difference.

Thanks,
Tobias

From tobias.hartmann at oracle.com  Tue Apr 21 06:44:12 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Tue, 21 Apr 2020 08:44:12 +0200
Subject: [15] RFR(M): 8242108: Performance regression after fix for
 JDK-8229496
In-Reply-To: <AM6PR08MB442233E20CA3A9FD3407E58AC5D50@AM6PR08MB4422.eurprd08.prod.outlook.com>
References: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com>
 <AM6PR08MB442233E20CA3A9FD3407E58AC5D50@AM6PR08MB4422.eurprd08.prod.outlook.com>
Message-ID: <7febb066-5b53-902f-0bf1-ee2f76d36748@oracle.com>

Hi Eric,

thanks for looking at this!

On 21.04.20 06:12, Eric Liu wrote:
> I'm not sure whether you noticed that with https://bugs.openjdk.java.net/browse/JDK-8229496, the 'CastII/CastLL' would mislead GVN that make it unable to recognize the same 'CmpNode' as before.
> 
> E.g. for java code:
> ```
>         public int foo(int a, int b) {
>             int r = a / b;
>             r = r / b; // no need zero-check
>             r = r / b; // no need zero-check
>             return r;
>         }
> ```
> 
> The zero-check for 'b' could not be removed as before if 'b' is boxed with CastII.

Right and there are also some other problems (for example, CastLL does not implement the Value
optimizations that CastII has).

> I think backing out the original fix for 8229496 would solve this problem.

Yes, I've verified that.
> One comment:
> 
> The test case [1] try to detect the wrong dependency order but I assume it unable to find the same issue in AArch64 due to the different behavior of div/mod compared with AMD64.
>  
> [1] http://cr.openjdk.java.net/~thartmann/8229496/webrev.00/raw_files/new/test/hotspot/jtreg/compiler/loopopts/TestDivZeroCheckControl.java 

I'm not familiar with the div/mod implementation on AArch64 but the underlying issue, which is a
div/mod node floating above the null-check, is platform independent.

Thanks,
Tobias


From rwestrel at redhat.com  Tue Apr 21 07:21:31 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Tue, 21 Apr 2020 09:21:31 +0200
Subject: [15] RFR(M): 8242108: Performance regression after fix for
 JDK-8229496
In-Reply-To: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com>
References: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com>
Message-ID: <875zdt9udg.fsf@redhat.com>


> http://cr.openjdk.java.net/~thartmann/8242108/webrev.00/

Looks good to me.

Roland.


From tobias.hartmann at oracle.com  Tue Apr 21 07:34:51 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Tue, 21 Apr 2020 09:34:51 +0200
Subject: [15] RFR(M): 8242108: Performance regression after fix for
 JDK-8229496
In-Reply-To: <875zdt9udg.fsf@redhat.com>
References: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com>
 <875zdt9udg.fsf@redhat.com>
Message-ID: <fa58baee-4d1a-f5be-4d81-3465af8cb19a@oracle.com>

Hi Roland,

thanks for the review!

Best regards,
Tobias

On 21.04.20 09:21, Roland Westrelin wrote:
> 
>> http://cr.openjdk.java.net/~thartmann/8242108/webrev.00/
> 
> Looks good to me.
> 
> Roland.
> 

From aph at redhat.com  Tue Apr 21 09:23:33 2020
From: aph at redhat.com (Andrew Haley)
Date: Tue, 21 Apr 2020 10:23:33 +0100
Subject: [aarch64-port-dev ] RFR(M): 8242482: AArch64: Change parameter
 names of reduction operations to make code clear
In-Reply-To: <VI1PR0802MB2558027F96432B28B3C25EFF8ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>
References: <VI1PR0802MB2558670821C29DC6202817258ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <f998a1cc-2ff8-d3b0-e3db-6c9ef4ffecd8@redhat.com>
 <VI1PR0802MB2558027F96432B28B3C25EFF8ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>
Message-ID: <b2b8c00f-0e07-c84e-c566-fcb72bb4f3ff@redhat.com>

On 4/17/20 10:13 AM, Yang Zhang wrote:
> Besides tier1, I also test these operations in Vector API test, which can cover all the reduction operations.  
> 
> In this directory, there are also some test cases about reduction operations,  which is added in [1].
> https://hg.openjdk.java.net/jdk/jdk/file/55c4283a7606/test/hotspot/jtreg/compiler/loopopts/superword
> 
> [1] https://bugs.openjdk.java.net/browse/JDK-8240248

Sounds good. Thanks!

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From eric.c.liu at arm.com  Tue Apr 21 10:14:05 2020
From: eric.c.liu at arm.com (Eric Liu)
Date: Tue, 21 Apr 2020 10:14:05 +0000
Subject: [15] RFR(M): 8242108: Performance regression after fix for
 JDK-8229496
In-Reply-To: <7febb066-5b53-902f-0bf1-ee2f76d36748@oracle.com>
References: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com>
 <AM6PR08MB442233E20CA3A9FD3407E58AC5D50@AM6PR08MB4422.eurprd08.prod.outlook.com>
 <7febb066-5b53-902f-0bf1-ee2f76d36748@oracle.com>
Message-ID: <AM6PR08MB442291AA5991D85FC383A929C5D50@AM6PR08MB4422.eurprd08.prod.outlook.com>

Hi Tobias,

> I'm not familiar with the div/mod implementation on AArch64 but the 
> underlying issue, which is a div/mod node floating above the null-check,
> is platform independent.

Yes, it's platform independent.

As you said, this test case intends to detect whether div/mod node floating 
above the null-check. But in AArch64, division by zero would not throw any
exception, while AMD64 would generate a SIGFPE.

I'm not sure whether this test case only be used for AMD64.

Thanks,
Eric

-----Original Message-----
From: Tobias Hartmann <tobias.hartmann at oracle.com> 
Sent: Tuesday, April 21, 2020 2:44 PM
To: Eric Liu <eric.c.liu at arm.com>; hotspot compiler <hotspot-compiler-dev at openjdk.java.net>
Cc: nd <nd at arm.com>
Subject: Re: [15] RFR(M): 8242108: Performance regression after fix for JDK-8229496

Hi Eric,

thanks for looking at this!

On 21.04.20 06:12, Eric Liu wrote:
> I'm not sure whether you noticed that with https://bugs.openjdk.java.net/browse/JDK-8229496, the 'CastII/CastLL' would mislead GVN that make it unable to recognize the same 'CmpNode' as before.
> 
> E.g. for java code:
> ```
>         public int foo(int a, int b) {
>             int r = a / b;
>             r = r / b; // no need zero-check
>             r = r / b; // no need zero-check
>             return r;
>         }
> ```
> 
> The zero-check for 'b' could not be removed as before if 'b' is boxed with CastII.

Right and there are also some other problems (for example, CastLL does not implement the Value optimizations that CastII has).

> I think backing out the original fix for 8229496 would solve this problem.

Yes, I've verified that.
> One comment:
> 
> The test case [1] try to detect the wrong dependency order but I assume it unable to find the same issue in AArch64 due to the different behavior of div/mod compared with AMD64.
>  
> [1] 
> http://cr.openjdk.java.net/~thartmann/8229496/webrev.00/raw_files/new/
> test/hotspot/jtreg/compiler/loopopts/TestDivZeroCheckControl.java

I'm not familiar with the div/mod implementation on AArch64 but the underlying issue, which is a div/mod node floating above the null-check, is platform independent.

Thanks,
Tobias


From tobias.hartmann at oracle.com  Tue Apr 21 12:09:53 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Tue, 21 Apr 2020 14:09:53 +0200
Subject: [15] RFR(M): 8242108: Performance regression after fix for
 JDK-8229496
In-Reply-To: <AM6PR08MB442291AA5991D85FC383A929C5D50@AM6PR08MB4422.eurprd08.prod.outlook.com>
References: <38c260cd-9d1f-7676-7f45-3bac926b11c0@oracle.com>
 <AM6PR08MB442233E20CA3A9FD3407E58AC5D50@AM6PR08MB4422.eurprd08.prod.outlook.com>
 <7febb066-5b53-902f-0bf1-ee2f76d36748@oracle.com>
 <AM6PR08MB442291AA5991D85FC383A929C5D50@AM6PR08MB4422.eurprd08.prod.outlook.com>
Message-ID: <405e0fc6-f4ef-e660-7c69-6f90d589c60d@oracle.com>

Hi Eric,

On 21.04.20 12:14, Eric Liu wrote:
> Yes, it's platform independent.
> 
> As you said, this test case intends to detect whether div/mod node floating 
> above the null-check. But in AArch64, division by zero would not throw any
> exception, while AMD64 would generate a SIGFPE.

Okay, thanks for the details.

> I'm not sure whether this test case only be used for AMD64.

Right but I think in any case it doesn't hurt to execute in on AARCH64 as well.

Best regards,
Tobias

From tobias.hartmann at oracle.com  Tue Apr 21 12:12:23 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Tue, 21 Apr 2020 14:12:23 +0200
Subject: Intrinsics for divideUnsigned/remainderUnsigned
In-Reply-To: <2104801b-5942-0982-ade7-1e6f418cfd06@oracle.com>
References: <CANghgrQyci1zZ1_aS+yMQEQkZbQgrnp01qJgEQ9kwM6Jj7t+iw@mail.gmail.com>
 <84dcf12f-7a75-0a8b-b590-3ebb122c7ec3@oracle.com>
 <CANghgrT3GbEBJXxN+O2GKpn87_+t3nbxGq87nyC_+Cg=Xku=yA@mail.gmail.com>
 <2104801b-5942-0982-ade7-1e6f418cfd06@oracle.com>
Message-ID: <0cd41102-c950-13dc-1959-0893ef1237dc@oracle.com>

That's correct, these methods are currently not intrinsified by the JITs.

Best regards,
Tobias

On 20.04.20 19:40, Joe Darcy wrote:
> The divideUnsigned methods in question are not marked with the @HotSpotIntrinsicCandidate annotation
> so it doesn't look like there are currently intrinsics.
> 
> Cheers,
> 
> -Joe
> 
> On 4/20/2020 8:07 AM, David Lloyd wrote:
>> Yes, I did, sorry about that.
>>
>> On Mon, Apr 20, 2020 at 10:02 AM Maurizio Cimadamore
>> <maurizio.cimadamore at oracle.com> wrote:
>>> Hi David,
>>> did you mean to write to hotspot compiler (CCed) ?
>>>
>>> Maurizio
>>>
>>> On 20/04/2020 15:38, David Lloyd wrote:
>>>> Am I correct in understanding that there are no compiler intrinsics
>>>> for Long.divideUnsigned/remainderUnsigned?
>>>>
>>>> The implementation seems pretty expensive for an operation that is, if
>>>> I understand correctly, a single instruction on many CPU
>>>> architectures.? But maybe these methods are not very frequently used?
>>>> (My clue was a comment in the source referencing an algorithm from
>>>> Hacker's Delight that could be used - if such an algorithm exists, but
>>>> wasn't implemented, presumably demand is low?)
>>

From HORIE at jp.ibm.com  Tue Apr 21 13:21:32 2020
From: HORIE at jp.ibm.com (Michihiro Horie)
Date: Tue, 21 Apr 2020 22:21:32 +0900
Subject: RFR[S]:8241874 [PPC64] Improve performance of Long.reverseBytes()
 and Integer.reverseBytes() on Power9
In-Reply-To: <13786032-d4e9-9682-5cd7-698ceb4f8c00@linux.ibm.com>
References: <13786032-d4e9-9682-5cd7-698ceb4f8c00@linux.ibm.com>,
 <67fa8056-a8ed-cdfc-1e5a-d36b49c4af18@linux.ibm.com>
 <0079874e-7bc2-5ff4-f004-337c718ec6df@linux.ibm.com>
 <OF54F8418D.F0B8E4CB-ON0025853E.004E3DC9-4925853E.004F648A@notes.na.collabserv.com>
 <OFE7B015D3.EC53D085-ON00258546.002BA949-49258546.00305059@notes.na.collabserv.com>
 <AM0PR0202MB3297222F0FEBE1239877EE479ADA0@AM0PR0202MB3297.eurprd02.prod.outlook.com>
 <1964be00-8926-7a70-d23a-2f7e85eb4ef3@linux.ibm.com>
 <AM0PR0202MB329767069BE76CD006FF2D679AD80@AM0PR0202MB3297.eurprd02.prod.outlook.com>
Message-ID: <OF5F813ECF.45AFCF13-ON00258551.00485325-49258551.004961F2@notes.na.collabserv.com>


Hi Corey, Martin,

I confirmed the latest webrev fixes copyright year properly, so the change
looks ready to be pushed.

I will push the change my tomorrow.

Best regards,
Michihiro


 ----- Original message -----
 From: "Corey Ashford" <cjashfor at linux.ibm.com>
 To: "Doerr, Martin" <martin.doerr at sap.com>
 Cc: Michihiro Horie/Japan/IBM at IBMJP,
 "hotspot-compiler-dev at openjdk.java.net"
 <hotspot-compiler-dev at openjdk.java.net>,
 "ppc-aix-port-dev at openjdk.java.net" <ppc-aix-port-dev at openjdk.java.net>
 Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of
 Long.reverseBytes() and Integer.reverseBytes() on Power9
 Date: Tue, Apr 21, 2020 5:39 AM

 Hi Martin,

 Sorry for the delay on getting the copyright changes in (I work half
 time).  Here's the revised patch, with all copyright dates set to 2020:

 https://bugs.openjdk.java.net/browse/JDK-8241874
 http://cr.openjdk.java.net/~gromero/8241874/v2/

 Thanks for your consideration,

 - Corey

 On 4/16/20 1:08 AM, Doerr, Martin wrote:
 > Hi Corey,
 >
 > please use 2020 for both, the Oracle and the SAP copyright.
 > Usually, both should be the same, but some people forget to update one
 of them.
 >
 > Best regards,
 > Martin
 >
 >
 >> -----Original Message-----
 >> From: Corey Ashford <cjashfor at linux.ibm.com>
 >> Sent: Donnerstag, 16. April 2020 03:35
 >> To: Doerr, Martin <martin.doerr at sap.com>
 >> Cc: Michihiro Horie <HORIE at jp.ibm.com>; hotspot-compiler-
 >> dev at openjdk.java.net; ppc-aix-port-dev at openjdk.java.net
 >> Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of
 >> Long.reverseBytes() and Integer.reverseBytes() on Power9
 >>
 >> Hello Martin,
 >>
 >> I'm having some trouble with my email server, so I'm having to reply to
 >> your earlier post, but I saw your most recent post on the mailing list
 >> archive.
 >>
 >> Thanks for reviewing and testing this patch.  I went to look at the
 >> copyright dates, and see two date ranges: one for Oracle and its
 >> affiliates, and another for SAP.  In the files I looked at, the end
 date
 >> wasn't the same between the two.  Which one (or both) should I modify?
 >>
 >> Thanks,
 >>
 >> - Corey
 >>
 >> On 4/14/20 6:26 AM, Doerr, Martin wrote:
 >>> Hi Corey,
 >>>
 >>> thanks for contributing it. Looks good to me. I?ll run it through our
 >>> testing and let you know about the results.
 >>>
 >>> Best regards,
 >>>
 >>> Martin
 >>>
 >>> *From:*ppc-aix-port-dev <ppc-aix-port-dev-bounces at openjdk.java.net>
 >> *On
 >>> Behalf Of *Michihiro Horie
 >>> *Sent:* Freitag, 10. April 2020 10:48
 >>> *To:* cjashfor at linux.ibm.com
 >>> *Cc:* hotspot-compiler-dev at openjdk.java.net;
 >>> ppc-aix-port-dev at openjdk.java.net
 >>> *Subject:* Re: RFR[S]:8241874 [PPC64] Improve performance of
 >>> Long.reverseBytes() and Integer.reverseBytes() on Power9
 >>>
 >>> Hi Corey,
 >>>
 >>> Thank you for sharing your benchmarks. I confirmed your change reduced
 >>> the elapsed time of the benchmarks by more than 30% on my P9 node.
 >> Also,
 >>> I checked JTREG results, which look no problem.
 >>>
 >>> BTW, I cannot find further points of improvement in your change.
 >>>
 >>> Best regards,
 >>> Michihiro
 >>>
 >>>
 >>> ----- Original message -----
 >>> From: "Corey Ashford" <cjashfor at linux.ibm.com
 >>> <mailto:cjashfor at linux.ibm.com>>
 >>> To: Michihiro Horie/Japan/IBM at IBMJP
 >>> Cc: hotspot-compiler-dev at openjdk.java.net
 >>> <mailto:hotspot-compiler-dev at openjdk.java.net>,
 >>> ppc-aix-port-dev at openjdk.java.net
 >>> <mailto:ppc-aix-port-dev at openjdk.java.net>, "Gustavo Romero"
 >>> <gromero at linux.vnet.ibm.com <mailto:gromero at linux.vnet.ibm.com>>
 >>> Subject: Re: RFR[S]:8241874 [PPC64] Improve performance of
 >>> Long.reverseBytes() and Integer.reverseBytes() on Power9
 >>> Date: Fri, Apr 3, 2020 8:07 AM
 >>>
 >>> On 4/2/20 7:27 AM, Michihiro Horie wrote:
 >>>> Hi Corey,
 >>>>
 >>>> I?m not a reviewer, but I can run your benchmark in my local P9 node
 if
 >>>> you share it.
 >>>>
 >>>> Best regards,
 >>>> Michihiro
 >>>
 >>> The tests are somewhat hokey; I added the shifts to keep the compiler
 >>> from hoisting the code that it could predetermine the result.
 >>>
 >>> Here's the one for Long.reverseBytes():
 >>>
 >>> import java.lang.*;
 >>>
 >>> class ReverseLong
 >>> {
 >>>        public static void main(String args[])
 >>>        {
 >>>            long reversed, re_reversed;
 >>> long accum = 0;
 >>> long orig = 0x1122334455667788L;
 >>> long start = System.currentTimeMillis();
 >>> for (int i = 0; i < 1_000_000_000; i++) {
 >>> // Try to keep java from figuring out stuff in advance
 >>> reversed = Long.reverseBytes(orig);
 >>> re_reversed = Long.reverseBytes(reversed);
 >>> if (re_reversed != orig) {
 >>>            System.out.println("Orig: " + String.format("%16x", orig) +
 >>> "  Re-reversed: " + String.format("%16x", re_reversed));
 >>> }
 >>> accum += orig;
 >>> orig = Long.rotateRight(orig, 3);
 >>> }
 >>> System.out.println("Elapsed time: " +
 >>> Long.toString(System.currentTimeMillis() - start));
 >>> System.out.println("accum: " + Long.toString(accum));
 >>>        }
 >>> }
 >>>
 >>>
 >>> And the one for Integer.reverseBytes():
 >>>
 >>> import java.lang.*;
 >>>
 >>> class ReverseInt
 >>> {
 >>>        public static void main(String args[])
 >>>        {
 >>>            int reversed, re_reversed;
 >>> int orig = 0x11223344;
 >>> int accum = 0;
 >>> long start = System.currentTimeMillis();
 >>> for (int i = 0; i < 1_000_000_000; i++) {
 >>> // Try to keep java from figuring out stuff in advance
 >>> reversed = Integer.reverseBytes(orig);
 >>> re_reversed = Integer.reverseBytes(reversed);
 >>> if (re_reversed != orig) {
 >>>            System.out.println("Orig: " + String.format("%08x", orig) +
 >>> "  Re-reversed: " + String.format("%08x", re_reversed));
 >>> }
 >>> accum += orig;
 >>> orig = Integer.rotateRight(orig, 3);
 >>> }
 >>> System.out.println("Elapsed time: " +
 >>> Long.toString(System.currentTimeMillis() - start));
 >>> System.out.println("accum: " + Integer.toString(accum));
 >>>        }
 >>> }
 >>>
 >


From rahul.v.raghavan at oracle.com  Tue Apr 21 13:26:52 2020
From: rahul.v.raghavan at oracle.com (Rahul Raghavan)
Date: Tue, 21 Apr 2020 18:56:52 +0530
Subject: [15]RFR: 8241986: java man page erroneously refers to XEND when it
 should refer XTEST
Message-ID: <6b7d90c1-1180-d643-826f-2c40baae2957@oracle.com>

Hi,

Please review the following very trivial fix for a typo in man page.

http://cr.openjdk.java.net/~rraghavan/8241986/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8241986

Thanks,
Rahul

From tobias.hartmann at oracle.com  Tue Apr 21 13:40:17 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Tue, 21 Apr 2020 15:40:17 +0200
Subject: [15]RFR: 8241986: java man page erroneously refers to XEND when
 it should refer XTEST
In-Reply-To: <6b7d90c1-1180-d643-826f-2c40baae2957@oracle.com>
References: <6b7d90c1-1180-d643-826f-2c40baae2957@oracle.com>
Message-ID: <09451a53-152f-4222-059b-90e3a3b6b7ce@oracle.com>

Hi Rahul,

looks good to me.

Best regards,
Tobias

On 21.04.20 15:26, Rahul Raghavan wrote:
> Hi,
> 
> Please review the following very trivial fix for a typo in man page.
> 
> http://cr.openjdk.java.net/~rraghavan/8241986/webrev.00/
> https://bugs.openjdk.java.net/browse/JDK-8241986
> 
> Thanks,
> Rahul

From HORIE at jp.ibm.com  Tue Apr 21 14:57:30 2020
From: HORIE at jp.ibm.com (Michihiro Horie)
Date: Tue, 21 Apr 2020 23:57:30 +0900
Subject: RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range
In-Reply-To: <AM0PR0202MB329721F804F88BEC62CE93289ADA0@AM0PR0202MB3297.eurprd02.prod.outlook.com>
References: <AM0PR0202MB329721F804F88BEC62CE93289ADA0@AM0PR0202MB3297.eurprd02.prod.outlook.com>
Message-ID: <OF5C36EBA8.F0084204-ON00258551.0050E40B-49258551.00522B72@notes.na.collabserv.com>


Hi Martin,

I started measuring SPECjbb2015 to see the performance impact on P9. Also,
I'm preparing same measurement on P8.

Best regards,
Michihiro


 ----- Original message -----
 From: "Doerr, Martin" <martin.doerr at sap.com>
 To: "'hotspot-compiler-dev at openjdk.java.net'"
 <hotspot-compiler-dev at openjdk.java.net>
 Cc: Michihiro Horie <HORIE at jp.ibm.com>, "cjashfor at linux.ibm.com"
 <cjashfor at linux.ibm.com>, "ppc-aix-port-dev at openjdk.java.net"
 <ppc-aix-port-dev at openjdk.java.net>, Gustavo Romero
 <gromero at linux.vnet.ibm.com>, "joserz at linux.ibm.com"
 <joserz at linux.ibm.com>
 Subject: [EXTERNAL] RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is
 out of range
 Date: Tue, Apr 14, 2020 11:07 PM

 Hi,

 I?d like to resolve a very old PPC64 issue:
 https://bugs.openjdk.java.net/browse/JDK-8151030

 There?s code for AllocatePrefetchStyle=4 which is not an accepted option.
 It was used for a special experimental prefetch mode using dcbz
 instructions to combine prefetching and zeroing in the TLABs.
 However, this code was never contributed and there are no plans to work on
 it. So I?d like to simply remove this small part of it.

 In addition to that, AllocatePrefetchLines is currently set to 3 by
 default which doesn?t make sense to me. PPC64 has an automatic prefetch
 engine and executing several prefetch instructions for succeeding cache
 lines doesn?t seem to be beneficial at all.
 So I?m setting it to 1 by default. I couldn?t observe regressions on
 Power7, Power8 and Power9.

 Webrev:
 http://cr.openjdk.java.net/~mdoerr/8151030_ppc_prefetch/webrev.00/

 Please review.

 If somebody from IBM would like to check performance impact of changing
 the AllocatePrefetchLines + Distance, I?ll be glad to receive feedback.

 Best regards,
 Martin


From vladimir.kozlov at oracle.com  Tue Apr 21 18:41:32 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Tue, 21 Apr 2020 11:41:32 -0700
Subject: [15]RFR: 8241986: java man page erroneously refers to XEND when
 it should refer XTEST
In-Reply-To: <09451a53-152f-4222-059b-90e3a3b6b7ce@oracle.com>
References: <6b7d90c1-1180-d643-826f-2c40baae2957@oracle.com>
 <09451a53-152f-4222-059b-90e3a3b6b7ce@oracle.com>
Message-ID: <a4cef159-b326-2f36-b755-aad524edfbf4@oracle.com>

+1

Vladimir

On 4/21/20 6:40 AM, Tobias Hartmann wrote:
> Hi Rahul,
> 
> looks good to me.
> 
> Best regards,
> Tobias
> 
> On 21.04.20 15:26, Rahul Raghavan wrote:
>> Hi,
>>
>> Please review the following very trivial fix for a typo in man page.
>>
>> http://cr.openjdk.java.net/~rraghavan/8241986/webrev.00/
>> https://bugs.openjdk.java.net/browse/JDK-8241986
>>
>> Thanks,
>> Rahul

From Yang.Zhang at arm.com  Wed Apr 22 04:23:51 2020
From: Yang.Zhang at arm.com (Yang Zhang)
Date: Wed, 22 Apr 2020 04:23:51 +0000
Subject: [aarch64-port-dev ] RFR(M): 8242482: AArch64: Change parameter
 names of reduction operations to make code clear
In-Reply-To: <b2b8c00f-0e07-c84e-c566-fcb72bb4f3ff@redhat.com>
References: <VI1PR0802MB2558670821C29DC6202817258ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <f998a1cc-2ff8-d3b0-e3db-6c9ef4ffecd8@redhat.com>
 <VI1PR0802MB2558027F96432B28B3C25EFF8ED90@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <b2b8c00f-0e07-c84e-c566-fcb72bb4f3ff@redhat.com>
Message-ID: <VI1PR0802MB2558C0940AB4BA224A3AC6C68ED20@VI1PR0802MB2558.eurprd08.prod.outlook.com>

Hi Andrew

Thanks for your review. I will ask Pengfei to help push it.

Regards
Yang

-----Original Message-----
From: Andrew Haley <aph at redhat.com> 
Sent: Tuesday, April 21, 2020 5:24 PM
To: Yang Zhang <Yang.Zhang at arm.com>; aarch64-port-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net
Cc: nd <nd at arm.com>
Subject: Re: [aarch64-port-dev ] RFR(M): 8242482: AArch64: Change parameter names of reduction operations to make code clear

On 4/17/20 10:13 AM, Yang Zhang wrote:
> Besides tier1, I also test these operations in Vector API test, which can cover all the reduction operations.  
> 
> In this directory, there are also some test cases about reduction operations,  which is added in [1].
> https://hg.openjdk.java.net/jdk/jdk/file/55c4283a7606/test/hotspot/jtr
> eg/compiler/loopopts/superword
> 
> [1] https://bugs.openjdk.java.net/browse/JDK-8240248

Sounds good. Thanks!

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From forax at univ-mlv.fr  Wed Apr 22 14:03:05 2020
From: forax at univ-mlv.fr (Remi Forax)
Date: Wed, 22 Apr 2020 16:03:05 +0200 (CEST)
Subject: Intrinsics for divideUnsigned/remainderUnsigned
In-Reply-To: <0cd41102-c950-13dc-1959-0893ef1237dc@oracle.com>
References: <CANghgrQyci1zZ1_aS+yMQEQkZbQgrnp01qJgEQ9kwM6Jj7t+iw@mail.gmail.com>
 <84dcf12f-7a75-0a8b-b590-3ebb122c7ec3@oracle.com>
 <CANghgrT3GbEBJXxN+O2GKpn87_+t3nbxGq87nyC_+Cg=Xku=yA@mail.gmail.com>
 <2104801b-5942-0982-ade7-1e6f418cfd06@oracle.com>
 <0cd41102-c950-13dc-1959-0893ef1237dc@oracle.com>
Message-ID: <39292789.1313445.1587564185591.JavaMail.zimbra@u-pem.fr>

And don't forget compareUnsigned !

I believe you can not have an efficient implementation of Mozilla SpiderMonkey respresentation (NaN boxing [1]) without it.

regards,
R?mi

[1] https://brionv.com/log/2018/05/17/javascript-engine-internals-nan-boxing/

----- Mail original -----
> De: "Tobias Hartmann" <tobias.hartmann at oracle.com>
> ?: "joe darcy" <joe.darcy at oracle.com>, "David Lloyd" <david.lloyd at redhat.com>, "Maurizio Cimadamore"
> <maurizio.cimadamore at oracle.com>
> Cc: "hotspot compiler" <hotspot-compiler-dev at openjdk.java.net>
> Envoy?: Mardi 21 Avril 2020 14:12:23
> Objet: Re: Intrinsics for divideUnsigned/remainderUnsigned

> That's correct, these methods are currently not intrinsified by the JITs.
> 
> Best regards,
> Tobias
> 
> On 20.04.20 19:40, Joe Darcy wrote:
>> The divideUnsigned methods in question are not marked with the
>> @HotSpotIntrinsicCandidate annotation
>> so it doesn't look like there are currently intrinsics.
>> 
>> Cheers,
>> 
>> -Joe
>> 
>> On 4/20/2020 8:07 AM, David Lloyd wrote:
>>> Yes, I did, sorry about that.
>>>
>>> On Mon, Apr 20, 2020 at 10:02 AM Maurizio Cimadamore
>>> <maurizio.cimadamore at oracle.com> wrote:
>>>> Hi David,
>>>> did you mean to write to hotspot compiler (CCed) ?
>>>>
>>>> Maurizio
>>>>
>>>> On 20/04/2020 15:38, David Lloyd wrote:
>>>>> Am I correct in understanding that there are no compiler intrinsics
>>>>> for Long.divideUnsigned/remainderUnsigned?
>>>>>
>>>>> The implementation seems pretty expensive for an operation that is, if
>>>>> I understand correctly, a single instruction on many CPU
>>>>> architectures.? But maybe these methods are not very frequently used?
>>>>> (My clue was a comment in the source referencing an algorithm from
>>>>> Hacker's Delight that could be used - if such an algorithm exists, but
>>>>> wasn't implemented, presumably demand is low?)

From lutz.schmidt at sap.com  Wed Apr 22 18:01:44 2020
From: lutz.schmidt at sap.com (Schmidt, Lutz)
Date: Wed, 22 Apr 2020 18:01:44 +0000
Subject: RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range
In-Reply-To: <OF5C36EBA8.F0084204-ON00258551.0050E40B-49258551.00522B72@notes.na.collabserv.com>
References: <AM0PR0202MB329721F804F88BEC62CE93289ADA0@AM0PR0202MB3297.eurprd02.prod.outlook.com>
 <OF5C36EBA8.F0084204-ON00258551.0050E40B-49258551.00522B72@notes.na.collabserv.com>
Message-ID: <0737AF50-4DED-4680-8629-47140DD2A7A6@sap.com>

Hi Martin, 

your change looks good to me. 

I noticed you didn't find a chance to put it in the patch queue for our internal testing. I did that now, but it's too late for tonight. We'll have to wait until Friday morning (GMT+2) to really see what I expect: no issues.

Thanks for cleaning up this old stuff. 

Regards,
Lutz


?On 21.04.20, 16:57, "hotspot-compiler-dev on behalf of Michihiro Horie" <hotspot-compiler-dev-bounces at openjdk.java.net on behalf of HORIE at jp.ibm.com> wrote:


    Hi Martin,

    I started measuring SPECjbb2015 to see the performance impact on P9. Also,
    I'm preparing same measurement on P8.

    Best regards,
    Michihiro


     ----- Original message -----
     From: "Doerr, Martin" <martin.doerr at sap.com>
     To: "'hotspot-compiler-dev at openjdk.java.net'"
     <hotspot-compiler-dev at openjdk.java.net>
     Cc: Michihiro Horie <HORIE at jp.ibm.com>, "cjashfor at linux.ibm.com"
     <cjashfor at linux.ibm.com>, "ppc-aix-port-dev at openjdk.java.net"
     <ppc-aix-port-dev at openjdk.java.net>, Gustavo Romero
     <gromero at linux.vnet.ibm.com>, "joserz at linux.ibm.com"
     <joserz at linux.ibm.com>
     Subject: [EXTERNAL] RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is
     out of range
     Date: Tue, Apr 14, 2020 11:07 PM

     Hi,

     I?d like to resolve a very old PPC64 issue:
     https://bugs.openjdk.java.net/browse/JDK-8151030

     There?s code for AllocatePrefetchStyle=4 which is not an accepted option.
     It was used for a special experimental prefetch mode using dcbz
     instructions to combine prefetching and zeroing in the TLABs.
     However, this code was never contributed and there are no plans to work on
     it. So I?d like to simply remove this small part of it.

     In addition to that, AllocatePrefetchLines is currently set to 3 by
     default which doesn?t make sense to me. PPC64 has an automatic prefetch
     engine and executing several prefetch instructions for succeeding cache
     lines doesn?t seem to be beneficial at all.
     So I?m setting it to 1 by default. I couldn?t observe regressions on
     Power7, Power8 and Power9.

     Webrev:
     http://cr.openjdk.java.net/~mdoerr/8151030_ppc_prefetch/webrev.00/

     Please review.

     If somebody from IBM would like to check performance impact of changing
     the AllocatePrefetchLines + Distance, I?ll be glad to receive feedback.

     Best regards,
     Martin


From Yang.Zhang at arm.com  Thu Apr 23 02:39:26 2020
From: Yang.Zhang at arm.com (Yang Zhang)
Date: Thu, 23 Apr 2020 02:39:26 +0000
Subject: [aarch64-port-dev ] RFR(XS): 8242905: AArch64: Client build failed 
Message-ID: <VI1PR0802MB2558A97BE952C61A2F7BF9428ED30@VI1PR0802MB2558.eurprd08.prod.outlook.com>

Hi,

Could you please help to review this patch?

JBS: https://bugs.openjdk.java.net/browse/JDK-8242905
Webrev: http://cr.openjdk.java.net/~yzhang/8242905/webrev.00/

This issue is introduced by [1]. In this commit, pop_CPU_state(restore
_vectors) and leave() are included under COMPILER2_OR_JVMCI check in
AArc64 restore_live_registers[2].

But restore_live_registers is used in generate_resolve_blob[3] which
might be called from c1. In x86 restore_live_registers, pop_CPU_state()
and pop(rbp) are always done [4].

To fix this issue, pop_CPU_state(restore_vectors) and leave()
are also moved outside of COMPILER2_OR_JVMCI check in AArch64
restore_live_registers.

Testing on AArch64 platform:
tier1 test with server build
server build with configuring --with-jvm-features=-compiler2
client build and ran HelloWorld

[1] https://bugs.openjdk.java.net/browse/JDK-8241665
[2] https://hg.openjdk.java.net/jdk/jdk/rev/53568400fec3#l1.23
[3] http://hg.openjdk.java.net/jdk/jdk/file/55c4283a7606/src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp#l2850
[4] http://hg.openjdk.java.net/jdk/jdk/file/55c4283a7606/src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp#l378

From eric.c.liu at arm.com  Thu Apr 23 03:57:46 2020
From: eric.c.liu at arm.com (Eric Liu)
Date: Thu, 23 Apr 2020 03:57:46 +0000
Subject: RFR(S):8242429:Better implementation for signed extract
In-Reply-To: <b31ca0c0-b5ca-cdba-f3f2-91fafa195b9d@oracle.com>
References: <AM6PR08MB4422DCEBC3FC64084DB24828C5C10@AM6PR08MB4422.eurprd08.prod.outlook.com>
 <d3eada15-a9f5-01a5-1110-6b268cec76b8@oracle.com>
 <DBBPR08MB44266E6C6FBA9CDEC1D0E0B0C5D80@DBBPR08MB4426.eurprd08.prod.outlook.com>
 <b31ca0c0-b5ca-cdba-f3f2-91fafa195b9d@oracle.com>
Message-ID: <AM6PR08MB4422C01A0249037A38F22395C5D30@AM6PR08MB4422.eurprd08.prod.outlook.com>

Hi Vladimir,

Today we retriggered the job and it's passed all test cases.

The detail as below:

 Job: mach5-one-njian-JDK-8242429-2-20200423-0236-10421472
 BuildId: 2020-04-23-0235070.ningsheng.jian.source
 No failed tests
 Tasks Summary
	NOTHING_TO_RUN: 0
	UNABLE_TO_RUN: 0
	KILLED: 0
	NA: 0
	HARNESS_ERROR: 0
	FAILED: 0
	EXECUTED_WITH_FAILURE: 0
	PASSED: 84

I'm wondering whether it's necessary to check it again by some another reviewer.

Thanks,
Eric

-----Original Message-----
From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com> 
Sent: Thursday, April 16, 2020 8:29 PM
To: Eric Liu <eric.c.liu at arm.com>; hotspot-compiler-dev at openjdk.java.net
Cc: nd <nd at arm.com>
Subject: Re: RFR(S):8242429:Better implementation for signed extract


> Webrev:?http://cr.openjdk.java.net/~yzhang/ericliu/8242429/webrev.01/

Looks good.

Have you tested it through submit repo?

Best regards,
Vladimir Ivanov

> [Tests]
> Jtreg hotspot::hotspot_all_no_apps, jdk::jdk_core and langtools::tier1.
> No new failure found.
> 
> JMH: A simple JMH case [1] on AArch64 and AMD64 machines.
> 
> For AArch64, one platform has no obvious improvement, but on others 
> the performance gain is 7.3%~32.7%.
> 
> For AMD64, one platform has no obvious improvement, but on others the 
> performance gain is 13.7%~32.4%.
> 
> A simple test case [2] has checked the correctness for some corner 
> cases.
> 
> [1] 
> https://bugs.openjdk.java.net/secure/attachment/87712/IdealNegate.java
> [2] 
> https://bugs.openjdk.java.net/secure/attachment/87713/SignExtractTest.
> java
> 
> 
> Thanks,
> Eric
> 

From aph at redhat.com  Thu Apr 23 08:44:15 2020
From: aph at redhat.com (Andrew Haley)
Date: Thu, 23 Apr 2020 09:44:15 +0100
Subject: [aarch64-port-dev ] RFR(XS): 8242905: AArch64: Client build failed
In-Reply-To: <VI1PR0802MB2558A97BE952C61A2F7BF9428ED30@VI1PR0802MB2558.eurprd08.prod.outlook.com>
References: <VI1PR0802MB2558A97BE952C61A2F7BF9428ED30@VI1PR0802MB2558.eurprd08.prod.outlook.com>
Message-ID: <f5882c3b-6223-f89d-ec7a-3a6eaa3198e1@redhat.com>

On 4/23/20 3:39 AM, Yang Zhang wrote:
> Could you please help to review this patch?
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8242905
> Webrev: http://cr.openjdk.java.net/~yzhang/8242905/webrev.00/

Ok, thanks.

Does anyone in the real world use AArch64 client builds? I'm wondering if
we'd be better off without that option.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From jorn.vernee at oracle.com  Thu Apr 23 12:52:38 2020
From: jorn.vernee at oracle.com (Jorn Vernee)
Date: Thu, 23 Apr 2020 14:52:38 +0200
Subject: is it time fully optimize long loops? (JDK-8223051)
In-Reply-To: <87ftdbbxj5.fsf@redhat.com>
References: <BBACDCA0-2E6A-4ACE-893A-55ED03509625@oracle.com>
 <87imi8bunn.fsf@redhat.com> <B87BCA31-E466-4AD6-B39E-9893FD74FA56@oracle.com>
 <87ftdbbxj5.fsf@redhat.com>
Message-ID: <bb346743-7c86-8464-e72b-ab2cb5ab2b97@oracle.com>

Hi Roland,

Sorry, I'm just now seeing this.

I was using the following test to diagnose C2 loop predication:

public class Main {

 ??? static final int SIZE = 1_000_000;

 ??? final long bound_long;
 ??? final int bound_int;

 ??? public Main() {
 ??????? this.bound_long = SIZE;
 ??????? this.bound_int = SIZE;
 ??? }

 ??? public static void main(String[] args) {
System.out.println(ProcessHandle.current().pid());
 ??????? run();
 ??? }

 ??? public static void run() {
 ??????? Main m = new Main();
System.out.println("=========================================================================");
 ??????? for (int i = 0; i < 20_000; i++) {
 ??????????? m.invoke();
 ??????? }
 ??? }

 ??? public int invoke() {
 ??????? int sum = 0;
 ??????? var bound = this.bound_int;
 ??????? for (int i = 0; i < SIZE; i++) {
 ??????????? if (i >= bound) throw new IllegalStateException();
 ??????????? sum +=? i;
 ??????? }
 ??????? return sum;
 ??? }
}

Together with explicitly disabling the inlining of the 'invoke' method.

Switching between `var bound = this.bound_int` and `var bound = 
this.bound_long` you should see that the bound check in the `if` is 
being eliminated in the int case, but not in the long case.

After some debugging the switch point between the 2 cases seems to be in 
'IdealLoopTree::iteration_split_impl' when initializing `should_rce` 
[1], but ultimately this call seems to bottom out in 
'PhaseIdealLoop::is_scaled_iv_plus_offset' in loopTransform.cpp, which 
is checking the nodes involved for integer opcodes explicitly [2].

In the Panama code we are currently working around this by assuming the 
operands of the calculation fit into `int` in some cases, and then 
explicitly casting them to ints, which then enables the optimization 
[3]. But, as John says, this is not ideal.

HTH,
Jorn

[1] : 
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L3308
[2] : 
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L2402
[3] : 
https://github.com/openjdk/panama-foreign/blob/c8fc03351277f318f86d333f7fff1338fe17a247/src/java.base/share/classes/jdk/internal/access/foreign/MemoryAddressProxy.java#L50-L94

On 10/04/2020 09:38, Roland Westrelin wrote:
> Once the long loop is transformed to an int counted loop what are the
> optimizations that need to trigger reliably? In particular do we need
> range check elimination? Can you or someone from the panama project shar
> code samples that I can use to verify the long loop optimizes well?
>
> Roland.
>

From aleksei.voitylov at bell-sw.com  Thu Apr 23 13:12:16 2020
From: aleksei.voitylov at bell-sw.com (Aleksei Voitylov)
Date: Thu, 23 Apr 2020 16:12:16 +0300
Subject: [aarch64-port-dev ] RFR(XS): 8242905: AArch64: Client build failed
In-Reply-To: <f5882c3b-6223-f89d-ec7a-3a6eaa3198e1@redhat.com>
References: <VI1PR0802MB2558A97BE952C61A2F7BF9428ED30@VI1PR0802MB2558.eurprd08.prod.outlook.com>
 <f5882c3b-6223-f89d-ec7a-3a6eaa3198e1@redhat.com>
Message-ID: <7b98219a-e45b-f0e8-9008-0c7a712c06f4@bell-sw.com>

Yes, in the embedded space.

On 23/04/2020 11:44, Andrew Haley wrote:
> On 4/23/20 3:39 AM, Yang Zhang wrote:
>> Could you please help to review this patch?
>>
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8242905
>> Webrev: http://cr.openjdk.java.net/~yzhang/8242905/webrev.00/
> Ok, thanks.
>
> Does anyone in the real world use AArch64 client builds? I'm wondering if
> we'd be better off without that option.
>

From dean.long at oracle.com  Thu Apr 23 23:48:06 2020
From: dean.long at oracle.com (Dean Long)
Date: Thu, 23 Apr 2020 16:48:06 -0700
Subject: RFR(S) 8219607: Add support in Graal and AOT for hidden class
Message-ID: <ef35a299-452c-e326-a4bc-26f301dba9c7@oracle.com>

https://bugs.openjdk.java.net/browse/JDK-8219607
http://cr.openjdk.java.net/~dlong/8219607/webrev/

This change adds support for the Class.isHidden() intrinsic to Graal.

thanks,

dl

From vladimir.kozlov at oracle.com  Fri Apr 24 00:57:04 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Thu, 23 Apr 2020 17:57:04 -0700
Subject: RFR(S) 8219607: Add support in Graal and AOT for hidden class
In-Reply-To: <ef35a299-452c-e326-a4bc-26f301dba9c7@oracle.com>
References: <ef35a299-452c-e326-a4bc-26f301dba9c7@oracle.com>
Message-ID: <36d29a0e-e396-da7a-4945-ad9afb709b14@oracle.com>

Hi Dean,

Changes looks good.

I see that compiler/graalunit/HotspotTest.java failed in tier1 (and tier3-graal). I assume it is 8243381.

Thanks,
Vladimir K

On 4/23/20 4:48 PM, Dean Long wrote:
> https://bugs.openjdk.java.net/browse/JDK-8219607
> http://cr.openjdk.java.net/~dlong/8219607/webrev/
> 
> This change adds support for the Class.isHidden() intrinsic to Graal.
> 
> thanks,
> 
> dl

From dean.long at oracle.com  Fri Apr 24 02:20:44 2020
From: dean.long at oracle.com (Dean Long)
Date: Thu, 23 Apr 2020 19:20:44 -0700
Subject: RFR(S) 8219607: Add support in Graal and AOT for hidden class
In-Reply-To: <36d29a0e-e396-da7a-4945-ad9afb709b14@oracle.com>
References: <ef35a299-452c-e326-a4bc-26f301dba9c7@oracle.com>
 <36d29a0e-e396-da7a-4945-ad9afb709b14@oracle.com>
Message-ID: <fc0ae9b7-b5c1-50a7-2634-c5aaa0bd9ade@oracle.com>

On 4/23/20 5:57 PM, Vladimir Kozlov wrote:
> Hi Dean,
>
> Changes looks good.

Thanks Vladimir.

>
> I see that compiler/graalunit/HotspotTest.java failed in tier1 (and 
> tier3-graal). I assume it is 8243381.

Yes, I accidentally removed that sub-test from the problem list during 
testing, so it added some "noise" to the test results.

dl
>
> Thanks,
> Vladimir K
>
> On 4/23/20 4:48 PM, Dean Long wrote:
>> https://bugs.openjdk.java.net/browse/JDK-8219607
>> http://cr.openjdk.java.net/~dlong/8219607/webrev/
>>
>> This change adds support for the Class.isHidden() intrinsic to Graal.
>>
>> thanks,
>>
>> dl


From HORIE at jp.ibm.com  Fri Apr 24 05:40:00 2020
From: HORIE at jp.ibm.com (Michihiro Horie)
Date: Fri, 24 Apr 2020 14:40:00 +0900
Subject: RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range
In-Reply-To: <0737AF50-4DED-4680-8629-47140DD2A7A6@sap.com>
References: <0737AF50-4DED-4680-8629-47140DD2A7A6@sap.com>,
 <AM0PR0202MB329721F804F88BEC62CE93289ADA0@AM0PR0202MB3297.eurprd02.prod.outlook.com>
 <OF5C36EBA8.F0084204-ON00258551.0050E40B-49258551.00522B72@notes.na.collabserv.com>
Message-ID: <OFF4A63D23.06CC1E89-ON00258554.001DDE0D-49258554.001F20C4@notes.na.collabserv.com>


Hi Martin, Lutz,

I have not seen big differences in SPECjbb2015 scores both on P8 and P9.

Best regards,
Michihiro


 ----- Original message -----
 From: "Schmidt, Lutz" <lutz.schmidt at sap.com>
 To: Michihiro Horie <HORIE at jp.ibm.com>, "Doerr, Martin"
 <martin.doerr at sap.com>
 Cc: "ppc-aix-port-dev at openjdk.java.net"
 <ppc-aix-port-dev at openjdk.java.net>,
 "hotspot-compiler-dev at openjdk.java.net"
 <hotspot-compiler-dev at openjdk.java.net>
 Subject: [EXTERNAL] Re: RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4
 is out of range
 Date: Thu, Apr 23, 2020 3:01 AM

 Hi Martin,

 your change looks good to me.

 I noticed you didn't find a chance to put it in the patch queue for our
 internal testing. I did that now, but it's too late for tonight. We'll
 have to wait until Friday morning (GMT+2) to really see what I expect: no
 issues.

 Thanks for cleaning up this old stuff.

 Regards,
 Lutz


 ?On 21.04.20, 16:57, "hotspot-compiler-dev on behalf of Michihiro Horie"
 <hotspot-compiler-dev-bounces at openjdk.java.net on behalf of
 HORIE at jp.ibm.com> wrote:


     Hi Martin,

     I started measuring SPECjbb2015 to see the performance impact on P9.
 Also,
     I'm preparing same measurement on P8.

     Best regards,
     Michihiro


      ----- Original message -----
      From: "Doerr, Martin" <martin.doerr at sap.com>
      To: "'hotspot-compiler-dev at openjdk.java.net'"
      <hotspot-compiler-dev at openjdk.java.net>
      Cc: Michihiro Horie <HORIE at jp.ibm.com>, "cjashfor at linux.ibm.com"
      <cjashfor at linux.ibm.com>, "ppc-aix-port-dev at openjdk.java.net"
      <ppc-aix-port-dev at openjdk.java.net>, Gustavo Romero
      <gromero at linux.vnet.ibm.com>, "joserz at linux.ibm.com"
      <joserz at linux.ibm.com>
      Subject: [EXTERNAL] RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4
 is
      out of range
      Date: Tue, Apr 14, 2020 11:07 PM

      Hi,

      I?d like to resolve a very old PPC64 issue:

 https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.openjdk.java.net_browse_JDK-2D8151030&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=oecsIpYF-cifqq2i1JEH0Q&m=Q3El2qgCsQyK-bunbC8-3yZzMvfZGBwC8q58omWEUCM&s=ohXZhHZXhsm01dbRh1iQHwrtNAH1QfUmokv2qs49cPY&e=


      There?s code for AllocatePrefetchStyle=4 which is not an accepted
 option.
      It was used for a special experimental prefetch mode using dcbz
      instructions to combine prefetching and zeroing in the TLABs.
      However, this code was never contributed and there are no plans to
 work on
      it. So I?d like to simply remove this small part of it.

      In addition to that, AllocatePrefetchLines is currently set to 3 by
      default which doesn?t make sense to me. PPC64 has an automatic
 prefetch
      engine and executing several prefetch instructions for succeeding
 cache
      lines doesn?t seem to be beneficial at all.
      So I?m setting it to 1 by default. I couldn?t observe regressions on
      Power7, Power8 and Power9.

      Webrev:

 https://urldefense.proofpoint.com/v2/url?u=http-3A__cr.openjdk.java.net_-7Emdoerr_8151030-5Fppc-5Fprefetch_webrev.00_&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=oecsIpYF-cifqq2i1JEH0Q&m=Q3El2qgCsQyK-bunbC8-3yZzMvfZGBwC8q58omWEUCM&s=paesC67BcmFOkkYjGySj1AUJJyOKHO25BwzZi0vHG8g&e=


      Please review.

      If somebody from IBM would like to check performance impact of
 changing
      the AllocatePrefetchLines + Distance, I?ll be glad to receive
 feedback.

      Best regards,
      Martin


From Yang.Zhang at arm.com  Fri Apr 24 06:01:28 2020
From: Yang.Zhang at arm.com (Yang Zhang)
Date: Fri, 24 Apr 2020 06:01:28 +0000
Subject: [aarch64-port-dev ] RFR(S): 8243240: AArch64: Add support for MulVB
Message-ID: <VI1PR0802MB25582396E839DC55AB2586E18ED00@VI1PR0802MB2558.eurprd08.prod.outlook.com>

Hi,

Could you please help to review this patch?

JBS: https://bugs.openjdk.java.net/browse/JDK-8243240
Webrev: http://cr.openjdk.java.net/~yzhang/8243240/webrev.00/

In this patch, the missing MulVB support for AArch64 is added.

Testing: tier1

Test case:
public static void mulvb(byte[] a, byte[] b, byte[] c) {
    for (int i = 0; i < a.length; i++) {
        c[i] = (byte)(a[i] * b[i]);
    }
}

Assembly generated by C2:
0x0000ffffacafdbac:   ldr q17, [x15, #16]
0x0000ffffacafdbb0:   ldr q16, [x14, #16]
0x0000ffffacafdbb4:   mul v16.16b, v16.16b, v17.16b
0x0000ffffacafdbbc:   str q16, [x11, #16]

Performance:
JMH test case is attached in JBS.

Before:
Benchmark               (size)  Mode  Cnt  Score   Error  Units
TestVect.testVectMulVB    1024  avgt    5  0.952  0.005  us/op

After:
Benchmark               (size)  Mode  Cnt  Score   Error  Units
TestVect.testVectMulVB    1024  avgt    5  0.110  0.001  us/op

Regards
Yang

From rwestrel at redhat.com  Fri Apr 24 08:14:15 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Fri, 24 Apr 2020 10:14:15 +0200
Subject: RFR(S): 8239569: PublicMethodsTest.java failed due to NPE in
 java.base/java.nio.file.FileSystems.getFileSystem(FileSystems.java:230)
Message-ID: <87zhb18fmw.fsf@redhat.com>


https://bugs.openjdk.java.net/browse/JDK-8239569
http://cr.openjdk.java.net/~roland/8239569/webrev.00/

The bug occurs when reading from a constant array after a loop is fully
unrolled. Reading an element in the loop has the shape:
(LoadB (AddP base (AddP base base index) ..) ..)
A load from the same element is also out of the loop:
(LoadUB (AddP base (AddP base base index) ..) ..)
The AddPs are shared between the LoadB in the loop and the LoadUB out of
the loop.

After full unrolling the load out of the loop becomes:
(LoadUB (Phi (AddP base (AddP base base index1) ..) (AddP base (AddP base base index2) ..) ..) ..)

The AddPs are then pushed through the Phi and that's where the bug
is.

- index1 is 0 and so the type of (AddP base base index1) is a constant
  array pointer with no offset.

- that type is met with the type of the base of the second AddP instead
  of the type of the address of the second AddP. The result is a
  constant array pointer.

The resulting Phi for the address input is created as a Phi of type
constant array with no offset instead of constant array with offset. As
a result, the Phi constant folds and the offset is lost.

Roland.


From richard.reingruber at sap.com  Fri Apr 24 08:18:31 2020
From: richard.reingruber at sap.com (Reingruber, Richard)
Date: Fri, 24 Apr 2020 08:18:31 +0000
Subject: RFR(S) 8238585: Use handshake for
 JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled
 methods on stack not_entrant
In-Reply-To: <AM0PR0202MB33318BC60E3460662D42A40D9B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
References: <AM0PR0202MB33311AB7623F4164335286309B1B0@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <3c59b9f9-ec38-18c9-8f24-e1186a08a04a@oracle.com>
 <AM0PR0202MB3331FD660B315CEC25B1DC209B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <410eed04-e2ef-0f4f-1c56-19e6734a10f6@oracle.com>
 <AM0PR0202MB33318BC60E3460662D42A40D9B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
Message-ID: <AM0PR0202MB33319C2F9595CC3C2FA1617B9BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>

Hi Patricio, Vladimir, and Serguei,

now that direct handshakes are available, I've updated the patch to make use of them.

In addition I have done some clean-up changes I missed in the first webrev.

Finally I have implemented the workaround suggested by Patricio to avoid nesting the handshake
into the vm operation VM_SetFramePop [1]

Kindly review again:

Webrev:        http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1/
Webrev(delta): http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1.inc/

I updated the JBS item explaining why the vm operation VM_EnterInterpOnlyMode can be replaced with a
direct handshake:

JBS: https://bugs.openjdk.java.net/browse/JDK-8238585

Testing:

* JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.

* Submit-repo: mach5-one-rrich-JDK-8238585-20200423-1436-10441737

Thanks,
Richard.

[1] An assertion in Handshake::execute_direct() fails, if called be VMThread, because it is no JavaThread.

-----Original Message-----
From: hotspot-dev <hotspot-dev-bounces at openjdk.java.net> On Behalf Of Reingruber, Richard
Sent: Freitag, 14. Februar 2020 19:47
To: Patricio Chilano <patricio.chilano.mateo at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
Subject: RE: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant

Hi Patricio,

  > > I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
  > > handshake cannot be nested in a vm operation. Maybe it should be asserted in the
  > > Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
  > >
  > >    > Alternatively I think you could do something similar to what we do in
  > >    > Deoptimization::deoptimize_all_marked():
  > >    >
  > >    >    EnterInterpOnlyModeClosure hs;
  > >    >    if (SafepointSynchronize::is_at_safepoint()) {
  > >    >      hs.do_thread(state->get_thread());
  > >    >    } else {
  > >    >      Handshake::execute(&hs, state->get_thread());
  > >    >    }
  > >    > (you could pass ?EnterInterpOnlyModeClosure? directly to the
  > >    > HandshakeClosure() constructor)
  > >
  > > Maybe this could be used also in the Handshake::execute() methods as general solution?
  > Right, we could also do that. Avoiding to clear the polling page in 
  > HandshakeState::clear_handshake() should be enough to fix this issue and 
  > execute a handshake inside a safepoint, but adding that "if" statement 
  > in Hanshake::execute() sounds good to avoid all the extra code that we 
  > go through when executing a handshake. I filed 8239084 to make that change.

Thanks for taking care of this and creating the RFE.

  > 
  > >    > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
  > >    > always called in a nested operation or just sometimes.
  > >
  > > At least one execution path without vm operation exists:
  > >
  > >    JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
  > >      JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
  > >        JvmtiEventControllerPrivate::recompute_enabled() : void
  > >          JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
  > >            JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
  > >              JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
  > >                jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
  > >
  > > I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
  > > handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
  > > encouraged to do it with a handshake :)
  > Ah! I think you can still do it with a handshake with the 
  > Deoptimization::deoptimize_all_marked() like solution. I can change the 
  > if-else statement with just the Handshake::execute() call in 8239084. 
  > But up to you.  : )

Well, I think that's enough encouragement :)
I'll wait for 8239084 and try then again.
(no urgency and all)

Thanks,
Richard.

-----Original Message-----
From: Patricio Chilano <patricio.chilano.mateo at oracle.com> 
Sent: Freitag, 14. Februar 2020 15:54
To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant

Hi Richard,

On 2/14/20 9:58 AM, Reingruber, Richard wrote:
> Hi Patricio,
>
> thanks for having a look.
>
>    > I?m only commenting on the handshake changes.
>    > I see that operation VM_EnterInterpOnlyMode can be called inside
>    > operation VM_SetFramePop which also allows nested operations. Here is a
>    > comment in VM_SetFramePop definition:
>    >
>    > // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>    > // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>    >
>    > So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>    > could have a handshake inside a safepoint operation. The issue I see
>    > there is that at the end of the handshake the polling page of the target
>    > thread could be disarmed. So if the target thread happens to be in a
>    > blocked state just transiently and wakes up then it will not stop for
>    > the ongoing safepoint. Maybe I can file an RFE to assert that the
>    > polling page is armed at the beginning of disarm_safepoint().
>
> I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
> handshake cannot be nested in a vm operation. Maybe it should be asserted in the
> Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>
>    > Alternatively I think you could do something similar to what we do in
>    > Deoptimization::deoptimize_all_marked():
>    >
>    >    EnterInterpOnlyModeClosure hs;
>    >    if (SafepointSynchronize::is_at_safepoint()) {
>    >      hs.do_thread(state->get_thread());
>    >    } else {
>    >      Handshake::execute(&hs, state->get_thread());
>    >    }
>    > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>    > HandshakeClosure() constructor)
>
> Maybe this could be used also in the Handshake::execute() methods as general solution?
Right, we could also do that. Avoiding to clear the polling page in 
HandshakeState::clear_handshake() should be enough to fix this issue and 
execute a handshake inside a safepoint, but adding that "if" statement 
in Hanshake::execute() sounds good to avoid all the extra code that we 
go through when executing a handshake. I filed 8239084 to make that change.

>    > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>    > always called in a nested operation or just sometimes.
>
> At least one execution path without vm operation exists:
>
>    JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>      JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>        JvmtiEventControllerPrivate::recompute_enabled() : void
>          JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>            JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>              JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>                jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>
> I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
> handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
> encouraged to do it with a handshake :)
Ah! I think you can still do it with a handshake with the 
Deoptimization::deoptimize_all_marked() like solution. I can change the 
if-else statement with just the Handshake::execute() call in 8239084. 
But up to you.? : )

Thanks,
Patricio
> Thanks again,
> Richard.
>
> -----Original Message-----
> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
> Sent: Donnerstag, 13. Februar 2020 18:47
> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>
> Hi Richard,
>
> I?m only commenting on the handshake changes.
> I see that operation VM_EnterInterpOnlyMode can be called inside
> operation VM_SetFramePop which also allows nested operations. Here is a
> comment in VM_SetFramePop definition:
>
> // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
> // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>
> So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
> could have a handshake inside a safepoint operation. The issue I see
> there is that at the end of the handshake the polling page of the target
> thread could be disarmed. So if the target thread happens to be in a
> blocked state just transiently and wakes up then it will not stop for
> the ongoing safepoint. Maybe I can file an RFE to assert that the
> polling page is armed at the beginning of disarm_safepoint().
>
> I think one option could be to remove
> SafepointMechanism::disarm_if_needed() in
> HandshakeState::clear_handshake() and let each JavaThread disarm itself
> for the handshake case.
>
> Alternatively I think you could do something similar to what we do in
> Deoptimization::deoptimize_all_marked():
>
>   ? EnterInterpOnlyModeClosure hs;
>   ? if (SafepointSynchronize::is_at_safepoint()) {
>   ??? hs.do_thread(state->get_thread());
>   ? } else {
>   ??? Handshake::execute(&hs, state->get_thread());
>   ? }
> (you could pass ?EnterInterpOnlyModeClosure? directly to the
> HandshakeClosure() constructor)
>
> I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
> always called in a nested operation or just sometimes.
>
> Thanks,
> Patricio
>
> On 2/12/20 7:23 AM, Reingruber, Richard wrote:
>> // Repost including hotspot runtime and gc lists.
>> // Dean Long suggested to do so, because the enhancement replaces a vm operation
>> // with a handshake.
>> // Original thread: http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-February/030359.html
>>
>> Hi,
>>
>> could I please get reviews for this small enhancement in hotspot's jvmti implementation:
>>
>> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.0/
>> Bug:    https://bugs.openjdk.java.net/browse/JDK-8238585
>>
>> The change avoids making all compiled methods on stack not_entrant when switching a java thread to
>> interpreter only execution for jvmti purposes. It is sufficient to deoptimize the compiled frames on stack.
>>
>> Additionally a handshake is used instead of a vm operation to walk the stack and do the deoptimizations.
>>
>> Testing: JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
>>
>> Thanks, Richard.
>>
>> See also my question if anyone knows a reason for making the compiled methods not_entrant:
>> http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-January/030339.html


From tobias.hartmann at oracle.com  Fri Apr 24 08:24:08 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Fri, 24 Apr 2020 10:24:08 +0200
Subject: RFR(S): 8239569: PublicMethodsTest.java failed due to NPE in
 java.base/java.nio.file.FileSystems.getFileSystem(FileSystems.java:230)
In-Reply-To: <87zhb18fmw.fsf@redhat.com>
References: <87zhb18fmw.fsf@redhat.com>
Message-ID: <d9e78153-d9a8-7e13-9ac5-49f803bfbed0@oracle.com>

Hi Roland,

Ouh, good catch! Looks good.

Best regards,
Tobias

On 24.04.20 10:14, Roland Westrelin wrote:
> 
> https://bugs.openjdk.java.net/browse/JDK-8239569
> http://cr.openjdk.java.net/~roland/8239569/webrev.00/
> 
> The bug occurs when reading from a constant array after a loop is fully
> unrolled. Reading an element in the loop has the shape:
> (LoadB (AddP base (AddP base base index) ..) ..)
> A load from the same element is also out of the loop:
> (LoadUB (AddP base (AddP base base index) ..) ..)
> The AddPs are shared between the LoadB in the loop and the LoadUB out of
> the loop.
> 
> After full unrolling the load out of the loop becomes:
> (LoadUB (Phi (AddP base (AddP base base index1) ..) (AddP base (AddP base base index2) ..) ..) ..)
> 
> The AddPs are then pushed through the Phi and that's where the bug
> is.
> 
> - index1 is 0 and so the type of (AddP base base index1) is a constant
>   array pointer with no offset.
> 
> - that type is met with the type of the base of the second AddP instead
>   of the type of the address of the second AddP. The result is a
>   constant array pointer.
> 
> The resulting Phi for the address input is created as a Phi of type
> constant array with no offset instead of constant array with offset. As
> a result, the Phi constant folds and the offset is lost.
> 
> Roland.
> 

From xxinliu at amazon.com  Fri Apr 24 08:33:40 2020
From: xxinliu at amazon.com (Liu, Xin)
Date: Fri, 24 Apr 2020 08:33:40 +0000
Subject: RFR[M]: 8151779: Some intrinsic flags could be replaced with one
 general flag
In-Reply-To: <0EDAAC88-E5D9-424F-A19E-5E20C689C2F3@amazon.com>
References: <19CD3956-4DC6-4908-8626-27D48A9AB4A4@amazon.com>
 <b2b2226f-8e97-75d0-8e3d-b8ffbf5f474d@oracle.com>
 <0EDAAC88-E5D9-424F-A19E-5E20C689C2F3@amazon.com>
Message-ID: <801D878C-CAE5-4EBE-8AFE-4E35346CD5BD@amazon.com>

Hi,  

May I get reviewed for this new revision? 
JBS: https://bugs.openjdk.java.net/browse/JDK-8151779
webrev: https://cr.openjdk.java.net/~xliu/8151779/01/webrev/

I introduce a new option -XX:ControlIntrinsic=+_id1,-id2...
The id is vmIntrinsics::ID.  As prior discussion, ControlIntrinsic is expected to replace DisableIntrinsic. 
I keep DisableIntrinsic in this revision. DisableIntrinsic prevails when an intrinsic appears on both lists.   

I use an array of tribool to mark each intrinsic is enabled or not. In this way, hotspot can avoid expensive string querying among intrinsics.
A Tribool value has 3 states: Default, true, or false. 
If developers don't explicitly set an intrinsic, it will be available unless is disabled by the corresponding UseXXXIntrinsics. 
Traditional Boolean value can't express fine/coarse-grained control. Ie. We only go through those auxiliary options UseXXXIntrinsics if developers don't control a specific intrinsic.   

I also add the support of ControlIntrinsic to CompilerDirectives. 

Test:
I reuse jtreg tests of DisableIntrinsic. Add add more @run annotations to verify ControlIntrinsics.
I passed hotspot:Tier1 test and all tests on x86_64/linux. 

Thanks,
--lx

?On 4/17/20, 7:22 PM, "hotspot-compiler-dev on behalf of Liu, Xin" <hotspot-compiler-dev-bounces at openjdk.java.net on behalf of xxinliu at amazon.com> wrote:

    Hi, Vladimir, 

    Thanks for the clarification. 
    Oh, yes, it's theoretically possible, but it's tedious. I am wrong at that point.
    I think I got your point. ControlIntrinsics will make developer's life easier. I will implement it. 

    Thanks,
    --lx


    On 4/17/20, 6:46 PM, "Vladimir Kozlov" <vladimir.kozlov at oracle.com> wrote:

        CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.


        I withdraw my suggestion about EnableIntrinsic from JDK-8151779 because ControlIntrinsics will provide such
        functionality and will replace existing DisableIntrinsic.

        Note, we can start deprecating Use*Intrinsic flags (and DisableIntrinsic) later in other changes. You don't need to do
        everything at once. What we need now a mechanism to replace them.

        On 4/16/20 11:58 PM, Liu, Xin wrote:
        > Hi, Corey and Vladimir,
        >
        > I recently go through vmSymbols.hpp/cpp. I think I understand your comments.
        > Each UseXXXIntrinsics does control a bunch of intrinsics (plural). Thanks for the hint.
        >
        > Even though I feel I know intrinsics mechanism of hotspot better, I still need a clarification of JDK- 8151779.
        >
        > There're 321 intrinsics (https://chriswhocodes.com/hotspot_intrinsics_jdk15.html).
        > If there's no any option, they are all available for compilers.  That makes sense because intrinsics are always beneficial.
        > But there're reasons we need to disable a subset of them. A specific architecture may miss efficient instructions or fixed functions. Or simply because an intrinsic is buggy.
        >
        > Currently, JDK provides developers 2 ways to control intrinsics. > 1. Some diagnostic options. Eg. InlineMathNatives, UseBase64Intrinsics.
        > Developers can use one option to disable a group of intrinsics.  That is to say, it's a coarse-grained approach.
        >
        > 2. DisableIntrinsic="a,b,c"
        > By passing a string list of vmIntrinsics::IDs, it's capable of disabling any specified intrinsic.
        >
        > But even putting above 2 approaches together, we still can't precisely control any intrinsic.

        Yes, you are right. We seems are trying to put these 2 different ways into one flag which may be mistake.

        -XX:ControlIntrinsic=-_updateBytesCRC32C,-_updateDirectByteBufferCRC32C is a similar to -XX:-UseCRC32CIntrinsics but it
        requires more detailed knowledge about intrinsics ids.

        May be we can have 2nd flag, as you suggested -XX:UseIntrinsics=-AESCTR,+CRC32C, for such cases.

        > If we want to enable an intrinsic which is under control of InlineMathNatives but keep others disable, it's impossible now.  [please correct if I am wrong here].

        You can disable all other from 321 intrinsics with DisableIntrinsic flag which is very tedious I agree.

        > I think that the motivation JDK-8151779 tried to solve.

        The idea is that instead of flags we use to control particular intrinsics depending on CPU we will use vmIntrinsics::IDs
        or other tables as you showed in your changes. It will require changes in vm_version_<cpu> codes.

        >
        > If we provide a new option EnableIntrinsic and put it least priority, then we can precisely control any intrinsic.
        > Quote Vladimir Kozlov "DisableIntrinsic list prevails if an intrinsic is specified on both EnableIntrinsic and DisableIntrinsic."
        >
        >   "-XX:ControlIntrinsic=+_dabs,-_fabs,-_getClass" looks more elegant, but it will confuse developers with DisableIntrinsic.
        > If we decide to deprecate DisableIntrinsic, I think ControlIntrinsic may be a better option. Now I prefer to provide EnableIntrinsic for simplicity and symmetry.

        I prefer to have one ControlIntrinsic flag and deprecate DisableIntrinsic. I don't think it is confusing.

        Thanks,
        Vladimir

        > What do you think?
        >
        > Thanks,
        > --lx
        >
        >
        > On 4/13/20, 1:47 PM, "hotspot-compiler-dev on behalf of Corey Ashford" <hotspot-compiler-dev-bounces at openjdk.java.net on behalf of cjashfor at linux.ibm.com> wrote:
        >
        >      CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
        >
        >
        >
        >      On 4/13/20 10:33 AM, Liu, Xin wrote:
        >      > Hi, compiler developers,
        >      > I attempt to refactor UseXXXIntrinsics for JDK-8151779.  I think we still need to keep UseXXXIntrinsics options because many applications may be using them.
        >      >
        >      > My change provide 2 new features:
        >      > 1) a shorthand to enable/disable intrinsics.
        >      > A comma-separated string. Each one is an intrinsic. An optional tailing symbol + or '-' denotes enabling or disabling.
        >      > If the tailing symbol is missing, it means enable.
        >      > Eg. -XX:UseIntrinsics="AESCTR-,CRC32C+,CRC32-,MathExact"
        >      > This jvm option will expand to multiple options -XX:-UseAESCTRIntrinsics, -XX:+UseCRC32CIntrinsics, -XX:-UseCRC32Intrinsics, -XX:UseMathExactIntrinsics
        >      >
        >      > 2) provide a set of macro to declare intrinsic options
        >      > Developers declare once in intrinsics.hpp and macros will take care all other places.
        >      > Here are example: https://cr.openjdk.java.net/~xliu/8151779/00/webrev/src/hotspot/share/compiler/intrinsics.hpp.html
        >      > Ion Lam is overhauling jvm options.  I am thinking how to be consistent with his proposal.
        >      >
        >
        >      Great idea, though to be consistent with the original syntax, I think
        >      the +/- should be in front of the name:
        >
        >      -XX:UseIntrinsics=-AESCTR,+CRC32C,...
        >
        >
        >      > I handle UseIntrinsics before VM_Version::initialize. It means that platform-specific initialization still has a chance to correct those options.
        >      > If we do that after VM_Version::initialize,  some intrinsics may cause JVM crash.  Eg. +UseBase64Intrinsics on x86_64 Linux.
        >      > Even though this behavior is same as -XX:+UseXXXIntrinsics, from user's perspective, it's not straightforward when JVM overrides what users specify implicitly. It's dilemma here,  stable jvm or fidelity of cmdline.  What do you think?
        >      >
        >      > Another problem is naming convention. Almost all intrinsics options use UseXXXIntrinsics. One exception is UseVectorizedMismatchIntrinsic.
        >      > Personally, I think it should be "UseXXXIntrinsic" because one option is for one intrinsic, right?  Is it possible to change this name convention?
        >
        >      Some (many?) intrinsic options turn on more than one .ad instruct
        >      instrinsic, or library instrinsics at the same time.  I think that's why
        >      the plural is there.  Also, consistently adding the plural allows you to
        >      add more capabilities to a flag that initially only had one intrinsic
        >      without changing the plurality (and thus backward compatibility).
        >
        >      Regards,
        >
        >      - Corey
        >
        >


From aph at redhat.com  Fri Apr 24 09:31:59 2020
From: aph at redhat.com (Andrew Haley)
Date: Fri, 24 Apr 2020 10:31:59 +0100
Subject: [aarch64-port-dev ] RFR(S): 8243240: AArch64: Add support for
 MulVB
In-Reply-To: <VI1PR0802MB25582396E839DC55AB2586E18ED00@VI1PR0802MB2558.eurprd08.prod.outlook.com>
References: <VI1PR0802MB25582396E839DC55AB2586E18ED00@VI1PR0802MB2558.eurprd08.prod.outlook.com>
Message-ID: <893f6983-7e3c-adc0-ecf4-48e57312c456@redhat.com>

On 4/24/20 7:01 AM, Yang Zhang wrote:
> JBS: https://bugs.openjdk.java.net/browse/JDK-8243240
> Webrev: http://cr.openjdk.java.net/~yzhang/8243240/webrev.00/

OK, thanks.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From suenaga at oss.nttdata.com  Fri Apr 24 11:34:22 2020
From: suenaga at oss.nttdata.com (Yasumasa Suenaga)
Date: Fri, 24 Apr 2020 20:34:22 +0900
Subject: RFR(S) 8238585: Use handshake for
 JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled
 methods on stack not_entrant
In-Reply-To: <AM0PR0202MB33319C2F9595CC3C2FA1617B9BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
References: <AM0PR0202MB33311AB7623F4164335286309B1B0@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <3c59b9f9-ec38-18c9-8f24-e1186a08a04a@oracle.com>
 <AM0PR0202MB3331FD660B315CEC25B1DC209B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <410eed04-e2ef-0f4f-1c56-19e6734a10f6@oracle.com>
 <AM0PR0202MB33318BC60E3460662D42A40D9B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <AM0PR0202MB33319C2F9595CC3C2FA1617B9BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
Message-ID: <e1244807-74b4-755b-17c2-dce3263ce337@oss.nttdata.com>

Hi Richard,

I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
Does it help you? I think it gives you to remove workaround.

(The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.)


Thanks,

Yasumasa


On 2020/04/24 17:18, Reingruber, Richard wrote:
> Hi Patricio, Vladimir, and Serguei,
> 
> now that direct handshakes are available, I've updated the patch to make use of them.
> 
> In addition I have done some clean-up changes I missed in the first webrev.
> 
> Finally I have implemented the workaround suggested by Patricio to avoid nesting the handshake
> into the vm operation VM_SetFramePop [1]
> 
> Kindly review again:
> 
> Webrev:        http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1/
> Webrev(delta): http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1.inc/
> 
> I updated the JBS item explaining why the vm operation VM_EnterInterpOnlyMode can be replaced with a
> direct handshake:
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8238585
> 
> Testing:
> 
> * JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
> 
> * Submit-repo: mach5-one-rrich-JDK-8238585-20200423-1436-10441737
> 
> Thanks,
> Richard.
> 
> [1] An assertion in Handshake::execute_direct() fails, if called be VMThread, because it is no JavaThread.
> 
> -----Original Message-----
> From: hotspot-dev <hotspot-dev-bounces at openjdk.java.net> On Behalf Of Reingruber, Richard
> Sent: Freitag, 14. Februar 2020 19:47
> To: Patricio Chilano <patricio.chilano.mateo at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
> Subject: RE: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
> 
> Hi Patricio,
> 
>    > > I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
>    > > handshake cannot be nested in a vm operation. Maybe it should be asserted in the
>    > > Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>    > >
>    > >    > Alternatively I think you could do something similar to what we do in
>    > >    > Deoptimization::deoptimize_all_marked():
>    > >    >
>    > >    >    EnterInterpOnlyModeClosure hs;
>    > >    >    if (SafepointSynchronize::is_at_safepoint()) {
>    > >    >      hs.do_thread(state->get_thread());
>    > >    >    } else {
>    > >    >      Handshake::execute(&hs, state->get_thread());
>    > >    >    }
>    > >    > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>    > >    > HandshakeClosure() constructor)
>    > >
>    > > Maybe this could be used also in the Handshake::execute() methods as general solution?
>    > Right, we could also do that. Avoiding to clear the polling page in
>    > HandshakeState::clear_handshake() should be enough to fix this issue and
>    > execute a handshake inside a safepoint, but adding that "if" statement
>    > in Hanshake::execute() sounds good to avoid all the extra code that we
>    > go through when executing a handshake. I filed 8239084 to make that change.
> 
> Thanks for taking care of this and creating the RFE.
> 
>    >
>    > >    > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>    > >    > always called in a nested operation or just sometimes.
>    > >
>    > > At least one execution path without vm operation exists:
>    > >
>    > >    JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>    > >      JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>    > >        JvmtiEventControllerPrivate::recompute_enabled() : void
>    > >          JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>    > >            JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>    > >              JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>    > >                jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>    > >
>    > > I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
>    > > handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
>    > > encouraged to do it with a handshake :)
>    > Ah! I think you can still do it with a handshake with the
>    > Deoptimization::deoptimize_all_marked() like solution. I can change the
>    > if-else statement with just the Handshake::execute() call in 8239084.
>    > But up to you.  : )
> 
> Well, I think that's enough encouragement :)
> I'll wait for 8239084 and try then again.
> (no urgency and all)
> 
> Thanks,
> Richard.
> 
> -----Original Message-----
> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
> Sent: Freitag, 14. Februar 2020 15:54
> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
> 
> Hi Richard,
> 
> On 2/14/20 9:58 AM, Reingruber, Richard wrote:
>> Hi Patricio,
>>
>> thanks for having a look.
>>
>>     > I?m only commenting on the handshake changes.
>>     > I see that operation VM_EnterInterpOnlyMode can be called inside
>>     > operation VM_SetFramePop which also allows nested operations. Here is a
>>     > comment in VM_SetFramePop definition:
>>     >
>>     > // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>>     > // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>>     >
>>     > So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>>     > could have a handshake inside a safepoint operation. The issue I see
>>     > there is that at the end of the handshake the polling page of the target
>>     > thread could be disarmed. So if the target thread happens to be in a
>>     > blocked state just transiently and wakes up then it will not stop for
>>     > the ongoing safepoint. Maybe I can file an RFE to assert that the
>>     > polling page is armed at the beginning of disarm_safepoint().
>>
>> I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
>> handshake cannot be nested in a vm operation. Maybe it should be asserted in the
>> Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>>
>>     > Alternatively I think you could do something similar to what we do in
>>     > Deoptimization::deoptimize_all_marked():
>>     >
>>     >    EnterInterpOnlyModeClosure hs;
>>     >    if (SafepointSynchronize::is_at_safepoint()) {
>>     >      hs.do_thread(state->get_thread());
>>     >    } else {
>>     >      Handshake::execute(&hs, state->get_thread());
>>     >    }
>>     > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>     > HandshakeClosure() constructor)
>>
>> Maybe this could be used also in the Handshake::execute() methods as general solution?
> Right, we could also do that. Avoiding to clear the polling page in
> HandshakeState::clear_handshake() should be enough to fix this issue and
> execute a handshake inside a safepoint, but adding that "if" statement
> in Hanshake::execute() sounds good to avoid all the extra code that we
> go through when executing a handshake. I filed 8239084 to make that change.
> 
>>     > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>     > always called in a nested operation or just sometimes.
>>
>> At least one execution path without vm operation exists:
>>
>>     JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>>       JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>>         JvmtiEventControllerPrivate::recompute_enabled() : void
>>           JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>>             JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>>               JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>>                 jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>>
>> I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
>> handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
>> encouraged to do it with a handshake :)
> Ah! I think you can still do it with a handshake with the
> Deoptimization::deoptimize_all_marked() like solution. I can change the
> if-else statement with just the Handshake::execute() call in 8239084.
> But up to you.? : )
> 
> Thanks,
> Patricio
>> Thanks again,
>> Richard.
>>
>> -----Original Message-----
>> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
>> Sent: Donnerstag, 13. Februar 2020 18:47
>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>
>> Hi Richard,
>>
>> I?m only commenting on the handshake changes.
>> I see that operation VM_EnterInterpOnlyMode can be called inside
>> operation VM_SetFramePop which also allows nested operations. Here is a
>> comment in VM_SetFramePop definition:
>>
>> // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>> // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>>
>> So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>> could have a handshake inside a safepoint operation. The issue I see
>> there is that at the end of the handshake the polling page of the target
>> thread could be disarmed. So if the target thread happens to be in a
>> blocked state just transiently and wakes up then it will not stop for
>> the ongoing safepoint. Maybe I can file an RFE to assert that the
>> polling page is armed at the beginning of disarm_safepoint().
>>
>> I think one option could be to remove
>> SafepointMechanism::disarm_if_needed() in
>> HandshakeState::clear_handshake() and let each JavaThread disarm itself
>> for the handshake case.
>>
>> Alternatively I think you could do something similar to what we do in
>> Deoptimization::deoptimize_all_marked():
>>
>>    ? EnterInterpOnlyModeClosure hs;
>>    ? if (SafepointSynchronize::is_at_safepoint()) {
>>    ??? hs.do_thread(state->get_thread());
>>    ? } else {
>>    ??? Handshake::execute(&hs, state->get_thread());
>>    ? }
>> (you could pass ?EnterInterpOnlyModeClosure? directly to the
>> HandshakeClosure() constructor)
>>
>> I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>> always called in a nested operation or just sometimes.
>>
>> Thanks,
>> Patricio
>>
>> On 2/12/20 7:23 AM, Reingruber, Richard wrote:
>>> // Repost including hotspot runtime and gc lists.
>>> // Dean Long suggested to do so, because the enhancement replaces a vm operation
>>> // with a handshake.
>>> // Original thread: http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-February/030359.html
>>>
>>> Hi,
>>>
>>> could I please get reviews for this small enhancement in hotspot's jvmti implementation:
>>>
>>> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.0/
>>> Bug:    https://bugs.openjdk.java.net/browse/JDK-8238585
>>>
>>> The change avoids making all compiled methods on stack not_entrant when switching a java thread to
>>> interpreter only execution for jvmti purposes. It is sufficient to deoptimize the compiled frames on stack.
>>>
>>> Additionally a handshake is used instead of a vm operation to walk the stack and do the deoptimizations.
>>>
>>> Testing: JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
>>>
>>> Thanks, Richard.
>>>
>>> See also my question if anyone knows a reason for making the compiled methods not_entrant:
>>> http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-January/030339.html
> 

From christian.hagedorn at oracle.com  Fri Apr 24 14:37:39 2020
From: christian.hagedorn at oracle.com (Christian Hagedorn)
Date: Fri, 24 Apr 2020 16:37:39 +0200
Subject: [15] RFR(S): 8230402: Allocation of compile task fails with assert:
 "Leaking compilation tasks?"
Message-ID: <27dd5ff1-9f91-d8c1-ecee-a77e6ecdb558@oracle.com>

Hi

Please review the following patch:
https://bugs.openjdk.java.net/browse/JDK-8230402
http://cr.openjdk.java.net/~chagedorn/8230402/webrev.00/

This assert was hit very intermittently in an internal test until 
jdk-14+19. The test was changed afterwards and the assert was not 
observed to fail anymore. However, the problem of having too many tasks 
in the queue is still present (i.e. the compile queue is growing too 
quickly and the compiler(s) too slow to catch up). This assert can 
easily be hit by creating many class loaders which load many methods 
which are immediately compiled by setting a low compilation threshold as 
used in runA() in the testcase.

Therefore, I suggest to tackle this problem with a general solution to 
drop half of the compilation tasks in CompileQueue::add() when a queue 
size of 10000 is reached and none of the other conditions of this assert 
hold (no Whitebox or JVMCI compiler). For tiered compilation, the tasks 
with the lowest method weight() or which are unloaded are removed from 
the queue (without altering the order of the remaining tasks in the 
queue). Without tiered compilation (i.e. SimpleCompPolicy), the tasks 
from the tail of the queue are removed. An additional verification in 
debug builds should ensure that there are no duplicated tasks. I assume 
that part of the reason of the original assert was to detect such 
duplicates.

Thank you!

Best regards,
Christian


From richard.reingruber at sap.com  Fri Apr 24 14:44:29 2020
From: richard.reingruber at sap.com (Reingruber, Richard)
Date: Fri, 24 Apr 2020 14:44:29 +0000
Subject: RFR(S) 8238585: Use handshake for
 JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled
 methods on stack not_entrant
In-Reply-To: <e1244807-74b4-755b-17c2-dce3263ce337@oss.nttdata.com>
References: <AM0PR0202MB33311AB7623F4164335286309B1B0@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <3c59b9f9-ec38-18c9-8f24-e1186a08a04a@oracle.com>
 <AM0PR0202MB3331FD660B315CEC25B1DC209B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <410eed04-e2ef-0f4f-1c56-19e6734a10f6@oracle.com>
 <AM0PR0202MB33318BC60E3460662D42A40D9B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <AM0PR0202MB33319C2F9595CC3C2FA1617B9BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <e1244807-74b4-755b-17c2-dce3263ce337@oss.nttdata.com>
Message-ID: <AM0PR0202MB333172BA4EAF92D0BC3BA9379BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>

Hi Yasumasa,

> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
> Does it help you? I think it gives you to remove workaround.

I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake
you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to
change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes.

Also my first impression was that it won't be that easy from a synchronization point of view to
replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls
JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where
JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear
to me, how this has to be handled.

So it appears to me that it would be easier to push JDK-8242427 after this (JDK-8238585).

> (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.)

Would be interesting to see how you handled the issues above :)

Thanks, Richard.

[1] See question in comment https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14302030&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302030

-----Original Message-----
From: Yasumasa Suenaga <suenaga at oss.nttdata.com> 
Sent: Freitag, 24. April 2020 13:34
To: Reingruber, Richard <richard.reingruber at sap.com>; Patricio Chilano <patricio.chilano.mateo at oracle.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant

Hi Richard,

I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
Does it help you? I think it gives you to remove workaround.

(The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.)


Thanks,

Yasumasa


On 2020/04/24 17:18, Reingruber, Richard wrote:
> Hi Patricio, Vladimir, and Serguei,
> 
> now that direct handshakes are available, I've updated the patch to make use of them.
> 
> In addition I have done some clean-up changes I missed in the first webrev.
> 
> Finally I have implemented the workaround suggested by Patricio to avoid nesting the handshake
> into the vm operation VM_SetFramePop [1]
> 
> Kindly review again:
> 
> Webrev:        http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1/
> Webrev(delta): http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1.inc/
> 
> I updated the JBS item explaining why the vm operation VM_EnterInterpOnlyMode can be replaced with a
> direct handshake:
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8238585
> 
> Testing:
> 
> * JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
> 
> * Submit-repo: mach5-one-rrich-JDK-8238585-20200423-1436-10441737
> 
> Thanks,
> Richard.
> 
> [1] An assertion in Handshake::execute_direct() fails, if called be VMThread, because it is no JavaThread.
> 
> -----Original Message-----
> From: hotspot-dev <hotspot-dev-bounces at openjdk.java.net> On Behalf Of Reingruber, Richard
> Sent: Freitag, 14. Februar 2020 19:47
> To: Patricio Chilano <patricio.chilano.mateo at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
> Subject: RE: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
> 
> Hi Patricio,
> 
>    > > I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
>    > > handshake cannot be nested in a vm operation. Maybe it should be asserted in the
>    > > Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>    > >
>    > >    > Alternatively I think you could do something similar to what we do in
>    > >    > Deoptimization::deoptimize_all_marked():
>    > >    >
>    > >    >    EnterInterpOnlyModeClosure hs;
>    > >    >    if (SafepointSynchronize::is_at_safepoint()) {
>    > >    >      hs.do_thread(state->get_thread());
>    > >    >    } else {
>    > >    >      Handshake::execute(&hs, state->get_thread());
>    > >    >    }
>    > >    > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>    > >    > HandshakeClosure() constructor)
>    > >
>    > > Maybe this could be used also in the Handshake::execute() methods as general solution?
>    > Right, we could also do that. Avoiding to clear the polling page in
>    > HandshakeState::clear_handshake() should be enough to fix this issue and
>    > execute a handshake inside a safepoint, but adding that "if" statement
>    > in Hanshake::execute() sounds good to avoid all the extra code that we
>    > go through when executing a handshake. I filed 8239084 to make that change.
> 
> Thanks for taking care of this and creating the RFE.
> 
>    >
>    > >    > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>    > >    > always called in a nested operation or just sometimes.
>    > >
>    > > At least one execution path without vm operation exists:
>    > >
>    > >    JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>    > >      JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>    > >        JvmtiEventControllerPrivate::recompute_enabled() : void
>    > >          JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>    > >            JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>    > >              JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>    > >                jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>    > >
>    > > I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
>    > > handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
>    > > encouraged to do it with a handshake :)
>    > Ah! I think you can still do it with a handshake with the
>    > Deoptimization::deoptimize_all_marked() like solution. I can change the
>    > if-else statement with just the Handshake::execute() call in 8239084.
>    > But up to you.  : )
> 
> Well, I think that's enough encouragement :)
> I'll wait for 8239084 and try then again.
> (no urgency and all)
> 
> Thanks,
> Richard.
> 
> -----Original Message-----
> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
> Sent: Freitag, 14. Februar 2020 15:54
> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
> 
> Hi Richard,
> 
> On 2/14/20 9:58 AM, Reingruber, Richard wrote:
>> Hi Patricio,
>>
>> thanks for having a look.
>>
>>     > I?m only commenting on the handshake changes.
>>     > I see that operation VM_EnterInterpOnlyMode can be called inside
>>     > operation VM_SetFramePop which also allows nested operations. Here is a
>>     > comment in VM_SetFramePop definition:
>>     >
>>     > // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>>     > // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>>     >
>>     > So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>>     > could have a handshake inside a safepoint operation. The issue I see
>>     > there is that at the end of the handshake the polling page of the target
>>     > thread could be disarmed. So if the target thread happens to be in a
>>     > blocked state just transiently and wakes up then it will not stop for
>>     > the ongoing safepoint. Maybe I can file an RFE to assert that the
>>     > polling page is armed at the beginning of disarm_safepoint().
>>
>> I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
>> handshake cannot be nested in a vm operation. Maybe it should be asserted in the
>> Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>>
>>     > Alternatively I think you could do something similar to what we do in
>>     > Deoptimization::deoptimize_all_marked():
>>     >
>>     >    EnterInterpOnlyModeClosure hs;
>>     >    if (SafepointSynchronize::is_at_safepoint()) {
>>     >      hs.do_thread(state->get_thread());
>>     >    } else {
>>     >      Handshake::execute(&hs, state->get_thread());
>>     >    }
>>     > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>     > HandshakeClosure() constructor)
>>
>> Maybe this could be used also in the Handshake::execute() methods as general solution?
> Right, we could also do that. Avoiding to clear the polling page in
> HandshakeState::clear_handshake() should be enough to fix this issue and
> execute a handshake inside a safepoint, but adding that "if" statement
> in Hanshake::execute() sounds good to avoid all the extra code that we
> go through when executing a handshake. I filed 8239084 to make that change.
> 
>>     > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>     > always called in a nested operation or just sometimes.
>>
>> At least one execution path without vm operation exists:
>>
>>     JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>>       JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>>         JvmtiEventControllerPrivate::recompute_enabled() : void
>>           JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>>             JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>>               JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>>                 jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>>
>> I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
>> handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
>> encouraged to do it with a handshake :)
> Ah! I think you can still do it with a handshake with the
> Deoptimization::deoptimize_all_marked() like solution. I can change the
> if-else statement with just the Handshake::execute() call in 8239084.
> But up to you.? : )
> 
> Thanks,
> Patricio
>> Thanks again,
>> Richard.
>>
>> -----Original Message-----
>> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
>> Sent: Donnerstag, 13. Februar 2020 18:47
>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>
>> Hi Richard,
>>
>> I?m only commenting on the handshake changes.
>> I see that operation VM_EnterInterpOnlyMode can be called inside
>> operation VM_SetFramePop which also allows nested operations. Here is a
>> comment in VM_SetFramePop definition:
>>
>> // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>> // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>>
>> So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>> could have a handshake inside a safepoint operation. The issue I see
>> there is that at the end of the handshake the polling page of the target
>> thread could be disarmed. So if the target thread happens to be in a
>> blocked state just transiently and wakes up then it will not stop for
>> the ongoing safepoint. Maybe I can file an RFE to assert that the
>> polling page is armed at the beginning of disarm_safepoint().
>>
>> I think one option could be to remove
>> SafepointMechanism::disarm_if_needed() in
>> HandshakeState::clear_handshake() and let each JavaThread disarm itself
>> for the handshake case.
>>
>> Alternatively I think you could do something similar to what we do in
>> Deoptimization::deoptimize_all_marked():
>>
>>    ? EnterInterpOnlyModeClosure hs;
>>    ? if (SafepointSynchronize::is_at_safepoint()) {
>>    ??? hs.do_thread(state->get_thread());
>>    ? } else {
>>    ??? Handshake::execute(&hs, state->get_thread());
>>    ? }
>> (you could pass ?EnterInterpOnlyModeClosure? directly to the
>> HandshakeClosure() constructor)
>>
>> I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>> always called in a nested operation or just sometimes.
>>
>> Thanks,
>> Patricio
>>
>> On 2/12/20 7:23 AM, Reingruber, Richard wrote:
>>> // Repost including hotspot runtime and gc lists.
>>> // Dean Long suggested to do so, because the enhancement replaces a vm operation
>>> // with a handshake.
>>> // Original thread: http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-February/030359.html
>>>
>>> Hi,
>>>
>>> could I please get reviews for this small enhancement in hotspot's jvmti implementation:
>>>
>>> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.0/
>>> Bug:    https://bugs.openjdk.java.net/browse/JDK-8238585
>>>
>>> The change avoids making all compiled methods on stack not_entrant when switching a java thread to
>>> interpreter only execution for jvmti purposes. It is sufficient to deoptimize the compiled frames on stack.
>>>
>>> Additionally a handshake is used instead of a vm operation to walk the stack and do the deoptimizations.
>>>
>>> Testing: JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
>>>
>>> Thanks, Richard.
>>>
>>> See also my question if anyone knows a reason for making the compiled methods not_entrant:
>>> http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-January/030339.html
> 

From lutz.schmidt at sap.com  Fri Apr 24 14:51:01 2020
From: lutz.schmidt at sap.com (Schmidt, Lutz)
Date: Fri, 24 Apr 2020 14:51:01 +0000
Subject: RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range
In-Reply-To: <OFF4A63D23.06CC1E89-ON00258554.001DDE0D-49258554.001F20C4@notes.na.collabserv.com>
References: <0737AF50-4DED-4680-8629-47140DD2A7A6@sap.com>
 <AM0PR0202MB329721F804F88BEC62CE93289ADA0@AM0PR0202MB3297.eurprd02.prod.outlook.com>
 <OF5C36EBA8.F0084204-ON00258551.0050E40B-49258551.00522B72@notes.na.collabserv.com>
 <OFF4A63D23.06CC1E89-ON00258554.001DDE0D-49258554.001F20C4@notes.na.collabserv.com>
Message-ID: <AB3E3349-3BE5-45E5-9AD3-0A5068F063A3@sap.com>

Hi Martin, 

SAP-internal testing revealed no problems related to this patch.

As Michihiro did not find performance issues, the patch is good to go from my perspective.

Regards,
Lutz

From: Michihiro Horie <HORIE at jp.ibm.com> on behalf of Michihiro Horie <HORIE at jp.ibm.com>
Date: Friday, 24. April 2020 at 07:40
To: Lutz Schmidt <lutz.schmidt at sap.com>
Cc: "hotspot-compiler-dev at openjdk.java.net" <hotspot-compiler-dev at openjdk.java.net>, "Doerr, Martin (martin.doerr at sap.com)" <martin.doerr at sap.com>, "ppc-aix-port-dev at openjdk.java.net" <ppc-aix-port-dev at openjdk.java.net>
Subject: Re: RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range

Hi Martin, Lutz,

I have not seen big differences in SPECjbb2015 scores both on P8 and P9.

Best regards,
Michihiro


----- Original message -----
From: "Schmidt, Lutz" <lutz.schmidt at sap.com>
To: Michihiro Horie <HORIE at jp.ibm.com>, "Doerr, Martin" <martin.doerr at sap.com>
Cc: "ppc-aix-port-dev at openjdk.java.net" <ppc-aix-port-dev at openjdk.java.net>, "hotspot-compiler-dev at openjdk.java.net" <hotspot-compiler-dev at openjdk.java.net>
Subject: [EXTERNAL] Re: RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range
Date: Thu, Apr 23, 2020 3:01 AM

Hi Martin,

your change looks good to me.

I noticed you didn't find a chance to put it in the patch queue for our internal testing. I did that now, but it's too late for tonight. We'll have to wait until Friday morning (GMT+2) to really see what I expect: no issues.

Thanks for cleaning up this old stuff.

Regards,
Lutz


On 21.04.20, 16:57, "hotspot-compiler-dev on behalf of Michihiro Horie" <hotspot-compiler-dev-bounces at openjdk.java.net on behalf of HORIE at jp.ibm.com> wrote:


? ? Hi Martin,

? ? I started measuring SPECjbb2015 to see the performance impact on P9. Also,
? ? I'm preparing same measurement on P8.

? ? Best regards,
? ? Michihiro


? ? ?----- Original message -----
? ? ?From: "Doerr, Martin" <martin.doerr at sap.com>
? ? ?To: "'hotspot-compiler-dev at openjdk.java.net'"
? ? ?<hotspot-compiler-dev at openjdk.java.net>
? ? ?Cc: Michihiro Horie <HORIE at jp.ibm.com>, "cjashfor at linux.ibm.com"
? ? ?<cjashfor at linux.ibm.com>, "ppc-aix-port-dev at openjdk.java.net"
? ? ?<ppc-aix-port-dev at openjdk.java.net>, Gustavo Romero
? ? ?<gromero at linux.vnet.ibm.com>, "joserz at linux.ibm.com"
? ? ?<joserz at linux.ibm.com>
? ? ?Subject: [EXTERNAL] RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is
? ? ?out of range
? ? ?Date: Tue, Apr 14, 2020 11:07 PM

? ? ?Hi,

? ? ?I?d like to resolve a very old PPC64 issue:
? ? ?https://bugs.openjdk.java.net/browse/JDK-8151030?

? ? ?There?s code for AllocatePrefetchStyle=4 which is not an accepted option.
? ? ?It was used for a special experimental prefetch mode using dcbz
? ? ?instructions to combine prefetching and zeroing in the TLABs.
? ? ?However, this code was never contributed and there are no plans to work on
? ? ?it. So I?d like to simply remove this small part of it.

? ? ?In addition to that, AllocatePrefetchLines is currently set to 3 by
? ? ?default which doesn?t make sense to me. PPC64 has an automatic prefetch
? ? ?engine and executing several prefetch instructions for succeeding cache
? ? ?lines doesn?t seem to be beneficial at all.
? ? ?So I?m setting it to 1 by default. I couldn?t observe regressions on
? ? ?Power7, Power8 and Power9.

? ? ?Webrev:
? ? ?http://cr.openjdk.java.net/~mdoerr/8151030_ppc_prefetch/webrev.00/?

? ? ?Please review.

? ? ?If somebody from IBM would like to check performance impact of changing
? ? ?the AllocatePrefetchLines + Distance, I?ll be glad to receive feedback.

? ? ?Best regards,
? ? ?Martin


From suenaga at oss.nttdata.com  Fri Apr 24 15:23:06 2020
From: suenaga at oss.nttdata.com (Yasumasa Suenaga)
Date: Sat, 25 Apr 2020 00:23:06 +0900
Subject: RFR(S) 8238585: Use handshake for
 JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled
 methods on stack not_entrant
In-Reply-To: <AM0PR0202MB333172BA4EAF92D0BC3BA9379BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
References: <AM0PR0202MB33311AB7623F4164335286309B1B0@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <3c59b9f9-ec38-18c9-8f24-e1186a08a04a@oracle.com>
 <AM0PR0202MB3331FD660B315CEC25B1DC209B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <410eed04-e2ef-0f4f-1c56-19e6734a10f6@oracle.com>
 <AM0PR0202MB33318BC60E3460662D42A40D9B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <AM0PR0202MB33319C2F9595CC3C2FA1617B9BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <e1244807-74b4-755b-17c2-dce3263ce337@oss.nttdata.com>
 <AM0PR0202MB333172BA4EAF92D0BC3BA9379BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
Message-ID: <81d7caa8-4244-85f3-4d4e-78117fe5e25b@oss.nttdata.com>

Hi Richard,

On 2020/04/24 23:44, Reingruber, Richard wrote:
> Hi Yasumasa,
> 
>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
>> Does it help you? I think it gives you to remove workaround.
> 
> I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake
> you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to
> change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes.

Thanks for your information.
I tested my patch with both vmTestbase/nsk/jvmti/PopFrame and vmTestbase/nsk/jvmti/NotifyFramePop.
I will modify and will test it after yours.


> Also my first impression was that it won't be that easy from a synchronization point of view to
> replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls
> JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where
> JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear
> to me, how this has to be handled.

I think JvmtiEventController::set_frame_pop() should hold JvmtiThreadState_lock because it affects other JVMTI operation especially FramePop event.


Thanks,

Yasumasa


> So it appears to me that it would be easier to push JDK-8242427 after this (JDK-8238585).
> 
>> (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.)
> 
> Would be interesting to see how you handled the issues above :)
> 
> Thanks, Richard.
> 
> [1] See question in comment https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14302030&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302030
> 
> -----Original Message-----
> From: Yasumasa Suenaga <suenaga at oss.nttdata.com>
> Sent: Freitag, 24. April 2020 13:34
> To: Reingruber, Richard <richard.reingruber at sap.com>; Patricio Chilano <patricio.chilano.mateo at oracle.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
> 
> Hi Richard,
> 
> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
> Does it help you? I think it gives you to remove workaround.
> 
> (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.)
> 
> 
> Thanks,
> 
> Yasumasa
> 
> 
> On 2020/04/24 17:18, Reingruber, Richard wrote:
>> Hi Patricio, Vladimir, and Serguei,
>>
>> now that direct handshakes are available, I've updated the patch to make use of them.
>>
>> In addition I have done some clean-up changes I missed in the first webrev.
>>
>> Finally I have implemented the workaround suggested by Patricio to avoid nesting the handshake
>> into the vm operation VM_SetFramePop [1]
>>
>> Kindly review again:
>>
>> Webrev:        http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1/
>> Webrev(delta): http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1.inc/
>>
>> I updated the JBS item explaining why the vm operation VM_EnterInterpOnlyMode can be replaced with a
>> direct handshake:
>>
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8238585
>>
>> Testing:
>>
>> * JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
>>
>> * Submit-repo: mach5-one-rrich-JDK-8238585-20200423-1436-10441737
>>
>> Thanks,
>> Richard.
>>
>> [1] An assertion in Handshake::execute_direct() fails, if called be VMThread, because it is no JavaThread.
>>
>> -----Original Message-----
>> From: hotspot-dev <hotspot-dev-bounces at openjdk.java.net> On Behalf Of Reingruber, Richard
>> Sent: Freitag, 14. Februar 2020 19:47
>> To: Patricio Chilano <patricio.chilano.mateo at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>> Subject: RE: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>
>> Hi Patricio,
>>
>>     > > I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
>>     > > handshake cannot be nested in a vm operation. Maybe it should be asserted in the
>>     > > Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>>     > >
>>     > >    > Alternatively I think you could do something similar to what we do in
>>     > >    > Deoptimization::deoptimize_all_marked():
>>     > >    >
>>     > >    >    EnterInterpOnlyModeClosure hs;
>>     > >    >    if (SafepointSynchronize::is_at_safepoint()) {
>>     > >    >      hs.do_thread(state->get_thread());
>>     > >    >    } else {
>>     > >    >      Handshake::execute(&hs, state->get_thread());
>>     > >    >    }
>>     > >    > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>     > >    > HandshakeClosure() constructor)
>>     > >
>>     > > Maybe this could be used also in the Handshake::execute() methods as general solution?
>>     > Right, we could also do that. Avoiding to clear the polling page in
>>     > HandshakeState::clear_handshake() should be enough to fix this issue and
>>     > execute a handshake inside a safepoint, but adding that "if" statement
>>     > in Hanshake::execute() sounds good to avoid all the extra code that we
>>     > go through when executing a handshake. I filed 8239084 to make that change.
>>
>> Thanks for taking care of this and creating the RFE.
>>
>>     >
>>     > >    > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>     > >    > always called in a nested operation or just sometimes.
>>     > >
>>     > > At least one execution path without vm operation exists:
>>     > >
>>     > >    JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>>     > >      JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>>     > >        JvmtiEventControllerPrivate::recompute_enabled() : void
>>     > >          JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>>     > >            JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>>     > >              JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>>     > >                jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>>     > >
>>     > > I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
>>     > > handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
>>     > > encouraged to do it with a handshake :)
>>     > Ah! I think you can still do it with a handshake with the
>>     > Deoptimization::deoptimize_all_marked() like solution. I can change the
>>     > if-else statement with just the Handshake::execute() call in 8239084.
>>     > But up to you.  : )
>>
>> Well, I think that's enough encouragement :)
>> I'll wait for 8239084 and try then again.
>> (no urgency and all)
>>
>> Thanks,
>> Richard.
>>
>> -----Original Message-----
>> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
>> Sent: Freitag, 14. Februar 2020 15:54
>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>
>> Hi Richard,
>>
>> On 2/14/20 9:58 AM, Reingruber, Richard wrote:
>>> Hi Patricio,
>>>
>>> thanks for having a look.
>>>
>>>      > I?m only commenting on the handshake changes.
>>>      > I see that operation VM_EnterInterpOnlyMode can be called inside
>>>      > operation VM_SetFramePop which also allows nested operations. Here is a
>>>      > comment in VM_SetFramePop definition:
>>>      >
>>>      > // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>>>      > // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>>>      >
>>>      > So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>>>      > could have a handshake inside a safepoint operation. The issue I see
>>>      > there is that at the end of the handshake the polling page of the target
>>>      > thread could be disarmed. So if the target thread happens to be in a
>>>      > blocked state just transiently and wakes up then it will not stop for
>>>      > the ongoing safepoint. Maybe I can file an RFE to assert that the
>>>      > polling page is armed at the beginning of disarm_safepoint().
>>>
>>> I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
>>> handshake cannot be nested in a vm operation. Maybe it should be asserted in the
>>> Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>>>
>>>      > Alternatively I think you could do something similar to what we do in
>>>      > Deoptimization::deoptimize_all_marked():
>>>      >
>>>      >    EnterInterpOnlyModeClosure hs;
>>>      >    if (SafepointSynchronize::is_at_safepoint()) {
>>>      >      hs.do_thread(state->get_thread());
>>>      >    } else {
>>>      >      Handshake::execute(&hs, state->get_thread());
>>>      >    }
>>>      > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>>      > HandshakeClosure() constructor)
>>>
>>> Maybe this could be used also in the Handshake::execute() methods as general solution?
>> Right, we could also do that. Avoiding to clear the polling page in
>> HandshakeState::clear_handshake() should be enough to fix this issue and
>> execute a handshake inside a safepoint, but adding that "if" statement
>> in Hanshake::execute() sounds good to avoid all the extra code that we
>> go through when executing a handshake. I filed 8239084 to make that change.
>>
>>>      > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>>      > always called in a nested operation or just sometimes.
>>>
>>> At least one execution path without vm operation exists:
>>>
>>>      JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>>>        JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>>>          JvmtiEventControllerPrivate::recompute_enabled() : void
>>>            JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>>>              JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>>>                JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>>>                  jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>>>
>>> I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
>>> handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
>>> encouraged to do it with a handshake :)
>> Ah! I think you can still do it with a handshake with the
>> Deoptimization::deoptimize_all_marked() like solution. I can change the
>> if-else statement with just the Handshake::execute() call in 8239084.
>> But up to you.? : )
>>
>> Thanks,
>> Patricio
>>> Thanks again,
>>> Richard.
>>>
>>> -----Original Message-----
>>> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
>>> Sent: Donnerstag, 13. Februar 2020 18:47
>>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>
>>> Hi Richard,
>>>
>>> I?m only commenting on the handshake changes.
>>> I see that operation VM_EnterInterpOnlyMode can be called inside
>>> operation VM_SetFramePop which also allows nested operations. Here is a
>>> comment in VM_SetFramePop definition:
>>>
>>> // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>>> // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>>>
>>> So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>>> could have a handshake inside a safepoint operation. The issue I see
>>> there is that at the end of the handshake the polling page of the target
>>> thread could be disarmed. So if the target thread happens to be in a
>>> blocked state just transiently and wakes up then it will not stop for
>>> the ongoing safepoint. Maybe I can file an RFE to assert that the
>>> polling page is armed at the beginning of disarm_safepoint().
>>>
>>> I think one option could be to remove
>>> SafepointMechanism::disarm_if_needed() in
>>> HandshakeState::clear_handshake() and let each JavaThread disarm itself
>>> for the handshake case.
>>>
>>> Alternatively I think you could do something similar to what we do in
>>> Deoptimization::deoptimize_all_marked():
>>>
>>>     ? EnterInterpOnlyModeClosure hs;
>>>     ? if (SafepointSynchronize::is_at_safepoint()) {
>>>     ??? hs.do_thread(state->get_thread());
>>>     ? } else {
>>>     ??? Handshake::execute(&hs, state->get_thread());
>>>     ? }
>>> (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>> HandshakeClosure() constructor)
>>>
>>> I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>> always called in a nested operation or just sometimes.
>>>
>>> Thanks,
>>> Patricio
>>>
>>> On 2/12/20 7:23 AM, Reingruber, Richard wrote:
>>>> // Repost including hotspot runtime and gc lists.
>>>> // Dean Long suggested to do so, because the enhancement replaces a vm operation
>>>> // with a handshake.
>>>> // Original thread: http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-February/030359.html
>>>>
>>>> Hi,
>>>>
>>>> could I please get reviews for this small enhancement in hotspot's jvmti implementation:
>>>>
>>>> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.0/
>>>> Bug:    https://bugs.openjdk.java.net/browse/JDK-8238585
>>>>
>>>> The change avoids making all compiled methods on stack not_entrant when switching a java thread to
>>>> interpreter only execution for jvmti purposes. It is sufficient to deoptimize the compiled frames on stack.
>>>>
>>>> Additionally a handshake is used instead of a vm operation to walk the stack and do the deoptimizations.
>>>>
>>>> Testing: JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
>>>>
>>>> Thanks, Richard.
>>>>
>>>> See also my question if anyone knows a reason for making the compiled methods not_entrant:
>>>> http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-January/030339.html
>>

From richard.reingruber at sap.com  Fri Apr 24 16:08:57 2020
From: richard.reingruber at sap.com (Reingruber, Richard)
Date: Fri, 24 Apr 2020 16:08:57 +0000
Subject: RFR(S) 8238585: Use handshake for
 JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled
 methods on stack not_entrant
In-Reply-To: <81d7caa8-4244-85f3-4d4e-78117fe5e25b@oss.nttdata.com>
References: <AM0PR0202MB33311AB7623F4164335286309B1B0@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <3c59b9f9-ec38-18c9-8f24-e1186a08a04a@oracle.com>
 <AM0PR0202MB3331FD660B315CEC25B1DC209B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <410eed04-e2ef-0f4f-1c56-19e6734a10f6@oracle.com>
 <AM0PR0202MB33318BC60E3460662D42A40D9B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <AM0PR0202MB33319C2F9595CC3C2FA1617B9BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <e1244807-74b4-755b-17c2-dce3263ce337@oss.nttdata.com>
 <AM0PR0202MB333172BA4EAF92D0BC3BA9379BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <81d7caa8-4244-85f3-4d4e-78117fe5e25b@oss.nttdata.com>
Message-ID: <AM0PR0202MB3331324D56A68E08CE7CDAE69BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>

Hi Yasumasa, Patricio,

> >> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
> >> Does it help you? I think it gives you to remove workaround.
> > 
> > I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake
> > you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to
> > change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes.

> Thanks for your information.
> I tested my patch with both vmTestbase/nsk/jvmti/PopFrame and vmTestbase/nsk/jvmti/NotifyFramePop.
> I will modify and will test it after yours.

Thanks :)

> > Also my first impression was that it won't be that easy from a synchronization point of view to
> > replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls
> > JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where
> > JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear
> > to me, how this has to be handled.

> I think JvmtiEventController::set_frame_pop() should hold JvmtiThreadState_lock because it affects other JVMTI operation especially FramePop event.

Yes. To me it is unclear what synchronization is necessary, if it is called during a handshake. And
also I'm unsure if a thread should do safepoint checks while executing a handshake.

@Patricio, coming back to my question [1]:

In the example you gave in your answer [2]: the java thread would execute a vm operation during a
direct handshake operation, while the VMThread is actually in the middle of a VM_HandshakeAllThreads
operation, waiting to handshake the same handshakee: why can't the VMThread just proceed? The
handshakee would be safepoint safe, wouldn't it?

Thanks, Richard.

[1] https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14301677&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14301677

[2] https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14301763&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14301763

-----Original Message-----
From: Yasumasa Suenaga <suenaga at oss.nttdata.com> 
Sent: Freitag, 24. April 2020 17:23
To: Reingruber, Richard <richard.reingruber at sap.com>; Patricio Chilano <patricio.chilano.mateo at oracle.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant

Hi Richard,

On 2020/04/24 23:44, Reingruber, Richard wrote:
> Hi Yasumasa,
> 
>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
>> Does it help you? I think it gives you to remove workaround.
> 
> I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake
> you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to
> change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes.

Thanks for your information.
I tested my patch with both vmTestbase/nsk/jvmti/PopFrame and vmTestbase/nsk/jvmti/NotifyFramePop.
I will modify and will test it after yours.


> Also my first impression was that it won't be that easy from a synchronization point of view to
> replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls
> JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where
> JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear
> to me, how this has to be handled.

I think JvmtiEventController::set_frame_pop() should hold JvmtiThreadState_lock because it affects other JVMTI operation especially FramePop event.


Thanks,

Yasumasa


> So it appears to me that it would be easier to push JDK-8242427 after this (JDK-8238585).
> 
>> (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.)
> 
> Would be interesting to see how you handled the issues above :)
> 
> Thanks, Richard.
> 
> [1] See question in comment https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14302030&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302030
> 
> -----Original Message-----
> From: Yasumasa Suenaga <suenaga at oss.nttdata.com>
> Sent: Freitag, 24. April 2020 13:34
> To: Reingruber, Richard <richard.reingruber at sap.com>; Patricio Chilano <patricio.chilano.mateo at oracle.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
> 
> Hi Richard,
> 
> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
> Does it help you? I think it gives you to remove workaround.
> 
> (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.)
> 
> 
> Thanks,
> 
> Yasumasa
> 
> 
> On 2020/04/24 17:18, Reingruber, Richard wrote:
>> Hi Patricio, Vladimir, and Serguei,
>>
>> now that direct handshakes are available, I've updated the patch to make use of them.
>>
>> In addition I have done some clean-up changes I missed in the first webrev.
>>
>> Finally I have implemented the workaround suggested by Patricio to avoid nesting the handshake
>> into the vm operation VM_SetFramePop [1]
>>
>> Kindly review again:
>>
>> Webrev:        http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1/
>> Webrev(delta): http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1.inc/
>>
>> I updated the JBS item explaining why the vm operation VM_EnterInterpOnlyMode can be replaced with a
>> direct handshake:
>>
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8238585
>>
>> Testing:
>>
>> * JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
>>
>> * Submit-repo: mach5-one-rrich-JDK-8238585-20200423-1436-10441737
>>
>> Thanks,
>> Richard.
>>
>> [1] An assertion in Handshake::execute_direct() fails, if called be VMThread, because it is no JavaThread.
>>
>> -----Original Message-----
>> From: hotspot-dev <hotspot-dev-bounces at openjdk.java.net> On Behalf Of Reingruber, Richard
>> Sent: Freitag, 14. Februar 2020 19:47
>> To: Patricio Chilano <patricio.chilano.mateo at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>> Subject: RE: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>
>> Hi Patricio,
>>
>>     > > I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
>>     > > handshake cannot be nested in a vm operation. Maybe it should be asserted in the
>>     > > Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>>     > >
>>     > >    > Alternatively I think you could do something similar to what we do in
>>     > >    > Deoptimization::deoptimize_all_marked():
>>     > >    >
>>     > >    >    EnterInterpOnlyModeClosure hs;
>>     > >    >    if (SafepointSynchronize::is_at_safepoint()) {
>>     > >    >      hs.do_thread(state->get_thread());
>>     > >    >    } else {
>>     > >    >      Handshake::execute(&hs, state->get_thread());
>>     > >    >    }
>>     > >    > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>     > >    > HandshakeClosure() constructor)
>>     > >
>>     > > Maybe this could be used also in the Handshake::execute() methods as general solution?
>>     > Right, we could also do that. Avoiding to clear the polling page in
>>     > HandshakeState::clear_handshake() should be enough to fix this issue and
>>     > execute a handshake inside a safepoint, but adding that "if" statement
>>     > in Hanshake::execute() sounds good to avoid all the extra code that we
>>     > go through when executing a handshake. I filed 8239084 to make that change.
>>
>> Thanks for taking care of this and creating the RFE.
>>
>>     >
>>     > >    > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>     > >    > always called in a nested operation or just sometimes.
>>     > >
>>     > > At least one execution path without vm operation exists:
>>     > >
>>     > >    JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>>     > >      JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>>     > >        JvmtiEventControllerPrivate::recompute_enabled() : void
>>     > >          JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>>     > >            JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>>     > >              JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>>     > >                jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>>     > >
>>     > > I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
>>     > > handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
>>     > > encouraged to do it with a handshake :)
>>     > Ah! I think you can still do it with a handshake with the
>>     > Deoptimization::deoptimize_all_marked() like solution. I can change the
>>     > if-else statement with just the Handshake::execute() call in 8239084.
>>     > But up to you.  : )
>>
>> Well, I think that's enough encouragement :)
>> I'll wait for 8239084 and try then again.
>> (no urgency and all)
>>
>> Thanks,
>> Richard.
>>
>> -----Original Message-----
>> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
>> Sent: Freitag, 14. Februar 2020 15:54
>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>
>> Hi Richard,
>>
>> On 2/14/20 9:58 AM, Reingruber, Richard wrote:
>>> Hi Patricio,
>>>
>>> thanks for having a look.
>>>
>>>      > I?m only commenting on the handshake changes.
>>>      > I see that operation VM_EnterInterpOnlyMode can be called inside
>>>      > operation VM_SetFramePop which also allows nested operations. Here is a
>>>      > comment in VM_SetFramePop definition:
>>>      >
>>>      > // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>>>      > // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>>>      >
>>>      > So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>>>      > could have a handshake inside a safepoint operation. The issue I see
>>>      > there is that at the end of the handshake the polling page of the target
>>>      > thread could be disarmed. So if the target thread happens to be in a
>>>      > blocked state just transiently and wakes up then it will not stop for
>>>      > the ongoing safepoint. Maybe I can file an RFE to assert that the
>>>      > polling page is armed at the beginning of disarm_safepoint().
>>>
>>> I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
>>> handshake cannot be nested in a vm operation. Maybe it should be asserted in the
>>> Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>>>
>>>      > Alternatively I think you could do something similar to what we do in
>>>      > Deoptimization::deoptimize_all_marked():
>>>      >
>>>      >    EnterInterpOnlyModeClosure hs;
>>>      >    if (SafepointSynchronize::is_at_safepoint()) {
>>>      >      hs.do_thread(state->get_thread());
>>>      >    } else {
>>>      >      Handshake::execute(&hs, state->get_thread());
>>>      >    }
>>>      > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>>      > HandshakeClosure() constructor)
>>>
>>> Maybe this could be used also in the Handshake::execute() methods as general solution?
>> Right, we could also do that. Avoiding to clear the polling page in
>> HandshakeState::clear_handshake() should be enough to fix this issue and
>> execute a handshake inside a safepoint, but adding that "if" statement
>> in Hanshake::execute() sounds good to avoid all the extra code that we
>> go through when executing a handshake. I filed 8239084 to make that change.
>>
>>>      > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>>      > always called in a nested operation or just sometimes.
>>>
>>> At least one execution path without vm operation exists:
>>>
>>>      JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>>>        JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>>>          JvmtiEventControllerPrivate::recompute_enabled() : void
>>>            JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>>>              JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>>>                JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>>>                  jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>>>
>>> I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
>>> handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
>>> encouraged to do it with a handshake :)
>> Ah! I think you can still do it with a handshake with the
>> Deoptimization::deoptimize_all_marked() like solution. I can change the
>> if-else statement with just the Handshake::execute() call in 8239084.
>> But up to you.? : )
>>
>> Thanks,
>> Patricio
>>> Thanks again,
>>> Richard.
>>>
>>> -----Original Message-----
>>> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
>>> Sent: Donnerstag, 13. Februar 2020 18:47
>>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>
>>> Hi Richard,
>>>
>>> I?m only commenting on the handshake changes.
>>> I see that operation VM_EnterInterpOnlyMode can be called inside
>>> operation VM_SetFramePop which also allows nested operations. Here is a
>>> comment in VM_SetFramePop definition:
>>>
>>> // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>>> // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>>>
>>> So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>>> could have a handshake inside a safepoint operation. The issue I see
>>> there is that at the end of the handshake the polling page of the target
>>> thread could be disarmed. So if the target thread happens to be in a
>>> blocked state just transiently and wakes up then it will not stop for
>>> the ongoing safepoint. Maybe I can file an RFE to assert that the
>>> polling page is armed at the beginning of disarm_safepoint().
>>>
>>> I think one option could be to remove
>>> SafepointMechanism::disarm_if_needed() in
>>> HandshakeState::clear_handshake() and let each JavaThread disarm itself
>>> for the handshake case.
>>>
>>> Alternatively I think you could do something similar to what we do in
>>> Deoptimization::deoptimize_all_marked():
>>>
>>>     ? EnterInterpOnlyModeClosure hs;
>>>     ? if (SafepointSynchronize::is_at_safepoint()) {
>>>     ??? hs.do_thread(state->get_thread());
>>>     ? } else {
>>>     ??? Handshake::execute(&hs, state->get_thread());
>>>     ? }
>>> (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>> HandshakeClosure() constructor)
>>>
>>> I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>> always called in a nested operation or just sometimes.
>>>
>>> Thanks,
>>> Patricio
>>>
>>> On 2/12/20 7:23 AM, Reingruber, Richard wrote:
>>>> // Repost including hotspot runtime and gc lists.
>>>> // Dean Long suggested to do so, because the enhancement replaces a vm operation
>>>> // with a handshake.
>>>> // Original thread: http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-February/030359.html
>>>>
>>>> Hi,
>>>>
>>>> could I please get reviews for this small enhancement in hotspot's jvmti implementation:
>>>>
>>>> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.0/
>>>> Bug:    https://bugs.openjdk.java.net/browse/JDK-8238585
>>>>
>>>> The change avoids making all compiled methods on stack not_entrant when switching a java thread to
>>>> interpreter only execution for jvmti purposes. It is sufficient to deoptimize the compiled frames on stack.
>>>>
>>>> Additionally a handshake is used instead of a vm operation to walk the stack and do the deoptimizations.
>>>>
>>>> Testing: JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
>>>>
>>>> Thanks, Richard.
>>>>
>>>> See also my question if anyone knows a reason for making the compiled methods not_entrant:
>>>> http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-January/030339.html
>>

From patricio.chilano.mateo at oracle.com  Fri Apr 24 17:13:43 2020
From: patricio.chilano.mateo at oracle.com (Patricio Chilano)
Date: Fri, 24 Apr 2020 14:13:43 -0300
Subject: RFR(S) 8238585: Use handshake for
 JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled
 methods on stack not_entrant
In-Reply-To: <AM0PR0202MB3331324D56A68E08CE7CDAE69BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
References: <AM0PR0202MB33311AB7623F4164335286309B1B0@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <3c59b9f9-ec38-18c9-8f24-e1186a08a04a@oracle.com>
 <AM0PR0202MB3331FD660B315CEC25B1DC209B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <410eed04-e2ef-0f4f-1c56-19e6734a10f6@oracle.com>
 <AM0PR0202MB33318BC60E3460662D42A40D9B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <AM0PR0202MB33319C2F9595CC3C2FA1617B9BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <e1244807-74b4-755b-17c2-dce3263ce337@oss.nttdata.com>
 <AM0PR0202MB333172BA4EAF92D0BC3BA9379BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <81d7caa8-4244-85f3-4d4e-78117fe5e25b@oss.nttdata.com>
 <AM0PR0202MB3331324D56A68E08CE7CDAE69BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
Message-ID: <11c78b30-de04-544d-3a10-811ebf663bf2@oracle.com>

Hi Richard,

Just jumping into your last question for now.? : )


On 4/24/20 1:08 PM, Reingruber, Richard wrote:
> Hi Yasumasa, Patricio,
>
>>>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
>>>> Does it help you? I think it gives you to remove workaround.
>>> I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake
>>> you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to
>>> change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes.
>> Thanks for your information.
>> I tested my patch with both vmTestbase/nsk/jvmti/PopFrame and vmTestbase/nsk/jvmti/NotifyFramePop.
>> I will modify and will test it after yours.
> Thanks :)
>
>>> Also my first impression was that it won't be that easy from a synchronization point of view to
>>> replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls
>>> JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where
>>> JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear
>>> to me, how this has to be handled.
>> I think JvmtiEventController::set_frame_pop() should hold JvmtiThreadState_lock because it affects other JVMTI operation especially FramePop event.
> Yes. To me it is unclear what synchronization is necessary, if it is called during a handshake. And
> also I'm unsure if a thread should do safepoint checks while executing a handshake.
>
> @Patricio, coming back to my question [1]:
>
> In the example you gave in your answer [2]: the java thread would execute a vm operation during a
> direct handshake operation, while the VMThread is actually in the middle of a VM_HandshakeAllThreads
> operation, waiting to handshake the same handshakee: why can't the VMThread just proceed? The
> handshakee would be safepoint safe, wouldn't it?
Because the VMThread would not be able to decrement _processing_sem to 
claim the operation and execute the closure for that handshakee. If 
another JavaThread is doing a direct handshake with that same handshakee 
and called a new VM operation inside the execution of the 
HandshakeClosure in do_handshake(), nobody would be able to decrement 
the _processing_sem anymore until the original direct operation finished 
and the semaphore is signaled again. So this can happen despite the 
state of the handshakee is "handshake/safepoint safe". Changing the 
nested VM operation to be a direct handshake would have the same issue. 
Actually as the code is right now we would not even get pass setting the 
handshake operation because in that case we would block in the 
_handshake_turn_sem for the same reason.

So changing VM_SetFramePop to use direct handshakes in the future will 
probably create that last issue I mentioned. Now, since it is executed 
at a safepoint, with your workaround in enter_interp_only_mode() we 
avoid those nested issues in . Maybe 8239084 would have to be revisited 
to address nested operations in all cases. It is not clear to me now 
though if we should handle that in the handshake code or the caller of a 
certain operation should know it might be called in a nested scenario 
and should handle it.

I'll look a bit more at the updated patch but at first glance looks good.

Thanks!

Patricio
> Thanks, Richard.
>
> [1] https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14301677&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14301677
>
> [2] https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14301763&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14301763
>
> -----Original Message-----
> From: Yasumasa Suenaga <suenaga at oss.nttdata.com>
> Sent: Freitag, 24. April 2020 17:23
> To: Reingruber, Richard <richard.reingruber at sap.com>; Patricio Chilano <patricio.chilano.mateo at oracle.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>
> Hi Richard,
>
> On 2020/04/24 23:44, Reingruber, Richard wrote:
>> Hi Yasumasa,
>>
>>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
>>> Does it help you? I think it gives you to remove workaround.
>> I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake
>> you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to
>> change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes.
> Thanks for your information.
> I tested my patch with both vmTestbase/nsk/jvmti/PopFrame and vmTestbase/nsk/jvmti/NotifyFramePop.
> I will modify and will test it after yours.
>
>
>> Also my first impression was that it won't be that easy from a synchronization point of view to
>> replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls
>> JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where
>> JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear
>> to me, how this has to be handled.
> I think JvmtiEventController::set_frame_pop() should hold JvmtiThreadState_lock because it affects other JVMTI operation especially FramePop event.
>
>
> Thanks,
>
> Yasumasa
>
>
>> So it appears to me that it would be easier to push JDK-8242427 after this (JDK-8238585).
>>
>>> (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.)
>> Would be interesting to see how you handled the issues above :)
>>
>> Thanks, Richard.
>>
>> [1] See question in comment https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14302030&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302030
>>
>> -----Original Message-----
>> From: Yasumasa Suenaga <suenaga at oss.nttdata.com>
>> Sent: Freitag, 24. April 2020 13:34
>> To: Reingruber, Richard <richard.reingruber at sap.com>; Patricio Chilano <patricio.chilano.mateo at oracle.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>
>> Hi Richard,
>>
>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
>> Does it help you? I think it gives you to remove workaround.
>>
>> (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.)
>>
>>
>> Thanks,
>>
>> Yasumasa
>>
>>
>> On 2020/04/24 17:18, Reingruber, Richard wrote:
>>> Hi Patricio, Vladimir, and Serguei,
>>>
>>> now that direct handshakes are available, I've updated the patch to make use of them.
>>>
>>> In addition I have done some clean-up changes I missed in the first webrev.
>>>
>>> Finally I have implemented the workaround suggested by Patricio to avoid nesting the handshake
>>> into the vm operation VM_SetFramePop [1]
>>>
>>> Kindly review again:
>>>
>>> Webrev:        http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1/
>>> Webrev(delta): http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1.inc/
>>>
>>> I updated the JBS item explaining why the vm operation VM_EnterInterpOnlyMode can be replaced with a
>>> direct handshake:
>>>
>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8238585
>>>
>>> Testing:
>>>
>>> * JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
>>>
>>> * Submit-repo: mach5-one-rrich-JDK-8238585-20200423-1436-10441737
>>>
>>> Thanks,
>>> Richard.
>>>
>>> [1] An assertion in Handshake::execute_direct() fails, if called be VMThread, because it is no JavaThread.
>>>
>>> -----Original Message-----
>>> From: hotspot-dev <hotspot-dev-bounces at openjdk.java.net> On Behalf Of Reingruber, Richard
>>> Sent: Freitag, 14. Februar 2020 19:47
>>> To: Patricio Chilano <patricio.chilano.mateo at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>> Subject: RE: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>
>>> Hi Patricio,
>>>
>>>      > > I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
>>>      > > handshake cannot be nested in a vm operation. Maybe it should be asserted in the
>>>      > > Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>>>      > >
>>>      > >    > Alternatively I think you could do something similar to what we do in
>>>      > >    > Deoptimization::deoptimize_all_marked():
>>>      > >    >
>>>      > >    >    EnterInterpOnlyModeClosure hs;
>>>      > >    >    if (SafepointSynchronize::is_at_safepoint()) {
>>>      > >    >      hs.do_thread(state->get_thread());
>>>      > >    >    } else {
>>>      > >    >      Handshake::execute(&hs, state->get_thread());
>>>      > >    >    }
>>>      > >    > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>>      > >    > HandshakeClosure() constructor)
>>>      > >
>>>      > > Maybe this could be used also in the Handshake::execute() methods as general solution?
>>>      > Right, we could also do that. Avoiding to clear the polling page in
>>>      > HandshakeState::clear_handshake() should be enough to fix this issue and
>>>      > execute a handshake inside a safepoint, but adding that "if" statement
>>>      > in Hanshake::execute() sounds good to avoid all the extra code that we
>>>      > go through when executing a handshake. I filed 8239084 to make that change.
>>>
>>> Thanks for taking care of this and creating the RFE.
>>>
>>>      >
>>>      > >    > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>>      > >    > always called in a nested operation or just sometimes.
>>>      > >
>>>      > > At least one execution path without vm operation exists:
>>>      > >
>>>      > >    JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>>>      > >      JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>>>      > >        JvmtiEventControllerPrivate::recompute_enabled() : void
>>>      > >          JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>>>      > >            JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>>>      > >              JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>>>      > >                jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>>>      > >
>>>      > > I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
>>>      > > handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
>>>      > > encouraged to do it with a handshake :)
>>>      > Ah! I think you can still do it with a handshake with the
>>>      > Deoptimization::deoptimize_all_marked() like solution. I can change the
>>>      > if-else statement with just the Handshake::execute() call in 8239084.
>>>      > But up to you.  : )
>>>
>>> Well, I think that's enough encouragement :)
>>> I'll wait for 8239084 and try then again.
>>> (no urgency and all)
>>>
>>> Thanks,
>>> Richard.
>>>
>>> -----Original Message-----
>>> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
>>> Sent: Freitag, 14. Februar 2020 15:54
>>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>
>>> Hi Richard,
>>>
>>> On 2/14/20 9:58 AM, Reingruber, Richard wrote:
>>>> Hi Patricio,
>>>>
>>>> thanks for having a look.
>>>>
>>>>       > I?m only commenting on the handshake changes.
>>>>       > I see that operation VM_EnterInterpOnlyMode can be called inside
>>>>       > operation VM_SetFramePop which also allows nested operations. Here is a
>>>>       > comment in VM_SetFramePop definition:
>>>>       >
>>>>       > // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>>>>       > // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>>>>       >
>>>>       > So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>>>>       > could have a handshake inside a safepoint operation. The issue I see
>>>>       > there is that at the end of the handshake the polling page of the target
>>>>       > thread could be disarmed. So if the target thread happens to be in a
>>>>       > blocked state just transiently and wakes up then it will not stop for
>>>>       > the ongoing safepoint. Maybe I can file an RFE to assert that the
>>>>       > polling page is armed at the beginning of disarm_safepoint().
>>>>
>>>> I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
>>>> handshake cannot be nested in a vm operation. Maybe it should be asserted in the
>>>> Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>>>>
>>>>       > Alternatively I think you could do something similar to what we do in
>>>>       > Deoptimization::deoptimize_all_marked():
>>>>       >
>>>>       >    EnterInterpOnlyModeClosure hs;
>>>>       >    if (SafepointSynchronize::is_at_safepoint()) {
>>>>       >      hs.do_thread(state->get_thread());
>>>>       >    } else {
>>>>       >      Handshake::execute(&hs, state->get_thread());
>>>>       >    }
>>>>       > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>>>       > HandshakeClosure() constructor)
>>>>
>>>> Maybe this could be used also in the Handshake::execute() methods as general solution?
>>> Right, we could also do that. Avoiding to clear the polling page in
>>> HandshakeState::clear_handshake() should be enough to fix this issue and
>>> execute a handshake inside a safepoint, but adding that "if" statement
>>> in Hanshake::execute() sounds good to avoid all the extra code that we
>>> go through when executing a handshake. I filed 8239084 to make that change.
>>>
>>>>       > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>>>       > always called in a nested operation or just sometimes.
>>>>
>>>> At least one execution path without vm operation exists:
>>>>
>>>>       JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>>>>         JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>>>>           JvmtiEventControllerPrivate::recompute_enabled() : void
>>>>             JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>>>>               JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>>>>                 JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>>>>                   jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>>>>
>>>> I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
>>>> handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
>>>> encouraged to do it with a handshake :)
>>> Ah! I think you can still do it with a handshake with the
>>> Deoptimization::deoptimize_all_marked() like solution. I can change the
>>> if-else statement with just the Handshake::execute() call in 8239084.
>>> But up to you.? : )
>>>
>>> Thanks,
>>> Patricio
>>>> Thanks again,
>>>> Richard.
>>>>
>>>> -----Original Message-----
>>>> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
>>>> Sent: Donnerstag, 13. Februar 2020 18:47
>>>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>>
>>>> Hi Richard,
>>>>
>>>> I?m only commenting on the handshake changes.
>>>> I see that operation VM_EnterInterpOnlyMode can be called inside
>>>> operation VM_SetFramePop which also allows nested operations. Here is a
>>>> comment in VM_SetFramePop definition:
>>>>
>>>> // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>>>> // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>>>>
>>>> So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>>>> could have a handshake inside a safepoint operation. The issue I see
>>>> there is that at the end of the handshake the polling page of the target
>>>> thread could be disarmed. So if the target thread happens to be in a
>>>> blocked state just transiently and wakes up then it will not stop for
>>>> the ongoing safepoint. Maybe I can file an RFE to assert that the
>>>> polling page is armed at the beginning of disarm_safepoint().
>>>>
>>>> I think one option could be to remove
>>>> SafepointMechanism::disarm_if_needed() in
>>>> HandshakeState::clear_handshake() and let each JavaThread disarm itself
>>>> for the handshake case.
>>>>
>>>> Alternatively I think you could do something similar to what we do in
>>>> Deoptimization::deoptimize_all_marked():
>>>>
>>>>      ? EnterInterpOnlyModeClosure hs;
>>>>      ? if (SafepointSynchronize::is_at_safepoint()) {
>>>>      ??? hs.do_thread(state->get_thread());
>>>>      ? } else {
>>>>      ??? Handshake::execute(&hs, state->get_thread());
>>>>      ? }
>>>> (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>>> HandshakeClosure() constructor)
>>>>
>>>> I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>>> always called in a nested operation or just sometimes.
>>>>
>>>> Thanks,
>>>> Patricio
>>>>
>>>> On 2/12/20 7:23 AM, Reingruber, Richard wrote:
>>>>> // Repost including hotspot runtime and gc lists.
>>>>> // Dean Long suggested to do so, because the enhancement replaces a vm operation
>>>>> // with a handshake.
>>>>> // Original thread: http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-February/030359.html
>>>>>
>>>>> Hi,
>>>>>
>>>>> could I please get reviews for this small enhancement in hotspot's jvmti implementation:
>>>>>
>>>>> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.0/
>>>>> Bug:    https://bugs.openjdk.java.net/browse/JDK-8238585
>>>>>
>>>>> The change avoids making all compiled methods on stack not_entrant when switching a java thread to
>>>>> interpreter only execution for jvmti purposes. It is sufficient to deoptimize the compiled frames on stack.
>>>>>
>>>>> Additionally a handshake is used instead of a vm operation to walk the stack and do the deoptimizations.
>>>>>
>>>>> Testing: JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
>>>>>
>>>>> Thanks, Richard.
>>>>>
>>>>> See also my question if anyone knows a reason for making the compiled methods not_entrant:
>>>>> http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-January/030339.html


From vladimir.kozlov at oracle.com  Fri Apr 24 18:25:08 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Fri, 24 Apr 2020 11:25:08 -0700
Subject: RFR(S): 8239569: PublicMethodsTest.java failed due to NPE in
 java.base/java.nio.file.FileSystems.getFileSystem(FileSystems.java:230)
In-Reply-To: <d9e78153-d9a8-7e13-9ac5-49f803bfbed0@oracle.com>
References: <87zhb18fmw.fsf@redhat.com>
 <d9e78153-d9a8-7e13-9ac5-49f803bfbed0@oracle.com>
Message-ID: <79d30ac1-3ddc-5d6b-e343-1e5adb98d1e0@oracle.com>

+1

Thanks,
Vladimir

On 4/24/20 1:24 AM, Tobias Hartmann wrote:
> Hi Roland,
> 
> Ouh, good catch! Looks good.
> 
> Best regards,
> Tobias
> 
> On 24.04.20 10:14, Roland Westrelin wrote:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8239569
>> http://cr.openjdk.java.net/~roland/8239569/webrev.00/
>>
>> The bug occurs when reading from a constant array after a loop is fully
>> unrolled. Reading an element in the loop has the shape:
>> (LoadB (AddP base (AddP base base index) ..) ..)
>> A load from the same element is also out of the loop:
>> (LoadUB (AddP base (AddP base base index) ..) ..)
>> The AddPs are shared between the LoadB in the loop and the LoadUB out of
>> the loop.
>>
>> After full unrolling the load out of the loop becomes:
>> (LoadUB (Phi (AddP base (AddP base base index1) ..) (AddP base (AddP base base index2) ..) ..) ..)
>>
>> The AddPs are then pushed through the Phi and that's where the bug
>> is.
>>
>> - index1 is 0 and so the type of (AddP base base index1) is a constant
>>    array pointer with no offset.
>>
>> - that type is met with the type of the base of the second AddP instead
>>    of the type of the address of the second AddP. The result is a
>>    constant array pointer.
>>
>> The resulting Phi for the address input is created as a Phi of type
>> constant array with no offset instead of constant array with offset. As
>> a result, the Phi constant folds and the offset is lost.
>>
>> Roland.
>>

From vladimir.kozlov at oracle.com  Fri Apr 24 21:12:39 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Fri, 24 Apr 2020 14:12:39 -0700
Subject: RFR(XS): 8242502: UnexpectedDeoptimizationTest.java failed
 "assert(phase->type(obj)->isa_oopptr()) failed: only for oop input"
In-Reply-To: <878siu9klq.fsf@redhat.com>
References: <878siu9klq.fsf@redhat.com>
Message-ID: <0bfc586a-fe96-0b7e-f0d4-244b5951f8f3@oracle.com>

Hi Roland

Can you also print type so we know it next time.

Thanks,
Vladimir

On 4/17/20 8:51 AM, Roland Westrelin wrote:
> 
> https://bugs.openjdk.java.net/browse/JDK-8242502
> http://cr.openjdk.java.net/~roland/8242502/webrev.00/
> 
> I wasn't able to reproduce that failure (neither by running the test or
> with the replay file) but I suspect the assert fails because it
> encounters a unexpected top node.
> 
> Roland.
> 

From richard.reingruber at sap.com  Fri Apr 24 21:41:45 2020
From: richard.reingruber at sap.com (Reingruber, Richard)
Date: Fri, 24 Apr 2020 21:41:45 +0000
Subject: RFR(S) 8238585: Use handshake for
 JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled
 methods on stack not_entrant
In-Reply-To: <11c78b30-de04-544d-3a10-811ebf663bf2@oracle.com>
References: <AM0PR0202MB33311AB7623F4164335286309B1B0@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <3c59b9f9-ec38-18c9-8f24-e1186a08a04a@oracle.com>
 <AM0PR0202MB3331FD660B315CEC25B1DC209B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <410eed04-e2ef-0f4f-1c56-19e6734a10f6@oracle.com>
 <AM0PR0202MB33318BC60E3460662D42A40D9B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <AM0PR0202MB33319C2F9595CC3C2FA1617B9BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <e1244807-74b4-755b-17c2-dce3263ce337@oss.nttdata.com>
 <AM0PR0202MB333172BA4EAF92D0BC3BA9379BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <81d7caa8-4244-85f3-4d4e-78117fe5e25b@oss.nttdata.com>
 <AM0PR0202MB3331324D56A68E08CE7CDAE69BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <11c78b30-de04-544d-3a10-811ebf663bf2@oracle.com>
Message-ID: <AM0PR0202MB3331610F89DFE2AD378B52289BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>

Hi Patricio,

> > @Patricio, coming back to my question [1]:
> >
> > In the example you gave in your answer [2]: the java thread would execute a vm operation during a
> > direct handshake operation, while the VMThread is actually in the middle of a VM_HandshakeAllThreads
> > operation, waiting to handshake the same handshakee: why can't the VMThread just proceed? The
> > handshakee would be safepoint safe, wouldn't it?
> Because the VMThread would not be able to decrement _processing_sem to 
> claim the operation and execute the closure for that handshakee. If 
> another JavaThread is doing a direct handshake with that same handshakee 
> and called a new VM operation inside the execution of the 
> HandshakeClosure in do_handshake(), nobody would be able to decrement 
> the _processing_sem anymore until the original direct operation finished 
> and the semaphore is signaled again.

Thanks, understood. On a higher level: a JavaThread can have at most one handshake operation being
processed at at time.

> So this can happen despite the 
> state of the handshakee is "handshake/safepoint safe". Changing the 
> nested VM operation to be a direct handshake would have the same issue. 
> Actually as the code is right now we would not even get pass setting the 
> handshake operation because in that case we would block in the 
> _handshake_turn_sem for the same reason.

Don't really understand the details here, but that's ok.
Interesting that _handshake_turn_sem gets signaled before or after do_handshake() depending if the
handshake operation is processed by handshakee. Comments say "Disarm before/after executing
operation" but not why :)

> So changing VM_SetFramePop to use direct handshakes in the future will 
> probably create that last issue I mentioned. Now, since it is executed 
> at a safepoint, with your workaround in enter_interp_only_mode() we 
> avoid those nested issues in . Maybe 8239084 would have to be revisited 
> to address nested operations in all cases. It is not clear to me now 
> though if we should handle that in the handshake code or the caller of a 
> certain operation should know it might be called in a nested scenario 
> and should handle it.

Last question: is it ok for the processor of a direct handshake operation to do safepoint/handshake
checks? I wouldn't see a reason, why not. But certainly I would avoid it.

> I'll look a bit more at the updated patch but at first glance looks good.

Thanks,
Richard.

-----Original Message-----
From: Patricio Chilano <patricio.chilano.mateo at oracle.com> 
Sent: Freitag, 24. April 2020 19:14
To: Reingruber, Richard <richard.reingruber at sap.com>; Yasumasa Suenaga <suenaga at oss.nttdata.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant

Hi Richard,

Just jumping into your last question for now.? : )


On 4/24/20 1:08 PM, Reingruber, Richard wrote:
> Hi Yasumasa, Patricio,
>
>>>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
>>>> Does it help you? I think it gives you to remove workaround.
>>> I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake
>>> you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to
>>> change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes.
>> Thanks for your information.
>> I tested my patch with both vmTestbase/nsk/jvmti/PopFrame and vmTestbase/nsk/jvmti/NotifyFramePop.
>> I will modify and will test it after yours.
> Thanks :)
>
>>> Also my first impression was that it won't be that easy from a synchronization point of view to
>>> replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls
>>> JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where
>>> JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear
>>> to me, how this has to be handled.
>> I think JvmtiEventController::set_frame_pop() should hold JvmtiThreadState_lock because it affects other JVMTI operation especially FramePop event.
> Yes. To me it is unclear what synchronization is necessary, if it is called during a handshake. And
> also I'm unsure if a thread should do safepoint checks while executing a handshake.
>
> @Patricio, coming back to my question [1]:
>
> In the example you gave in your answer [2]: the java thread would execute a vm operation during a
> direct handshake operation, while the VMThread is actually in the middle of a VM_HandshakeAllThreads
> operation, waiting to handshake the same handshakee: why can't the VMThread just proceed? The
> handshakee would be safepoint safe, wouldn't it?
Because the VMThread would not be able to decrement _processing_sem to 
claim the operation and execute the closure for that handshakee. If 
another JavaThread is doing a direct handshake with that same handshakee 
and called a new VM operation inside the execution of the 
HandshakeClosure in do_handshake(), nobody would be able to decrement 
the _processing_sem anymore until the original direct operation finished 
and the semaphore is signaled again. So this can happen despite the 
state of the handshakee is "handshake/safepoint safe". Changing the 
nested VM operation to be a direct handshake would have the same issue. 
Actually as the code is right now we would not even get pass setting the 
handshake operation because in that case we would block in the 
_handshake_turn_sem for the same reason.

So changing VM_SetFramePop to use direct handshakes in the future will 
probably create that last issue I mentioned. Now, since it is executed 
at a safepoint, with your workaround in enter_interp_only_mode() we 
avoid those nested issues in . Maybe 8239084 would have to be revisited 
to address nested operations in all cases. It is not clear to me now 
though if we should handle that in the handshake code or the caller of a 
certain operation should know it might be called in a nested scenario 
and should handle it.

I'll look a bit more at the updated patch but at first glance looks good.

Thanks!

Patricio
> Thanks, Richard.
>
> [1] https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14301677&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14301677
>
> [2] https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14301763&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14301763
>
> -----Original Message-----
> From: Yasumasa Suenaga <suenaga at oss.nttdata.com>
> Sent: Freitag, 24. April 2020 17:23
> To: Reingruber, Richard <richard.reingruber at sap.com>; Patricio Chilano <patricio.chilano.mateo at oracle.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>
> Hi Richard,
>
> On 2020/04/24 23:44, Reingruber, Richard wrote:
>> Hi Yasumasa,
>>
>>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
>>> Does it help you? I think it gives you to remove workaround.
>> I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake
>> you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to
>> change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes.
> Thanks for your information.
> I tested my patch with both vmTestbase/nsk/jvmti/PopFrame and vmTestbase/nsk/jvmti/NotifyFramePop.
> I will modify and will test it after yours.
>
>
>> Also my first impression was that it won't be that easy from a synchronization point of view to
>> replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls
>> JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where
>> JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear
>> to me, how this has to be handled.
> I think JvmtiEventController::set_frame_pop() should hold JvmtiThreadState_lock because it affects other JVMTI operation especially FramePop event.
>
>
> Thanks,
>
> Yasumasa
>
>
>> So it appears to me that it would be easier to push JDK-8242427 after this (JDK-8238585).
>>
>>> (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.)
>> Would be interesting to see how you handled the issues above :)
>>
>> Thanks, Richard.
>>
>> [1] See question in comment https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14302030&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302030
>>
>> -----Original Message-----
>> From: Yasumasa Suenaga <suenaga at oss.nttdata.com>
>> Sent: Freitag, 24. April 2020 13:34
>> To: Reingruber, Richard <richard.reingruber at sap.com>; Patricio Chilano <patricio.chilano.mateo at oracle.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>
>> Hi Richard,
>>
>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
>> Does it help you? I think it gives you to remove workaround.
>>
>> (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.)
>>
>>
>> Thanks,
>>
>> Yasumasa
>>
>>
>> On 2020/04/24 17:18, Reingruber, Richard wrote:
>>> Hi Patricio, Vladimir, and Serguei,
>>>
>>> now that direct handshakes are available, I've updated the patch to make use of them.
>>>
>>> In addition I have done some clean-up changes I missed in the first webrev.
>>>
>>> Finally I have implemented the workaround suggested by Patricio to avoid nesting the handshake
>>> into the vm operation VM_SetFramePop [1]
>>>
>>> Kindly review again:
>>>
>>> Webrev:        http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1/
>>> Webrev(delta): http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1.inc/
>>>
>>> I updated the JBS item explaining why the vm operation VM_EnterInterpOnlyMode can be replaced with a
>>> direct handshake:
>>>
>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8238585
>>>
>>> Testing:
>>>
>>> * JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
>>>
>>> * Submit-repo: mach5-one-rrich-JDK-8238585-20200423-1436-10441737
>>>
>>> Thanks,
>>> Richard.
>>>
>>> [1] An assertion in Handshake::execute_direct() fails, if called be VMThread, because it is no JavaThread.
>>>
>>> -----Original Message-----
>>> From: hotspot-dev <hotspot-dev-bounces at openjdk.java.net> On Behalf Of Reingruber, Richard
>>> Sent: Freitag, 14. Februar 2020 19:47
>>> To: Patricio Chilano <patricio.chilano.mateo at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>> Subject: RE: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>
>>> Hi Patricio,
>>>
>>>      > > I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
>>>      > > handshake cannot be nested in a vm operation. Maybe it should be asserted in the
>>>      > > Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>>>      > >
>>>      > >    > Alternatively I think you could do something similar to what we do in
>>>      > >    > Deoptimization::deoptimize_all_marked():
>>>      > >    >
>>>      > >    >    EnterInterpOnlyModeClosure hs;
>>>      > >    >    if (SafepointSynchronize::is_at_safepoint()) {
>>>      > >    >      hs.do_thread(state->get_thread());
>>>      > >    >    } else {
>>>      > >    >      Handshake::execute(&hs, state->get_thread());
>>>      > >    >    }
>>>      > >    > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>>      > >    > HandshakeClosure() constructor)
>>>      > >
>>>      > > Maybe this could be used also in the Handshake::execute() methods as general solution?
>>>      > Right, we could also do that. Avoiding to clear the polling page in
>>>      > HandshakeState::clear_handshake() should be enough to fix this issue and
>>>      > execute a handshake inside a safepoint, but adding that "if" statement
>>>      > in Hanshake::execute() sounds good to avoid all the extra code that we
>>>      > go through when executing a handshake. I filed 8239084 to make that change.
>>>
>>> Thanks for taking care of this and creating the RFE.
>>>
>>>      >
>>>      > >    > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>>      > >    > always called in a nested operation or just sometimes.
>>>      > >
>>>      > > At least one execution path without vm operation exists:
>>>      > >
>>>      > >    JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>>>      > >      JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>>>      > >        JvmtiEventControllerPrivate::recompute_enabled() : void
>>>      > >          JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>>>      > >            JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>>>      > >              JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>>>      > >                jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>>>      > >
>>>      > > I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
>>>      > > handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
>>>      > > encouraged to do it with a handshake :)
>>>      > Ah! I think you can still do it with a handshake with the
>>>      > Deoptimization::deoptimize_all_marked() like solution. I can change the
>>>      > if-else statement with just the Handshake::execute() call in 8239084.
>>>      > But up to you.  : )
>>>
>>> Well, I think that's enough encouragement :)
>>> I'll wait for 8239084 and try then again.
>>> (no urgency and all)
>>>
>>> Thanks,
>>> Richard.
>>>
>>> -----Original Message-----
>>> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
>>> Sent: Freitag, 14. Februar 2020 15:54
>>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>
>>> Hi Richard,
>>>
>>> On 2/14/20 9:58 AM, Reingruber, Richard wrote:
>>>> Hi Patricio,
>>>>
>>>> thanks for having a look.
>>>>
>>>>       > I?m only commenting on the handshake changes.
>>>>       > I see that operation VM_EnterInterpOnlyMode can be called inside
>>>>       > operation VM_SetFramePop which also allows nested operations. Here is a
>>>>       > comment in VM_SetFramePop definition:
>>>>       >
>>>>       > // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>>>>       > // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>>>>       >
>>>>       > So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>>>>       > could have a handshake inside a safepoint operation. The issue I see
>>>>       > there is that at the end of the handshake the polling page of the target
>>>>       > thread could be disarmed. So if the target thread happens to be in a
>>>>       > blocked state just transiently and wakes up then it will not stop for
>>>>       > the ongoing safepoint. Maybe I can file an RFE to assert that the
>>>>       > polling page is armed at the beginning of disarm_safepoint().
>>>>
>>>> I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
>>>> handshake cannot be nested in a vm operation. Maybe it should be asserted in the
>>>> Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>>>>
>>>>       > Alternatively I think you could do something similar to what we do in
>>>>       > Deoptimization::deoptimize_all_marked():
>>>>       >
>>>>       >    EnterInterpOnlyModeClosure hs;
>>>>       >    if (SafepointSynchronize::is_at_safepoint()) {
>>>>       >      hs.do_thread(state->get_thread());
>>>>       >    } else {
>>>>       >      Handshake::execute(&hs, state->get_thread());
>>>>       >    }
>>>>       > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>>>       > HandshakeClosure() constructor)
>>>>
>>>> Maybe this could be used also in the Handshake::execute() methods as general solution?
>>> Right, we could also do that. Avoiding to clear the polling page in
>>> HandshakeState::clear_handshake() should be enough to fix this issue and
>>> execute a handshake inside a safepoint, but adding that "if" statement
>>> in Hanshake::execute() sounds good to avoid all the extra code that we
>>> go through when executing a handshake. I filed 8239084 to make that change.
>>>
>>>>       > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>>>       > always called in a nested operation or just sometimes.
>>>>
>>>> At least one execution path without vm operation exists:
>>>>
>>>>       JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>>>>         JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>>>>           JvmtiEventControllerPrivate::recompute_enabled() : void
>>>>             JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>>>>               JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>>>>                 JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>>>>                   jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>>>>
>>>> I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
>>>> handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
>>>> encouraged to do it with a handshake :)
>>> Ah! I think you can still do it with a handshake with the
>>> Deoptimization::deoptimize_all_marked() like solution. I can change the
>>> if-else statement with just the Handshake::execute() call in 8239084.
>>> But up to you.? : )
>>>
>>> Thanks,
>>> Patricio
>>>> Thanks again,
>>>> Richard.
>>>>
>>>> -----Original Message-----
>>>> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
>>>> Sent: Donnerstag, 13. Februar 2020 18:47
>>>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>>
>>>> Hi Richard,
>>>>
>>>> I?m only commenting on the handshake changes.
>>>> I see that operation VM_EnterInterpOnlyMode can be called inside
>>>> operation VM_SetFramePop which also allows nested operations. Here is a
>>>> comment in VM_SetFramePop definition:
>>>>
>>>> // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>>>> // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>>>>
>>>> So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>>>> could have a handshake inside a safepoint operation. The issue I see
>>>> there is that at the end of the handshake the polling page of the target
>>>> thread could be disarmed. So if the target thread happens to be in a
>>>> blocked state just transiently and wakes up then it will not stop for
>>>> the ongoing safepoint. Maybe I can file an RFE to assert that the
>>>> polling page is armed at the beginning of disarm_safepoint().
>>>>
>>>> I think one option could be to remove
>>>> SafepointMechanism::disarm_if_needed() in
>>>> HandshakeState::clear_handshake() and let each JavaThread disarm itself
>>>> for the handshake case.
>>>>
>>>> Alternatively I think you could do something similar to what we do in
>>>> Deoptimization::deoptimize_all_marked():
>>>>
>>>>      ? EnterInterpOnlyModeClosure hs;
>>>>      ? if (SafepointSynchronize::is_at_safepoint()) {
>>>>      ??? hs.do_thread(state->get_thread());
>>>>      ? } else {
>>>>      ??? Handshake::execute(&hs, state->get_thread());
>>>>      ? }
>>>> (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>>> HandshakeClosure() constructor)
>>>>
>>>> I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>>> always called in a nested operation or just sometimes.
>>>>
>>>> Thanks,
>>>> Patricio
>>>>
>>>> On 2/12/20 7:23 AM, Reingruber, Richard wrote:
>>>>> // Repost including hotspot runtime and gc lists.
>>>>> // Dean Long suggested to do so, because the enhancement replaces a vm operation
>>>>> // with a handshake.
>>>>> // Original thread: http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-February/030359.html
>>>>>
>>>>> Hi,
>>>>>
>>>>> could I please get reviews for this small enhancement in hotspot's jvmti implementation:
>>>>>
>>>>> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.0/
>>>>> Bug:    https://bugs.openjdk.java.net/browse/JDK-8238585
>>>>>
>>>>> The change avoids making all compiled methods on stack not_entrant when switching a java thread to
>>>>> interpreter only execution for jvmti purposes. It is sufficient to deoptimize the compiled frames on stack.
>>>>>
>>>>> Additionally a handshake is used instead of a vm operation to walk the stack and do the deoptimizations.
>>>>>
>>>>> Testing: JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
>>>>>
>>>>> Thanks, Richard.
>>>>>
>>>>> See also my question if anyone knows a reason for making the compiled methods not_entrant:
>>>>> http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-January/030339.html


From vladimir.kozlov at oracle.com  Fri Apr 24 21:57:25 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Fri, 24 Apr 2020 14:57:25 -0700
Subject: [15] RFR(S): 8230402: Allocation of compile task fails with
 assert: "Leaking compilation tasks?"
In-Reply-To: <27dd5ff1-9f91-d8c1-ecee-a77e6ecdb558@oracle.com>
References: <27dd5ff1-9f91-d8c1-ecee-a77e6ecdb558@oracle.com>
Message-ID: <e5a3c683-878d-f94e-b083-897e9ff89d77@oracle.com>

Hi Christian,

compileBroker.hpp and other places - when you have only one line you can use DEBUG_ONLY( ) macro.
I think dump() method should print only duplicated tasks to avoid search duplicates in 5000 lines.

Can you use TieredThresholdPolicy::compare_methods() in compare_by_weight()? It would be nice to have the same logic 
which determines which method should be compiled first or removed from queue.

May be we should mark methods which are removed from queue or use counters decay or use other mechanisms to prevent 
methods be put back into queue immediately because their counters are high. You may not need to remove half of queue in 
such case.

Thanks,
Vladimir

On 4/24/20 7:37 AM, Christian Hagedorn wrote:
> Hi
> 
> Please review the following patch:
> https://bugs.openjdk.java.net/browse/JDK-8230402
> http://cr.openjdk.java.net/~chagedorn/8230402/webrev.00/
> 
> This assert was hit very intermittently in an internal test until jdk-14+19. The test was changed afterwards and the 
> assert was not observed to fail anymore. However, the problem of having too many tasks in the queue is still present 
> (i.e. the compile queue is growing too quickly and the compiler(s) too slow to catch up). This assert can easily be hit 
> by creating many class loaders which load many methods which are immediately compiled by setting a low compilation 
> threshold as used in runA() in the testcase.
> 
> Therefore, I suggest to tackle this problem with a general solution to drop half of the compilation tasks in 
> CompileQueue::add() when a queue size of 10000 is reached and none of the other conditions of this assert hold (no 
> Whitebox or JVMCI compiler). For tiered compilation, the tasks with the lowest method weight() or which are unloaded are 
> removed from the queue (without altering the order of the remaining tasks in the queue). Without tiered compilation 
> (i.e. SimpleCompPolicy), the tasks from the tail of the queue are removed. An additional verification in debug builds 
> should ensure that there are no duplicated tasks. I assume that part of the reason of the original assert was to detect 
> such duplicates.
> 
> Thank you!
> 
> Best regards,
> Christian
> 

From patricio.chilano.mateo at oracle.com  Sat Apr 25 09:23:03 2020
From: patricio.chilano.mateo at oracle.com (Patricio Chilano)
Date: Sat, 25 Apr 2020 06:23:03 -0300
Subject: RFR(S) 8238585: Use handshake for
 JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled
 methods on stack not_entrant
In-Reply-To: <AM0PR0202MB3331610F89DFE2AD378B52289BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
References: <AM0PR0202MB33311AB7623F4164335286309B1B0@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <3c59b9f9-ec38-18c9-8f24-e1186a08a04a@oracle.com>
 <AM0PR0202MB3331FD660B315CEC25B1DC209B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <410eed04-e2ef-0f4f-1c56-19e6734a10f6@oracle.com>
 <AM0PR0202MB33318BC60E3460662D42A40D9B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <AM0PR0202MB33319C2F9595CC3C2FA1617B9BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <e1244807-74b4-755b-17c2-dce3263ce337@oss.nttdata.com>
 <AM0PR0202MB333172BA4EAF92D0BC3BA9379BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <81d7caa8-4244-85f3-4d4e-78117fe5e25b@oss.nttdata.com>
 <AM0PR0202MB3331324D56A68E08CE7CDAE69BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <11c78b30-de04-544d-3a10-811ebf663bf2@oracle.com>
 <AM0PR0202MB3331610F89DFE2AD378B52289BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
Message-ID: <7bb8abdc-6c49-3059-23d5-9552bd80b480@oracle.com>

Hi Richard,

On 4/24/20 6:41 PM, Reingruber, Richard wrote:
> Hi Patricio,
>
>>> @Patricio, coming back to my question [1]:
>>>
>>> In the example you gave in your answer [2]: the java thread would execute a vm operation during a
>>> direct handshake operation, while the VMThread is actually in the middle of a VM_HandshakeAllThreads
>>> operation, waiting to handshake the same handshakee: why can't the VMThread just proceed? The
>>> handshakee would be safepoint safe, wouldn't it?
>> Because the VMThread would not be able to decrement _processing_sem to
>> claim the operation and execute the closure for that handshakee. If
>> another JavaThread is doing a direct handshake with that same handshakee
>> and called a new VM operation inside the execution of the
>> HandshakeClosure in do_handshake(), nobody would be able to decrement
>> the _processing_sem anymore until the original direct operation finished
>> and the semaphore is signaled again.
> Thanks, understood. On a higher level: a JavaThread can have at most one handshake operation being
> processed at at time.
Exactly. As of now we don't handle the case where another handshake 
operation on the same handshakee is called inside 
_handshake_cl->do_thread(). If this happens we will deadlock.

>> So this can happen despite the
>> state of the handshakee is "handshake/safepoint safe". Changing the
>> nested VM operation to be a direct handshake would have the same issue.
>> Actually as the code is right now we would not even get pass setting the
>> handshake operation because in that case we would block in the
>> _handshake_turn_sem for the same reason.
> Don't really understand the details here, but that's ok.
> Interesting that _handshake_turn_sem gets signaled before or after do_handshake() depending if the
> handshake operation is processed by handshakee. Comments say "Disarm before/after executing
> operation" but not why :)
Yes, that pattern actually relates with clearing _operation and predates 
direct handshakes. In theory we should always call do_handshake() first 
and then clear the handshake. This is what we do when the operation is 
processed by the handshaker, and it is necessary to be that way, 
otherwise if we clear the handshake first then the handshakee might 
transition from the safe state and never see that it actually has to 
stop for the handshake. Now, when the handshake operation is processed 
by the handshakee itself we don't have that issue, so it doesn't matter 
if we clear it before or after. The reason we do it before is to avoid 
the VMThread to execute unnecessary instructions in try_process(). This 
is specially true for the VM_HandshakeAllThreads operation case. If the 
VMThread sees that a JavaThread doesn't have an operation set, it can 
just continue to try to process the next JavaThread, instead of going 
through the unnecessary steps of checking the state of the JavaThread 
and trying to execute a try_wait() operation on the _processing_sem 
which we know won't succeed. Now for the direct handshake case doing it 
before or after doesn't really matter and so I just copied the pattern 
from the non-direct case to make it consistent in that same method.


>> So changing VM_SetFramePop to use direct handshakes in the future will
>> probably create that last issue I mentioned. Now, since it is executed
>> at a safepoint, with your workaround in enter_interp_only_mode() we
>> avoid those nested issues in . Maybe 8239084 would have to be revisited
>> to address nested operations in all cases. It is not clear to me now
>> though if we should handle that in the handshake code or the caller of a
>> certain operation should know it might be called in a nested scenario
>> and should handle it.
> Last question: is it ok for the processor of a direct handshake operation to do safepoint/handshake
> checks? I wouldn't see a reason, why not. But certainly I would avoid it.
I tried to think of possible issues with that (independent of the 
closure logic) but I couldn't find a specific one. If the handshakee 
tries to process a pending handshake, process_by_self() will just return 
without calling process_self_inner() since it will detect it is already 
inside a handshake. And that behaviour makes sense since there is no 
point in trying to execute a new handshake operation if you are in the 
middle of another one. If the handshaker inside the closure checks for 
its own pending handshakes that also seems okay (this will by itself 
also check for safepoints in the transitions). Checking for safepoints 
in both cases seems more tricky but I couldn't think of a concrete issue 
with that.
In any case I would also avoid checking for safepoints/handshakes inside 
the handshake closure. You might get issues related to the actual logic 
of the closure, like the typical deadlock because of trying to grab the 
same lock (although it's true that you always have to deal with that 
kind of problems when checking for safepoint/handshakes), or coming back 
from the safepoint/handshake and failing because some state you didn't 
expect to change in the middle of the handshake actually changed.


Thanks,
Patricio
>> I'll look a bit more at the updated patch but at first glance looks good.
> Thanks,
> Richard.
>
> -----Original Message-----
> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
> Sent: Freitag, 24. April 2020 19:14
> To: Reingruber, Richard <richard.reingruber at sap.com>; Yasumasa Suenaga <suenaga at oss.nttdata.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>
> Hi Richard,
>
> Just jumping into your last question for now.? : )
>
>
> On 4/24/20 1:08 PM, Reingruber, Richard wrote:
>> Hi Yasumasa, Patricio,
>>
>>>>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
>>>>> Does it help you? I think it gives you to remove workaround.
>>>> I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake
>>>> you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to
>>>> change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes.
>>> Thanks for your information.
>>> I tested my patch with both vmTestbase/nsk/jvmti/PopFrame and vmTestbase/nsk/jvmti/NotifyFramePop.
>>> I will modify and will test it after yours.
>> Thanks :)
>>
>>>> Also my first impression was that it won't be that easy from a synchronization point of view to
>>>> replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls
>>>> JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where
>>>> JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear
>>>> to me, how this has to be handled.
>>> I think JvmtiEventController::set_frame_pop() should hold JvmtiThreadState_lock because it affects other JVMTI operation especially FramePop event.
>> Yes. To me it is unclear what synchronization is necessary, if it is called during a handshake. And
>> also I'm unsure if a thread should do safepoint checks while executing a handshake.
>>
>> @Patricio, coming back to my question [1]:
>>
>> In the example you gave in your answer [2]: the java thread would execute a vm operation during a
>> direct handshake operation, while the VMThread is actually in the middle of a VM_HandshakeAllThreads
>> operation, waiting to handshake the same handshakee: why can't the VMThread just proceed? The
>> handshakee would be safepoint safe, wouldn't it?
> Because the VMThread would not be able to decrement _processing_sem to
> claim the operation and execute the closure for that handshakee. If
> another JavaThread is doing a direct handshake with that same handshakee
> and called a new VM operation inside the execution of the
> HandshakeClosure in do_handshake(), nobody would be able to decrement
> the _processing_sem anymore until the original direct operation finished
> and the semaphore is signaled again. So this can happen despite the
> state of the handshakee is "handshake/safepoint safe". Changing the
> nested VM operation to be a direct handshake would have the same issue.
> Actually as the code is right now we would not even get pass setting the
> handshake operation because in that case we would block in the
> _handshake_turn_sem for the same reason.
>
> So changing VM_SetFramePop to use direct handshakes in the future will
> probably create that last issue I mentioned. Now, since it is executed
> at a safepoint, with your workaround in enter_interp_only_mode() we
> avoid those nested issues in . Maybe 8239084 would have to be revisited
> to address nested operations in all cases. It is not clear to me now
> though if we should handle that in the handshake code or the caller of a
> certain operation should know it might be called in a nested scenario
> and should handle it.
>
> I'll look a bit more at the updated patch but at first glance looks good.
>
> Thanks!
>
> Patricio
>> Thanks, Richard.
>>
>> [1] https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14301677&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14301677
>>
>> [2] https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14301763&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14301763
>>
>> -----Original Message-----
>> From: Yasumasa Suenaga <suenaga at oss.nttdata.com>
>> Sent: Freitag, 24. April 2020 17:23
>> To: Reingruber, Richard <richard.reingruber at sap.com>; Patricio Chilano <patricio.chilano.mateo at oracle.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>
>> Hi Richard,
>>
>> On 2020/04/24 23:44, Reingruber, Richard wrote:
>>> Hi Yasumasa,
>>>
>>>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
>>>> Does it help you? I think it gives you to remove workaround.
>>> I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake
>>> you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to
>>> change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes.
>> Thanks for your information.
>> I tested my patch with both vmTestbase/nsk/jvmti/PopFrame and vmTestbase/nsk/jvmti/NotifyFramePop.
>> I will modify and will test it after yours.
>>
>>
>>> Also my first impression was that it won't be that easy from a synchronization point of view to
>>> replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls
>>> JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where
>>> JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear
>>> to me, how this has to be handled.
>> I think JvmtiEventController::set_frame_pop() should hold JvmtiThreadState_lock because it affects other JVMTI operation especially FramePop event.
>>
>>
>> Thanks,
>>
>> Yasumasa
>>
>>
>>> So it appears to me that it would be easier to push JDK-8242427 after this (JDK-8238585).
>>>
>>>> (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.)
>>> Would be interesting to see how you handled the issues above :)
>>>
>>> Thanks, Richard.
>>>
>>> [1] See question in comment https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14302030&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302030
>>>
>>> -----Original Message-----
>>> From: Yasumasa Suenaga <suenaga at oss.nttdata.com>
>>> Sent: Freitag, 24. April 2020 13:34
>>> To: Reingruber, Richard <richard.reingruber at sap.com>; Patricio Chilano <patricio.chilano.mateo at oracle.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>
>>> Hi Richard,
>>>
>>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
>>> Does it help you? I think it gives you to remove workaround.
>>>
>>> (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.)
>>>
>>>
>>> Thanks,
>>>
>>> Yasumasa
>>>
>>>
>>> On 2020/04/24 17:18, Reingruber, Richard wrote:
>>>> Hi Patricio, Vladimir, and Serguei,
>>>>
>>>> now that direct handshakes are available, I've updated the patch to make use of them.
>>>>
>>>> In addition I have done some clean-up changes I missed in the first webrev.
>>>>
>>>> Finally I have implemented the workaround suggested by Patricio to avoid nesting the handshake
>>>> into the vm operation VM_SetFramePop [1]
>>>>
>>>> Kindly review again:
>>>>
>>>> Webrev:        http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1/
>>>> Webrev(delta): http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1.inc/
>>>>
>>>> I updated the JBS item explaining why the vm operation VM_EnterInterpOnlyMode can be replaced with a
>>>> direct handshake:
>>>>
>>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8238585
>>>>
>>>> Testing:
>>>>
>>>> * JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
>>>>
>>>> * Submit-repo: mach5-one-rrich-JDK-8238585-20200423-1436-10441737
>>>>
>>>> Thanks,
>>>> Richard.
>>>>
>>>> [1] An assertion in Handshake::execute_direct() fails, if called be VMThread, because it is no JavaThread.
>>>>
>>>> -----Original Message-----
>>>> From: hotspot-dev <hotspot-dev-bounces at openjdk.java.net> On Behalf Of Reingruber, Richard
>>>> Sent: Freitag, 14. Februar 2020 19:47
>>>> To: Patricio Chilano <patricio.chilano.mateo at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>>> Subject: RE: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>>
>>>> Hi Patricio,
>>>>
>>>>       > > I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
>>>>       > > handshake cannot be nested in a vm operation. Maybe it should be asserted in the
>>>>       > > Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>>>>       > >
>>>>       > >    > Alternatively I think you could do something similar to what we do in
>>>>       > >    > Deoptimization::deoptimize_all_marked():
>>>>       > >    >
>>>>       > >    >    EnterInterpOnlyModeClosure hs;
>>>>       > >    >    if (SafepointSynchronize::is_at_safepoint()) {
>>>>       > >    >      hs.do_thread(state->get_thread());
>>>>       > >    >    } else {
>>>>       > >    >      Handshake::execute(&hs, state->get_thread());
>>>>       > >    >    }
>>>>       > >    > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>>>       > >    > HandshakeClosure() constructor)
>>>>       > >
>>>>       > > Maybe this could be used also in the Handshake::execute() methods as general solution?
>>>>       > Right, we could also do that. Avoiding to clear the polling page in
>>>>       > HandshakeState::clear_handshake() should be enough to fix this issue and
>>>>       > execute a handshake inside a safepoint, but adding that "if" statement
>>>>       > in Hanshake::execute() sounds good to avoid all the extra code that we
>>>>       > go through when executing a handshake. I filed 8239084 to make that change.
>>>>
>>>> Thanks for taking care of this and creating the RFE.
>>>>
>>>>       >
>>>>       > >    > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>>>       > >    > always called in a nested operation or just sometimes.
>>>>       > >
>>>>       > > At least one execution path without vm operation exists:
>>>>       > >
>>>>       > >    JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>>>>       > >      JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>>>>       > >        JvmtiEventControllerPrivate::recompute_enabled() : void
>>>>       > >          JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>>>>       > >            JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>>>>       > >              JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>>>>       > >                jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>>>>       > >
>>>>       > > I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
>>>>       > > handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
>>>>       > > encouraged to do it with a handshake :)
>>>>       > Ah! I think you can still do it with a handshake with the
>>>>       > Deoptimization::deoptimize_all_marked() like solution. I can change the
>>>>       > if-else statement with just the Handshake::execute() call in 8239084.
>>>>       > But up to you.  : )
>>>>
>>>> Well, I think that's enough encouragement :)
>>>> I'll wait for 8239084 and try then again.
>>>> (no urgency and all)
>>>>
>>>> Thanks,
>>>> Richard.
>>>>
>>>> -----Original Message-----
>>>> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
>>>> Sent: Freitag, 14. Februar 2020 15:54
>>>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>>
>>>> Hi Richard,
>>>>
>>>> On 2/14/20 9:58 AM, Reingruber, Richard wrote:
>>>>> Hi Patricio,
>>>>>
>>>>> thanks for having a look.
>>>>>
>>>>>        > I?m only commenting on the handshake changes.
>>>>>        > I see that operation VM_EnterInterpOnlyMode can be called inside
>>>>>        > operation VM_SetFramePop which also allows nested operations. Here is a
>>>>>        > comment in VM_SetFramePop definition:
>>>>>        >
>>>>>        > // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>>>>>        > // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>>>>>        >
>>>>>        > So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>>>>>        > could have a handshake inside a safepoint operation. The issue I see
>>>>>        > there is that at the end of the handshake the polling page of the target
>>>>>        > thread could be disarmed. So if the target thread happens to be in a
>>>>>        > blocked state just transiently and wakes up then it will not stop for
>>>>>        > the ongoing safepoint. Maybe I can file an RFE to assert that the
>>>>>        > polling page is armed at the beginning of disarm_safepoint().
>>>>>
>>>>> I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
>>>>> handshake cannot be nested in a vm operation. Maybe it should be asserted in the
>>>>> Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>>>>>
>>>>>        > Alternatively I think you could do something similar to what we do in
>>>>>        > Deoptimization::deoptimize_all_marked():
>>>>>        >
>>>>>        >    EnterInterpOnlyModeClosure hs;
>>>>>        >    if (SafepointSynchronize::is_at_safepoint()) {
>>>>>        >      hs.do_thread(state->get_thread());
>>>>>        >    } else {
>>>>>        >      Handshake::execute(&hs, state->get_thread());
>>>>>        >    }
>>>>>        > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>>>>        > HandshakeClosure() constructor)
>>>>>
>>>>> Maybe this could be used also in the Handshake::execute() methods as general solution?
>>>> Right, we could also do that. Avoiding to clear the polling page in
>>>> HandshakeState::clear_handshake() should be enough to fix this issue and
>>>> execute a handshake inside a safepoint, but adding that "if" statement
>>>> in Hanshake::execute() sounds good to avoid all the extra code that we
>>>> go through when executing a handshake. I filed 8239084 to make that change.
>>>>
>>>>>        > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>>>>        > always called in a nested operation or just sometimes.
>>>>>
>>>>> At least one execution path without vm operation exists:
>>>>>
>>>>>        JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>>>>>          JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>>>>>            JvmtiEventControllerPrivate::recompute_enabled() : void
>>>>>              JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>>>>>                JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>>>>>                  JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>>>>>                    jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>>>>>
>>>>> I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
>>>>> handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
>>>>> encouraged to do it with a handshake :)
>>>> Ah! I think you can still do it with a handshake with the
>>>> Deoptimization::deoptimize_all_marked() like solution. I can change the
>>>> if-else statement with just the Handshake::execute() call in 8239084.
>>>> But up to you.? : )
>>>>
>>>> Thanks,
>>>> Patricio
>>>>> Thanks again,
>>>>> Richard.
>>>>>
>>>>> -----Original Message-----
>>>>> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
>>>>> Sent: Donnerstag, 13. Februar 2020 18:47
>>>>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>>>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>>>
>>>>> Hi Richard,
>>>>>
>>>>> I?m only commenting on the handshake changes.
>>>>> I see that operation VM_EnterInterpOnlyMode can be called inside
>>>>> operation VM_SetFramePop which also allows nested operations. Here is a
>>>>> comment in VM_SetFramePop definition:
>>>>>
>>>>> // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>>>>> // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>>>>>
>>>>> So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>>>>> could have a handshake inside a safepoint operation. The issue I see
>>>>> there is that at the end of the handshake the polling page of the target
>>>>> thread could be disarmed. So if the target thread happens to be in a
>>>>> blocked state just transiently and wakes up then it will not stop for
>>>>> the ongoing safepoint. Maybe I can file an RFE to assert that the
>>>>> polling page is armed at the beginning of disarm_safepoint().
>>>>>
>>>>> I think one option could be to remove
>>>>> SafepointMechanism::disarm_if_needed() in
>>>>> HandshakeState::clear_handshake() and let each JavaThread disarm itself
>>>>> for the handshake case.
>>>>>
>>>>> Alternatively I think you could do something similar to what we do in
>>>>> Deoptimization::deoptimize_all_marked():
>>>>>
>>>>>       ? EnterInterpOnlyModeClosure hs;
>>>>>       ? if (SafepointSynchronize::is_at_safepoint()) {
>>>>>       ??? hs.do_thread(state->get_thread());
>>>>>       ? } else {
>>>>>       ??? Handshake::execute(&hs, state->get_thread());
>>>>>       ? }
>>>>> (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>>>> HandshakeClosure() constructor)
>>>>>
>>>>> I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>>>> always called in a nested operation or just sometimes.
>>>>>
>>>>> Thanks,
>>>>> Patricio
>>>>>
>>>>> On 2/12/20 7:23 AM, Reingruber, Richard wrote:
>>>>>> // Repost including hotspot runtime and gc lists.
>>>>>> // Dean Long suggested to do so, because the enhancement replaces a vm operation
>>>>>> // with a handshake.
>>>>>> // Original thread: http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-February/030359.html
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> could I please get reviews for this small enhancement in hotspot's jvmti implementation:
>>>>>>
>>>>>> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.0/
>>>>>> Bug:    https://bugs.openjdk.java.net/browse/JDK-8238585
>>>>>>
>>>>>> The change avoids making all compiled methods on stack not_entrant when switching a java thread to
>>>>>> interpreter only execution for jvmti purposes. It is sufficient to deoptimize the compiled frames on stack.
>>>>>>
>>>>>> Additionally a handshake is used instead of a vm operation to walk the stack and do the deoptimizations.
>>>>>>
>>>>>> Testing: JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
>>>>>>
>>>>>> Thanks, Richard.
>>>>>>
>>>>>> See also my question if anyone knows a reason for making the compiled methods not_entrant:
>>>>>> http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-January/030339.html


From richard.reingruber at sap.com  Sat Apr 25 10:28:00 2020
From: richard.reingruber at sap.com (Reingruber, Richard)
Date: Sat, 25 Apr 2020 10:28:00 +0000
Subject: RFR(S) 8238585: Use handshake for
 JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled
 methods on stack not_entrant
In-Reply-To: <7bb8abdc-6c49-3059-23d5-9552bd80b480@oracle.com>
References: <AM0PR0202MB33311AB7623F4164335286309B1B0@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <3c59b9f9-ec38-18c9-8f24-e1186a08a04a@oracle.com>
 <AM0PR0202MB3331FD660B315CEC25B1DC209B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <410eed04-e2ef-0f4f-1c56-19e6734a10f6@oracle.com>
 <AM0PR0202MB33318BC60E3460662D42A40D9B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <AM0PR0202MB33319C2F9595CC3C2FA1617B9BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <e1244807-74b4-755b-17c2-dce3263ce337@oss.nttdata.com>
 <AM0PR0202MB333172BA4EAF92D0BC3BA9379BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <81d7caa8-4244-85f3-4d4e-78117fe5e25b@oss.nttdata.com>
 <AM0PR0202MB3331324D56A68E08CE7CDAE69BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <11c78b30-de04-544d-3a10-811ebf663bf2@oracle.com>
 <AM0PR0202MB3331610F89DFE2AD378B52289BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <7bb8abdc-6c49-3059-23d5-9552bd80b480@oracle.com>
Message-ID: <AM0PR0202MB333106369C015452B2E0283F9BD10@AM0PR0202MB3331.eurprd02.prod.outlook.com>

Hi Patricio,

thanks a lot for all the explanations. At least to me they are really helpful. :)

Cheers, Richard.

-----Original Message-----
From: Patricio Chilano <patricio.chilano.mateo at oracle.com> 
Sent: Samstag, 25. April 2020 11:23
To: Reingruber, Richard <richard.reingruber at sap.com>; Yasumasa Suenaga <suenaga at oss.nttdata.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant

Hi Richard,

On 4/24/20 6:41 PM, Reingruber, Richard wrote:
> Hi Patricio,
>
>>> @Patricio, coming back to my question [1]:
>>>
>>> In the example you gave in your answer [2]: the java thread would execute a vm operation during a
>>> direct handshake operation, while the VMThread is actually in the middle of a VM_HandshakeAllThreads
>>> operation, waiting to handshake the same handshakee: why can't the VMThread just proceed? The
>>> handshakee would be safepoint safe, wouldn't it?
>> Because the VMThread would not be able to decrement _processing_sem to
>> claim the operation and execute the closure for that handshakee. If
>> another JavaThread is doing a direct handshake with that same handshakee
>> and called a new VM operation inside the execution of the
>> HandshakeClosure in do_handshake(), nobody would be able to decrement
>> the _processing_sem anymore until the original direct operation finished
>> and the semaphore is signaled again.
> Thanks, understood. On a higher level: a JavaThread can have at most one handshake operation being
> processed at at time.
Exactly. As of now we don't handle the case where another handshake 
operation on the same handshakee is called inside 
_handshake_cl->do_thread(). If this happens we will deadlock.

>> So this can happen despite the
>> state of the handshakee is "handshake/safepoint safe". Changing the
>> nested VM operation to be a direct handshake would have the same issue.
>> Actually as the code is right now we would not even get pass setting the
>> handshake operation because in that case we would block in the
>> _handshake_turn_sem for the same reason.
> Don't really understand the details here, but that's ok.
> Interesting that _handshake_turn_sem gets signaled before or after do_handshake() depending if the
> handshake operation is processed by handshakee. Comments say "Disarm before/after executing
> operation" but not why :)
Yes, that pattern actually relates with clearing _operation and predates 
direct handshakes. In theory we should always call do_handshake() first 
and then clear the handshake. This is what we do when the operation is 
processed by the handshaker, and it is necessary to be that way, 
otherwise if we clear the handshake first then the handshakee might 
transition from the safe state and never see that it actually has to 
stop for the handshake. Now, when the handshake operation is processed 
by the handshakee itself we don't have that issue, so it doesn't matter 
if we clear it before or after. The reason we do it before is to avoid 
the VMThread to execute unnecessary instructions in try_process(). This 
is specially true for the VM_HandshakeAllThreads operation case. If the 
VMThread sees that a JavaThread doesn't have an operation set, it can 
just continue to try to process the next JavaThread, instead of going 
through the unnecessary steps of checking the state of the JavaThread 
and trying to execute a try_wait() operation on the _processing_sem 
which we know won't succeed. Now for the direct handshake case doing it 
before or after doesn't really matter and so I just copied the pattern 
from the non-direct case to make it consistent in that same method.


>> So changing VM_SetFramePop to use direct handshakes in the future will
>> probably create that last issue I mentioned. Now, since it is executed
>> at a safepoint, with your workaround in enter_interp_only_mode() we
>> avoid those nested issues in . Maybe 8239084 would have to be revisited
>> to address nested operations in all cases. It is not clear to me now
>> though if we should handle that in the handshake code or the caller of a
>> certain operation should know it might be called in a nested scenario
>> and should handle it.
> Last question: is it ok for the processor of a direct handshake operation to do safepoint/handshake
> checks? I wouldn't see a reason, why not. But certainly I would avoid it.
I tried to think of possible issues with that (independent of the 
closure logic) but I couldn't find a specific one. If the handshakee 
tries to process a pending handshake, process_by_self() will just return 
without calling process_self_inner() since it will detect it is already 
inside a handshake. And that behaviour makes sense since there is no 
point in trying to execute a new handshake operation if you are in the 
middle of another one. If the handshaker inside the closure checks for 
its own pending handshakes that also seems okay (this will by itself 
also check for safepoints in the transitions). Checking for safepoints 
in both cases seems more tricky but I couldn't think of a concrete issue 
with that.
In any case I would also avoid checking for safepoints/handshakes inside 
the handshake closure. You might get issues related to the actual logic 
of the closure, like the typical deadlock because of trying to grab the 
same lock (although it's true that you always have to deal with that 
kind of problems when checking for safepoint/handshakes), or coming back 
from the safepoint/handshake and failing because some state you didn't 
expect to change in the middle of the handshake actually changed.


Thanks,
Patricio
>> I'll look a bit more at the updated patch but at first glance looks good.
> Thanks,
> Richard.
>
> -----Original Message-----
> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
> Sent: Freitag, 24. April 2020 19:14
> To: Reingruber, Richard <richard.reingruber at sap.com>; Yasumasa Suenaga <suenaga at oss.nttdata.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>
> Hi Richard,
>
> Just jumping into your last question for now.? : )
>
>
> On 4/24/20 1:08 PM, Reingruber, Richard wrote:
>> Hi Yasumasa, Patricio,
>>
>>>>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
>>>>> Does it help you? I think it gives you to remove workaround.
>>>> I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake
>>>> you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to
>>>> change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes.
>>> Thanks for your information.
>>> I tested my patch with both vmTestbase/nsk/jvmti/PopFrame and vmTestbase/nsk/jvmti/NotifyFramePop.
>>> I will modify and will test it after yours.
>> Thanks :)
>>
>>>> Also my first impression was that it won't be that easy from a synchronization point of view to
>>>> replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls
>>>> JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where
>>>> JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear
>>>> to me, how this has to be handled.
>>> I think JvmtiEventController::set_frame_pop() should hold JvmtiThreadState_lock because it affects other JVMTI operation especially FramePop event.
>> Yes. To me it is unclear what synchronization is necessary, if it is called during a handshake. And
>> also I'm unsure if a thread should do safepoint checks while executing a handshake.
>>
>> @Patricio, coming back to my question [1]:
>>
>> In the example you gave in your answer [2]: the java thread would execute a vm operation during a
>> direct handshake operation, while the VMThread is actually in the middle of a VM_HandshakeAllThreads
>> operation, waiting to handshake the same handshakee: why can't the VMThread just proceed? The
>> handshakee would be safepoint safe, wouldn't it?
> Because the VMThread would not be able to decrement _processing_sem to
> claim the operation and execute the closure for that handshakee. If
> another JavaThread is doing a direct handshake with that same handshakee
> and called a new VM operation inside the execution of the
> HandshakeClosure in do_handshake(), nobody would be able to decrement
> the _processing_sem anymore until the original direct operation finished
> and the semaphore is signaled again. So this can happen despite the
> state of the handshakee is "handshake/safepoint safe". Changing the
> nested VM operation to be a direct handshake would have the same issue.
> Actually as the code is right now we would not even get pass setting the
> handshake operation because in that case we would block in the
> _handshake_turn_sem for the same reason.
>
> So changing VM_SetFramePop to use direct handshakes in the future will
> probably create that last issue I mentioned. Now, since it is executed
> at a safepoint, with your workaround in enter_interp_only_mode() we
> avoid those nested issues in . Maybe 8239084 would have to be revisited
> to address nested operations in all cases. It is not clear to me now
> though if we should handle that in the handshake code or the caller of a
> certain operation should know it might be called in a nested scenario
> and should handle it.
>
> I'll look a bit more at the updated patch but at first glance looks good.
>
> Thanks!
>
> Patricio
>> Thanks, Richard.
>>
>> [1] https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14301677&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14301677
>>
>> [2] https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14301763&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14301763
>>
>> -----Original Message-----
>> From: Yasumasa Suenaga <suenaga at oss.nttdata.com>
>> Sent: Freitag, 24. April 2020 17:23
>> To: Reingruber, Richard <richard.reingruber at sap.com>; Patricio Chilano <patricio.chilano.mateo at oracle.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>
>> Hi Richard,
>>
>> On 2020/04/24 23:44, Reingruber, Richard wrote:
>>> Hi Yasumasa,
>>>
>>>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
>>>> Does it help you? I think it gives you to remove workaround.
>>> I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake
>>> you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to
>>> change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes.
>> Thanks for your information.
>> I tested my patch with both vmTestbase/nsk/jvmti/PopFrame and vmTestbase/nsk/jvmti/NotifyFramePop.
>> I will modify and will test it after yours.
>>
>>
>>> Also my first impression was that it won't be that easy from a synchronization point of view to
>>> replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls
>>> JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where
>>> JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear
>>> to me, how this has to be handled.
>> I think JvmtiEventController::set_frame_pop() should hold JvmtiThreadState_lock because it affects other JVMTI operation especially FramePop event.
>>
>>
>> Thanks,
>>
>> Yasumasa
>>
>>
>>> So it appears to me that it would be easier to push JDK-8242427 after this (JDK-8238585).
>>>
>>>> (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.)
>>> Would be interesting to see how you handled the issues above :)
>>>
>>> Thanks, Richard.
>>>
>>> [1] See question in comment https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14302030&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302030
>>>
>>> -----Original Message-----
>>> From: Yasumasa Suenaga <suenaga at oss.nttdata.com>
>>> Sent: Freitag, 24. April 2020 13:34
>>> To: Reingruber, Richard <richard.reingruber at sap.com>; Patricio Chilano <patricio.chilano.mateo at oracle.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>
>>> Hi Richard,
>>>
>>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
>>> Does it help you? I think it gives you to remove workaround.
>>>
>>> (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.)
>>>
>>>
>>> Thanks,
>>>
>>> Yasumasa
>>>
>>>
>>> On 2020/04/24 17:18, Reingruber, Richard wrote:
>>>> Hi Patricio, Vladimir, and Serguei,
>>>>
>>>> now that direct handshakes are available, I've updated the patch to make use of them.
>>>>
>>>> In addition I have done some clean-up changes I missed in the first webrev.
>>>>
>>>> Finally I have implemented the workaround suggested by Patricio to avoid nesting the handshake
>>>> into the vm operation VM_SetFramePop [1]
>>>>
>>>> Kindly review again:
>>>>
>>>> Webrev:        http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1/
>>>> Webrev(delta): http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1.inc/
>>>>
>>>> I updated the JBS item explaining why the vm operation VM_EnterInterpOnlyMode can be replaced with a
>>>> direct handshake:
>>>>
>>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8238585
>>>>
>>>> Testing:
>>>>
>>>> * JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
>>>>
>>>> * Submit-repo: mach5-one-rrich-JDK-8238585-20200423-1436-10441737
>>>>
>>>> Thanks,
>>>> Richard.
>>>>
>>>> [1] An assertion in Handshake::execute_direct() fails, if called be VMThread, because it is no JavaThread.
>>>>
>>>> -----Original Message-----
>>>> From: hotspot-dev <hotspot-dev-bounces at openjdk.java.net> On Behalf Of Reingruber, Richard
>>>> Sent: Freitag, 14. Februar 2020 19:47
>>>> To: Patricio Chilano <patricio.chilano.mateo at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>>> Subject: RE: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>>
>>>> Hi Patricio,
>>>>
>>>>       > > I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
>>>>       > > handshake cannot be nested in a vm operation. Maybe it should be asserted in the
>>>>       > > Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>>>>       > >
>>>>       > >    > Alternatively I think you could do something similar to what we do in
>>>>       > >    > Deoptimization::deoptimize_all_marked():
>>>>       > >    >
>>>>       > >    >    EnterInterpOnlyModeClosure hs;
>>>>       > >    >    if (SafepointSynchronize::is_at_safepoint()) {
>>>>       > >    >      hs.do_thread(state->get_thread());
>>>>       > >    >    } else {
>>>>       > >    >      Handshake::execute(&hs, state->get_thread());
>>>>       > >    >    }
>>>>       > >    > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>>>       > >    > HandshakeClosure() constructor)
>>>>       > >
>>>>       > > Maybe this could be used also in the Handshake::execute() methods as general solution?
>>>>       > Right, we could also do that. Avoiding to clear the polling page in
>>>>       > HandshakeState::clear_handshake() should be enough to fix this issue and
>>>>       > execute a handshake inside a safepoint, but adding that "if" statement
>>>>       > in Hanshake::execute() sounds good to avoid all the extra code that we
>>>>       > go through when executing a handshake. I filed 8239084 to make that change.
>>>>
>>>> Thanks for taking care of this and creating the RFE.
>>>>
>>>>       >
>>>>       > >    > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>>>       > >    > always called in a nested operation or just sometimes.
>>>>       > >
>>>>       > > At least one execution path without vm operation exists:
>>>>       > >
>>>>       > >    JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>>>>       > >      JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>>>>       > >        JvmtiEventControllerPrivate::recompute_enabled() : void
>>>>       > >          JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>>>>       > >            JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>>>>       > >              JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>>>>       > >                jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>>>>       > >
>>>>       > > I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
>>>>       > > handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
>>>>       > > encouraged to do it with a handshake :)
>>>>       > Ah! I think you can still do it with a handshake with the
>>>>       > Deoptimization::deoptimize_all_marked() like solution. I can change the
>>>>       > if-else statement with just the Handshake::execute() call in 8239084.
>>>>       > But up to you.  : )
>>>>
>>>> Well, I think that's enough encouragement :)
>>>> I'll wait for 8239084 and try then again.
>>>> (no urgency and all)
>>>>
>>>> Thanks,
>>>> Richard.
>>>>
>>>> -----Original Message-----
>>>> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
>>>> Sent: Freitag, 14. Februar 2020 15:54
>>>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>>
>>>> Hi Richard,
>>>>
>>>> On 2/14/20 9:58 AM, Reingruber, Richard wrote:
>>>>> Hi Patricio,
>>>>>
>>>>> thanks for having a look.
>>>>>
>>>>>        > I?m only commenting on the handshake changes.
>>>>>        > I see that operation VM_EnterInterpOnlyMode can be called inside
>>>>>        > operation VM_SetFramePop which also allows nested operations. Here is a
>>>>>        > comment in VM_SetFramePop definition:
>>>>>        >
>>>>>        > // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>>>>>        > // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>>>>>        >
>>>>>        > So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>>>>>        > could have a handshake inside a safepoint operation. The issue I see
>>>>>        > there is that at the end of the handshake the polling page of the target
>>>>>        > thread could be disarmed. So if the target thread happens to be in a
>>>>>        > blocked state just transiently and wakes up then it will not stop for
>>>>>        > the ongoing safepoint. Maybe I can file an RFE to assert that the
>>>>>        > polling page is armed at the beginning of disarm_safepoint().
>>>>>
>>>>> I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
>>>>> handshake cannot be nested in a vm operation. Maybe it should be asserted in the
>>>>> Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>>>>>
>>>>>        > Alternatively I think you could do something similar to what we do in
>>>>>        > Deoptimization::deoptimize_all_marked():
>>>>>        >
>>>>>        >    EnterInterpOnlyModeClosure hs;
>>>>>        >    if (SafepointSynchronize::is_at_safepoint()) {
>>>>>        >      hs.do_thread(state->get_thread());
>>>>>        >    } else {
>>>>>        >      Handshake::execute(&hs, state->get_thread());
>>>>>        >    }
>>>>>        > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>>>>        > HandshakeClosure() constructor)
>>>>>
>>>>> Maybe this could be used also in the Handshake::execute() methods as general solution?
>>>> Right, we could also do that. Avoiding to clear the polling page in
>>>> HandshakeState::clear_handshake() should be enough to fix this issue and
>>>> execute a handshake inside a safepoint, but adding that "if" statement
>>>> in Hanshake::execute() sounds good to avoid all the extra code that we
>>>> go through when executing a handshake. I filed 8239084 to make that change.
>>>>
>>>>>        > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>>>>        > always called in a nested operation or just sometimes.
>>>>>
>>>>> At least one execution path without vm operation exists:
>>>>>
>>>>>        JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>>>>>          JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>>>>>            JvmtiEventControllerPrivate::recompute_enabled() : void
>>>>>              JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>>>>>                JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>>>>>                  JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>>>>>                    jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>>>>>
>>>>> I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
>>>>> handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
>>>>> encouraged to do it with a handshake :)
>>>> Ah! I think you can still do it with a handshake with the
>>>> Deoptimization::deoptimize_all_marked() like solution. I can change the
>>>> if-else statement with just the Handshake::execute() call in 8239084.
>>>> But up to you.? : )
>>>>
>>>> Thanks,
>>>> Patricio
>>>>> Thanks again,
>>>>> Richard.
>>>>>
>>>>> -----Original Message-----
>>>>> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
>>>>> Sent: Donnerstag, 13. Februar 2020 18:47
>>>>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>>>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>>>
>>>>> Hi Richard,
>>>>>
>>>>> I?m only commenting on the handshake changes.
>>>>> I see that operation VM_EnterInterpOnlyMode can be called inside
>>>>> operation VM_SetFramePop which also allows nested operations. Here is a
>>>>> comment in VM_SetFramePop definition:
>>>>>
>>>>> // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>>>>> // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>>>>>
>>>>> So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>>>>> could have a handshake inside a safepoint operation. The issue I see
>>>>> there is that at the end of the handshake the polling page of the target
>>>>> thread could be disarmed. So if the target thread happens to be in a
>>>>> blocked state just transiently and wakes up then it will not stop for
>>>>> the ongoing safepoint. Maybe I can file an RFE to assert that the
>>>>> polling page is armed at the beginning of disarm_safepoint().
>>>>>
>>>>> I think one option could be to remove
>>>>> SafepointMechanism::disarm_if_needed() in
>>>>> HandshakeState::clear_handshake() and let each JavaThread disarm itself
>>>>> for the handshake case.
>>>>>
>>>>> Alternatively I think you could do something similar to what we do in
>>>>> Deoptimization::deoptimize_all_marked():
>>>>>
>>>>>       ? EnterInterpOnlyModeClosure hs;
>>>>>       ? if (SafepointSynchronize::is_at_safepoint()) {
>>>>>       ??? hs.do_thread(state->get_thread());
>>>>>       ? } else {
>>>>>       ??? Handshake::execute(&hs, state->get_thread());
>>>>>       ? }
>>>>> (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>>>> HandshakeClosure() constructor)
>>>>>
>>>>> I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>>>> always called in a nested operation or just sometimes.
>>>>>
>>>>> Thanks,
>>>>> Patricio
>>>>>
>>>>> On 2/12/20 7:23 AM, Reingruber, Richard wrote:
>>>>>> // Repost including hotspot runtime and gc lists.
>>>>>> // Dean Long suggested to do so, because the enhancement replaces a vm operation
>>>>>> // with a handshake.
>>>>>> // Original thread: http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-February/030359.html
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> could I please get reviews for this small enhancement in hotspot's jvmti implementation:
>>>>>>
>>>>>> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.0/
>>>>>> Bug:    https://bugs.openjdk.java.net/browse/JDK-8238585
>>>>>>
>>>>>> The change avoids making all compiled methods on stack not_entrant when switching a java thread to
>>>>>> interpreter only execution for jvmti purposes. It is sufficient to deoptimize the compiled frames on stack.
>>>>>>
>>>>>> Additionally a handshake is used instead of a vm operation to walk the stack and do the deoptimizations.
>>>>>>
>>>>>> Testing: JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
>>>>>>
>>>>>> Thanks, Richard.
>>>>>>
>>>>>> See also my question if anyone knows a reason for making the compiled methods not_entrant:
>>>>>> http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-January/030339.html


From zhuoren.wz at alibaba-inc.com  Sun Apr 26 12:31:59 2020
From: zhuoren.wz at alibaba-inc.com (=?UTF-8?B?V2FuZyBaaHVvKFpodW9yZW4p?=)
Date: Sun, 26 Apr 2020 20:31:59 +0800
Subject: =?UTF-8?B?UkZSOjgyNDM2MTUgQ29udGludW91cyBkZW9wdGltaXphdGlvbnMgd2l0aCBSZWFzb249dW5z?=
 =?UTF-8?B?dGFibGVfaWYgYW5kIEFjdGlvbj1ub25l?=
Message-ID: <272f8207-0b1e-4b34-b1d4-0f562b4da9d1.zhuoren.wz@alibaba-inc.com>

Please review the following patch:
http://cr.openjdk.java.net/~wzhuo/8243615/webrev.01/
https://bugs.openjdk.java.net/browse/JDK-8243615

I met continuous deoptimization w/ Reason_unstable_if and Action_none in an online application and significant performance drop was observed.
It was found in JDK8 but I think it also existed in tip.

Regards,
Zhuoren

From igor.ignatyev at oracle.com  Sun Apr 26 17:15:41 2020
From: igor.ignatyev at oracle.com (Igor Ignatyev)
Date: Sun, 26 Apr 2020 10:15:41 -0700
Subject: RFR(T) : 8243617 : compiler/onSpinWait/TestOnSpinWaitC1.java test
 uses wrong class
Message-ID: <7DA2242F-8525-4902-AE2E-6BCB7BE7F2A1@oracle.com>

Hi all,

could you please review the trivial patch for compiler/onSpinWait/TestOnSpinWaitC1.java?
from JBS:
> compiler/onSpinWait/TestOnSpinWaitC1.java runs compiler.onSpinWait.TestOnSpinWait yet it should run compiler.onSpinWait.TestOnSpinWaitC1

patch:
> diff -r 6eec7b7aa740 test/hotspot/jtreg/compiler/onSpinWait/TestOnSpinWaitC1.java
> --- a/test/hotspot/jtreg/compiler/onSpinWait/TestOnSpinWaitC1.java	Sun Apr 26 10:12:37 2020 -0700
> +++ b/test/hotspot/jtreg/compiler/onSpinWait/TestOnSpinWaitC1.java	Sun Apr 26 10:14:45 2020 -0700
> @@ -30,7 +30,7 @@
>   * @modules java.base/jdk.internal.misc
>   * @requires os.arch=="x86" | os.arch=="amd64" | os.arch=="x86_64"
>   * @requires vm.compiler1.enabled
> - * @run driver compiler.onSpinWait.TestOnSpinWait
> + * @run driver compiler.onSpinWait.TestOnSpinWaitC1
>   */
>  
>  package compiler.onSpinWait;
> 
JBS: https://bugs.openjdk.java.net/browse/JDK-8243617
testing: the affected test


Thanks,
-- Igor


From igor.ignatyev at oracle.com  Sun Apr 26 17:30:36 2020
From: igor.ignatyev at oracle.com (Igor Ignatyev)
Date: Sun, 26 Apr 2020 10:30:36 -0700
Subject: RFR(S/T) : 8243618 : compiler/rtm/cli tests can be run w/o WhiteBox
Message-ID: <14D115AD-BA50-4EA5-83B8-298792C7B9A1@oracle.com>

http://cr.openjdk.java.net/~iignatyev/8243618/webrev.00/
> 73 lines changed: 0 ins; 56 del; 17 mod;


Hi all,

could you please review this small (and hopefully trivial) clean up in compiler/rtm/cli tests?
from JBS:
> JDK-8181124 replaced usages of compiler.testlibrary.rtm.predicate (which depended on WhiteBox API) w/ @requires, so now compiler/rtm/cli don't use WhiteBox API and can be run w/o it and given the actual testing is done the spawned JVMs by tests, the test code can also be run in driver mode.

JBS: https://bugs.openjdk.java.net/browse/JDK-8243618
webrev:
testing: the affected tests

Thanks,
-- Igor

From igor.ignatyev at oracle.com  Sun Apr 26 17:41:50 2020
From: igor.ignatyev at oracle.com (Igor Ignatyev)
Date: Sun, 26 Apr 2020 10:41:50 -0700
Subject: RFR(S/T) : 8243619 : compiler/codecache/CheckSegmentedCodeCache.java
 test misses -version
Message-ID: <611D9FAA-2F78-4A2D-A5EA-1530869D09C5@oracle.com>

http://cr.openjdk.java.net/~iignatyev//8243619/webrev.00
> 22 lines changed: 11 ins; 0 del; 11 mod;

Hi all,

could you please review this small and trivial clean up of CheckSegmentedCodeCache.java test?
from JBS:
> in several cases, compiler/codecache/CheckSegmentedCodeCache.java test doesn't pass -version to a forked JVM, and although it's not a problem b/c these cases are expected to fail, it's better to have -version in order to minimize chances of false-positive (should the way JVM handles flags get changed).

webrev: http://cr.openjdk.java.net/~iignatyev//8243619/webrev.00
testing: the affected test
JBS: https://bugs.openjdk.java.net/browse/JDK-8243619

Thanks,
-- Igor

From igor.ignatyev at oracle.com  Sun Apr 26 18:22:05 2020
From: igor.ignatyev at oracle.com (Igor Ignatyev)
Date: Sun, 26 Apr 2020 11:22:05 -0700
Subject: RFR(T) : 8243620 : a few compiler/jvmci tests can be run in driver
 mode
Message-ID: <0147C3BE-D9B6-472A-86D6-B4E6E0EE0EAE@oracle.com>

http://cr.openjdk.java.net/~iignatyev//8243620/webrev.00
> 2 lines changed: 2 ins; 0 del; 0 mod; 


Hi all,

could you please review the patch which updates two compiler/jvmci tests to use '@run driver'?
from JBS:
> compiler/jvmci/TestEnableJVMCIProduct and TestJVMCIPrintProperties just spawn new JVMs and check their output, hence they can be run in a driver mode.


webrev: http://cr.openjdk.java.net/~iignatyev//8243620/webrev.00
JBS: https://bugs.openjdk.java.net/browse/JDK-8243620
testing: the affected tests

Thanks,
-- Igor

From igor.ignatyev at oracle.com  Sun Apr 26 18:32:31 2020
From: igor.ignatyev at oracle.com (Igor Ignatyev)
Date: Sun, 26 Apr 2020 11:32:31 -0700
Subject: RFR(T) : 8243621 : use SkippedException in
 compiler/jsr292/MHInlineTest.java test
Message-ID: <2ED8A7FE-77B3-4FF9-B944-C7CB4633DB62@oracle.com>

http://cr.openjdk.java.net/~iignatyev//8243621/webrev.00
> 4 lines changed: 3 ins; 0 del; 1 mod; 


Hi all,

could you please review this trivial patch for compiler/jsr292/MHInlineTest.java test?
from JBS:
> compiler/jsr292/MHInlineTest.java performs checks only when run on server JVM, in other cases, the checks are skipped. to make it clearer, the test should use SkippedException

the patch also updates the test to be run in a driver mode, since there is no reason to run MHInlineTest class w/ any of "external" vm flags.

webrev: http://cr.openjdk.java.net/~iignatyev//8243621/webrev.00
JBS: https://bugs.openjdk.java.net/browse/JDK-8243621
testing: the affected test

Thanks,
-- Igor

From igor.ignatyev at oracle.com  Sun Apr 26 18:41:39 2020
From: igor.ignatyev at oracle.com (Igor Ignatyev)
Date: Sun, 26 Apr 2020 11:41:39 -0700
Subject: RFR(T) : 8243622 : all actions in
 compiler/aot/fingerprint/SelfChangedCDS.java can be run in driver mode
Message-ID: <570B3D77-EB9B-46D5-9AB2-F11B4B3107B1@oracle.com>

http://cr.openjdk.java.net/~iignatyev//8243622/webrev.00
> 10 lines changed: 0 ins; 0 del; 10 mod;

Hi all,

could you please review this trivial patch which updates SelfChangedCDS.java test to use driver mode for all its actions?
from JBS:
> compiler.aot.fingerprint.CDSRunner and CDSDumper classes spawn other JVMs to perform required actions, and there is no need for them (CDSRunner and CDSDumper) to be run w/ external vm flags, hence they can be run in a driver mode.

webrev: http://cr.openjdk.java.net/~iignatyev//8243622/webrev.00
JBS: https://bugs.openjdk.java.net/browse/JDK-8243622
testing: the affected test

Thanks,
-- Igor

From martin.doerr at sap.com  Mon Apr 27 08:06:44 2020
From: martin.doerr at sap.com (Doerr, Martin)
Date: Mon, 27 Apr 2020 08:06:44 +0000
Subject: RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range
In-Reply-To: <AB3E3349-3BE5-45E5-9AD3-0A5068F063A3@sap.com>
References: <0737AF50-4DED-4680-8629-47140DD2A7A6@sap.com>
 <AM0PR0202MB329721F804F88BEC62CE93289ADA0@AM0PR0202MB3297.eurprd02.prod.outlook.com>
 <OF5C36EBA8.F0084204-ON00258551.0050E40B-49258551.00522B72@notes.na.collabserv.com>
 <OFF4A63D23.06CC1E89-ON00258554.001DDE0D-49258554.001F20C4@notes.na.collabserv.com>
 <AB3E3349-3BE5-45E5-9AD3-0A5068F063A3@sap.com>
Message-ID: <AM4PR02MB305786397CBB1F8E83F39FAC9AAF0@AM4PR02MB3057.eurprd02.prod.outlook.com>

Hi G?tz, Michihiro and Lutz,

thanks for the reviews. Pushed.

Best regards,
Martin


> -----Original Message-----
> From: Schmidt, Lutz <lutz.schmidt at sap.com>
> Sent: Freitag, 24. April 2020 16:51
> To: Michihiro Horie <HORIE at jp.ibm.com>
> Cc: hotspot-compiler-dev at openjdk.java.net; Doerr, Martin
> <martin.doerr at sap.com>; ppc-aix-port-dev at openjdk.java.net
> Subject: Re: RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range
> 
> Hi Martin,
> 
> SAP-internal testing revealed no problems related to this patch.
> 
> As Michihiro did not find performance issues, the patch is good to go from
> my perspective.
> 
> Regards,
> Lutz
> 
> From: Michihiro Horie <HORIE at jp.ibm.com> on behalf of Michihiro Horie
> <HORIE at jp.ibm.com>
> Date: Friday, 24. April 2020 at 07:40
> To: Lutz Schmidt <lutz.schmidt at sap.com>
> Cc: "hotspot-compiler-dev at openjdk.java.net" <hotspot-compiler-
> dev at openjdk.java.net>, "Doerr, Martin (martin.doerr at sap.com)"
> <martin.doerr at sap.com>, "ppc-aix-port-dev at openjdk.java.net" <ppc-aix-
> port-dev at openjdk.java.net>
> Subject: Re: RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is out of range
> 
> Hi Martin, Lutz,
> 
> I have not seen big differences in SPECjbb2015 scores both on P8 and P9.
> 
> Best regards,
> Michihiro
> 
> 
> ----- Original message -----
> From: "Schmidt, Lutz" <lutz.schmidt at sap.com>
> To: Michihiro Horie <HORIE at jp.ibm.com>, "Doerr, Martin"
> <martin.doerr at sap.com>
> Cc: "ppc-aix-port-dev at openjdk.java.net" <ppc-aix-port-
> dev at openjdk.java.net>, "hotspot-compiler-dev at openjdk.java.net"
> <hotspot-compiler-dev at openjdk.java.net>
> Subject: [EXTERNAL] Re: RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is
> out of range
> Date: Thu, Apr 23, 2020 3:01 AM
> 
> Hi Martin,
> 
> your change looks good to me.
> 
> I noticed you didn't find a chance to put it in the patch queue for our internal
> testing. I did that now, but it's too late for tonight. We'll have to wait until
> Friday morning (GMT+2) to really see what I expect: no issues.
> 
> Thanks for cleaning up this old stuff.
> 
> Regards,
> Lutz
> 
> 
> On 21.04.20, 16:57, "hotspot-compiler-dev on behalf of Michihiro Horie"
> <hotspot-compiler-dev-bounces at openjdk.java.net on behalf of
> HORIE at jp.ibm.com> wrote:
> 
> 
> ? ? Hi Martin,
> 
> ? ? I started measuring SPECjbb2015 to see the performance impact on P9.
> Also,
> ? ? I'm preparing same measurement on P8.
> 
> ? ? Best regards,
> ? ? Michihiro
> 
> 
> ? ? ?----- Original message -----
> ? ? ?From: "Doerr, Martin" <martin.doerr at sap.com>
> ? ? ?To: "'hotspot-compiler-dev at openjdk.java.net'"
> ? ? ?<hotspot-compiler-dev at openjdk.java.net>
> ? ? ?Cc: Michihiro Horie <HORIE at jp.ibm.com>, "cjashfor at linux.ibm.com"
> ? ? ?<cjashfor at linux.ibm.com>, "ppc-aix-port-dev at openjdk.java.net"
> ? ? ?<ppc-aix-port-dev at openjdk.java.net>, Gustavo Romero
> ? ? ?<gromero at linux.vnet.ibm.com>, "joserz at linux.ibm.com"
> ? ? ?<joserz at linux.ibm.com>
> ? ? ?Subject: [EXTERNAL] RFR(XS): 8151030: PPC64: AllocatePrefetchStyle=4 is
> ? ? ?out of range
> ? ? ?Date: Tue, Apr 14, 2020 11:07 PM
> 
> ? ? ?Hi,
> 
> ? ? ?I?d like to resolve a very old PPC64 issue:
> ? ? ?https://bugs.openjdk.java.net/browse/JDK-8151030
> 
> ? ? ?There?s code for AllocatePrefetchStyle=4 which is not an accepted option.
> ? ? ?It was used for a special experimental prefetch mode using dcbz
> ? ? ?instructions to combine prefetching and zeroing in the TLABs.
> ? ? ?However, this code was never contributed and there are no plans to work
> on
> ? ? ?it. So I?d like to simply remove this small part of it.
> 
> ? ? ?In addition to that, AllocatePrefetchLines is currently set to 3 by
> ? ? ?default which doesn?t make sense to me. PPC64 has an automatic prefetch
> ? ? ?engine and executing several prefetch instructions for succeeding cache
> ? ? ?lines doesn?t seem to be beneficial at all.
> ? ? ?So I?m setting it to 1 by default. I couldn?t observe regressions on
> ? ? ?Power7, Power8 and Power9.
> 
> ? ? ?Webrev:
> ? ? ?http://cr.openjdk.java.net/~mdoerr/8151030_ppc_prefetch/webrev.00/
> 
> ? ? ?Please review.
> 
> ? ? ?If somebody from IBM would like to check performance impact of
> changing
> ? ? ?the AllocatePrefetchLines + Distance, I?ll be glad to receive feedback.
> 
> ? ? ?Best regards,
> ? ? ?Martin
> 
> 
> 
> 
> 


From tobias.hartmann at oracle.com  Mon Apr 27 08:27:00 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Mon, 27 Apr 2020 10:27:00 +0200
Subject: RFR(S/T) : 8243618 : compiler/rtm/cli tests can be run w/o
 WhiteBox
In-Reply-To: <14D115AD-BA50-4EA5-83B8-298792C7B9A1@oracle.com>
References: <14D115AD-BA50-4EA5-83B8-298792C7B9A1@oracle.com>
Message-ID: <a520f77d-5db5-d4f3-fddd-0e19f2ac04af@oracle.com>

Hi Igor,

looks good and trivial.

Best regards,
Tobias

On 26.04.20 19:30, Igor Ignatyev wrote:
> http://cr.openjdk.java.net/~iignatyev/8243618/webrev.00/
>> 73 lines changed: 0 ins; 56 del; 17 mod;
> 
> 
> Hi all,
> 
> could you please review this small (and hopefully trivial) clean up in compiler/rtm/cli tests?
> from JBS:
>> JDK-8181124 replaced usages of compiler.testlibrary.rtm.predicate (which depended on WhiteBox API) w/ @requires, so now compiler/rtm/cli don't use WhiteBox API and can be run w/o it and given the actual testing is done the spawned JVMs by tests, the test code can also be run in driver mode.
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8243618
> webrev:
> testing: the affected tests
> 
> Thanks,
> -- Igor
> 

From tobias.hartmann at oracle.com  Mon Apr 27 08:29:31 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Mon, 27 Apr 2020 10:29:31 +0200
Subject: RFR(T) : 8243621 : use SkippedException in
 compiler/jsr292/MHInlineTest.java test
In-Reply-To: <2ED8A7FE-77B3-4FF9-B944-C7CB4633DB62@oracle.com>
References: <2ED8A7FE-77B3-4FF9-B944-C7CB4633DB62@oracle.com>
Message-ID: <f9f3f636-2a48-e370-4bbb-0ea50d503aca@oracle.com>

Hi Igor,

looks good and trivial.

Best regards,
Tobias

On 26.04.20 20:32, Igor Ignatyev wrote:
> http://cr.openjdk.java.net/~iignatyev//8243621/webrev.00
>> 4 lines changed: 3 ins; 0 del; 1 mod; 
> 
> 
> Hi all,
> 
> could you please review this trivial patch for compiler/jsr292/MHInlineTest.java test?
> from JBS:
>> compiler/jsr292/MHInlineTest.java performs checks only when run on server JVM, in other cases, the checks are skipped. to make it clearer, the test should use SkippedException
> 
> the patch also updates the test to be run in a driver mode, since there is no reason to run MHInlineTest class w/ any of "external" vm flags.
> 
> webrev: http://cr.openjdk.java.net/~iignatyev//8243621/webrev.00
> JBS: https://bugs.openjdk.java.net/browse/JDK-8243621
> testing: the affected test
> 
> Thanks,
> -- Igor
> 

From tobias.hartmann at oracle.com  Mon Apr 27 08:30:48 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Mon, 27 Apr 2020 10:30:48 +0200
Subject: RFR(T) : 8243622 : all actions in
 compiler/aot/fingerprint/SelfChangedCDS.java can be run in driver mode
In-Reply-To: <570B3D77-EB9B-46D5-9AB2-F11B4B3107B1@oracle.com>
References: <570B3D77-EB9B-46D5-9AB2-F11B4B3107B1@oracle.com>
Message-ID: <7248d488-16d8-752a-210b-e9fac9f5b396@oracle.com>

Hi Igor,

looks good and trivial.

Best regards,
Tobias

On 26.04.20 20:41, Igor Ignatyev wrote:
> http://cr.openjdk.java.net/~iignatyev//8243622/webrev.00
>> 10 lines changed: 0 ins; 0 del; 10 mod;
> 
> Hi all,
> 
> could you please review this trivial patch which updates SelfChangedCDS.java test to use driver mode for all its actions?
> from JBS:
>> compiler.aot.fingerprint.CDSRunner and CDSDumper classes spawn other JVMs to perform required actions, and there is no need for them (CDSRunner and CDSDumper) to be run w/ external vm flags, hence they can be run in a driver mode.
> 
> webrev: http://cr.openjdk.java.net/~iignatyev//8243622/webrev.00
> JBS: https://bugs.openjdk.java.net/browse/JDK-8243622
> testing: the affected test
> 
> Thanks,
> -- Igor
> 

From tobias.hartmann at oracle.com  Mon Apr 27 08:25:46 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Mon, 27 Apr 2020 10:25:46 +0200
Subject: RFR(T) : 8243617 : compiler/onSpinWait/TestOnSpinWaitC1.java test
 uses wrong class
In-Reply-To: <7DA2242F-8525-4902-AE2E-6BCB7BE7F2A1@oracle.com>
References: <7DA2242F-8525-4902-AE2E-6BCB7BE7F2A1@oracle.com>
Message-ID: <cb86a872-95a8-bc4d-d041-1121f5756ed6@oracle.com>

Hi Igor,

looks good and trivial.

Best regards,
Tobias

On 26.04.20 19:15, Igor Ignatyev wrote:
> Hi all,
> 
> could you please review the trivial patch for compiler/onSpinWait/TestOnSpinWaitC1.java?
> from JBS:
>> compiler/onSpinWait/TestOnSpinWaitC1.java runs compiler.onSpinWait.TestOnSpinWait yet it should run compiler.onSpinWait.TestOnSpinWaitC1
> 
> patch:
>> diff -r 6eec7b7aa740 test/hotspot/jtreg/compiler/onSpinWait/TestOnSpinWaitC1.java
>> --- a/test/hotspot/jtreg/compiler/onSpinWait/TestOnSpinWaitC1.java	Sun Apr 26 10:12:37 2020 -0700
>> +++ b/test/hotspot/jtreg/compiler/onSpinWait/TestOnSpinWaitC1.java	Sun Apr 26 10:14:45 2020 -0700
>> @@ -30,7 +30,7 @@
>>   * @modules java.base/jdk.internal.misc
>>   * @requires os.arch=="x86" | os.arch=="amd64" | os.arch=="x86_64"
>>   * @requires vm.compiler1.enabled
>> - * @run driver compiler.onSpinWait.TestOnSpinWait
>> + * @run driver compiler.onSpinWait.TestOnSpinWaitC1
>>   */
>>  
>>  package compiler.onSpinWait;
>>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8243617
> testing: the affected test
> 
> 
> Thanks,
> -- Igor
> 

From tobias.hartmann at oracle.com  Mon Apr 27 08:28:04 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Mon, 27 Apr 2020 10:28:04 +0200
Subject: RFR(S/T) : 8243619 :
 compiler/codecache/CheckSegmentedCodeCache.java test misses -version
In-Reply-To: <611D9FAA-2F78-4A2D-A5EA-1530869D09C5@oracle.com>
References: <611D9FAA-2F78-4A2D-A5EA-1530869D09C5@oracle.com>
Message-ID: <3cb99677-da28-0a5a-6093-229ea7d07e57@oracle.com>

Hi Igor,

looks good and trivial.

Best regards,
Tobias

On 26.04.20 19:41, Igor Ignatyev wrote:
> http://cr.openjdk.java.net/~iignatyev//8243619/webrev.00
>> 22 lines changed: 11 ins; 0 del; 11 mod;
> 
> Hi all,
> 
> could you please review this small and trivial clean up of CheckSegmentedCodeCache.java test?
> from JBS:
>> in several cases, compiler/codecache/CheckSegmentedCodeCache.java test doesn't pass -version to a forked JVM, and although it's not a problem b/c these cases are expected to fail, it's better to have -version in order to minimize chances of false-positive (should the way JVM handles flags get changed).
> 
> webrev: http://cr.openjdk.java.net/~iignatyev//8243619/webrev.00
> testing: the affected test
> JBS: https://bugs.openjdk.java.net/browse/JDK-8243619
> 
> Thanks,
> -- Igor
> 

From tobias.hartmann at oracle.com  Mon Apr 27 08:31:14 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Mon, 27 Apr 2020 10:31:14 +0200
Subject: RFR(T) : 8243620 : a few compiler/jvmci tests can be run in
 driver mode
In-Reply-To: <0147C3BE-D9B6-472A-86D6-B4E6E0EE0EAE@oracle.com>
References: <0147C3BE-D9B6-472A-86D6-B4E6E0EE0EAE@oracle.com>
Message-ID: <96b702b1-95db-f299-7191-dd0fe9773bf8@oracle.com>

Hi Igor,

looks good and trivial.

Best regards,
Tobias

On 26.04.20 20:22, Igor Ignatyev wrote:
> http://cr.openjdk.java.net/~iignatyev//8243620/webrev.00
>> 2 lines changed: 2 ins; 0 del; 0 mod; 
> 
> 
> Hi all,
> 
> could you please review the patch which updates two compiler/jvmci tests to use '@run driver'?
> from JBS:
>> compiler/jvmci/TestEnableJVMCIProduct and TestJVMCIPrintProperties just spawn new JVMs and check their output, hence they can be run in a driver mode.
> 
> 
> webrev: http://cr.openjdk.java.net/~iignatyev//8243620/webrev.00
> JBS: https://bugs.openjdk.java.net/browse/JDK-8243620
> testing: the affected tests
> 
> Thanks,
> -- Igor
> 

From christian.hagedorn at oracle.com  Mon Apr 27 09:26:35 2020
From: christian.hagedorn at oracle.com (Christian Hagedorn)
Date: Mon, 27 Apr 2020 11:26:35 +0200
Subject: [15] RFR(S): 8230402: Allocation of compile task fails with
 assert: "Leaking compilation tasks?"
In-Reply-To: <e5a3c683-878d-f94e-b083-897e9ff89d77@oracle.com>
References: <27dd5ff1-9f91-d8c1-ecee-a77e6ecdb558@oracle.com>
 <e5a3c683-878d-f94e-b083-897e9ff89d77@oracle.com>
Message-ID: <38eaa19f-723f-cd24-ff92-030f18c62780@oracle.com>

Hi Vladimir

Thank you for your review!

On 24.04.20 23:57, Vladimir Kozlov wrote:
> compileBroker.hpp and other places - when you have only one line you can 
> use DEBUG_ONLY( ) macro.
> I think dump() method should print only duplicated tasks to avoid search 
> duplicates in 5000 lines.
> 
> Can you use TieredThresholdPolicy::compare_methods() in 
> compare_by_weight()? It would be nice to have the same logic which 
> determines which method should be compiled first or removed from queue.

Sounds good, I included these in a new webrev:
http://cr.openjdk.java.net/~chagedorn/8230402/webrev.01/

> May be we should mark methods which are removed from queue or use 
> counters decay or use other mechanisms to prevent methods be put back 
> into queue immediately because their counters are high. You may not need 
> to remove half of queue in such case.

You mean we could, for example, just reset the invocation and backedge 
counters of removed methods from the queue? This would probably be 
beneficial in a more general case than in my test case where each method 
is only executed twice. About the number of tasks to drop, it was just a 
best guess. We can also choose to drop fewer. But it is probably hard to 
determine a best value in general.

Best regards,
Christian

> 
> On 4/24/20 7:37 AM, Christian Hagedorn wrote:
>> Hi
>>
>> Please review the following patch:
>> https://bugs.openjdk.java.net/browse/JDK-8230402
>> http://cr.openjdk.java.net/~chagedorn/8230402/webrev.00/
>>
>> This assert was hit very intermittently in an internal test until 
>> jdk-14+19. The test was changed afterwards and the assert was not 
>> observed to fail anymore. However, the problem of having too many 
>> tasks in the queue is still present (i.e. the compile queue is growing 
>> too quickly and the compiler(s) too slow to catch up). This assert can 
>> easily be hit by creating many class loaders which load many methods 
>> which are immediately compiled by setting a low compilation threshold 
>> as used in runA() in the testcase.
>>
>> Therefore, I suggest to tackle this problem with a general solution to 
>> drop half of the compilation tasks in CompileQueue::add() when a queue 
>> size of 10000 is reached and none of the other conditions of this 
>> assert hold (no Whitebox or JVMCI compiler). For tiered compilation, 
>> the tasks with the lowest method weight() or which are unloaded are 
>> removed from the queue (without altering the order of the remaining 
>> tasks in the queue). Without tiered compilation (i.e. 
>> SimpleCompPolicy), the tasks from the tail of the queue are removed. 
>> An additional verification in debug builds should ensure that there 
>> are no duplicated tasks. I assume that part of the reason of the 
>> original assert was to detect such duplicates.
>>
>> Thank you!
>>
>> Best regards,
>> Christian
>>

From martin.doerr at sap.com  Mon Apr 27 13:06:09 2020
From: martin.doerr at sap.com (Doerr, Martin)
Date: Mon, 27 Apr 2020 13:06:09 +0000
Subject: RFR(S): 8235673: [C1, C2] Split inlining control flags
Message-ID: <AM4PR02MB3057BC8CC2E51664A0B6DB909AAF0@AM4PR02MB3057.eurprd02.prod.outlook.com>

Hi,

while tuning inlining parameters for C2 compiler with JDK-8234863 we had discussed impact on C1.
I still think it's bad to share them between both compilers. We may want to do further C2 tuning without negative impact on C1 in the future.

C1 has issues with substantial inlining because of the lack of uncommon traps. When C1 inlines a lot, stack frames may get large and code cache space may get wasted for cold or even never executed code. The situation gets worse when many patching stubs get used for such code.

I had opened the following issue:
https://bugs.openjdk.java.net/browse/JDK-8235673

And my initial proposal is here:
http://cr.openjdk.java.net/~mdoerr/8235673_C1_inlining/webrev.00/


Part of my proposal is to add an additional flag which I called C1InlineStackLimit to reduce stack utilization for C1 methods.
I have a simple example which shows wasted stack space (java example TestStack at the end).

It simply counts stack frames until a stack overflow occurs. With the current implementation, only 1283 frames fit on the stack because the never executed method bogus_test with local variables gets inlined.
Reduced C1InlineStackLimit avoids inlining of bogus_test and we get 2310 frames until stack overflow. (I only used C1 for this example. Can be reproduced as shown below.)

I didn't notice any performance regression even with the aggressive setting of C1InlineStackLimit=5 with TieredCompilation.

I know that I'll need a CSR for this change, but I'd like to get feedback in general and feedback about the flag names before creating a CSR.
I'd also be glad about feedback regarding the performance impact.

Best regards,
Martin


Command line:
jdk/bin/java -XX:TieredStopAtLevel=1 -XX:C1InlineStackLimit=20 -XX:C1MaxRecursiveInlineLevel=0 -Xss256k -Xbatch -XX:+PrintInlining -XX:CompileCommand=compileonly,TestStack::triggerStackOverflow TestStack
CompileCommand: compileonly TestStack.triggerStackOverflow
                              @ 8   TestStack::triggerStackOverflow (15 bytes)   recursive inlining too deep
                              @ 11   TestStack::bogus_test (33 bytes)   inline
caught java.lang.StackOverflowError
1283 activations were on stack, sum = 0

jdk/bin/java -XX:TieredStopAtLevel=1 -XX:C1InlineStackLimit=10 -XX:C1MaxRecursiveInlineLevel=0 -Xss256k -Xbatch -XX:+PrintInlining -XX:CompileCommand=compileonly,TestStack::triggerStackOverflow TestStack
CompileCommand: compileonly TestStack.triggerStackOverflow
                              @ 8   TestStack::triggerStackOverflow (15 bytes)   recursive inlining too deep
                              @ 11   TestStack::bogus_test (33 bytes)   callee uses too much stack
caught java.lang.StackOverflowError
2310 activations were on stack, sum = 0


TestStack.java:
public class TestStack {

    static long cnt = 0,
                sum = 0;

    public static void bogus_test() {
        long c1 = 1, c2 = 2, c3 = 3, c4 = 4;
        sum += c1 + c2 + c3 + c4;
    }

    public static void triggerStackOverflow() {
        cnt++;
        triggerStackOverflow();
        bogus_test();
    }


    public static void main(String args[]) {
        try {
            triggerStackOverflow();
        } catch (StackOverflowError e) {
            System.out.println("caught " + e);
        }
        System.out.println(cnt + " activations were on stack, sum = " + sum);
    }
}


From david.holmes at oracle.com  Mon Apr 27 05:15:59 2020
From: david.holmes at oracle.com (David Holmes)
Date: Mon, 27 Apr 2020 15:15:59 +1000
Subject: RFR(S) 8238585: Use handshake for
 JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled
 methods on stack not_entrant
In-Reply-To: <AM0PR0202MB3331324D56A68E08CE7CDAE69BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
References: <AM0PR0202MB33311AB7623F4164335286309B1B0@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <3c59b9f9-ec38-18c9-8f24-e1186a08a04a@oracle.com>
 <AM0PR0202MB3331FD660B315CEC25B1DC209B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <410eed04-e2ef-0f4f-1c56-19e6734a10f6@oracle.com>
 <AM0PR0202MB33318BC60E3460662D42A40D9B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <AM0PR0202MB33319C2F9595CC3C2FA1617B9BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <e1244807-74b4-755b-17c2-dce3263ce337@oss.nttdata.com>
 <AM0PR0202MB333172BA4EAF92D0BC3BA9379BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <81d7caa8-4244-85f3-4d4e-78117fe5e25b@oss.nttdata.com>
 <AM0PR0202MB3331324D56A68E08CE7CDAE69BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
Message-ID: <550b95ac-8b29-1eb8-a507-533e81d02322@oracle.com>

Hi all,

Not a review but some general commentary ...

On 25/04/2020 2:08 am, Reingruber, Richard wrote:
> Hi Yasumasa, Patricio,
> 
>>>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
>>>> Does it help you? I think it gives you to remove workaround.
>>>
>>> I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake
>>> you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to
>>> change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes.
> 
>> Thanks for your information.
>> I tested my patch with both vmTestbase/nsk/jvmti/PopFrame and vmTestbase/nsk/jvmti/NotifyFramePop.
>> I will modify and will test it after yours.
> 
> Thanks :)
> 
>>> Also my first impression was that it won't be that easy from a synchronization point of view to
>>> replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls
>>> JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where
>>> JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear
>>> to me, how this has to be handled.
> 
>> I think JvmtiEventController::set_frame_pop() should hold JvmtiThreadState_lock because it affects other JVMTI operation especially FramePop event.
> 
> Yes. To me it is unclear what synchronization is necessary, if it is called during a handshake. And
> also I'm unsure if a thread should do safepoint checks while executing a handshake.

I'm growing increasingly concerned that use of direct handshakes to 
replace VM operations needs a much greater examination for correctness 
than might initially be thought. I see a number of issues:

First, the VMThread executes (most) VM operations with a clean stack in 
a clean state, so it has lots of room to work. If we now execute the 
same logic in a JavaThread then we risk hitting stackoverflows if 
nothing else. But we are also now executing code in a JavaThread and so 
we have to be sure that code is not going to act differently (in a bad 
way) if executed by a JavaThread rather than the VMThread. For example, 
may it be possible that if executing in the VMThread we defer some 
activity that might require execution of Java code, or else hand it off 
to one of the service threads? If we execute that code directly in the 
current JavaThread instead we may not be in a valid state (e.g. consider 
re-entrancy to various subsystems that is not allowed).

Second, we have this question mark over what happens if the operation 
hits further safepoint or handshake polls/checks? Are there constraints 
on what is allowed here? How can we recognise this problem may exist and 
so deal with it?

Third, while we are generally considering what appear to be 
single-thread operations, which should be amenable to a direct 
handshake, we also have to be careful that some of the code involved 
doesn't already expect/assume we are at a safepoint - e.g. a VM op may 
not need to take a lock where a direct handshake might!

Cheers,
David
-----

> @Patricio, coming back to my question [1]:
> 
> In the example you gave in your answer [2]: the java thread would execute a vm operation during a
> direct handshake operation, while the VMThread is actually in the middle of a VM_HandshakeAllThreads
> operation, waiting to handshake the same handshakee: why can't the VMThread just proceed? The
> handshakee would be safepoint safe, wouldn't it?
> 
> Thanks, Richard.
> 
> [1] https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14301677&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14301677
> 
> [2] https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14301763&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14301763
> 
> -----Original Message-----
> From: Yasumasa Suenaga <suenaga at oss.nttdata.com>
> Sent: Freitag, 24. April 2020 17:23
> To: Reingruber, Richard <richard.reingruber at sap.com>; Patricio Chilano <patricio.chilano.mateo at oracle.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
> 
> Hi Richard,
> 
> On 2020/04/24 23:44, Reingruber, Richard wrote:
>> Hi Yasumasa,
>>
>>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
>>> Does it help you? I think it gives you to remove workaround.
>>
>> I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake
>> you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to
>> change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes.
> 
> Thanks for your information.
> I tested my patch with both vmTestbase/nsk/jvmti/PopFrame and vmTestbase/nsk/jvmti/NotifyFramePop.
> I will modify and will test it after yours.
> 
> 
>> Also my first impression was that it won't be that easy from a synchronization point of view to
>> replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls
>> JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where
>> JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear
>> to me, how this has to be handled.
> 
> I think JvmtiEventController::set_frame_pop() should hold JvmtiThreadState_lock because it affects other JVMTI operation especially FramePop event.
> 
> 
> Thanks,
> 
> Yasumasa
> 
> 
>> So it appears to me that it would be easier to push JDK-8242427 after this (JDK-8238585).
>>
>>> (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.)
>>
>> Would be interesting to see how you handled the issues above :)
>>
>> Thanks, Richard.
>>
>> [1] See question in comment https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14302030&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302030
>>
>> -----Original Message-----
>> From: Yasumasa Suenaga <suenaga at oss.nttdata.com>
>> Sent: Freitag, 24. April 2020 13:34
>> To: Reingruber, Richard <richard.reingruber at sap.com>; Patricio Chilano <patricio.chilano.mateo at oracle.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>
>> Hi Richard,
>>
>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
>> Does it help you? I think it gives you to remove workaround.
>>
>> (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.)
>>
>>
>> Thanks,
>>
>> Yasumasa
>>
>>
>> On 2020/04/24 17:18, Reingruber, Richard wrote:
>>> Hi Patricio, Vladimir, and Serguei,
>>>
>>> now that direct handshakes are available, I've updated the patch to make use of them.
>>>
>>> In addition I have done some clean-up changes I missed in the first webrev.
>>>
>>> Finally I have implemented the workaround suggested by Patricio to avoid nesting the handshake
>>> into the vm operation VM_SetFramePop [1]
>>>
>>> Kindly review again:
>>>
>>> Webrev:        http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1/
>>> Webrev(delta): http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1.inc/
>>>
>>> I updated the JBS item explaining why the vm operation VM_EnterInterpOnlyMode can be replaced with a
>>> direct handshake:
>>>
>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8238585
>>>
>>> Testing:
>>>
>>> * JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
>>>
>>> * Submit-repo: mach5-one-rrich-JDK-8238585-20200423-1436-10441737
>>>
>>> Thanks,
>>> Richard.
>>>
>>> [1] An assertion in Handshake::execute_direct() fails, if called be VMThread, because it is no JavaThread.
>>>
>>> -----Original Message-----
>>> From: hotspot-dev <hotspot-dev-bounces at openjdk.java.net> On Behalf Of Reingruber, Richard
>>> Sent: Freitag, 14. Februar 2020 19:47
>>> To: Patricio Chilano <patricio.chilano.mateo at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>> Subject: RE: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>
>>> Hi Patricio,
>>>
>>>      > > I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
>>>      > > handshake cannot be nested in a vm operation. Maybe it should be asserted in the
>>>      > > Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>>>      > >
>>>      > >    > Alternatively I think you could do something similar to what we do in
>>>      > >    > Deoptimization::deoptimize_all_marked():
>>>      > >    >
>>>      > >    >    EnterInterpOnlyModeClosure hs;
>>>      > >    >    if (SafepointSynchronize::is_at_safepoint()) {
>>>      > >    >      hs.do_thread(state->get_thread());
>>>      > >    >    } else {
>>>      > >    >      Handshake::execute(&hs, state->get_thread());
>>>      > >    >    }
>>>      > >    > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>>      > >    > HandshakeClosure() constructor)
>>>      > >
>>>      > > Maybe this could be used also in the Handshake::execute() methods as general solution?
>>>      > Right, we could also do that. Avoiding to clear the polling page in
>>>      > HandshakeState::clear_handshake() should be enough to fix this issue and
>>>      > execute a handshake inside a safepoint, but adding that "if" statement
>>>      > in Hanshake::execute() sounds good to avoid all the extra code that we
>>>      > go through when executing a handshake. I filed 8239084 to make that change.
>>>
>>> Thanks for taking care of this and creating the RFE.
>>>
>>>      >
>>>      > >    > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>>      > >    > always called in a nested operation or just sometimes.
>>>      > >
>>>      > > At least one execution path without vm operation exists:
>>>      > >
>>>      > >    JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>>>      > >      JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>>>      > >        JvmtiEventControllerPrivate::recompute_enabled() : void
>>>      > >          JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>>>      > >            JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>>>      > >              JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>>>      > >                jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>>>      > >
>>>      > > I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
>>>      > > handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
>>>      > > encouraged to do it with a handshake :)
>>>      > Ah! I think you can still do it with a handshake with the
>>>      > Deoptimization::deoptimize_all_marked() like solution. I can change the
>>>      > if-else statement with just the Handshake::execute() call in 8239084.
>>>      > But up to you.  : )
>>>
>>> Well, I think that's enough encouragement :)
>>> I'll wait for 8239084 and try then again.
>>> (no urgency and all)
>>>
>>> Thanks,
>>> Richard.
>>>
>>> -----Original Message-----
>>> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
>>> Sent: Freitag, 14. Februar 2020 15:54
>>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>
>>> Hi Richard,
>>>
>>> On 2/14/20 9:58 AM, Reingruber, Richard wrote:
>>>> Hi Patricio,
>>>>
>>>> thanks for having a look.
>>>>
>>>>       > I?m only commenting on the handshake changes.
>>>>       > I see that operation VM_EnterInterpOnlyMode can be called inside
>>>>       > operation VM_SetFramePop which also allows nested operations. Here is a
>>>>       > comment in VM_SetFramePop definition:
>>>>       >
>>>>       > // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>>>>       > // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>>>>       >
>>>>       > So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>>>>       > could have a handshake inside a safepoint operation. The issue I see
>>>>       > there is that at the end of the handshake the polling page of the target
>>>>       > thread could be disarmed. So if the target thread happens to be in a
>>>>       > blocked state just transiently and wakes up then it will not stop for
>>>>       > the ongoing safepoint. Maybe I can file an RFE to assert that the
>>>>       > polling page is armed at the beginning of disarm_safepoint().
>>>>
>>>> I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
>>>> handshake cannot be nested in a vm operation. Maybe it should be asserted in the
>>>> Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>>>>
>>>>       > Alternatively I think you could do something similar to what we do in
>>>>       > Deoptimization::deoptimize_all_marked():
>>>>       >
>>>>       >    EnterInterpOnlyModeClosure hs;
>>>>       >    if (SafepointSynchronize::is_at_safepoint()) {
>>>>       >      hs.do_thread(state->get_thread());
>>>>       >    } else {
>>>>       >      Handshake::execute(&hs, state->get_thread());
>>>>       >    }
>>>>       > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>>>       > HandshakeClosure() constructor)
>>>>
>>>> Maybe this could be used also in the Handshake::execute() methods as general solution?
>>> Right, we could also do that. Avoiding to clear the polling page in
>>> HandshakeState::clear_handshake() should be enough to fix this issue and
>>> execute a handshake inside a safepoint, but adding that "if" statement
>>> in Hanshake::execute() sounds good to avoid all the extra code that we
>>> go through when executing a handshake. I filed 8239084 to make that change.
>>>
>>>>       > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>>>       > always called in a nested operation or just sometimes.
>>>>
>>>> At least one execution path without vm operation exists:
>>>>
>>>>       JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>>>>         JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>>>>           JvmtiEventControllerPrivate::recompute_enabled() : void
>>>>             JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>>>>               JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>>>>                 JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>>>>                   jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>>>>
>>>> I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
>>>> handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
>>>> encouraged to do it with a handshake :)
>>> Ah! I think you can still do it with a handshake with the
>>> Deoptimization::deoptimize_all_marked() like solution. I can change the
>>> if-else statement with just the Handshake::execute() call in 8239084.
>>> But up to you.? : )
>>>
>>> Thanks,
>>> Patricio
>>>> Thanks again,
>>>> Richard.
>>>>
>>>> -----Original Message-----
>>>> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
>>>> Sent: Donnerstag, 13. Februar 2020 18:47
>>>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>>
>>>> Hi Richard,
>>>>
>>>> I?m only commenting on the handshake changes.
>>>> I see that operation VM_EnterInterpOnlyMode can be called inside
>>>> operation VM_SetFramePop which also allows nested operations. Here is a
>>>> comment in VM_SetFramePop definition:
>>>>
>>>> // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>>>> // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>>>>
>>>> So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>>>> could have a handshake inside a safepoint operation. The issue I see
>>>> there is that at the end of the handshake the polling page of the target
>>>> thread could be disarmed. So if the target thread happens to be in a
>>>> blocked state just transiently and wakes up then it will not stop for
>>>> the ongoing safepoint. Maybe I can file an RFE to assert that the
>>>> polling page is armed at the beginning of disarm_safepoint().
>>>>
>>>> I think one option could be to remove
>>>> SafepointMechanism::disarm_if_needed() in
>>>> HandshakeState::clear_handshake() and let each JavaThread disarm itself
>>>> for the handshake case.
>>>>
>>>> Alternatively I think you could do something similar to what we do in
>>>> Deoptimization::deoptimize_all_marked():
>>>>
>>>>      ? EnterInterpOnlyModeClosure hs;
>>>>      ? if (SafepointSynchronize::is_at_safepoint()) {
>>>>      ??? hs.do_thread(state->get_thread());
>>>>      ? } else {
>>>>      ??? Handshake::execute(&hs, state->get_thread());
>>>>      ? }
>>>> (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>>> HandshakeClosure() constructor)
>>>>
>>>> I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>>> always called in a nested operation or just sometimes.
>>>>
>>>> Thanks,
>>>> Patricio
>>>>
>>>> On 2/12/20 7:23 AM, Reingruber, Richard wrote:
>>>>> // Repost including hotspot runtime and gc lists.
>>>>> // Dean Long suggested to do so, because the enhancement replaces a vm operation
>>>>> // with a handshake.
>>>>> // Original thread: http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-February/030359.html
>>>>>
>>>>> Hi,
>>>>>
>>>>> could I please get reviews for this small enhancement in hotspot's jvmti implementation:
>>>>>
>>>>> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.0/
>>>>> Bug:    https://bugs.openjdk.java.net/browse/JDK-8238585
>>>>>
>>>>> The change avoids making all compiled methods on stack not_entrant when switching a java thread to
>>>>> interpreter only execution for jvmti purposes. It is sufficient to deoptimize the compiled frames on stack.
>>>>>
>>>>> Additionally a handshake is used instead of a vm operation to walk the stack and do the deoptimizations.
>>>>>
>>>>> Testing: JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
>>>>>
>>>>> Thanks, Richard.
>>>>>
>>>>> See also my question if anyone knows a reason for making the compiled methods not_entrant:
>>>>> http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-January/030339.html
>>>

From richard.reingruber at sap.com  Mon Apr 27 14:09:08 2020
From: richard.reingruber at sap.com (Reingruber, Richard)
Date: Mon, 27 Apr 2020 14:09:08 +0000
Subject: RFR(S) 8238585: Use handshake for
 JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled
 methods on stack not_entrant
In-Reply-To: <550b95ac-8b29-1eb8-a507-533e81d02322@oracle.com>
References: <AM0PR0202MB33311AB7623F4164335286309B1B0@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <3c59b9f9-ec38-18c9-8f24-e1186a08a04a@oracle.com>
 <AM0PR0202MB3331FD660B315CEC25B1DC209B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <410eed04-e2ef-0f4f-1c56-19e6734a10f6@oracle.com>
 <AM0PR0202MB33318BC60E3460662D42A40D9B150@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <AM0PR0202MB33319C2F9595CC3C2FA1617B9BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <e1244807-74b4-755b-17c2-dce3263ce337@oss.nttdata.com>
 <AM0PR0202MB333172BA4EAF92D0BC3BA9379BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <81d7caa8-4244-85f3-4d4e-78117fe5e25b@oss.nttdata.com>
 <AM0PR0202MB3331324D56A68E08CE7CDAE69BD00@AM0PR0202MB3331.eurprd02.prod.outlook.com>
 <550b95ac-8b29-1eb8-a507-533e81d02322@oracle.com>
Message-ID: <AM0PR0202MB33312E168ECCFD38BAC38FD99BAF0@AM0PR0202MB3331.eurprd02.prod.outlook.com>

Hi David,

> Not a review but some general commentary ...

That's welcome.

> On 25/04/2020 2:08 am, Reingruber, Richard wrote:
> > Hi Yasumasa, Patricio,
> > 
> >>>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
> >>>> Does it help you? I think it gives you to remove workaround.
> >>>
> >>> I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake
> >>> you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to
> >>> change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes.
> > 
> >> Thanks for your information.
> >> I tested my patch with both vmTestbase/nsk/jvmti/PopFrame and vmTestbase/nsk/jvmti/NotifyFramePop.
> >> I will modify and will test it after yours.
> > 
> > Thanks :)
> > 
> >>> Also my first impression was that it won't be that easy from a synchronization point of view to
> >>> replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls
> >>> JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where
> >>> JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear
> >>> to me, how this has to be handled.
> > 
> >> I think JvmtiEventController::set_frame_pop() should hold JvmtiThreadState_lock because it affects other JVMTI operation especially FramePop event.
> > 
> > Yes. To me it is unclear what synchronization is necessary, if it is called during a handshake. And
> > also I'm unsure if a thread should do safepoint checks while executing a handshake.

> I'm growing increasingly concerned that use of direct handshakes to 
> replace VM operations needs a much greater examination for correctness 
> than might initially be thought. I see a number of issues:

I agree. I'll address your concerns in the context of this review thread for JDK-8238585 below.

In addition I would suggest to take the general part of the discussion to a dedicated thread or to
the review thread for JDK-8242427. I would like to keep this thread closer to its subject.

> First, the VMThread executes (most) VM operations with a clean stack in 
> a clean state, so it has lots of room to work. If we now execute the 
> same logic in a JavaThread then we risk hitting stackoverflows if 
> nothing else. But we are also now executing code in a JavaThread and so 
> we have to be sure that code is not going to act differently (in a bad 
> way) if executed by a JavaThread rather than the VMThread. For example, 
> may it be possible that if executing in the VMThread we defer some 
> activity that might require execution of Java code, or else hand it off 
> to one of the service threads? If we execute that code directly in the 
> current JavaThread instead we may not be in a valid state (e.g. consider 
> re-entrancy to various subsystems that is not allowed).

It is not too complex, what EnterInterpOnlyModeClosure::do_thread() is doing. I already added a
paragraph to the JBS-Item [1] explaining why the direct handshake is sufficient from a
synchronization point of view.

Furthermore the stack is walked and the return pc of compiled frames is replaced with the address of
the deopt handler.

I can't see why this cannot be done with a direct handshake. Something very similar is already done
in JavaThread::deoptimize_marked_methods() which is executed as part of an ordinary handshake.

The demand on stack-space should be very modest. I would not expect a higher risk for stackoverflow.

> Second, we have this question mark over what happens if the operation 
> hits further safepoint or handshake polls/checks? Are there constraints 
> on what is allowed here? How can we recognise this problem may exist and 
> so deal with it?

The thread in EnterInterpOnlyModeClosure::do_thread() can't become safepoint/handshake safe. I
tested locally test/hotspot/jtreg:vmTestbase_nsk_jvmti with a NoSafepointVerifier.

> Third, while we are generally considering what appear to be 
> single-thread operations, which should be amenable to a direct 
> handshake, we also have to be careful that some of the code involved 
> doesn't already expect/assume we are at a safepoint - e.g. a VM op may 
> not need to take a lock where a direct handshake might!

See again my arguments in the JBS item [1].

Thanks,
Richard.

[1] https://bugs.openjdk.java.net/browse/JDK-8238585

-----Original Message-----
From: David Holmes <david.holmes at oracle.com> 
Sent: Montag, 27. April 2020 07:16
To: Reingruber, Richard <richard.reingruber at sap.com>; Yasumasa Suenaga <suenaga at oss.nttdata.com>; Patricio Chilano <patricio.chilano.mateo at oracle.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant

Hi all,

Not a review but some general commentary ...

On 25/04/2020 2:08 am, Reingruber, Richard wrote:
> Hi Yasumasa, Patricio,
> 
>>>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
>>>> Does it help you? I think it gives you to remove workaround.
>>>
>>> I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake
>>> you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to
>>> change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes.
> 
>> Thanks for your information.
>> I tested my patch with both vmTestbase/nsk/jvmti/PopFrame and vmTestbase/nsk/jvmti/NotifyFramePop.
>> I will modify and will test it after yours.
> 
> Thanks :)
> 
>>> Also my first impression was that it won't be that easy from a synchronization point of view to
>>> replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls
>>> JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where
>>> JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear
>>> to me, how this has to be handled.
> 
>> I think JvmtiEventController::set_frame_pop() should hold JvmtiThreadState_lock because it affects other JVMTI operation especially FramePop event.
> 
> Yes. To me it is unclear what synchronization is necessary, if it is called during a handshake. And
> also I'm unsure if a thread should do safepoint checks while executing a handshake.

I'm growing increasingly concerned that use of direct handshakes to 
replace VM operations needs a much greater examination for correctness 
than might initially be thought. I see a number of issues:

First, the VMThread executes (most) VM operations with a clean stack in 
a clean state, so it has lots of room to work. If we now execute the 
same logic in a JavaThread then we risk hitting stackoverflows if 
nothing else. But we are also now executing code in a JavaThread and so 
we have to be sure that code is not going to act differently (in a bad 
way) if executed by a JavaThread rather than the VMThread. For example, 
may it be possible that if executing in the VMThread we defer some 
activity that might require execution of Java code, or else hand it off 
to one of the service threads? If we execute that code directly in the 
current JavaThread instead we may not be in a valid state (e.g. consider 
re-entrancy to various subsystems that is not allowed).

Second, we have this question mark over what happens if the operation 
hits further safepoint or handshake polls/checks? Are there constraints 
on what is allowed here? How can we recognise this problem may exist and 
so deal with it?

Third, while we are generally considering what appear to be 
single-thread operations, which should be amenable to a direct 
handshake, we also have to be careful that some of the code involved 
doesn't already expect/assume we are at a safepoint - e.g. a VM op may 
not need to take a lock where a direct handshake might!

Cheers,
David
-----

> @Patricio, coming back to my question [1]:
> 
> In the example you gave in your answer [2]: the java thread would execute a vm operation during a
> direct handshake operation, while the VMThread is actually in the middle of a VM_HandshakeAllThreads
> operation, waiting to handshake the same handshakee: why can't the VMThread just proceed? The
> handshakee would be safepoint safe, wouldn't it?
> 
> Thanks, Richard.
> 
> [1] https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14301677&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14301677
> 
> [2] https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14301763&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14301763
> 
> -----Original Message-----
> From: Yasumasa Suenaga <suenaga at oss.nttdata.com>
> Sent: Freitag, 24. April 2020 17:23
> To: Reingruber, Richard <richard.reingruber at sap.com>; Patricio Chilano <patricio.chilano.mateo at oracle.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
> 
> Hi Richard,
> 
> On 2020/04/24 23:44, Reingruber, Richard wrote:
>> Hi Yasumasa,
>>
>>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
>>> Does it help you? I think it gives you to remove workaround.
>>
>> I think it would not help that much. Note that when replacing VM_SetFramePop with a direct handshake
>> you could not just execute VM_EnterInterpOnlyMode as a nested vm operation [1]. So you would have to
>> change/replace VM_EnterInterpOnlyMode and I would have to adapt to these changes.
> 
> Thanks for your information.
> I tested my patch with both vmTestbase/nsk/jvmti/PopFrame and vmTestbase/nsk/jvmti/NotifyFramePop.
> I will modify and will test it after yours.
> 
> 
>> Also my first impression was that it won't be that easy from a synchronization point of view to
>> replace VM_SetFramePop with a direct handshake. E.g. VM_SetFramePop::doit() indirectly calls
>> JvmtiEventController::set_frame_pop(JvmtiEnvThreadState *ets, JvmtiFramePop fpop) where
>> JvmtiThreadState_lock is acquired with safepoint check, if not at safepoint. It's not directly clear
>> to me, how this has to be handled.
> 
> I think JvmtiEventController::set_frame_pop() should hold JvmtiThreadState_lock because it affects other JVMTI operation especially FramePop event.
> 
> 
> Thanks,
> 
> Yasumasa
> 
> 
>> So it appears to me that it would be easier to push JDK-8242427 after this (JDK-8238585).
>>
>>> (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.)
>>
>> Would be interesting to see how you handled the issues above :)
>>
>> Thanks, Richard.
>>
>> [1] See question in comment https://bugs.openjdk.java.net/browse/JDK-8230594?focusedCommentId=14302030&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14302030
>>
>> -----Original Message-----
>> From: Yasumasa Suenaga <suenaga at oss.nttdata.com>
>> Sent: Freitag, 24. April 2020 13:34
>> To: Reingruber, Richard <richard.reingruber at sap.com>; Patricio Chilano <patricio.chilano.mateo at oracle.com>; serguei.spitsyn at oracle.com; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>
>> Hi Richard,
>>
>> I will send review request to replace VM_SetFramePop to handshake in early next week in JDK-8242427.
>> Does it help you? I think it gives you to remove workaround.
>>
>> (The patch is available, but I want to see the result of PIT in this weekend whether JDK-8242425 works fine.)
>>
>>
>> Thanks,
>>
>> Yasumasa
>>
>>
>> On 2020/04/24 17:18, Reingruber, Richard wrote:
>>> Hi Patricio, Vladimir, and Serguei,
>>>
>>> now that direct handshakes are available, I've updated the patch to make use of them.
>>>
>>> In addition I have done some clean-up changes I missed in the first webrev.
>>>
>>> Finally I have implemented the workaround suggested by Patricio to avoid nesting the handshake
>>> into the vm operation VM_SetFramePop [1]
>>>
>>> Kindly review again:
>>>
>>> Webrev:        http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1/
>>> Webrev(delta): http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.1.inc/
>>>
>>> I updated the JBS item explaining why the vm operation VM_EnterInterpOnlyMode can be replaced with a
>>> direct handshake:
>>>
>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8238585
>>>
>>> Testing:
>>>
>>> * JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
>>>
>>> * Submit-repo: mach5-one-rrich-JDK-8238585-20200423-1436-10441737
>>>
>>> Thanks,
>>> Richard.
>>>
>>> [1] An assertion in Handshake::execute_direct() fails, if called be VMThread, because it is no JavaThread.
>>>
>>> -----Original Message-----
>>> From: hotspot-dev <hotspot-dev-bounces at openjdk.java.net> On Behalf Of Reingruber, Richard
>>> Sent: Freitag, 14. Februar 2020 19:47
>>> To: Patricio Chilano <patricio.chilano.mateo at oracle.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>> Subject: RE: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>
>>> Hi Patricio,
>>>
>>>      > > I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
>>>      > > handshake cannot be nested in a vm operation. Maybe it should be asserted in the
>>>      > > Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>>>      > >
>>>      > >    > Alternatively I think you could do something similar to what we do in
>>>      > >    > Deoptimization::deoptimize_all_marked():
>>>      > >    >
>>>      > >    >    EnterInterpOnlyModeClosure hs;
>>>      > >    >    if (SafepointSynchronize::is_at_safepoint()) {
>>>      > >    >      hs.do_thread(state->get_thread());
>>>      > >    >    } else {
>>>      > >    >      Handshake::execute(&hs, state->get_thread());
>>>      > >    >    }
>>>      > >    > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>>      > >    > HandshakeClosure() constructor)
>>>      > >
>>>      > > Maybe this could be used also in the Handshake::execute() methods as general solution?
>>>      > Right, we could also do that. Avoiding to clear the polling page in
>>>      > HandshakeState::clear_handshake() should be enough to fix this issue and
>>>      > execute a handshake inside a safepoint, but adding that "if" statement
>>>      > in Hanshake::execute() sounds good to avoid all the extra code that we
>>>      > go through when executing a handshake. I filed 8239084 to make that change.
>>>
>>> Thanks for taking care of this and creating the RFE.
>>>
>>>      >
>>>      > >    > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>>      > >    > always called in a nested operation or just sometimes.
>>>      > >
>>>      > > At least one execution path without vm operation exists:
>>>      > >
>>>      > >    JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>>>      > >      JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>>>      > >        JvmtiEventControllerPrivate::recompute_enabled() : void
>>>      > >          JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>>>      > >            JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>>>      > >              JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>>>      > >                jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>>>      > >
>>>      > > I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
>>>      > > handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
>>>      > > encouraged to do it with a handshake :)
>>>      > Ah! I think you can still do it with a handshake with the
>>>      > Deoptimization::deoptimize_all_marked() like solution. I can change the
>>>      > if-else statement with just the Handshake::execute() call in 8239084.
>>>      > But up to you.  : )
>>>
>>> Well, I think that's enough encouragement :)
>>> I'll wait for 8239084 and try then again.
>>> (no urgency and all)
>>>
>>> Thanks,
>>> Richard.
>>>
>>> -----Original Message-----
>>> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
>>> Sent: Freitag, 14. Februar 2020 15:54
>>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>
>>> Hi Richard,
>>>
>>> On 2/14/20 9:58 AM, Reingruber, Richard wrote:
>>>> Hi Patricio,
>>>>
>>>> thanks for having a look.
>>>>
>>>>       > I?m only commenting on the handshake changes.
>>>>       > I see that operation VM_EnterInterpOnlyMode can be called inside
>>>>       > operation VM_SetFramePop which also allows nested operations. Here is a
>>>>       > comment in VM_SetFramePop definition:
>>>>       >
>>>>       > // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>>>>       > // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>>>>       >
>>>>       > So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>>>>       > could have a handshake inside a safepoint operation. The issue I see
>>>>       > there is that at the end of the handshake the polling page of the target
>>>>       > thread could be disarmed. So if the target thread happens to be in a
>>>>       > blocked state just transiently and wakes up then it will not stop for
>>>>       > the ongoing safepoint. Maybe I can file an RFE to assert that the
>>>>       > polling page is armed at the beginning of disarm_safepoint().
>>>>
>>>> I'm really glad you noticed the problematic nesting. This seems to be a general issue: currently a
>>>> handshake cannot be nested in a vm operation. Maybe it should be asserted in the
>>>> Handshake::execute() methods that they are not called by the vm thread evaluating a vm operation?
>>>>
>>>>       > Alternatively I think you could do something similar to what we do in
>>>>       > Deoptimization::deoptimize_all_marked():
>>>>       >
>>>>       >    EnterInterpOnlyModeClosure hs;
>>>>       >    if (SafepointSynchronize::is_at_safepoint()) {
>>>>       >      hs.do_thread(state->get_thread());
>>>>       >    } else {
>>>>       >      Handshake::execute(&hs, state->get_thread());
>>>>       >    }
>>>>       > (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>>>       > HandshakeClosure() constructor)
>>>>
>>>> Maybe this could be used also in the Handshake::execute() methods as general solution?
>>> Right, we could also do that. Avoiding to clear the polling page in
>>> HandshakeState::clear_handshake() should be enough to fix this issue and
>>> execute a handshake inside a safepoint, but adding that "if" statement
>>> in Hanshake::execute() sounds good to avoid all the extra code that we
>>> go through when executing a handshake. I filed 8239084 to make that change.
>>>
>>>>       > I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>>>       > always called in a nested operation or just sometimes.
>>>>
>>>> At least one execution path without vm operation exists:
>>>>
>>>>       JvmtiEventControllerPrivate::enter_interp_only_mode(JvmtiThreadState *) : void
>>>>         JvmtiEventControllerPrivate::recompute_thread_enabled(JvmtiThreadState *) : jlong
>>>>           JvmtiEventControllerPrivate::recompute_enabled() : void
>>>>             JvmtiEventControllerPrivate::change_field_watch(jvmtiEvent, bool) : void (2 matches)
>>>>               JvmtiEventController::change_field_watch(jvmtiEvent, bool) : void
>>>>                 JvmtiEnv::SetFieldAccessWatch(fieldDescriptor *) : jvmtiError
>>>>                   jvmti_SetFieldAccessWatch(jvmtiEnv *, jclass, jfieldID) : jvmtiError
>>>>
>>>> I tend to revert back to VM_EnterInterpOnlyMode as it wasn't my main intent to replace it with a
>>>> handshake, but to avoid making the compiled methods on stack not_entrant.... unless I'm further
>>>> encouraged to do it with a handshake :)
>>> Ah! I think you can still do it with a handshake with the
>>> Deoptimization::deoptimize_all_marked() like solution. I can change the
>>> if-else statement with just the Handshake::execute() call in 8239084.
>>> But up to you.? : )
>>>
>>> Thanks,
>>> Patricio
>>>> Thanks again,
>>>> Richard.
>>>>
>>>> -----Original Message-----
>>>> From: Patricio Chilano <patricio.chilano.mateo at oracle.com>
>>>> Sent: Donnerstag, 13. Februar 2020 18:47
>>>> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; hotspot-gc-dev at openjdk.java.net
>>>> Subject: Re: RFR(S) 8238585: Use handshake for JvmtiEventControllerPrivate::enter_interp_only_mode() and don't make compiled methods on stack not_entrant
>>>>
>>>> Hi Richard,
>>>>
>>>> I?m only commenting on the handshake changes.
>>>> I see that operation VM_EnterInterpOnlyMode can be called inside
>>>> operation VM_SetFramePop which also allows nested operations. Here is a
>>>> comment in VM_SetFramePop definition:
>>>>
>>>> // Nested operation must be allowed for the VM_EnterInterpOnlyMode that is
>>>> // called from the JvmtiEventControllerPrivate::recompute_thread_enabled.
>>>>
>>>> So if we change VM_EnterInterpOnlyMode to be a handshake, then now we
>>>> could have a handshake inside a safepoint operation. The issue I see
>>>> there is that at the end of the handshake the polling page of the target
>>>> thread could be disarmed. So if the target thread happens to be in a
>>>> blocked state just transiently and wakes up then it will not stop for
>>>> the ongoing safepoint. Maybe I can file an RFE to assert that the
>>>> polling page is armed at the beginning of disarm_safepoint().
>>>>
>>>> I think one option could be to remove
>>>> SafepointMechanism::disarm_if_needed() in
>>>> HandshakeState::clear_handshake() and let each JavaThread disarm itself
>>>> for the handshake case.
>>>>
>>>> Alternatively I think you could do something similar to what we do in
>>>> Deoptimization::deoptimize_all_marked():
>>>>
>>>>      ? EnterInterpOnlyModeClosure hs;
>>>>      ? if (SafepointSynchronize::is_at_safepoint()) {
>>>>      ??? hs.do_thread(state->get_thread());
>>>>      ? } else {
>>>>      ??? Handshake::execute(&hs, state->get_thread());
>>>>      ? }
>>>> (you could pass ?EnterInterpOnlyModeClosure? directly to the
>>>> HandshakeClosure() constructor)
>>>>
>>>> I don?t know JVMTI code so I?m not sure if VM_EnterInterpOnlyMode is
>>>> always called in a nested operation or just sometimes.
>>>>
>>>> Thanks,
>>>> Patricio
>>>>
>>>> On 2/12/20 7:23 AM, Reingruber, Richard wrote:
>>>>> // Repost including hotspot runtime and gc lists.
>>>>> // Dean Long suggested to do so, because the enhancement replaces a vm operation
>>>>> // with a handshake.
>>>>> // Original thread: http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-February/030359.html
>>>>>
>>>>> Hi,
>>>>>
>>>>> could I please get reviews for this small enhancement in hotspot's jvmti implementation:
>>>>>
>>>>> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8238585/webrev.0/
>>>>> Bug:    https://bugs.openjdk.java.net/browse/JDK-8238585
>>>>>
>>>>> The change avoids making all compiled methods on stack not_entrant when switching a java thread to
>>>>> interpreter only execution for jvmti purposes. It is sufficient to deoptimize the compiled frames on stack.
>>>>>
>>>>> Additionally a handshake is used instead of a vm operation to walk the stack and do the deoptimizations.
>>>>>
>>>>> Testing: JCK and JTREG tests, also in Xcomp mode with fastdebug and release builds on all platforms.
>>>>>
>>>>> Thanks, Richard.
>>>>>
>>>>> See also my question if anyone knows a reason for making the compiled methods not_entrant:
>>>>> http://mail.openjdk.java.net/pipermail/serviceability-dev/2020-January/030339.html
>>>

From igor.ignatyev at oracle.com  Mon Apr 27 16:20:52 2020
From: igor.ignatyev at oracle.com (Igor Ignatyev)
Date: Mon, 27 Apr 2020 09:20:52 -0700
Subject: RFR(T) : 8243620 : a few compiler/jvmci tests can be run in
 driver mode
In-Reply-To: <96b702b1-95db-f299-7191-dd0fe9773bf8@oracle.com>
References: <0147C3BE-D9B6-472A-86D6-B4E6E0EE0EAE@oracle.com>
 <96b702b1-95db-f299-7191-dd0fe9773bf8@oracle.com>
Message-ID: <3F519746-7EB0-47E4-94F4-BE1D3AD492A0@oracle.com>

Hi Tobias,

thanks you for this and 5 other reviews! pushed 'em all.

-- Igor

> On Apr 27, 2020, at 1:31 AM, Tobias Hartmann <tobias.hartmann at oracle.com> wrote:
> 
> Hi Igor,
> 
> looks good and trivial.
> 
> Best regards,
> Tobias
> 
> On 26.04.20 20:22, Igor Ignatyev wrote:
>> http://cr.openjdk.java.net/~iignatyev//8243620/webrev.00
>>> 2 lines changed: 2 ins; 0 del; 0 mod; 
>> 
>> 
>> Hi all,
>> 
>> could you please review the patch which updates two compiler/jvmci tests to use '@run driver'?
>> from JBS:
>>> compiler/jvmci/TestEnableJVMCIProduct and TestJVMCIPrintProperties just spawn new JVMs and check their output, hence they can be run in a driver mode.
>> 
>> 
>> webrev: http://cr.openjdk.java.net/~iignatyev//8243620/webrev.00
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8243620
>> testing: the affected tests
>> 
>> Thanks,
>> -- Igor
>> 


From vladimir.kozlov at oracle.com  Mon Apr 27 19:30:30 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Mon, 27 Apr 2020 12:30:30 -0700
Subject: [15] RFR(S): 8230402: Allocation of compile task fails with
 assert: "Leaking compilation tasks?"
In-Reply-To: <38eaa19f-723f-cd24-ff92-030f18c62780@oracle.com>
References: <27dd5ff1-9f91-d8c1-ecee-a77e6ecdb558@oracle.com>
 <e5a3c683-878d-f94e-b083-897e9ff89d77@oracle.com>
 <38eaa19f-723f-cd24-ff92-030f18c62780@oracle.com>
Message-ID: <8751839c-4a48-7708-02da-6d59d5691849@oracle.com>


On 4/27/20 2:26 AM, Christian Hagedorn wrote:
> Hi Vladimir
> 
> Thank you for your review!
> 
> On 24.04.20 23:57, Vladimir Kozlov wrote:
>> compileBroker.hpp and other places - when you have only one line you can use DEBUG_ONLY( ) macro.
>> I think dump() method should print only duplicated tasks to avoid search duplicates in 5000 lines.
>>
>> Can you use TieredThresholdPolicy::compare_methods() in compare_by_weight()? It would be nice to have the same logic 
>> which determines which method should be compiled first or removed from queue.
> 
> Sounds good, I included these in a new webrev:
> http://cr.openjdk.java.net/~chagedorn/8230402/webrev.01/

Looks better.

> 
>> May be we should mark methods which are removed from queue or use counters decay or use other mechanisms to prevent 
>> methods be put back into queue immediately because their counters are high. You may not need to remove half of queue 
>> in such case.
> 
> You mean we could, for example, just reset the invocation and backedge counters of removed methods from the queue? This 
> would probably be beneficial in a more general case than in my test case where each method is only executed twice. About 
> the number of tasks to drop, it was just a best guess. We can also choose to drop fewer. But it is probably hard to 
> determine a best value in general.

An other thought. Instead of removing tasks from queue may be we should not put new tasks on queue when it become almost 
full (but continue profiling methods). For that we need a parameter (or diagnostic flag) instead of 10000 number.

We are not using counters decay in Tiered mode because we are loosing/corrupting profiling data with it. We should avoid 
this. I just gave an example of what could be done.

One concern I have is that before it was check in debug VM. Now we putting limitation on compilations in product VM 
which may affect performance in some cases. We should check that.

Thanks,
Vladimir

> 
> Best regards,
> Christian
> 
>>
>> On 4/24/20 7:37 AM, Christian Hagedorn wrote:
>>> Hi
>>>
>>> Please review the following patch:
>>> https://bugs.openjdk.java.net/browse/JDK-8230402
>>> http://cr.openjdk.java.net/~chagedorn/8230402/webrev.00/
>>>
>>> This assert was hit very intermittently in an internal test until jdk-14+19. The test was changed afterwards and the 
>>> assert was not observed to fail anymore. However, the problem of having too many tasks in the queue is still present 
>>> (i.e. the compile queue is growing too quickly and the compiler(s) too slow to catch up). This assert can easily be 
>>> hit by creating many class loaders which load many methods which are immediately compiled by setting a low 
>>> compilation threshold as used in runA() in the testcase.
>>>
>>> Therefore, I suggest to tackle this problem with a general solution to drop half of the compilation tasks in 
>>> CompileQueue::add() when a queue size of 10000 is reached and none of the other conditions of this assert hold (no 
>>> Whitebox or JVMCI compiler). For tiered compilation, the tasks with the lowest method weight() or which are unloaded 
>>> are removed from the queue (without altering the order of the remaining tasks in the queue). Without tiered 
>>> compilation (i.e. SimpleCompPolicy), the tasks from the tail of the queue are removed. An additional verification in 
>>> debug builds should ensure that there are no duplicated tasks. I assume that part of the reason of the original 
>>> assert was to detect such duplicates.
>>>
>>> Thank you!
>>>
>>> Best regards,
>>> Christian
>>>

From igor.ignatyev at oracle.com  Tue Apr 28 00:26:18 2020
From: igor.ignatyev at oracle.com (Igor Ignatyev)
Date: Mon, 27 Apr 2020 17:26:18 -0700
Subject: RFR(T) : 8243932 :
 compiler/codecache/cli/printcodecache/TestPrintCodeCacheOption.java test can
 use driver mode : 
Message-ID: <BD52078C-715E-4110-925D-0A88094AAC9C@oracle.com>

http://cr.openjdk.java.net/~iignatyev//8243932/webrev.00
> 2 lines changed: 0 ins; 0 del; 2 mod;

Hi all,

could you please review this trivial one-liner?
from JBS:
> the main test class spawns multiple JVMs and checks their output, there is no need for it to be run w/ external vm flags.

webrev: http://cr.openjdk.java.net/~iignatyev//8243932/webrev.00
JBS: https://bugs.openjdk.java.net/browse/JDK-8243932

Thanks,
-- Igor

From felix.yang at huawei.com  Tue Apr 28 06:02:19 2020
From: felix.yang at huawei.com (Yangfei (Felix))
Date: Tue, 28 Apr 2020 06:02:19 +0000
Subject: RFR(S): 8243670: Unexpected test result caused by C2
 MergeMemNode::Ideal
Message-ID: <DA41BE1DDCA941489001C7FBD7A8820EE7DE0314@dggeml507-mbs.china.huawei.com>

Hi,

  Please help review this patch fixing a C2 issue.  
      Bug: https://bugs.openjdk.java.net/browse/JDK-8243670 
      Webrev: http://cr.openjdk.java.net/~fyang/8243670/webrev.00/

As described on the issue, C2 generates incorrect code for the following OSR compile:
    420    4 %  b  4       TestReplaceEquivPhis::test @ 25 (107 bytes)

  v = iFld;      // load from field "iFld"             
  iFld = TestReplaceEquivPhis.instanceCount; // store to field "iFld"

From the C2 JIT code, load and store of field "iFld" are misplaced.
Looks like this is initially caused by the replace equivalent phis transformation in MergeMemNode::Ideal.

Call trace:
#0  MergeMemNode::Ideal (this=0x7fff580c70c0, phase=0x7fff7d9407b0, can_reshape=true) at /home/yangfei/openjdk-jdk/src/hotspot/share/opto/memnode.cpp:4621
#1  0x00007ffff6020bcd in PhaseGVN::apply_ideal (this=0x7fff7d9407b0, k=0x7fff580c70c0, can_reshape=true)
    at /home/yangfei/openjdk-jdk/src/hotspot/share/opto/phaseX.cpp:806
#2  0x00007ffff60223ef in PhaseIterGVN::transform_old (this=0x7fff7d9407b0, n=0x7fff580c70c0)
    at /home/yangfei/openjdk-jdk/src/hotspot/share/opto/phaseX.cpp:1229
#3  0x00007ffff602217a in PhaseIterGVN::optimize (this=0x7fff7d9407b0) at /home/yangfei/openjdk-jdk/src/hotspot/share/opto/phaseX.cpp:1175
#4  0x00007ffff5e32618 in PhaseIdealLoop::build_and_optimize (this=0x7fff7d93fa90, mode=LoopOptsDefault)
    at /home/yangfei/openjdk-jdk/src/hotspot/share/opto/loopnode.cpp:3192
#5  0x00007ffff5800831 in PhaseIdealLoop::PhaseIdealLoop (this=0x7fff7d93fa90, igvn=..., mode=LoopOptsDefault)
    at /home/yangfei/openjdk-jdk/src/hotspot/share/opto/loopnode.hpp:951
#6  0x00007ffff580092c in PhaseIdealLoop::optimize (igvn=..., mode=LoopOptsDefault) at /home/yangfei/openjdk-jdk/src/hotspot/share/opto/loopnode.hpp:1026
#7  0x00007ffff57f4553 in Compile::optimize_loops (this=0x7fff7d942d00, igvn=..., mode=LoopOptsDefault)
    at /home/yangfei/openjdk-jdk/src/hotspot/share/opto/compile.cpp:1970
#8  0x00007ffff57f5308 in Compile::Optimize (this=0x7fff7d942d00) at /home/yangfei/openjdk-jdk/src/hotspot/share/opto/compile.cpp:2182
#9  0x00007ffff57ee5a3 in Compile::Compile (this=0x7fff7d942d00, ci_env=0x7fff7d943810, target=0x7fff580eeb70, osr_bci=25, subsume_loads=true,
    do_escape_analysis=true, eliminate_boxing=true, directive=0x7ffff031b430) at /home/yangfei/openjdk-jdk/src/hotspot/share/opto/compile.cpp:736
#10 0x00007ffff56ebc23 in C2Compiler::compile_method (this=0x7ffff035e940, env=0x7fff7d943810, target=0x7fff580eeb70, entry_bci=25, directive=0x7ffff031b430)
    at /home/yangfei/openjdk-jdk/src/hotspot/share/opto/c2compiler.cpp:111
#11 0x00007ffff5808cd0 in CompileBroker::invoke_compiler_on_method (task=0x7ffff03b6a10)
    at /home/yangfei/openjdk-jdk/src/hotspot/share/compiler/compileBroker.cpp:2210
#12 0x00007ffff58079eb in CompileBroker::compiler_thread_loop () at /home/yangfei/openjdk-jdk/src/hotspot/share/compiler/compileBroker.cpp:1894
#13 0x00007ffff62557f1 in compiler_thread_entry (thread=0x7ffff035f800, __the_thread__=0x7ffff035f800)
    at /home/yangfei/openjdk-jdk/src/hotspot/share/runtime/thread.cpp:3454
#14 0x00007ffff6250a11 in JavaThread::thread_main_inner (this=0x7ffff035f800) at /home/yangfei/openjdk-jdk/src/hotspot/share/runtime/thread.cpp:1969
#15 0x00007ffff62508bf in JavaThread::run (this=0x7ffff035f800) at /home/yangfei/openjdk-jdk/src/hotspot/share/runtime/thread.cpp:1952
#16 0x00007ffff624cb2c in Thread::call_run (this=0x7ffff035f800) at /home/yangfei/openjdk-jdk/src/hotspot/share/runtime/thread.cpp:399
#17 0x00007ffff5fbf288 in thread_native_entry (thread=0x7ffff035f800) at /home/yangfei/openjdk-jdk/src/hotspot/os/linux/os_linux.cpp:789
#18 0x00007ffff71976db in start_thread (arg=0x7fff7d944700) at pthread_create.c:463
#19 0x00007ffff78f188f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

graph before the transformation looks like:

  1: Phi1( Phi1 and Phi2 have the same input edges)  #memory  Memory: @TestReplaceEquivPhis+12 *, name=iFld, idx=5;
  2: Phi2( Phi1 and Phi2 have the same input edges)  #memory  Memory: @BotPTR *+bot, idx=Bot;
  3: LoadI( input: 1)   =>  name=iFld, idx=5
  4: MergeMem( input:1, 2)
  5: MemBarAcqure( input: 4)
  6: Proj( input: 5)
  7: StoreI( input: 6)   => name=iFld, idx=5

Here Phi1 and Phi2 have same input edges.  Input from Phi1 to MergeMem is simplified by MergeMemoryNode::Ideal.

graph after the transformation looks like:

  1: Phi1( Phi1 and Phi2 have the same input edges)  #memory  Memory: @TestReplaceEquivPhis+12 *, name=iFld, idx=5;
  2: Phi2( Phi1 and Phi2 have the same input edges)  #memory  Memory: @BotPTR *+bot, idx=Bot;
  3: LoadI( input: 1)    =>  name=iFld, idx=5
  4: MergeMem( input: 2)
  5: MemBarAcqure( input: 4)
  6: Proj( input: 5)
  7: StoreI( input: 6)    => name=iFld, idx=5

As a result, PhaseCFG::insert_anti_dependences won't insert an anti-dependence edge between 3 and 4 to place the load correctly.
This transformation is there from day one.  With -XX:-SplitIfBlocks option, it triggers more errors.  Proposed webrev simply disables it.
Tier1-3 tested on x86-64-linux-gnu.  Specjbb2015 shows no performance regression with this change.
Suggestions?

Thanks,
Felix

From Yang.Zhang at arm.com  Tue Apr 28 06:57:15 2020
From: Yang.Zhang at arm.com (Yang Zhang)
Date: Tue, 28 Apr 2020 06:57:15 +0000
Subject: [aarch64-port-dev ] RFR(S): 8243155: AArch64: Add support for SqrtVF 
Message-ID: <VI1PR0802MB2558D446B5D05488708624148EAC0@VI1PR0802MB2558.eurprd08.prod.outlook.com>

Hi,

Could you please help to review this patch?

JBS: https://bugs.openjdk.java.net/browse/JDK-8243155
Webrev: http://cr.openjdk.java.net/~yzhang/8243155/webrev.00/

In Java, Math.sqrt() supports double data only. To support Math.sqrt()
for float, the following conversion must be done.

    float a, b;
    a = (float)Math.sqrt((double)b)

Both AArch64 and x86 support such single-precision sqrt by hardware
instructions. AArch64 FSQRT instruction matches Java (float)Math.
sqrt((double)b) exactly. And X86 has supported vectorization of
Math.sqrt() on floats in [1].

In this patch, vectorized sqrt for float (SqrtVF) is supported in
AArch64 backend. Jtreg test cases for SqrtVF and SqrtVD are also
added. Special cases such as min/max, +/-Inf, +0.0/-0.0 and NaN are
covered.

Testing:
Full jtreg
Newly added sqrt jtreg tests
Panama/Vector API tests which cover vector sqrt

Test case for sqrtvf:

public static void sqrtvf(float[] a, float[] b, float[] c) {
    float tmp;
    for (int i = 0; i < a.length; i++) {
        tmp = (float)(a[i] + b[i]);
        c[i] = (float)Math.sqrt((double)tmp);
    }
}

With this patch, the following code snippet is generated.

  0x0000ffffacaf872c:   ldr	q17, [x18, #16]
  0x0000ffffacaf8730:   ldr	q16, [x16, #16]
  0x0000ffffacaf8734:   fadd	v16.4s, v16.4s, v17.4s
  0x0000ffffacaf8738:   fsqrt	v16.4s, v16.4s
  0x0000ffffacaf8740:   str	q16, [x14, #16]

Performance:
JMH test is attached in JBS.

Before:
Benchmark                (size)  Mode  Cnt  Score   Error  Units
TestVect.testVectSqrtVF    1024  avgt    5  4.372 ? 0.016  us/op

After:
Benchmark                (size)  Mode  Cnt  Score   Error  Units
TestVect.testVectSqrtVF    1024  avgt    5  1.115 ? 0.013  us/op

[1] https://bugs.openjdk.java.net/browse/JDK-8190800

Regards
Yang

From aph at redhat.com  Tue Apr 28 09:46:43 2020
From: aph at redhat.com (Andrew Haley)
Date: Tue, 28 Apr 2020 10:46:43 +0100
Subject: [aarch64-port-dev ] RFR(S): 8243155: AArch64: Add support for
 SqrtVF
In-Reply-To: <VI1PR0802MB2558D446B5D05488708624148EAC0@VI1PR0802MB2558.eurprd08.prod.outlook.com>
References: <VI1PR0802MB2558D446B5D05488708624148EAC0@VI1PR0802MB2558.eurprd08.prod.outlook.com>
Message-ID: <33e0d71a-0b82-9112-fe81-a8e9a34d6d57@redhat.com>

On 4/28/20 7:57 AM, Yang Zhang wrote:
> Could you please help to review this patch?
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8243155
> Webrev: http://cr.openjdk.java.net/~yzhang/8243155/webrev.00/

This was a bit of a head scratcher. To begin with I thought that this
must be wrong, because Math.sqrt() is supposed to be correctly
rounded, and (float)Math.sqrt(float) is double rounded, leading to an
inaccurate result.

Looking round the web, Figueroa [1] proved double rounding to be
innocuous for the square root if it is performed with a precision
larger than twice the original precision, plus two. [2]

But it's not hard to write a program to do an exhaustive search from
x = FLT_MIN; x <= FLT_MAX, like so:

float roundedSqrt(float x) {
  return (float)ieee754_sqrt((double)x);
}

int main() {
  for (float x = FLT_MIN; x <= FLT_MAX; x = nextFloat(x)) {
    if (ieee754_sqrtf(x) != roundedSqrt(x)) {
      fprintf(stdout, "%12.6f\n", x);
    }
  }
}

... and it returns no differences.

The patch is OK, thanks.

[1] Samuel A. Figueroa. When is Double Rounding Innocuous? SIGNUM
Newsl., 30(3):21?26, July 1995.

[2] Pierre Roux. Innocuous Double Rounding of Basic Arithmetic
Operations. Journal of Formalized Reasoning, ASDD-AlmaDL, 2014, 7 (1),
pp.131-142. 10.6092/issn.1972-5787/4359. hal-01091186

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From tobias.hartmann at oracle.com  Tue Apr 28 12:55:18 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Tue, 28 Apr 2020 14:55:18 +0200
Subject: RFR(T) : 8243932 :
 compiler/codecache/cli/printcodecache/TestPrintCodeCacheOption.java test can
 use driver mode :
In-Reply-To: <BD52078C-715E-4110-925D-0A88094AAC9C@oracle.com>
References: <BD52078C-715E-4110-925D-0A88094AAC9C@oracle.com>
Message-ID: <27a4fdf6-52ba-f8f4-d789-f990045270bd@oracle.com>

Hi Igor,

looks good.

Best regards,
Tobias

On 28.04.20 02:26, Igor Ignatyev wrote:
> http://cr.openjdk.java.net/~iignatyev//8243932/webrev.00
>> 2 lines changed: 0 ins; 0 del; 2 mod;
> 
> Hi all,
> 
> could you please review this trivial one-liner?
> from JBS:
>> the main test class spawns multiple JVMs and checks their output, there is no need for it to be run w/ external vm flags.
> 
> webrev: http://cr.openjdk.java.net/~iignatyev//8243932/webrev.00
> JBS: https://bugs.openjdk.java.net/browse/JDK-8243932
> 
> Thanks,
> -- Igor
> 

From christian.hagedorn at oracle.com  Tue Apr 28 15:36:01 2020
From: christian.hagedorn at oracle.com (Christian Hagedorn)
Date: Tue, 28 Apr 2020 17:36:01 +0200
Subject: [15] RFR(S): 8230402: Allocation of compile task fails with
 assert: "Leaking compilation tasks?"
In-Reply-To: <8751839c-4a48-7708-02da-6d59d5691849@oracle.com>
References: <27dd5ff1-9f91-d8c1-ecee-a77e6ecdb558@oracle.com>
 <e5a3c683-878d-f94e-b083-897e9ff89d77@oracle.com>
 <38eaa19f-723f-cd24-ff92-030f18c62780@oracle.com>
 <8751839c-4a48-7708-02da-6d59d5691849@oracle.com>
Message-ID: <60c14954-5482-9453-f252-d56551f98c4d@oracle.com>

Hi Vladimir

>>> May be we should mark methods which are removed from queue or use 
>>> counters decay or use other mechanisms to prevent methods be put back 
>>> into queue immediately because their counters are high. You may not 
>>> need to remove half of queue in such case.
>>
>> You mean we could, for example, just reset the invocation and backedge 
>> counters of removed methods from the queue? This would probably be 
>> beneficial in a more general case than in my test case where each 
>> method is only executed twice. About the number of tasks to drop, it 
>> was just a best guess. We can also choose to drop fewer. But it is 
>> probably hard to determine a best value in general.
> 
> An other thought. Instead of removing tasks from queue may be we should 
> not put new tasks on queue when it become almost full (but continue 
> profiling methods). For that we need a parameter (or diagnostic flag) 
> instead of 10000 number.

That also sounds reasonable. But then we might miss on new hot methods 
while the queue could contain many cold methods.

> We are not using counters decay in Tiered mode because we are 
> loosing/corrupting profiling data with it. We should avoid this. I just 
> gave an example of what could be done.

Okay.

> One concern I have is that before it was check in debug VM. Now we 
> putting limitation on compilations in product VM which may affect 
> performance in some cases. We should check that.

I ran some standard benchmarks and have not observed any regressions. 
However, I also ran these benchmarks with the original code and 
substituted the assert by a guarantee. It was never hit (i.e. the new 
code never executed and thus had no influence). It could still affect 
other benchmarks and programs in some unexpected way. But I think it is 
very unlikely to hit the threshold in a normal program.

Therefore, I think it is kinda a trade-off between complexity of the 
solution and likelihood that those special cases occur. So, to summarize 
all the current options:
1) Just remove the assert. But then we miss cases where we actually have 
duplicates in the queue
2) webrev.01 to drop half of the tasks. We can check for duplicates 
before dropping. Can also change the number of tasks to drop.
     a) do it only in debug builds. Performance would not be significant.
     b) do it also in product builds. New limitation that might impact 
performance on some other benchmarks/programs but I think it's not very 
likely.
3a/b) Do the same without excluding WhiteBoxAPI and/or UseJVMCICompiler
4) Use your suggestion to stop enqueuing tasks at a certain threshold 
with a parameter or flag. Simple solution but might miss some new hot 
methods and it needs a different trigger to check for duplicates to 
avoid checking it too many times.
     a) do it only in debug builds. Performance would not be significant.
     b) do it also in product builds. Performance might be impacted on 
some other benchmarks/programs but I think not very likely.
5a/b) Do the same without excluding WhiteBoxAPI and/or UseJVMCICompiler

What do you think?

Best regards,
Christian

>>
>>>
>>> On 4/24/20 7:37 AM, Christian Hagedorn wrote:
>>>> Hi
>>>>
>>>> Please review the following patch:
>>>> https://bugs.openjdk.java.net/browse/JDK-8230402
>>>> http://cr.openjdk.java.net/~chagedorn/8230402/webrev.00/
>>>>
>>>> This assert was hit very intermittently in an internal test until 
>>>> jdk-14+19. The test was changed afterwards and the assert was not 
>>>> observed to fail anymore. However, the problem of having too many 
>>>> tasks in the queue is still present (i.e. the compile queue is 
>>>> growing too quickly and the compiler(s) too slow to catch up). This 
>>>> assert can easily be hit by creating many class loaders which load 
>>>> many methods which are immediately compiled by setting a low 
>>>> compilation threshold as used in runA() in the testcase.
>>>>
>>>> Therefore, I suggest to tackle this problem with a general solution 
>>>> to drop half of the compilation tasks in CompileQueue::add() when a 
>>>> queue size of 10000 is reached and none of the other conditions of 
>>>> this assert hold (no Whitebox or JVMCI compiler). For tiered 
>>>> compilation, the tasks with the lowest method weight() or which are 
>>>> unloaded are removed from the queue (without altering the order of 
>>>> the remaining tasks in the queue). Without tiered compilation (i.e. 
>>>> SimpleCompPolicy), the tasks from the tail of the queue are removed. 
>>>> An additional verification in debug builds should ensure that there 
>>>> are no duplicated tasks. I assume that part of the reason of the 
>>>> original assert was to detect such duplicates.
>>>>
>>>> Thank you!
>>>>
>>>> Best regards,
>>>> Christian
>>>>

From nils.eliasson at oracle.com  Tue Apr 28 16:28:47 2020
From: nils.eliasson at oracle.com (Nils Eliasson)
Date: Tue, 28 Apr 2020 18:28:47 +0200
Subject: RFR(S): 8235673: [C1, C2] Split inlining control flags
In-Reply-To: <AM4PR02MB3057BC8CC2E51664A0B6DB909AAF0@AM4PR02MB3057.eurprd02.prod.outlook.com>
References: <AM4PR02MB3057BC8CC2E51664A0B6DB909AAF0@AM4PR02MB3057.eurprd02.prod.outlook.com>
Message-ID: <496a3bde-09ca-adbe-1d2c-93a759623118@oracle.com>

Hi,

Thanks for addressing this! This has been an annoyance for a long time.

Have you though about including other flags - like MaxTrivialSize? 
MaxInlineSize is tested against it.

Also - you should move the flags that are now c2-only to c2_globals.hpp.

Best regards,
Nils Eliasson

On 2020-04-27 15:06, Doerr, Martin wrote:
> Hi,
>
> while tuning inlining parameters for C2 compiler with JDK-8234863 we had discussed impact on C1.
> I still think it's bad to share them between both compilers. We may want to do further C2 tuning without negative impact on C1 in the future.
>
> C1 has issues with substantial inlining because of the lack of uncommon traps. When C1 inlines a lot, stack frames may get large and code cache space may get wasted for cold or even never executed code. The situation gets worse when many patching stubs get used for such code.
>
> I had opened the following issue:
> https://bugs.openjdk.java.net/browse/JDK-8235673
>
> And my initial proposal is here:
> http://cr.openjdk.java.net/~mdoerr/8235673_C1_inlining/webrev.00/
>
>
> Part of my proposal is to add an additional flag which I called C1InlineStackLimit to reduce stack utilization for C1 methods.
> I have a simple example which shows wasted stack space (java example TestStack at the end).
>
> It simply counts stack frames until a stack overflow occurs. With the current implementation, only 1283 frames fit on the stack because the never executed method bogus_test with local variables gets inlined.
> Reduced C1InlineStackLimit avoids inlining of bogus_test and we get 2310 frames until stack overflow. (I only used C1 for this example. Can be reproduced as shown below.)
>
> I didn't notice any performance regression even with the aggressive setting of C1InlineStackLimit=5 with TieredCompilation.
>
> I know that I'll need a CSR for this change, but I'd like to get feedback in general and feedback about the flag names before creating a CSR.
> I'd also be glad about feedback regarding the performance impact.
>
> Best regards,
> Martin
>
>
>
> Command line:
> jdk/bin/java -XX:TieredStopAtLevel=1 -XX:C1InlineStackLimit=20 -XX:C1MaxRecursiveInlineLevel=0 -Xss256k -Xbatch -XX:+PrintInlining -XX:CompileCommand=compileonly,TestStack::triggerStackOverflow TestStack
> CompileCommand: compileonly TestStack.triggerStackOverflow
>                                @ 8   TestStack::triggerStackOverflow (15 bytes)   recursive inlining too deep
>                                @ 11   TestStack::bogus_test (33 bytes)   inline
> caught java.lang.StackOverflowError
> 1283 activations were on stack, sum = 0
>
> jdk/bin/java -XX:TieredStopAtLevel=1 -XX:C1InlineStackLimit=10 -XX:C1MaxRecursiveInlineLevel=0 -Xss256k -Xbatch -XX:+PrintInlining -XX:CompileCommand=compileonly,TestStack::triggerStackOverflow TestStack
> CompileCommand: compileonly TestStack.triggerStackOverflow
>                                @ 8   TestStack::triggerStackOverflow (15 bytes)   recursive inlining too deep
>                                @ 11   TestStack::bogus_test (33 bytes)   callee uses too much stack
> caught java.lang.StackOverflowError
> 2310 activations were on stack, sum = 0
>
>
> TestStack.java:
> public class TestStack {
>
>      static long cnt = 0,
>                  sum = 0;
>
>      public static void bogus_test() {
>          long c1 = 1, c2 = 2, c3 = 3, c4 = 4;
>          sum += c1 + c2 + c3 + c4;
>      }
>
>      public static void triggerStackOverflow() {
>          cnt++;
>          triggerStackOverflow();
>          bogus_test();
>      }
>
>
>      public static void main(String args[]) {
>          try {
>              triggerStackOverflow();
>          } catch (StackOverflowError e) {
>              System.out.println("caught " + e);
>          }
>          System.out.println(cnt + " activations were on stack, sum = " + sum);
>      }
> }
>


From igor.ignatyev at oracle.com  Tue Apr 28 18:34:37 2020
From: igor.ignatyev at oracle.com (Igor Ignatyev)
Date: Tue, 28 Apr 2020 11:34:37 -0700
Subject: RFR(T) : 8243932 :
 compiler/codecache/cli/printcodecache/TestPrintCodeCacheOption.java test can
 use driver mode : 
In-Reply-To: <27a4fdf6-52ba-f8f4-d789-f990045270bd@oracle.com>
References: <BD52078C-715E-4110-925D-0A88094AAC9C@oracle.com>
 <27a4fdf6-52ba-f8f4-d789-f990045270bd@oracle.com>
Message-ID: <5DF2FEC8-EE3D-484B-9D19-A0FD6B472195@oracle.com>

Thanks Tobias, pushed.
-- Igor

> On Apr 28, 2020, at 5:55 AM, Tobias Hartmann <tobias.hartmann at oracle.com> wrote:
> 
> Hi Igor,
> 
> looks good.
> 
> Best regards,
> Tobias
> 
> On 28.04.20 02:26, Igor Ignatyev wrote:
>> http://cr.openjdk.java.net/~iignatyev//8243932/webrev.00
>>> 2 lines changed: 0 ins; 0 del; 2 mod;
>> 
>> Hi all,
>> 
>> could you please review this trivial one-liner?
>> from JBS:
>>> the main test class spawns multiple JVMs and checks their output, there is no need for it to be run w/ external vm flags.
>> 
>> webrev: http://cr.openjdk.java.net/~iignatyev//8243932/webrev.00
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8243932
>> 
>> Thanks,
>> -- Igor
>> 


From vladimir.kozlov at oracle.com  Wed Apr 29 01:43:02 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Tue, 28 Apr 2020 18:43:02 -0700
Subject: [15] RFR(S): 8230402: Allocation of compile task fails with
 assert: "Leaking compilation tasks?"
In-Reply-To: <60c14954-5482-9453-f252-d56551f98c4d@oracle.com>
References: <27dd5ff1-9f91-d8c1-ecee-a77e6ecdb558@oracle.com>
 <e5a3c683-878d-f94e-b083-897e9ff89d77@oracle.com>
 <38eaa19f-723f-cd24-ff92-030f18c62780@oracle.com>
 <8751839c-4a48-7708-02da-6d59d5691849@oracle.com>
 <60c14954-5482-9453-f252-d56551f98c4d@oracle.com>
Message-ID: <a76996ae-7c28-f88e-06d0-5aa57e3f6d8f@oracle.com>

On 4/28/20 8:36 AM, Christian Hagedorn wrote:
> Hi Vladimir
> 
>>>> May be we should mark methods which are removed from queue or use counters decay or use other mechanisms to prevent 
>>>> methods be put back into queue immediately because their counters are high. You may not need to remove half of queue 
>>>> in such case.
>>>
>>> You mean we could, for example, just reset the invocation and backedge counters of removed methods from the queue? 
>>> This would probably be beneficial in a more general case than in my test case where each method is only executed 
>>> twice. About the number of tasks to drop, it was just a best guess. We can also choose to drop fewer. But it is 
>>> probably hard to determine a best value in general.
>>
>> An other thought. Instead of removing tasks from queue may be we should not put new tasks on queue when it become 
>> almost full (but continue profiling methods). For that we need a parameter (or diagnostic flag) instead of 10000 number.
> 
> That also sounds reasonable. But then we might miss on new hot methods while the queue could contain many cold methods.

The hotter a method the sooner it will be put on queue for compilation. The only case I can think of is recompilation of 
hot method due to deoptimization. May be spending more time in Interpreter is not bad thing.

> 
>> We are not using counters decay in Tiered mode because we are loosing/corrupting profiling data with it. We should 
>> avoid this. I just gave an example of what could be done.
> 
> Okay.
> 
>> One concern I have is that before it was check in debug VM. Now we putting limitation on compilations in product VM 
>> which may affect performance in some cases. We should check that.
> 
> I ran some standard benchmarks and have not observed any regressions. However, I also ran these benchmarks with the 
> original code and substituted the assert by a guarantee. It was never hit (i.e. the new code never executed and thus had 
> no influence). It could still affect other benchmarks and programs in some unexpected way. But I think it is very 
> unlikely to hit the threshold in a normal program.
> 
> Therefore, I think it is kinda a trade-off between complexity of the solution and likelihood that those special cases 

I completely agree with you on this.

> occur. So, to summarize all the current options:
> 1) Just remove the assert. But then we miss cases where we actually have duplicates in the queue
> 2) webrev.01 to drop half of the tasks. We can check for duplicates before dropping. Can also change the number of tasks 
> to drop.
>  ??? a) do it only in debug builds. Performance would not be significant.
>  ??? b) do it also in product builds. New limitation that might impact performance on some other benchmarks/programs but 
> I think it's not very likely.
> 3a/b) Do the same without excluding WhiteBoxAPI and/or UseJVMCICompiler
> 4) Use your suggestion to stop enqueuing tasks at a certain threshold with a parameter or flag. Simple solution but 
> might miss some new hot methods and it needs a different trigger to check for duplicates to avoid checking it too many 
> times.
>  ??? a) do it only in debug builds. Performance would not be significant.
>  ??? b) do it also in product builds. Performance might be impacted on some other benchmarks/programs but I think not 
> very likely.
> 5a/b) Do the same without excluding WhiteBoxAPI and/or UseJVMCICompiler

There should be no duplicates in queue - we check for that:
http://hg.openjdk.java.net/jdk/jdk/file/06745527c7b8/src/hotspot/share/compiler/compileBroker.cpp#l1120
Unless we screwed up JVM_ACC_QUEUED bit setting.

I think original assert was added in 8040798 to check that a task is always put into _task_free_list when we finish 
compilation (or abort compilation).

The real case is when all C2 compiler threads hangs (or takes very long time to compile) and no progress is done on 
compiling other methods in queue. But it should be very rare that all compiling threads hangs. And we can catch such 
cases by other checks.

I don't think we should have different behavior (remove tasks from queue or not put task in queue) in product and debug 
builds. If we do that we have to do in both. We use mostly fastdebug build for testing - we should execute the same code 
as in product as close as possible.

Based on all that I would go with 1).  As you pointed in bug report we not observing this assert anymore (only with hand 
made test).

Thanks,
Vladimir

> 
> What do you think?
> 
> Best regards,
> Christian
> 
>>>
>>>>
>>>> On 4/24/20 7:37 AM, Christian Hagedorn wrote:
>>>>> Hi
>>>>>
>>>>> Please review the following patch:
>>>>> https://bugs.openjdk.java.net/browse/JDK-8230402
>>>>> http://cr.openjdk.java.net/~chagedorn/8230402/webrev.00/
>>>>>
>>>>> This assert was hit very intermittently in an internal test until jdk-14+19. The test was changed afterwards and 
>>>>> the assert was not observed to fail anymore. However, the problem of having too many tasks in the queue is still 
>>>>> present (i.e. the compile queue is growing too quickly and the compiler(s) too slow to catch up). This assert can 
>>>>> easily be hit by creating many class loaders which load many methods which are immediately compiled by setting a 
>>>>> low compilation threshold as used in runA() in the testcase.
>>>>>
>>>>> Therefore, I suggest to tackle this problem with a general solution to drop half of the compilation tasks in 
>>>>> CompileQueue::add() when a queue size of 10000 is reached and none of the other conditions of this assert hold (no 
>>>>> Whitebox or JVMCI compiler). For tiered compilation, the tasks with the lowest method weight() or which are 
>>>>> unloaded are removed from the queue (without altering the order of the remaining tasks in the queue). Without 
>>>>> tiered compilation (i.e. SimpleCompPolicy), the tasks from the tail of the queue are removed. An additional 
>>>>> verification in debug builds should ensure that there are no duplicated tasks. I assume that part of the reason of 
>>>>> the original assert was to detect such duplicates.
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Best regards,
>>>>> Christian
>>>>>

From mikael.vidstedt at oracle.com  Wed Apr 29 04:12:08 2020
From: mikael.vidstedt at oracle.com (Mikael Vidstedt)
Date: Tue, 28 Apr 2020 21:12:08 -0700
Subject: RFR(XS): 8244061: Disable jvmci/graal/aot when building linux-aarch64
 at Oracle
Message-ID: <243DE92A-64F7-4293-9722-B76211AA50FA@oracle.com>


Please review this small change which disables JVMCI, Graal, and AOT when building linux-aarch64 at Oracle, for now.

JBS: https://bugs.openjdk.java.net/browse/JDK-8244061
webrev: http://cr.openjdk.java.net/~mikael/webrevs/8244061/webrev.00/open/webrev/

Cheers,
Mikael


From vladimir.kozlov at oracle.com  Wed Apr 29 04:25:42 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Tue, 28 Apr 2020 21:25:42 -0700
Subject: RFR(XS): 8244061: Disable jvmci/graal/aot when building
 linux-aarch64 at Oracle
In-Reply-To: <243DE92A-64F7-4293-9722-B76211AA50FA@oracle.com>
References: <243DE92A-64F7-4293-9722-B76211AA50FA@oracle.com>
Message-ID: <eae695f1-450c-7f1b-c8ed-9c8b561f379d@oracle.com>

Good.

Thanks,
Vladimir

On 4/28/20 9:12 PM, Mikael Vidstedt wrote:
> 
> Please review this small change which disables JVMCI, Graal, and AOT when building linux-aarch64 at Oracle, for now.
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8244061
> webrev: http://cr.openjdk.java.net/~mikael/webrevs/8244061/webrev.00/open/webrev/
> 
> Cheers,
> Mikael
> 

From aph at redhat.com  Wed Apr 29 07:46:50 2020
From: aph at redhat.com (Andrew Haley)
Date: Wed, 29 Apr 2020 08:46:50 +0100
Subject: RFR(XS): 8244061: Disable jvmci/graal/aot when building
 linux-aarch64 at Oracle
In-Reply-To: <243DE92A-64F7-4293-9722-B76211AA50FA@oracle.com>
References: <243DE92A-64F7-4293-9722-B76211AA50FA@oracle.com>
Message-ID: <a62f6e35-e3d8-ca42-96f7-a2bdaff918cd@redhat.com>

On 4/29/20 5:12 AM, Mikael Vidstedt wrote:
> Please review this small change which disables JVMCI, Graal, and AOT when building linux-aarch64 at Oracle, for now.
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8244061
> webrev: http://cr.openjdk.java.net/~mikael/webrevs/8244061/webrev.00/open/webrev/

Why? Are there problems with it? If so, I need bug reports.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From magnus.ihse.bursie at oracle.com  Wed Apr 29 08:09:53 2020
From: magnus.ihse.bursie at oracle.com (Magnus Ihse Bursie)
Date: Wed, 29 Apr 2020 10:09:53 +0200
Subject: RFR(XS): 8244061: Disable jvmci/graal/aot when building
 linux-aarch64 at Oracle
In-Reply-To: <243DE92A-64F7-4293-9722-B76211AA50FA@oracle.com>
References: <243DE92A-64F7-4293-9722-B76211AA50FA@oracle.com>
Message-ID: <7b811848-d6c9-ed61-b85e-2cf65d9f0a33@oracle.com>


On 2020-04-29 06:12, Mikael Vidstedt wrote:
> Please review this small change which disables JVMCI, Graal, and AOT when building linux-aarch64 at Oracle, for now.
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8244061
> webrev: http://cr.openjdk.java.net/~mikael/webrevs/8244061/webrev.00/open/webrev/
LGTM.

/Magnus
>
> Cheers,
> Mikael
>


From rwestrel at redhat.com  Wed Apr 29 08:40:08 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Wed, 29 Apr 2020 10:40:08 +0200
Subject: RFR(S): 8244086: Following 8241492,
 strip mined loop may run extra iterations
Message-ID: <87wo5y8z2v.fsf@redhat.com>


https://bugs.openjdk.java.net/browse/JDK-8244086
http://cr.openjdk.java.net/~roland/8244086/webrev.00/

The number of iterations to execute in an inner loop of a strip mined
loop nest is:

min(LoopStripMiningIter, limit - iv) for stride > 0
min(LoopStripMiningIter, iv - limit) for stride < 0

Assuming stride > 0 for the rest of the discussion (a similar problem
exists for stride < 0):

With 8241492, I changed that computation to use an unsigned comparison
because limit - init can be greater that max_jint. This assumes that
limit is always greater that init but in some rare cases that doesn't
hold. The body should then be executed for only one iteration but the
computation of the number of iterations above causes it to run for
LoopStripMiningIter. The fix I propose is to change the computation to:

min_unsigned(LoopStripMiningIter, max(0, limit - iv)) for stride > 0
min_unsigned(LoopStripMiningIter, max(0, iv - limit)) for stride < 0

Roland.


From doug.simon at oracle.com  Wed Apr 29 08:45:57 2020
From: doug.simon at oracle.com (Doug Simon)
Date: Wed, 29 Apr 2020 10:45:57 +0200
Subject: RFR(XS): 8244061: Disable jvmci/graal/aot when building
 linux-aarch64 at Oracle
In-Reply-To: <243DE92A-64F7-4293-9722-B76211AA50FA@oracle.com>
References: <243DE92A-64F7-4293-9722-B76211AA50FA@oracle.com>
Message-ID: <D2F91E4F-253B-481D-98A6-0A8D2198EEF9@oracle.com>

If I understand correctly, this disables building jvmci/graal/aot in *any* linux-aarch64 jib based build, not just those done at Oracle. Wouldn?t this be better done on the jib command line?

-Doug

> On 29 Apr 2020, at 06:12, Mikael Vidstedt <mikael.vidstedt at oracle.com> wrote:
> 
> 
> Please review this small change which disables JVMCI, Graal, and AOT when building linux-aarch64 at Oracle, for now.
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8244061
> webrev: http://cr.openjdk.java.net/~mikael/webrevs/8244061/webrev.00/open/webrev/
> 
> Cheers,
> Mikael
> 


From rwestrel at redhat.com  Wed Apr 29 08:51:03 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Wed, 29 Apr 2020 10:51:03 +0200
Subject: RFR(S): 8239569: PublicMethodsTest.java failed due to NPE in
 java.base/java.nio.file.FileSystems.getFileSystem(FileSystems.java:230)
In-Reply-To: <79d30ac1-3ddc-5d6b-e343-1e5adb98d1e0@oracle.com>
References: <87zhb18fmw.fsf@redhat.com>
 <d9e78153-d9a8-7e13-9ac5-49f803bfbed0@oracle.com>
 <79d30ac1-3ddc-5d6b-e343-1e5adb98d1e0@oracle.com>
Message-ID: <87tv128yko.fsf@redhat.com>


Thanks for the reviews.

Roland.


From mikael.vidstedt at oracle.com  Wed Apr 29 08:57:42 2020
From: mikael.vidstedt at oracle.com (Mikael Vidstedt)
Date: Wed, 29 Apr 2020 01:57:42 -0700
Subject: RFR(XS): 8244061: Disable jvmci/graal/aot when building
 linux-aarch64 at Oracle
In-Reply-To: <D2F91E4F-253B-481D-98A6-0A8D2198EEF9@oracle.com>
References: <243DE92A-64F7-4293-9722-B76211AA50FA@oracle.com>
 <D2F91E4F-253B-481D-98A6-0A8D2198EEF9@oracle.com>
Message-ID: <9970C494-BA95-44A5-8F35-9F9EDFDB19F7@oracle.com>


JIB is an Oracle (internal) tool, meaning the jib-profiles.js configuration file is only relevant at Oracle, so this really only affects builds done at Oracle. Does that address your concern?

Cheers,
Mikael

> On Apr 29, 2020, at 1:45 AM, Doug Simon <doug.simon at oracle.com> wrote:
> 
> If I understand correctly, this disables building jvmci/graal/aot in *any* linux-aarch64 jib based build, not just those done at Oracle. Wouldn?t this be better done on the jib command line?
> 
> -Doug
> 
>> On 29 Apr 2020, at 06:12, Mikael Vidstedt <mikael.vidstedt at oracle.com> wrote:
>> 
>> 
>> Please review this small change which disables JVMCI, Graal, and AOT when building linux-aarch64 at Oracle, for now.
>> 
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8244061
>> webrev: http://cr.openjdk.java.net/~mikael/webrevs/8244061/webrev.00/open/webrev/
>> 
>> Cheers,
>> Mikael
>> 
> 


From mikael.vidstedt at oracle.com  Wed Apr 29 09:02:24 2020
From: mikael.vidstedt at oracle.com (Mikael Vidstedt)
Date: Wed, 29 Apr 2020 02:02:24 -0700
Subject: RFR(XS): 8244061: Disable jvmci/graal/aot when building
 linux-aarch64 at Oracle
In-Reply-To: <a62f6e35-e3d8-ca42-96f7-a2bdaff918cd@redhat.com>
References: <243DE92A-64F7-4293-9722-B76211AA50FA@oracle.com>
 <a62f6e35-e3d8-ca42-96f7-a2bdaff918cd@redhat.com>
Message-ID: <4EED2776-3F3A-4D94-9990-D178AB58E458@oracle.com>


> On Apr 29, 2020, at 12:46 AM, Andrew Haley <aph at redhat.com> wrote:
> 
> On 4/29/20 5:12 AM, Mikael Vidstedt wrote:
>> Please review this small change which disables JVMCI, Graal, and AOT when building linux-aarch64 at Oracle, for now.
>> 
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8244061
>> webrev: http://cr.openjdk.java.net/~mikael/webrevs/8244061/webrev.00/open/webrev/
> 
> Why? Are there problems with it? If so, I need bug reports.

No specific problems or bugs that I?m aware of. We?re simply opting not to include Graal/JVMTI/AOT in the linux-aarch64 builds we do at Oracle, for now.

Cheers,
Mikael


From magnus.ihse.bursie at oracle.com  Wed Apr 29 09:02:20 2020
From: magnus.ihse.bursie at oracle.com (Magnus Ihse Bursie)
Date: Wed, 29 Apr 2020 11:02:20 +0200
Subject: RFR(XS): 8244061: Disable jvmci/graal/aot when building
 linux-aarch64 at Oracle
In-Reply-To: <D2F91E4F-253B-481D-98A6-0A8D2198EEF9@oracle.com>
References: <243DE92A-64F7-4293-9722-B76211AA50FA@oracle.com>
 <D2F91E4F-253B-481D-98A6-0A8D2198EEF9@oracle.com>
Message-ID: <8598b600-3766-4877-e0bc-91dac1363f2b@oracle.com>

On 2020-04-29 10:45, Doug Simon wrote:
> If I understand correctly, this disables building jvmci/graal/aot in *any* linux-aarch64 jib based build, not just those done at Oracle. Wouldn?t this be better done on the jib command line?
While this is technically correct, the jib tool is only used at Oracle, 
so in practice, this is the same thing.

If/when some other party of the community starts using the jib 
configuration (instead of creating their own), that will be a relevant 
feedback.

/Magnus
>
> -Doug
>
>> On 29 Apr 2020, at 06:12, Mikael Vidstedt <mikael.vidstedt at oracle.com> wrote:
>>
>>
>> Please review this small change which disables JVMCI, Graal, and AOT when building linux-aarch64 at Oracle, for now.
>>
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8244061
>> webrev: http://cr.openjdk.java.net/~mikael/webrevs/8244061/webrev.00/open/webrev/
>>
>> Cheers,
>> Mikael
>>


From doug.simon at oracle.com  Wed Apr 29 09:23:38 2020
From: doug.simon at oracle.com (Doug Simon)
Date: Wed, 29 Apr 2020 11:23:38 +0200
Subject: RFR(XS): 8244061: Disable jvmci/graal/aot when building
 linux-aarch64 at Oracle
In-Reply-To: <8598b600-3766-4877-e0bc-91dac1363f2b@oracle.com>
References: <243DE92A-64F7-4293-9722-B76211AA50FA@oracle.com>
 <D2F91E4F-253B-481D-98A6-0A8D2198EEF9@oracle.com>
 <8598b600-3766-4877-e0bc-91dac1363f2b@oracle.com>
Message-ID: <37744A68-828A-4332-8429-8367E9EA401A@oracle.com>

Ok, thanks for the clarification. I wasn?t sure if external parties had managed to set up a jib server but it sounds like that?s not so easy in practice.

-Doug

> On 29 Apr 2020, at 11:02, Magnus Ihse Bursie <magnus.ihse.bursie at oracle.com> wrote:
> 
> On 2020-04-29 10:45, Doug Simon wrote:
>> If I understand correctly, this disables building jvmci/graal/aot in *any* linux-aarch64 jib based build, not just those done at Oracle. Wouldn?t this be better done on the jib command line?
> While this is technically correct, the jib tool is only used at Oracle, so in practice, this is the same thing.
> 
> If/when some other party of the community starts using the jib configuration (instead of creating their own), that will be a relevant feedback.
> 
> /Magnus
>> 
>> -Doug
>> 
>>> On 29 Apr 2020, at 06:12, Mikael Vidstedt <mikael.vidstedt at oracle.com> wrote:
>>> 
>>> 
>>> Please review this small change which disables JVMCI, Graal, and AOT when building linux-aarch64 at Oracle, for now.
>>> 
>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8244061
>>> webrev: http://cr.openjdk.java.net/~mikael/webrevs/8244061/webrev.00/open/webrev/
>>> 
>>> Cheers,
>>> Mikael
>>> 
> 


From christian.hagedorn at oracle.com  Wed Apr 29 09:26:30 2020
From: christian.hagedorn at oracle.com (Christian Hagedorn)
Date: Wed, 29 Apr 2020 11:26:30 +0200
Subject: [15] RFR(S): 8230402: Allocation of compile task fails with
 assert: "Leaking compilation tasks?"
In-Reply-To: <a76996ae-7c28-f88e-06d0-5aa57e3f6d8f@oracle.com>
References: <27dd5ff1-9f91-d8c1-ecee-a77e6ecdb558@oracle.com>
 <e5a3c683-878d-f94e-b083-897e9ff89d77@oracle.com>
 <38eaa19f-723f-cd24-ff92-030f18c62780@oracle.com>
 <8751839c-4a48-7708-02da-6d59d5691849@oracle.com>
 <60c14954-5482-9453-f252-d56551f98c4d@oracle.com>
 <a76996ae-7c28-f88e-06d0-5aa57e3f6d8f@oracle.com>
Message-ID: <34153fa0-0016-e5dd-29ac-5bedf251a170@oracle.com>

Hi Vladimir

On 29.04.20 03:43, Vladimir Kozlov wrote:
> On 4/28/20 8:36 AM, Christian Hagedorn wrote:
>> Hi Vladimir
>>
>>>>> May be we should mark methods which are removed from queue or use 
>>>>> counters decay or use other mechanisms to prevent methods be put 
>>>>> back into queue immediately because their counters are high. You 
>>>>> may not need to remove half of queue in such case.
>>>>
>>>> You mean we could, for example, just reset the invocation and 
>>>> backedge counters of removed methods from the queue? This would 
>>>> probably be beneficial in a more general case than in my test case 
>>>> where each method is only executed twice. About the number of tasks 
>>>> to drop, it was just a best guess. We can also choose to drop fewer. 
>>>> But it is probably hard to determine a best value in general.
>>>
>>> An other thought. Instead of removing tasks from queue may be we 
>>> should not put new tasks on queue when it become almost full (but 
>>> continue profiling methods). For that we need a parameter (or 
>>> diagnostic flag) instead of 10000 number.
>>
>> That also sounds reasonable. But then we might miss on new hot methods 
>> while the queue could contain many cold methods.
> 
> The hotter a method the sooner it will be put on queue for compilation. 
> The only case I can think of is recompilation of hot method due to 
> deoptimization. May be spending more time in Interpreter is not bad thing.

Yes, that seems okay.

>>
>>> We are not using counters decay in Tiered mode because we are 
>>> loosing/corrupting profiling data with it. We should avoid this. I 
>>> just gave an example of what could be done.
>>
>> Okay.
>>
>>> One concern I have is that before it was check in debug VM. Now we 
>>> putting limitation on compilations in product VM which may affect 
>>> performance in some cases. We should check that.
>>
>> I ran some standard benchmarks and have not observed any regressions. 
>> However, I also ran these benchmarks with the original code and 
>> substituted the assert by a guarantee. It was never hit (i.e. the new 
>> code never executed and thus had no influence). It could still affect 
>> other benchmarks and programs in some unexpected way. But I think it 
>> is very unlikely to hit the threshold in a normal program.
>>
>> Therefore, I think it is kinda a trade-off between complexity of the 
>> solution and likelihood that those special cases 
> 
> I completely agree with you on this.
> 
>> occur. So, to summarize all the current options:
>> 1) Just remove the assert. But then we miss cases where we actually 
>> have duplicates in the queue
>> 2) webrev.01 to drop half of the tasks. We can check for duplicates 
>> before dropping. Can also change the number of tasks to drop.
>> ???? a) do it only in debug builds. Performance would not be significant.
>> ???? b) do it also in product builds. New limitation that might impact 
>> performance on some other benchmarks/programs but I think it's not 
>> very likely.
>> 3a/b) Do the same without excluding WhiteBoxAPI and/or UseJVMCICompiler
>> 4) Use your suggestion to stop enqueuing tasks at a certain threshold 
>> with a parameter or flag. Simple solution but might miss some new hot 
>> methods and it needs a different trigger to check for duplicates to 
>> avoid checking it too many times.
>> ???? a) do it only in debug builds. Performance would not be significant.
>> ???? b) do it also in product builds. Performance might be impacted on 
>> some other benchmarks/programs but I think not very likely.
>> 5a/b) Do the same without excluding WhiteBoxAPI and/or UseJVMCICompiler
> 
> There should be no duplicates in queue - we check for that:
> http://hg.openjdk.java.net/jdk/jdk/file/06745527c7b8/src/hotspot/share/compiler/compileBroker.cpp#l1120 
> 
> Unless we screwed up JVM_ACC_QUEUED bit setting.

I see, then it might not really be necessary to check for duplicates yet 
again.

> I think original assert was added in 8040798 to check that a task is 
> always put into _task_free_list when we finish compilation (or abort 
> compilation).

Thanks for clearing that up. I was not aware of that intention. I first 
thought it had only the purpose of finding strange things inside the 
compile queue.

> I don't think we should have different behavior (remove tasks from queue 
> or not put task in queue) in product and debug builds. If we do that we 
> have to do in both. We use mostly fastdebug build for testing - we 
> should execute the same code as in product as close as possible.

I agree with that. Doing it only in debug builds leads to a too 
different behavior.

> The real case is when all C2 compiler threads hangs (or takes very long 
> time to compile) and no progress is done on compiling other methods in 
> queue. But it should be very rare that all compiling threads hangs. And 
> we can catch such cases by other checks.
> > Based on all that I would go with 1).? As you pointed in bug report we
> not observing this assert anymore (only with hand made test).

Thank you for explaining it in more detail. When we have other checks 
that can detect such a situation where all compiling threads are hanging 
then we are probably fine by just removing that assert.

I updated my webrev with that option. I left the updated stress test there:
http://cr.openjdk.java.net/~chagedorn/8230402/webrev.02/

What do others think?

Best regards,
Christian

> Thanks,
> Vladimir
> 
>>
>> What do you think?
>>
>> Best regards,
>> Christian
>>
>>>>
>>>>>
>>>>> On 4/24/20 7:37 AM, Christian Hagedorn wrote:
>>>>>> Hi
>>>>>>
>>>>>> Please review the following patch:
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8230402
>>>>>> http://cr.openjdk.java.net/~chagedorn/8230402/webrev.00/
>>>>>>
>>>>>> This assert was hit very intermittently in an internal test until 
>>>>>> jdk-14+19. The test was changed afterwards and the assert was not 
>>>>>> observed to fail anymore. However, the problem of having too many 
>>>>>> tasks in the queue is still present (i.e. the compile queue is 
>>>>>> growing too quickly and the compiler(s) too slow to catch up). 
>>>>>> This assert can easily be hit by creating many class loaders which 
>>>>>> load many methods which are immediately compiled by setting a low 
>>>>>> compilation threshold as used in runA() in the testcase.
>>>>>>
>>>>>> Therefore, I suggest to tackle this problem with a general 
>>>>>> solution to drop half of the compilation tasks in 
>>>>>> CompileQueue::add() when a queue size of 10000 is reached and none 
>>>>>> of the other conditions of this assert hold (no Whitebox or JVMCI 
>>>>>> compiler). For tiered compilation, the tasks with the lowest 
>>>>>> method weight() or which are unloaded are removed from the queue 
>>>>>> (without altering the order of the remaining tasks in the queue). 
>>>>>> Without tiered compilation (i.e. SimpleCompPolicy), the tasks from 
>>>>>> the tail of the queue are removed. An additional verification in 
>>>>>> debug builds should ensure that there are no duplicated tasks. I 
>>>>>> assume that part of the reason of the original assert was to 
>>>>>> detect such duplicates.
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Best regards,
>>>>>> Christian
>>>>>>

From evgeny.nikitin at oracle.com  Wed Apr 29 09:46:56 2020
From: evgeny.nikitin at oracle.com (Evgeny Nikitin)
Date: Wed, 29 Apr 2020 11:46:56 +0200
Subject: RFR(S): 8147018: Better reporting for compiler control tests.
Message-ID: <861622ba-c03d-92d4-c562-6582ea82a034@oracle.com>

Hi,

Bug: https://bugs.openjdk.java.net/browse/JDK-8147018
Webrev: http://cr.openjdk.java.net/~enikitin/8147018/webrev.00/

The patch enhances the compiler control tests reporting by adding 
compile commands and expected states reporting.

Sample output (in the .jtr file) for the compile commands reporting:

 > (CompileCommand COMPILEONLY Type: JCMD Compiler: null 
MethodDescriptor: 
_compiler/compilercontrol/share/pool/sub/Klass$Internal,*-  IsValid: 
true JCMDType: ADD)
 > (CompileCommand COMPILEONLY Type: JCMD Compiler: null 
MethodDescriptor: *Klass *met@%hod IsValid: true JCMDType: ADD
)
 > (CompileCommand COMPILEONLY Type: JCMD Compiler: null 
MethodDescriptor: +*::*  IsValid: true JCMDType: ADD)
 > (CompileCommand NONEXISTENT Type: JCMD Compiler: null 
MethodDescriptor: null IsValid: false JCMDType: REMOVE)

(ability to print removals also added by the change)

Sample output for expected compilation state reporting:

 > Checking expected compilation state: {
 >   method: public void 
compiler.compilercontrol.share.pool.sub.Klass.method()
 >   compile [Optional.empty, Optional.empty]
 >   force_inline [Optional.empty, Optional.empty]
 >   dont_inline [Optional.empty, Optional.empty]
 >   log Optional.empty
 >   print_assembly Optional.empty
 >   print_inline Optional.empty
 > }


Other input parameters are already printed by the child VM's start 
command and the child VM's output.

The change had been tested via mach5 test runs for the compiler control 
tests and tier1 run.

Please review,
/Evgeny Nikitin

From aph at redhat.com  Wed Apr 29 10:44:20 2020
From: aph at redhat.com (Andrew Haley)
Date: Wed, 29 Apr 2020 11:44:20 +0100
Subject: RFR(XS): 8244061: Disable jvmci/graal/aot when building
 linux-aarch64 at Oracle
In-Reply-To: <4EED2776-3F3A-4D94-9990-D178AB58E458@oracle.com>
References: <243DE92A-64F7-4293-9722-B76211AA50FA@oracle.com>
 <a62f6e35-e3d8-ca42-96f7-a2bdaff918cd@redhat.com>
 <4EED2776-3F3A-4D94-9990-D178AB58E458@oracle.com>
Message-ID: <1a645807-db7e-d1f8-c871-4ae97278cb95@redhat.com>

On 4/29/20 10:02 AM, Mikael Vidstedt wrote:
> 
>> On Apr 29, 2020, at 12:46 AM, Andrew Haley <aph at redhat.com> wrote:
>>
>> On 4/29/20 5:12 AM, Mikael Vidstedt wrote:
>>> Please review this small change which disables JVMCI, Graal, and AOT when building linux-aarch64 at Oracle, for now.
>>>
>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8244061
>>> webrev: http://cr.openjdk.java.net/~mikael/webrevs/8244061/webrev.00/open/webrev/
>>
>> Why? Are there problems with it? If so, I need bug reports.
> 
> No specific problems or bugs that I?m aware of. We?re simply opting not to include Graal/JVMTI/AOT in the linux-aarch64 builds we do at Oracle, for now.

Cool, NP. Obvs. I'd prefer it if y'all built and tested it all,
but fair enough.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


From rwestrel at redhat.com  Wed Apr 29 11:30:39 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Wed, 29 Apr 2020 13:30:39 +0200
Subject: RFR(XS): 8242502: UnexpectedDeoptimizationTest.java failed
 "assert(phase->type(obj)->isa_oopptr()) failed: only for oop input"
In-Reply-To: <0bfc586a-fe96-0b7e-f0d4-244b5951f8f3@oracle.com>
References: <878siu9klq.fsf@redhat.com>
 <0bfc586a-fe96-0b7e-f0d4-244b5951f8f3@oracle.com>
Message-ID: <87r1w68r6o.fsf@redhat.com>


Hi Vladimir,

> Can you also print type so we know it next time.

Thansks for reviewing this. This:

http://cr.openjdk.java.net/~roland/8242502/webrev.01/

?

Roland.


From erik.joelsson at oracle.com  Wed Apr 29 12:45:19 2020
From: erik.joelsson at oracle.com (Erik Joelsson)
Date: Wed, 29 Apr 2020 05:45:19 -0700
Subject: RFR(XS): 8244061: Disable jvmci/graal/aot when building
 linux-aarch64 at Oracle
In-Reply-To: <243DE92A-64F7-4293-9722-B76211AA50FA@oracle.com>
References: <243DE92A-64F7-4293-9722-B76211AA50FA@oracle.com>
Message-ID: <21ecb3ef-bd8f-ff4e-272a-c2fdc12abe9e@oracle.com>

Looks good.

/Erik

On 2020-04-28 21:12, Mikael Vidstedt wrote:
> Please review this small change which disables JVMCI, Graal, and AOT when building linux-aarch64 at Oracle, for now.
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8244061
> webrev: http://cr.openjdk.java.net/~mikael/webrevs/8244061/webrev.00/open/webrev/
>
> Cheers,
> Mikael
>

From vladimir.kozlov at oracle.com  Wed Apr 29 17:07:39 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 29 Apr 2020 10:07:39 -0700
Subject: [15] RFR(S): 8230402: Allocation of compile task fails with
 assert: "Leaking compilation tasks?"
In-Reply-To: <34153fa0-0016-e5dd-29ac-5bedf251a170@oracle.com>
References: <27dd5ff1-9f91-d8c1-ecee-a77e6ecdb558@oracle.com>
 <e5a3c683-878d-f94e-b083-897e9ff89d77@oracle.com>
 <38eaa19f-723f-cd24-ff92-030f18c62780@oracle.com>
 <8751839c-4a48-7708-02da-6d59d5691849@oracle.com>
 <60c14954-5482-9453-f252-d56551f98c4d@oracle.com>
 <a76996ae-7c28-f88e-06d0-5aa57e3f6d8f@oracle.com>
 <34153fa0-0016-e5dd-29ac-5bedf251a170@oracle.com>
Message-ID: <1fb7fbc5-5dd5-5eda-668a-9fd2040956a0@oracle.com>

webrev.02 looks good to me.

Thanks,
Vladimir

On 4/29/20 2:26 AM, Christian Hagedorn wrote:
> Hi Vladimir
> 
> On 29.04.20 03:43, Vladimir Kozlov wrote:
>> On 4/28/20 8:36 AM, Christian Hagedorn wrote:
>>> Hi Vladimir
>>>
>>>>>> May be we should mark methods which are removed from queue or use counters decay or use other mechanisms to 
>>>>>> prevent methods be put back into queue immediately because their counters are high. You may not need to remove 
>>>>>> half of queue in such case.
>>>>>
>>>>> You mean we could, for example, just reset the invocation and backedge counters of removed methods from the queue? 
>>>>> This would probably be beneficial in a more general case than in my test case where each method is only executed 
>>>>> twice. About the number of tasks to drop, it was just a best guess. We can also choose to drop fewer. But it is 
>>>>> probably hard to determine a best value in general.
>>>>
>>>> An other thought. Instead of removing tasks from queue may be we should not put new tasks on queue when it become 
>>>> almost full (but continue profiling methods). For that we need a parameter (or diagnostic flag) instead of 10000 
>>>> number.
>>>
>>> That also sounds reasonable. But then we might miss on new hot methods while the queue could contain many cold methods.
>>
>> The hotter a method the sooner it will be put on queue for compilation. The only case I can think of is recompilation 
>> of hot method due to deoptimization. May be spending more time in Interpreter is not bad thing.
> 
> Yes, that seems okay.
> 
>>>
>>>> We are not using counters decay in Tiered mode because we are loosing/corrupting profiling data with it. We should 
>>>> avoid this. I just gave an example of what could be done.
>>>
>>> Okay.
>>>
>>>> One concern I have is that before it was check in debug VM. Now we putting limitation on compilations in product VM 
>>>> which may affect performance in some cases. We should check that.
>>>
>>> I ran some standard benchmarks and have not observed any regressions. However, I also ran these benchmarks with the 
>>> original code and substituted the assert by a guarantee. It was never hit (i.e. the new code never executed and thus 
>>> had no influence). It could still affect other benchmarks and programs in some unexpected way. But I think it is very 
>>> unlikely to hit the threshold in a normal program.
>>>
>>> Therefore, I think it is kinda a trade-off between complexity of the solution and likelihood that those special cases 
>>
>> I completely agree with you on this.
>>
>>> occur. So, to summarize all the current options:
>>> 1) Just remove the assert. But then we miss cases where we actually have duplicates in the queue
>>> 2) webrev.01 to drop half of the tasks. We can check for duplicates before dropping. Can also change the number of 
>>> tasks to drop.
>>> ???? a) do it only in debug builds. Performance would not be significant.
>>> ???? b) do it also in product builds. New limitation that might impact performance on some other benchmarks/programs 
>>> but I think it's not very likely.
>>> 3a/b) Do the same without excluding WhiteBoxAPI and/or UseJVMCICompiler
>>> 4) Use your suggestion to stop enqueuing tasks at a certain threshold with a parameter or flag. Simple solution but 
>>> might miss some new hot methods and it needs a different trigger to check for duplicates to avoid checking it too 
>>> many times.
>>> ???? a) do it only in debug builds. Performance would not be significant.
>>> ???? b) do it also in product builds. Performance might be impacted on some other benchmarks/programs but I think not 
>>> very likely.
>>> 5a/b) Do the same without excluding WhiteBoxAPI and/or UseJVMCICompiler
>>
>> There should be no duplicates in queue - we check for that:
>> http://hg.openjdk.java.net/jdk/jdk/file/06745527c7b8/src/hotspot/share/compiler/compileBroker.cpp#l1120
>> Unless we screwed up JVM_ACC_QUEUED bit setting.
> 
> I see, then it might not really be necessary to check for duplicates yet again.
> 
>> I think original assert was added in 8040798 to check that a task is always put into _task_free_list when we finish 
>> compilation (or abort compilation).
> 
> Thanks for clearing that up. I was not aware of that intention. I first thought it had only the purpose of finding 
> strange things inside the compile queue.
> 
>> I don't think we should have different behavior (remove tasks from queue or not put task in queue) in product and 
>> debug builds. If we do that we have to do in both. We use mostly fastdebug build for testing - we should execute the 
>> same code as in product as close as possible.
> 
> I agree with that. Doing it only in debug builds leads to a too different behavior.
> 
>> The real case is when all C2 compiler threads hangs (or takes very long time to compile) and no progress is done on 
>> compiling other methods in queue. But it should be very rare that all compiling threads hangs. And we can catch such 
>> cases by other checks.
>> > Based on all that I would go with 1).? As you pointed in bug report we
>> not observing this assert anymore (only with hand made test).
> 
> Thank you for explaining it in more detail. When we have other checks that can detect such a situation where all 
> compiling threads are hanging then we are probably fine by just removing that assert.
> 
> I updated my webrev with that option. I left the updated stress test there:
> http://cr.openjdk.java.net/~chagedorn/8230402/webrev.02/
> 
> What do others think?
> 
> Best regards,
> Christian
> 
>> Thanks,
>> Vladimir
>>
>>>
>>> What do you think?
>>>
>>> Best regards,
>>> Christian
>>>
>>>>>
>>>>>>
>>>>>> On 4/24/20 7:37 AM, Christian Hagedorn wrote:
>>>>>>> Hi
>>>>>>>
>>>>>>> Please review the following patch:
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8230402
>>>>>>> http://cr.openjdk.java.net/~chagedorn/8230402/webrev.00/
>>>>>>>
>>>>>>> This assert was hit very intermittently in an internal test until jdk-14+19. The test was changed afterwards and 
>>>>>>> the assert was not observed to fail anymore. However, the problem of having too many tasks in the queue is still 
>>>>>>> present (i.e. the compile queue is growing too quickly and the compiler(s) too slow to catch up). This assert can 
>>>>>>> easily be hit by creating many class loaders which load many methods which are immediately compiled by setting a 
>>>>>>> low compilation threshold as used in runA() in the testcase.
>>>>>>>
>>>>>>> Therefore, I suggest to tackle this problem with a general solution to drop half of the compilation tasks in 
>>>>>>> CompileQueue::add() when a queue size of 10000 is reached and none of the other conditions of this assert hold 
>>>>>>> (no Whitebox or JVMCI compiler). For tiered compilation, the tasks with the lowest method weight() or which are 
>>>>>>> unloaded are removed from the queue (without altering the order of the remaining tasks in the queue). Without 
>>>>>>> tiered compilation (i.e. SimpleCompPolicy), the tasks from the tail of the queue are removed. An additional 
>>>>>>> verification in debug builds should ensure that there are no duplicated tasks. I assume that part of the reason 
>>>>>>> of the original assert was to detect such duplicates.
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Christian
>>>>>>>

From vladimir.kozlov at oracle.com  Wed Apr 29 17:38:52 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Wed, 29 Apr 2020 10:38:52 -0700
Subject: RFR(XS): 8242502: UnexpectedDeoptimizationTest.java failed
 "assert(phase->type(obj)->isa_oopptr()) failed: only for oop input"
In-Reply-To: <87r1w68r6o.fsf@redhat.com>
References: <878siu9klq.fsf@redhat.com>
 <0bfc586a-fe96-0b7e-f0d4-244b5951f8f3@oracle.com> <87r1w68r6o.fsf@redhat.com>
Message-ID: <fdb1c3c4-1b5f-aca6-e4be-fea2a2937346@oracle.com>

Good.

Thanks,
Vladimir

On 4/29/20 4:30 AM, Roland Westrelin wrote:
> 
> Hi Vladimir,
> 
>> Can you also print type so we know it next time.
> 
> Thansks for reviewing this. This:
> 
> http://cr.openjdk.java.net/~roland/8242502/webrev.01/
> 
> ?
> 
> Roland.
> 

From mikael.vidstedt at oracle.com  Wed Apr 29 19:56:21 2020
From: mikael.vidstedt at oracle.com (Mikael Vidstedt)
Date: Wed, 29 Apr 2020 12:56:21 -0700
Subject: RFR(XS): 8244061: Disable jvmci/graal/aot when building
 linux-aarch64 at Oracle
In-Reply-To: <1a645807-db7e-d1f8-c871-4ae97278cb95@redhat.com>
References: <243DE92A-64F7-4293-9722-B76211AA50FA@oracle.com>
 <a62f6e35-e3d8-ca42-96f7-a2bdaff918cd@redhat.com>
 <4EED2776-3F3A-4D94-9990-D178AB58E458@oracle.com>
 <1a645807-db7e-d1f8-c871-4ae97278cb95@redhat.com>
Message-ID: <519E99B2-353F-4B3E-801E-97421BBCEB2B@oracle.com>


> On Apr 29, 2020, at 3:44 AM, Andrew Haley <aph at redhat.com> wrote:
> 
> On 4/29/20 10:02 AM, Mikael Vidstedt wrote:
>> 
>>> On Apr 29, 2020, at 12:46 AM, Andrew Haley <aph at redhat.com> wrote:
>>> 
>>> On 4/29/20 5:12 AM, Mikael Vidstedt wrote:
>>>> Please review this small change which disables JVMCI, Graal, and AOT when building linux-aarch64 at Oracle, for now.
>>>> 
>>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8244061
>>>> webrev: http://cr.openjdk.java.net/~mikael/webrevs/8244061/webrev.00/open/webrev/
>>> 
>>> Why? Are there problems with it? If so, I need bug reports.
>> 
>> No specific problems or bugs that I?m aware of. We?re simply opting not to include Graal/JVMTI/AOT in the linux-aarch64 builds we do at Oracle, for now.
> 
> Cool, NP. Obvs. I'd prefer it if y'all built and tested it all,
> but fair enough.

One small step at a time :)

Cheers,
Mikael


From mikael.vidstedt at oracle.com  Wed Apr 29 20:43:18 2020
From: mikael.vidstedt at oracle.com (Mikael Vidstedt)
Date: Wed, 29 Apr 2020 13:43:18 -0700
Subject: RFR(XS): 8244061: Disable jvmci/graal/aot when building
 linux-aarch64 at Oracle
In-Reply-To: <21ecb3ef-bd8f-ff4e-272a-c2fdc12abe9e@oracle.com>
References: <243DE92A-64F7-4293-9722-B76211AA50FA@oracle.com>
 <21ecb3ef-bd8f-ff4e-272a-c2fdc12abe9e@oracle.com>
Message-ID: <5D530BA1-F4D2-4458-B78E-47DF700A3142@oracle.com>


Vladimir/Magnus/Erik,

Thanks for the reviews, change pushed.

Cheers,
Mikael

> On Apr 29, 2020, at 5:45 AM, Erik Joelsson <erik.joelsson at oracle.com> wrote:
> 
> Looks good.
> 
> /Erik
> 
> On 2020-04-28 21:12, Mikael Vidstedt wrote:
>> Please review this small change which disables JVMCI, Graal, and AOT when building linux-aarch64 at Oracle, for now.
>> 
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8244061
>> webrev: http://cr.openjdk.java.net/~mikael/webrevs/8244061/webrev.00/open/webrev/
>> 
>> Cheers,
>> Mikael
>> 


From Pengfei.Li at arm.com  Thu Apr 30 02:42:50 2020
From: Pengfei.Li at arm.com (Pengfei Li)
Date: Thu, 30 Apr 2020 02:42:50 +0000
Subject: RFR(S): 8244086: Following 8241492, strip mined loop may run
 extra iterations
In-Reply-To: <87wo5y8z2v.fsf@redhat.com>
References: <87wo5y8z2v.fsf@redhat.com>
Message-ID: <DB8PR08MB4969D2D5AA8EB23E0C5A0F3B96AA0@DB8PR08MB4969.eurprd08.prod.outlook.com>

Hi Roland,

I'm studying the strip mining code recently. 

> With 8241492, I changed that computation to use an unsigned comparison
> because limit - init can be greater that max_jint. This assumes that limit is
> always greater that init but in some rare cases that doesn't hold. The body
> should then be executed for only one iteration but the computation of the
> number of iterations above causes it to run for LoopStripMiningIter. The fix I
> propose is to change the computation to:
> 
> min_unsigned(LoopStripMiningIter, max(0, limit - iv)) for stride > 0
> min_unsigned(LoopStripMiningIter, max(0, iv - limit)) for stride < 0

Just one question: If the case that "limit < init" is quite rare (perhaps occurs only with some "synchronized (new Object()) {}" hack), is it better to add a "limit >= init" prediction and put it before the outer loop? I see the limit check for counted loops does in this way to avoid the overflow beyond max_jint.

--
Thanks,
Pengfei


From igor.ignatyev at oracle.com  Thu Apr 30 03:11:27 2020
From: igor.ignatyev at oracle.com (Igor Ignatyev)
Date: Wed, 29 Apr 2020 20:11:27 -0700 (PDT)
Subject: RFR(S) : 8243427 : use reproducible random in :vmTestbase_vm_mlvm
Message-ID: <356FA029-9316-40CA-9F35-FA3843DB0BF9@oracle.com>

http://cr.openjdk.java.net/~iignatyev/8243427/webrev.00
> 48 lines changed: 16 ins; 15 del; 17 mod;


Hi all,

could you please review this small patch?
from JBS:
> this subtask is to use j.t.l.Utils.getRandomInstance() as a random number generator, where applicable, in :vmTestbase_vm_mlvm test group and marking the tests which make use of "randomness" with a proper k/w.


JBS: https://bugs.openjdk.java.net/browse/JDK-8243427
testing: :vmTestbase_vm_mlvm test group
webrev:
 - code changes: http://cr.openjdk.java.net/~iignatyev/8243427/webrev.00.code/
 - adding k/w: http://cr.openjdk.java.net/~iignatyev//8243427/webrev.00.kw/
 - full: http://cr.openjdk.java.net/~iignatyev/8243427/webrev.00

Thanks,
-- Igor

From igor.ignatyev at oracle.com  Thu Apr 30 03:38:19 2020
From: igor.ignatyev at oracle.com (Igor Ignatyev)
Date: Wed, 29 Apr 2020 20:38:19 -0700
Subject: RFR(S) : 8243428 : use reproducible random in :vmTestbase_vm_compiler
Message-ID: <6FC99F72-80A9-4923-9BDF-93D4A7AC3861@oracle.com>

http://cr.openjdk.java.net/~iignatyev/8243428/webrev.00
> 158 lines changed: 76 ins; 2 del; 80 mod;

Hi all,

could you please review this small patch?
from JBS:
> this subtask is to use j.t.l.Utils.getRandomInstance() as a random number generator, where applicable, in :vmTestbase_vm_compiler test group and marking the tests which make use of "randomness" with a proper k/w.

testing: :vmTestbase_vm_compiler test group
JBS: https://bugs.openjdk.java.net/browse/JDK-8243428
webrevs:
 - code changes: http://cr.openjdk.java.net/~iignatyev/8243428/webrev.00.code
> 15 lines changed: 7 ins; 2 del; 6 mod;

 - adding k/w: http://cr.openjdk.java.net/~iignatyev/8243428/webrev.00.kw/
> 69 lines changed: 69 ins; 0 del; 0 mod;

 - full: http://cr.openjdk.java.net/~iignatyev/8243428/webrev.00
> 158 lines changed: 76 ins; 2 del; 80 mod;

Thanks,
-- Igor

From xxinliu at amazon.com  Thu Apr 30 04:02:56 2020
From: xxinliu at amazon.com (Liu, Xin)
Date: Thu, 30 Apr 2020 04:02:56 +0000
Subject: RFR(XS): Provide information when hitting a HaltNode for
 architectures other than x86
Message-ID: <92E14A43-E260-49D5-BF74-CB6331A2EB33@amazon.com>

Hi, 

Could you review this small patch?  It unifies codegen of HaltNode for other architectures. 
JBS: https://bugs.openjdk.java.net/browse/JDK-8230552
Webrev: https://cr.openjdk.java.net/~xliu/8230552/00/webrev/

I tested on aarch64.  It generates the same crash report as x86_64 when it does hit HaltNode.  Halt reason is displayed. I paste report on the JBS.
I ran hotspot:tier1 on aarch64 fastdebug build.  It passed except for 3 relevant failures[1].
  
I plan to do that on aarch64 only, but it?s trivial on other architectures, so I bravely modified them all.  May I invite s390, SPARC arm32 maintainers take a look at it?
If it goes through the review, I hope a sponsor can help me to push the submit repo and see if it works.  
 
[1] those 3 tests failed on aarch64 with/without my changes. 
gc/shenandoah/mxbeans/TestChurnNotifications.java#id2
gc/shenandoah/mxbeans/TestChurnNotifications.java#id1
gc/shenandoah/mxbeans/TestPauseNotifications.java#id1

thanks,
-lx


From Pengfei.Li at arm.com  Thu Apr 30 06:05:30 2020
From: Pengfei.Li at arm.com (Pengfei Li)
Date: Thu, 30 Apr 2020 06:05:30 +0000
Subject: RFR(XS): Provide information when hitting a HaltNode for
 architectures other than x86
In-Reply-To: <92E14A43-E260-49D5-BF74-CB6331A2EB33@amazon.com>
References: <92E14A43-E260-49D5-BF74-CB6331A2EB33@amazon.com>
Message-ID: <DB8PR08MB4969A585B39B6E50BD4272B096AA0@DB8PR08MB4969.eurprd08.prod.outlook.com>

Hi Xin,

> I tested on aarch64.  It generates the same crash report as x86_64 when it
> does hit HaltNode.  Halt reason is displayed. I paste report on the JBS.
> I ran hotspot:tier1 on aarch64 fastdebug build.  It passed except for 3
> relevant failures[1].

(NOT a reviewer) The original instruction used should be dcps1 instead of dpcs1 - there's a misspelling in AArch64 assembler. Could you add a trivial fix to change dpcs1/2/3 to dcps1/2/3?

BTW, how did you test to hit the HaltNode?

--
Thanks,
Pengfei


From xxinliu at amazon.com  Thu Apr 30 06:35:54 2020
From: xxinliu at amazon.com (Liu, Xin)
Date: Thu, 30 Apr 2020 06:35:54 +0000
Subject: RFR(XS): 8230552: Provide information when hitting a HaltNode for
 architectures other than x86
Message-ID: <19BC4D2D-56F3-45BE-898C-1389469A7B36@amazon.com>


?On 4/29/20, 11:06 PM, "Pengfei Li" <Pengfei.Li at arm.com> wrote:


    Hi Xin,

    > I tested on aarch64.  It generates the same crash report as x86_64 when it
    > does hit HaltNode.  Halt reason is displayed. I paste report on the JBS.
    > I ran hotspot:tier1 on aarch64 fastdebug build.  It passed except for 3
    > relevant failures[1].

    (NOT a reviewer) The original instruction used should be dcps1 instead of dpcs1 - there's a misspelling in AArch64 assembler. Could you add a trivial fix to change dpcs1/2/3 to dcps1/2/3?

Oh, I don't know that. I did search dpcs and found nothing. 
I've filed a new issue about the typo thing:  JDK-8244170. Let's resolve it in separated issue.

    BTW, how did you test to hit the HaltNode?
    --
    Thanks,
    Pengfei

I followed Christian and Volkers' recipe on JDK-8230552. Both of them can generate HaltNode. 
Volker's approach is very interesting. You have to give program a couple of "-XX:SuppressErrorAt=" to increase tolerance.

Thanks,
--lx


From christian.hagedorn at oracle.com  Thu Apr 30 06:46:05 2020
From: christian.hagedorn at oracle.com (Christian Hagedorn)
Date: Thu, 30 Apr 2020 08:46:05 +0200
Subject: [15] RFR(S): 8230402: Allocation of compile task fails with
 assert: "Leaking compilation tasks?"
In-Reply-To: <1fb7fbc5-5dd5-5eda-668a-9fd2040956a0@oracle.com>
References: <27dd5ff1-9f91-d8c1-ecee-a77e6ecdb558@oracle.com>
 <e5a3c683-878d-f94e-b083-897e9ff89d77@oracle.com>
 <38eaa19f-723f-cd24-ff92-030f18c62780@oracle.com>
 <8751839c-4a48-7708-02da-6d59d5691849@oracle.com>
 <60c14954-5482-9453-f252-d56551f98c4d@oracle.com>
 <a76996ae-7c28-f88e-06d0-5aa57e3f6d8f@oracle.com>
 <34153fa0-0016-e5dd-29ac-5bedf251a170@oracle.com>
 <1fb7fbc5-5dd5-5eda-668a-9fd2040956a0@oracle.com>
Message-ID: <7a7b996f-67a1-511c-76bd-2ef58bbb1b81@oracle.com>

Thank you Vladimir for your review!

Best regards,
Christian

On 29.04.20 19:07, Vladimir Kozlov wrote:
> webrev.02 looks good to me.
> 
> Thanks,
> Vladimir
> 
> On 4/29/20 2:26 AM, Christian Hagedorn wrote:
>> Hi Vladimir
>>
>> On 29.04.20 03:43, Vladimir Kozlov wrote:
>>> On 4/28/20 8:36 AM, Christian Hagedorn wrote:
>>>> Hi Vladimir
>>>>
>>>>>>> May be we should mark methods which are removed from queue or use 
>>>>>>> counters decay or use other mechanisms to prevent methods be put 
>>>>>>> back into queue immediately because their counters are high. You 
>>>>>>> may not need to remove half of queue in such case.
>>>>>>
>>>>>> You mean we could, for example, just reset the invocation and 
>>>>>> backedge counters of removed methods from the queue? This would 
>>>>>> probably be beneficial in a more general case than in my test case 
>>>>>> where each method is only executed twice. About the number of 
>>>>>> tasks to drop, it was just a best guess. We can also choose to 
>>>>>> drop fewer. But it is probably hard to determine a best value in 
>>>>>> general.
>>>>>
>>>>> An other thought. Instead of removing tasks from queue may be we 
>>>>> should not put new tasks on queue when it become almost full (but 
>>>>> continue profiling methods). For that we need a parameter (or 
>>>>> diagnostic flag) instead of 10000 number.
>>>>
>>>> That also sounds reasonable. But then we might miss on new hot 
>>>> methods while the queue could contain many cold methods.
>>>
>>> The hotter a method the sooner it will be put on queue for 
>>> compilation. The only case I can think of is recompilation of hot 
>>> method due to deoptimization. May be spending more time in 
>>> Interpreter is not bad thing.
>>
>> Yes, that seems okay.
>>
>>>>
>>>>> We are not using counters decay in Tiered mode because we are 
>>>>> loosing/corrupting profiling data with it. We should avoid this. I 
>>>>> just gave an example of what could be done.
>>>>
>>>> Okay.
>>>>
>>>>> One concern I have is that before it was check in debug VM. Now we 
>>>>> putting limitation on compilations in product VM which may affect 
>>>>> performance in some cases. We should check that.
>>>>
>>>> I ran some standard benchmarks and have not observed any 
>>>> regressions. However, I also ran these benchmarks with the original 
>>>> code and substituted the assert by a guarantee. It was never hit 
>>>> (i.e. the new code never executed and thus had no influence). It 
>>>> could still affect other benchmarks and programs in some unexpected 
>>>> way. But I think it is very unlikely to hit the threshold in a 
>>>> normal program.
>>>>
>>>> Therefore, I think it is kinda a trade-off between complexity of the 
>>>> solution and likelihood that those special cases 
>>>
>>> I completely agree with you on this.
>>>
>>>> occur. So, to summarize all the current options:
>>>> 1) Just remove the assert. But then we miss cases where we actually 
>>>> have duplicates in the queue
>>>> 2) webrev.01 to drop half of the tasks. We can check for duplicates 
>>>> before dropping. Can also change the number of tasks to drop.
>>>> ???? a) do it only in debug builds. Performance would not be 
>>>> significant.
>>>> ???? b) do it also in product builds. New limitation that might 
>>>> impact performance on some other benchmarks/programs but I think 
>>>> it's not very likely.
>>>> 3a/b) Do the same without excluding WhiteBoxAPI and/or UseJVMCICompiler
>>>> 4) Use your suggestion to stop enqueuing tasks at a certain 
>>>> threshold with a parameter or flag. Simple solution but might miss 
>>>> some new hot methods and it needs a different trigger to check for 
>>>> duplicates to avoid checking it too many times.
>>>> ???? a) do it only in debug builds. Performance would not be 
>>>> significant.
>>>> ???? b) do it also in product builds. Performance might be impacted 
>>>> on some other benchmarks/programs but I think not very likely.
>>>> 5a/b) Do the same without excluding WhiteBoxAPI and/or UseJVMCICompiler
>>>
>>> There should be no duplicates in queue - we check for that:
>>> http://hg.openjdk.java.net/jdk/jdk/file/06745527c7b8/src/hotspot/share/compiler/compileBroker.cpp#l1120 
>>>
>>> Unless we screwed up JVM_ACC_QUEUED bit setting.
>>
>> I see, then it might not really be necessary to check for duplicates 
>> yet again.
>>
>>> I think original assert was added in 8040798 to check that a task is 
>>> always put into _task_free_list when we finish compilation (or abort 
>>> compilation).
>>
>> Thanks for clearing that up. I was not aware of that intention. I 
>> first thought it had only the purpose of finding strange things inside 
>> the compile queue.
>>
>>> I don't think we should have different behavior (remove tasks from 
>>> queue or not put task in queue) in product and debug builds. If we do 
>>> that we have to do in both. We use mostly fastdebug build for testing 
>>> - we should execute the same code as in product as close as possible.
>>
>> I agree with that. Doing it only in debug builds leads to a too 
>> different behavior.
>>
>>> The real case is when all C2 compiler threads hangs (or takes very 
>>> long time to compile) and no progress is done on compiling other 
>>> methods in queue. But it should be very rare that all compiling 
>>> threads hangs. And we can catch such cases by other checks.
>>> > Based on all that I would go with 1).? As you pointed in bug report we
>>> not observing this assert anymore (only with hand made test).
>>
>> Thank you for explaining it in more detail. When we have other checks 
>> that can detect such a situation where all compiling threads are 
>> hanging then we are probably fine by just removing that assert.
>>
>> I updated my webrev with that option. I left the updated stress test 
>> there:
>> http://cr.openjdk.java.net/~chagedorn/8230402/webrev.02/
>>
>> What do others think?
>>
>> Best regards,
>> Christian
>>
>>> Thanks,
>>> Vladimir
>>>
>>>>
>>>> What do you think?
>>>>
>>>> Best regards,
>>>> Christian
>>>>
>>>>>>
>>>>>>>
>>>>>>> On 4/24/20 7:37 AM, Christian Hagedorn wrote:
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> Please review the following patch:
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8230402
>>>>>>>> http://cr.openjdk.java.net/~chagedorn/8230402/webrev.00/
>>>>>>>>
>>>>>>>> This assert was hit very intermittently in an internal test 
>>>>>>>> until jdk-14+19. The test was changed afterwards and the assert 
>>>>>>>> was not observed to fail anymore. However, the problem of having 
>>>>>>>> too many tasks in the queue is still present (i.e. the compile 
>>>>>>>> queue is growing too quickly and the compiler(s) too slow to 
>>>>>>>> catch up). This assert can easily be hit by creating many class 
>>>>>>>> loaders which load many methods which are immediately compiled 
>>>>>>>> by setting a low compilation threshold as used in runA() in the 
>>>>>>>> testcase.
>>>>>>>>
>>>>>>>> Therefore, I suggest to tackle this problem with a general 
>>>>>>>> solution to drop half of the compilation tasks in 
>>>>>>>> CompileQueue::add() when a queue size of 10000 is reached and 
>>>>>>>> none of the other conditions of this assert hold (no Whitebox or 
>>>>>>>> JVMCI compiler). For tiered compilation, the tasks with the 
>>>>>>>> lowest method weight() or which are unloaded are removed from 
>>>>>>>> the queue (without altering the order of the remaining tasks in 
>>>>>>>> the queue). Without tiered compilation (i.e. SimpleCompPolicy), 
>>>>>>>> the tasks from the tail of the queue are removed. An additional 
>>>>>>>> verification in debug builds should ensure that there are no 
>>>>>>>> duplicated tasks. I assume that part of the reason of the 
>>>>>>>> original assert was to detect such duplicates.
>>>>>>>>
>>>>>>>> Thank you!
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Christian
>>>>>>>>

From tobias.hartmann at oracle.com  Thu Apr 30 07:08:05 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Thu, 30 Apr 2020 09:08:05 +0200
Subject: [15] RFR(S): 8230402: Allocation of compile task fails with
 assert: "Leaking compilation tasks?"
In-Reply-To: <34153fa0-0016-e5dd-29ac-5bedf251a170@oracle.com>
References: <27dd5ff1-9f91-d8c1-ecee-a77e6ecdb558@oracle.com>
 <e5a3c683-878d-f94e-b083-897e9ff89d77@oracle.com>
 <38eaa19f-723f-cd24-ff92-030f18c62780@oracle.com>
 <8751839c-4a48-7708-02da-6d59d5691849@oracle.com>
 <60c14954-5482-9453-f252-d56551f98c4d@oracle.com>
 <a76996ae-7c28-f88e-06d0-5aa57e3f6d8f@oracle.com>
 <34153fa0-0016-e5dd-29ac-5bedf251a170@oracle.com>
Message-ID: <00469fc7-19ba-bb56-d81e-0bcddee73e96@oracle.com>

Hi Christian,

On 29.04.20 11:26, Christian Hagedorn wrote:
> I updated my webrev with that option. I left the updated stress test there:
> http://cr.openjdk.java.net/~chagedorn/8230402/webrev.02/
> 
> What do others think?

Looks good to me.

Best regards,
Tobias

From tobias.hartmann at oracle.com  Thu Apr 30 07:22:23 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Thu, 30 Apr 2020 09:22:23 +0200
Subject: RFR(S):8242429:Better implementation for signed extract
In-Reply-To: <DBBPR08MB44266E6C6FBA9CDEC1D0E0B0C5D80@DBBPR08MB4426.eurprd08.prod.outlook.com>
References: <AM6PR08MB4422DCEBC3FC64084DB24828C5C10@AM6PR08MB4422.eurprd08.prod.outlook.com>
 <d3eada15-a9f5-01a5-1110-6b268cec76b8@oracle.com>
 <DBBPR08MB44266E6C6FBA9CDEC1D0E0B0C5D80@DBBPR08MB4426.eurprd08.prod.outlook.com>
Message-ID: <420844d8-fad2-7e60-2353-398957e965e7@oracle.com>

Hi Eric,

On 16.04.20 06:13, Eric Liu wrote:
> Webrev:?http://cr.openjdk.java.net/~yzhang/ericliu/8242429/webrev.01/

Vladimir is currently out. Looks good to me as well.

> [2] https://bugs.openjdk.java.net/secure/attachment/87713/SignExtractTest.java

Could you convert that test to a jtreg test and add it to the webrev?

Thanks,
Tobias


From tobias.hartmann at oracle.com  Thu Apr 30 07:29:56 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Thu, 30 Apr 2020 09:29:56 +0200
Subject: RFR(S):8242429:Better implementation for signed extract
In-Reply-To: <420844d8-fad2-7e60-2353-398957e965e7@oracle.com>
References: <AM6PR08MB4422DCEBC3FC64084DB24828C5C10@AM6PR08MB4422.eurprd08.prod.outlook.com>
 <d3eada15-a9f5-01a5-1110-6b268cec76b8@oracle.com>
 <DBBPR08MB44266E6C6FBA9CDEC1D0E0B0C5D80@DBBPR08MB4426.eurprd08.prod.outlook.com>
 <420844d8-fad2-7e60-2353-398957e965e7@oracle.com>
Message-ID: <caa7db93-554e-f342-b776-d01185f1e646@oracle.com>


On 30.04.20 09:22, Tobias Hartmann wrote:
> Could you convert that test to a jtreg test and add it to the webrev?

And maybe also add test cases for the "0-(A>>31)" into "(A>>>31)" optimization.

Thanks,
Tobias

From tobias.hartmann at oracle.com  Thu Apr 30 07:39:09 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Thu, 30 Apr 2020 09:39:09 +0200
Subject: RFR(XS): 8242502: UnexpectedDeoptimizationTest.java failed
 "assert(phase->type(obj)->isa_oopptr()) failed: only for oop input"
In-Reply-To: <fdb1c3c4-1b5f-aca6-e4be-fea2a2937346@oracle.com>
References: <878siu9klq.fsf@redhat.com>
 <0bfc586a-fe96-0b7e-f0d4-244b5951f8f3@oracle.com> <87r1w68r6o.fsf@redhat.com>
 <fdb1c3c4-1b5f-aca6-e4be-fea2a2937346@oracle.com>
Message-ID: <cb2f7041-d37c-567d-2a47-b99011f54a55@oracle.com>

+1

Best regards,
Tobias

On 29.04.20 19:38, Vladimir Kozlov wrote:
> Good.
> 
> Thanks,
> Vladimir
> 
> On 4/29/20 4:30 AM, Roland Westrelin wrote:
>>
>> Hi Vladimir,
>>
>>> Can you also print type so we know it next time.
>>
>> Thansks for reviewing this. This:
>>
>> http://cr.openjdk.java.net/~roland/8242502/webrev.01/
>>
>> ?
>>
>> Roland.
>>

From rwestrel at redhat.com  Thu Apr 30 07:45:18 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Thu, 30 Apr 2020 09:45:18 +0200
Subject: RFR(M): 8223051: support loops with long (64b) trip counts
Message-ID: <87lfmd8lip.fsf@redhat.com>


https://bugs.openjdk.java.net/browse/JDK-8223051
http://cr.openjdk.java.net/~roland/8223051/webrev.00/

This transforms a long counted loop into a strip mined loop nest. That
is roughly:

for (long l = long_start; l < long_stop; l += long_stride) {
}

into

for (long l = long_start; l < long_stop; ) {
  int int_stride = (int)long_stride;
  int int_stop = MIN(long_stop - l, max_jint - int_stride);
  l += int_stop;
  for (int i = 0; i < int_stop; i += int_stride) {
  }
}

This is implemented as a separate transformation from loop strip mining
of JDK-8186027. I used the logic from JDK-8186027 as inspiration but
it's really quite different.

If JDK-8186027's loop strip mining is enabled the loop nest above can be
further transformed into:

for (long l = long_start; l < long_stop; ) {
  for (int i = 0; i < int_stop; i += int_stride) {
    for (int j = i; j < LoopStripMiningIter; j+= int_stride) {
    }
  }
}

I refactored the code of PhaseIdealLoop::is_counted_loop() so it was
straightforward to add a PhaseIdealLoop::is_long_counted_loop() that
shares some logic with PhaseIdealLoop::is_counted_loop().

is_long_counted_loop() starts by looking at the shape of the loop and if
its shape is that of a counted loop with a long induction variable, then
an outer loop is added with a long induction variable. An int induction
variable is constructed for the inner loop. At this point the loop nest
is only partially constructed.

is_long_counted_loop() then attempts the conversion of the inner loop
into an int counted loop with a call to is_counted_loop(). If that fails
for some rare corner case, is_long_counted_loop() backs off and
transforms the loop nest back so it's a single long loop again.

If the inner loop is successfully converted into a counted loop,
is_long_counted_loop() finishes building the loop nest. This is
different from JDK-8186027's loop strip mining which builds the loop
nest in 2 phases: first a skeleton outer loop and after loop opts, the
fully built loop nest.

I also added stressing code that turns:

for (int i = int_start; i < int_stop; i += int_stride) {
}

into:

for (long l = (long)int_start; l < (long)int_stop; l += (long)int_stride) {
}

that can then be converted into the loop nest above. The reason for this
is that I was concerned long loops were too uncommon in the wild for
this change to be properly tested.

I had to change the asserts in loopopts.cpp, because all nodes that are
added when the loop nest is constructed have the same dom depth.

This change doesn't handle RCE. I'll work on that next.

Roland.


From tobias.hartmann at oracle.com  Thu Apr 30 07:49:04 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Thu, 30 Apr 2020 09:49:04 +0200
Subject: RFR(S) : 8243427 : use reproducible random in :vmTestbase_vm_mlvm
In-Reply-To: <356FA029-9316-40CA-9F35-FA3843DB0BF9@oracle.com>
References: <356FA029-9316-40CA-9F35-FA3843DB0BF9@oracle.com>
Message-ID: <563901ac-8fe2-70d3-8801-7302e785d371@oracle.com>

Hi Igor,

looks reasonable to me.

Best regards,
Tobias

On 30.04.20 05:11, Igor Ignatyev wrote:
> http://cr.openjdk.java.net/~iignatyev/8243427/webrev.00
>> 48 lines changed: 16 ins; 15 del; 17 mod;
> 
> 
> Hi all,
> 
> could you please review this small patch?
> from JBS:
>> this subtask is to use j.t.l.Utils.getRandomInstance() as a random number generator, where applicable, in :vmTestbase_vm_mlvm test group and marking the tests which make use of "randomness" with a proper k/w.
> 
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8243427
> testing: :vmTestbase_vm_mlvm test group
> webrev:
>  - code changes: http://cr.openjdk.java.net/~iignatyev/8243427/webrev.00.code/
>  - adding k/w: http://cr.openjdk.java.net/~iignatyev//8243427/webrev.00.kw/
>  - full: http://cr.openjdk.java.net/~iignatyev/8243427/webrev.00
> 
> Thanks,
> -- Igor
> 

From rwestrel at redhat.com  Thu Apr 30 07:48:40 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Thu, 30 Apr 2020 09:48:40 +0200
Subject: RFR(XS): 8242502: UnexpectedDeoptimizationTest.java failed
 "assert(phase->type(obj)->isa_oopptr()) failed: only for oop input"
In-Reply-To: <cb2f7041-d37c-567d-2a47-b99011f54a55@oracle.com>
References: <878siu9klq.fsf@redhat.com>
 <0bfc586a-fe96-0b7e-f0d4-244b5951f8f3@oracle.com> <87r1w68r6o.fsf@redhat.com>
 <fdb1c3c4-1b5f-aca6-e4be-fea2a2937346@oracle.com>
 <cb2f7041-d37c-567d-2a47-b99011f54a55@oracle.com>
Message-ID: <87imhh8ld3.fsf@redhat.com>


Thanks for the reviews!

Roland.


From martin.doerr at sap.com  Thu Apr 30 07:54:36 2020
From: martin.doerr at sap.com (Doerr, Martin)
Date: Thu, 30 Apr 2020 07:54:36 +0000
Subject: RFR(S): 8244086: Following 8241492, strip mined loop may run
 extra iterations
In-Reply-To: <DB8PR08MB4969D2D5AA8EB23E0C5A0F3B96AA0@DB8PR08MB4969.eurprd08.prod.outlook.com>
References: <87wo5y8z2v.fsf@redhat.com>
 <DB8PR08MB4969D2D5AA8EB23E0C5A0F3B96AA0@DB8PR08MB4969.eurprd08.prod.outlook.com>
Message-ID: <AM4PR02MB3057E81FCB2D86CBC20745D49AAA0@AM4PR02MB3057.eurprd02.prod.outlook.com>

Hi Pengfei,

I had similar thoughts. However, seems like a "limit >= init" check before the outer loop would require cloning the loop body or inserting an uncommon trap in the example of the tests which use a do-while loop which runs exactly one iteration.

Anyway, I'm not a fan of the complicated formula, either.

Best regards,
Martin


> -----Original Message-----
> From: hotspot-compiler-dev <hotspot-compiler-dev-
> bounces at openjdk.java.net> On Behalf Of Pengfei Li
> Sent: Donnerstag, 30. April 2020 04:43
> To: Roland Westrelin <rwestrel at redhat.com>; hotspot-compiler-
> dev at openjdk.java.net
> Cc: nd <nd at arm.com>
> Subject: RE: RFR(S): 8244086: Following 8241492, strip mined loop may run
> extra iterations
> 
> Hi Roland,
> 
> I'm studying the strip mining code recently.
> 
> > With 8241492, I changed that computation to use an unsigned comparison
> > because limit - init can be greater that max_jint. This assumes that limit is
> > always greater that init but in some rare cases that doesn't hold. The body
> > should then be executed for only one iteration but the computation of the
> > number of iterations above causes it to run for LoopStripMiningIter. The fix
> I
> > propose is to change the computation to:
> >
> > min_unsigned(LoopStripMiningIter, max(0, limit - iv)) for stride > 0
> > min_unsigned(LoopStripMiningIter, max(0, iv - limit)) for stride < 0
> 
> Just one question: If the case that "limit < init" is quite rare (perhaps occurs
> only with some "synchronized (new Object()) {}" hack), is it better to add a
> "limit >= init" prediction and put it before the outer loop? I see the limit
> check for counted loops does in this way to avoid the overflow beyond
> max_jint.
> 
> --
> Thanks,
> Pengfei


From tobias.hartmann at oracle.com  Thu Apr 30 07:59:14 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Thu, 30 Apr 2020 09:59:14 +0200
Subject: RFR(S) : 8243428 : use reproducible random in
 :vmTestbase_vm_compiler
In-Reply-To: <6FC99F72-80A9-4923-9BDF-93D4A7AC3861@oracle.com>
References: <6FC99F72-80A9-4923-9BDF-93D4A7AC3861@oracle.com>
Message-ID: <1d537485-7de4-3e8f-d565-03d43bd9ee8e@oracle.com>

Hi Igor,

looks reasonable to me.

Best regards,
Tobias

On 30.04.20 05:38, Igor Ignatyev wrote:
> http://cr.openjdk.java.net/~iignatyev/8243428/webrev.00
>> 158 lines changed: 76 ins; 2 del; 80 mod;
> 
> Hi all,
> 
> could you please review this small patch?
> from JBS:
>> this subtask is to use j.t.l.Utils.getRandomInstance() as a random number generator, where applicable, in :vmTestbase_vm_compiler test group and marking the tests which make use of "randomness" with a proper k/w.
> 
> testing: :vmTestbase_vm_compiler test group
> JBS: https://bugs.openjdk.java.net/browse/JDK-8243428
> webrevs:
>  - code changes: http://cr.openjdk.java.net/~iignatyev/8243428/webrev.00.code
>> 15 lines changed: 7 ins; 2 del; 6 mod;
> 
>  - adding k/w: http://cr.openjdk.java.net/~iignatyev/8243428/webrev.00.kw/
>> 69 lines changed: 69 ins; 0 del; 0 mod;
> 
>  - full: http://cr.openjdk.java.net/~iignatyev/8243428/webrev.00
>> 158 lines changed: 76 ins; 2 del; 80 mod;
> 
> Thanks,
> -- Igor
> 

From xxinliu at amazon.com  Thu Apr 30 08:11:32 2020
From: xxinliu at amazon.com (Liu, Xin)
Date: Thu, 30 Apr 2020 08:11:32 +0000
Subject: RFR(XXS):8244170: correct instruction typo for dcps1/2/3
Message-ID: <3F8C4202-6810-4CC6-BB77-656A6D71E9D3@amazon.com>

Hi,

Please review the typo correction change for aarch64.
The change is trivial. It just makes the instruction name dcps same as armv8 manual.

JBS: https://cr.openjdk.java.net/~xliu/8244170/webrev/
webrev: https://bugs.openjdk.java.net/browse/JDK-8244170

I ran hotspot-tier1 and no regression found for fastdebug build on aarch64.

Thanks,
--lx


From rwestrel at redhat.com  Thu Apr 30 08:13:07 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Thu, 30 Apr 2020 10:13:07 +0200
Subject: RFR(S): 8244086: Following 8241492,
 strip mined loop may run extra iterations
In-Reply-To: <DB8PR08MB4969D2D5AA8EB23E0C5A0F3B96AA0@DB8PR08MB4969.eurprd08.prod.outlook.com>
References: <87wo5y8z2v.fsf@redhat.com>
 <DB8PR08MB4969D2D5AA8EB23E0C5A0F3B96AA0@DB8PR08MB4969.eurprd08.prod.outlook.com>
Message-ID: <87bln98k8c.fsf@redhat.com>


Hi Pengfei,

Thanks for looking at this.

> Just one question: If the case that "limit < init" is quite rare
> (perhaps occurs only with some "synchronized (new Object()) {}" hack),
> is it better to add a "limit >= init" prediction and put it before the
> outer loop? I see the limit check for counted loops does in this way
> to avoid the overflow beyond max_jint.

That seems possible but given the new code is out of the inner loop and
so likely not performance sensitive, the extra complexity doesn't feel
right.

If the concern is performance, the new checks are in fact not needed if
the loop was split into a pre/main/post loops. Emitting the extra checks
only for "normal" loops would make them a lot less common.

Roland.


From christian.hagedorn at oracle.com  Thu Apr 30 08:16:17 2020
From: christian.hagedorn at oracle.com (Christian Hagedorn)
Date: Thu, 30 Apr 2020 10:16:17 +0200
Subject: [15] RFR(S): 8230402: Allocation of compile task fails with
 assert: "Leaking compilation tasks?"
In-Reply-To: <00469fc7-19ba-bb56-d81e-0bcddee73e96@oracle.com>
References: <27dd5ff1-9f91-d8c1-ecee-a77e6ecdb558@oracle.com>
 <e5a3c683-878d-f94e-b083-897e9ff89d77@oracle.com>
 <38eaa19f-723f-cd24-ff92-030f18c62780@oracle.com>
 <8751839c-4a48-7708-02da-6d59d5691849@oracle.com>
 <60c14954-5482-9453-f252-d56551f98c4d@oracle.com>
 <a76996ae-7c28-f88e-06d0-5aa57e3f6d8f@oracle.com>
 <34153fa0-0016-e5dd-29ac-5bedf251a170@oracle.com>
 <00469fc7-19ba-bb56-d81e-0bcddee73e96@oracle.com>
Message-ID: <de6d27e6-2539-c526-39e8-d0f86e39d701@oracle.com>

Thank you Tobias for your review!

Best regards,
Christian

On 30.04.20 09:08, Tobias Hartmann wrote:
> Hi Christian,
> 
> On 29.04.20 11:26, Christian Hagedorn wrote:
>> I updated my webrev with that option. I left the updated stress test there:
>> http://cr.openjdk.java.net/~chagedorn/8230402/webrev.02/
>>
>> What do others think?
> 
> Looks good to me.
> 
> Best regards,
> Tobias
> 

From rwestrel at redhat.com  Thu Apr 30 08:18:20 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Thu, 30 Apr 2020 10:18:20 +0200
Subject: RFR(S): 8244086: Following 8241492,
 strip mined loop may run extra iterations
In-Reply-To: <AM4PR02MB3057E81FCB2D86CBC20745D49AAA0@AM4PR02MB3057.eurprd02.prod.outlook.com>
References: <87wo5y8z2v.fsf@redhat.com>
 <DB8PR08MB4969D2D5AA8EB23E0C5A0F3B96AA0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <AM4PR02MB3057E81FCB2D86CBC20745D49AAA0@AM4PR02MB3057.eurprd02.prod.outlook.com>
Message-ID: <878sid8jzn.fsf@redhat.com>


Hi Martin,

Thanks for looking at this.

> Anyway, I'm not a fan of the complicated formula, either.

Another possibility would be to cast int values to long, compute the
number of iterations and cast the result back to int. No need to worry
about overflow then. I didn't go that way because 8223051 (support loops
with long (64b) trip count) that I just sent out for review needs
similar logic and there, promoting long values to 128 bit integers is
not possible. I would like the logic of 8223051 and loop strip mining to
remain as similar as possible.

Roland.


From martin.doerr at sap.com  Thu Apr 30 09:07:04 2020
From: martin.doerr at sap.com (Doerr, Martin)
Date: Thu, 30 Apr 2020 09:07:04 +0000
Subject: RFR(S): 8244086: Following 8241492, strip mined loop may run
 extra iterations
In-Reply-To: <878sid8jzn.fsf@redhat.com>
References: <87wo5y8z2v.fsf@redhat.com>
 <DB8PR08MB4969D2D5AA8EB23E0C5A0F3B96AA0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <AM4PR02MB3057E81FCB2D86CBC20745D49AAA0@AM4PR02MB3057.eurprd02.prod.outlook.com>
 <878sid8jzn.fsf@redhat.com>
Message-ID: <AM4PR02MB3057D76A813EAE602E4EA6939AAA0@AM4PR02MB3057.eurprd02.prod.outlook.com>

Hi Roland,

I also prefer having the same solution for int and long.

"I'm not a fan of the complicated formula" means I could live with it, but it doesn't look like it's going to be my preferred solution.
The code is already hard to read and this makes it worse.

Maybe another option would be to recognize loops which are affected by this problem and live with having a safepoint in the body (skip LSM).
AFAICS "for" loops have the check at the beginning and should not be affected.

Best regards,
Martin


> -----Original Message-----
> From: Roland Westrelin <rwestrel at redhat.com>
> Sent: Donnerstag, 30. April 2020 10:18
> To: Doerr, Martin <martin.doerr at sap.com>; Pengfei Li
> <Pengfei.Li at arm.com>; hotspot-compiler-dev at openjdk.java.net
> Cc: nd <nd at arm.com>
> Subject: RE: RFR(S): 8244086: Following 8241492, strip mined loop may run
> extra iterations
> 
> 
> Hi Martin,
> 
> Thanks for looking at this.
> 
> > Anyway, I'm not a fan of the complicated formula, either.
> 
> Another possibility would be to cast int values to long, compute the
> number of iterations and cast the result back to int. No need to worry
> about overflow then. I didn't go that way because 8223051 (support loops
> with long (64b) trip count) that I just sent out for review needs
> similar logic and there, promoting long values to 128 bit integers is
> not possible. I would like the logic of 8223051 and loop strip mining to
> remain as similar as possible.
> 
> Roland.


From christian.hagedorn at oracle.com  Thu Apr 30 09:09:41 2020
From: christian.hagedorn at oracle.com (Christian Hagedorn)
Date: Thu, 30 Apr 2020 11:09:41 +0200
Subject: [15] RFR(XS): 8244182: Use root node as default for find_node when
 called from debugger
Message-ID: <d44db182-dfae-9997-c701-3c7d2fcc26ad@oracle.com>

Hi

Please review the following small enhancement for debugging:
https://bugs.openjdk.java.net/browse/JDK-8244182
http://cr.openjdk.java.net/~chagedorn/8244182/webrev.00/

I often find myself calling find_node(Node* n, int idx) from gdb using 
the following command:

(gdb) p find_node(C->_root, 42)
$1 = (Node *) 0x7ffff08cab68

When C is not defined I need to search for a different node to use in 
the same method which is an additional effort. Moreover, there is 
sometimes no node and then I either need to go up in the current stack 
until C is defined or use Compile::current()->_root. But that does not 
work with gdb and neither does Compile::current()->root():

(gdb) p find_node(Compile::current()->_root, 42)
A syntax error in expression, near `()->_root, 42)'.
(gdb) p find_node(Compile::current()->root(), 42)
A syntax error in expression, near `()->root(), 42)'.

As a workaround I do the following which is quite tedious:

(gdb) p Compile::current()->_root
$2 = (RootNode *) 0x7fffa8072ae0
(gdb) p find_node((RootNode *) 0x7fffa8072ae0, 42)
$3 = (Node *) 0x7ffff08cab68

Therefore, I suggest to add a default to use the root node which is 
probably the most likely use case.

Thank you!

Best regards,
Christian

From rwestrel at redhat.com  Thu Apr 30 09:20:20 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Thu, 30 Apr 2020 11:20:20 +0200
Subject: [15] RFR(XS): 8244182: Use root node as default for find_node
 when called from debugger
In-Reply-To: <d44db182-dfae-9997-c701-3c7d2fcc26ad@oracle.com>
References: <d44db182-dfae-9997-c701-3c7d2fcc26ad@oracle.com>
Message-ID: <87368l8h4b.fsf@redhat.com>


> http://cr.openjdk.java.net/~chagedorn/8244182/webrev.00/

That looks good and trivial.

> I often find myself calling find_node(Node* n, int idx) from gdb using 
> the following command:
>
> (gdb) p find_node(C->_root, 42)
> $1 = (Node *) 0x7ffff08cab68

I wasn't aware of find_node(). You taught me a new trick...

Roland.


From tobias.hartmann at oracle.com  Thu Apr 30 09:35:49 2020
From: tobias.hartmann at oracle.com (Tobias Hartmann)
Date: Thu, 30 Apr 2020 11:35:49 +0200
Subject: [15] RFR(XS): 8244182: Use root node as default for find_node
 when called from debugger
In-Reply-To: <d44db182-dfae-9997-c701-3c7d2fcc26ad@oracle.com>
References: <d44db182-dfae-9997-c701-3c7d2fcc26ad@oracle.com>
Message-ID: <49244352-9c07-dbaa-7c24-d0d4e470044b@oracle.com>

Hi Christian,

On 30.04.20 11:09, Christian Hagedorn wrote:
> http://cr.openjdk.java.net/~chagedorn/8244182/webrev.00/
> 
> I often find myself calling find_node(Node* n, int idx) from gdb using the following command:

Same here. Looks good!

Best regards,
Tobias

From christian.hagedorn at oracle.com  Thu Apr 30 09:49:51 2020
From: christian.hagedorn at oracle.com (Christian Hagedorn)
Date: Thu, 30 Apr 2020 11:49:51 +0200
Subject: [15] RFR(XS): 8244182: Use root node as default for find_node
 when called from debugger
In-Reply-To: <87368l8h4b.fsf@redhat.com>
References: <d44db182-dfae-9997-c701-3c7d2fcc26ad@oracle.com>
 <87368l8h4b.fsf@redhat.com>
Message-ID: <ed9b9652-91fe-bbef-88fa-1eb7e417be10@oracle.com>

Hi Roland

On 30.04.20 11:20, Roland Westrelin wrote:
> 
>> http://cr.openjdk.java.net/~chagedorn/8244182/webrev.00/
> 
> That looks good and trivial.

Thank you for your review!

>> I often find myself calling find_node(Node* n, int idx) from gdb using
>> the following command:
>>
>> (gdb) p find_node(C->_root, 42)
>> $1 = (Node *) 0x7ffff08cab68
> 
> I wasn't aware of find_node(). You taught me a new trick...

I'm glad I could show you that one :-)

Best regards,
Christian

From christian.hagedorn at oracle.com  Thu Apr 30 09:50:23 2020
From: christian.hagedorn at oracle.com (Christian Hagedorn)
Date: Thu, 30 Apr 2020 11:50:23 +0200
Subject: [15] RFR(XS): 8244182: Use root node as default for find_node
 when called from debugger
In-Reply-To: <49244352-9c07-dbaa-7c24-d0d4e470044b@oracle.com>
References: <d44db182-dfae-9997-c701-3c7d2fcc26ad@oracle.com>
 <49244352-9c07-dbaa-7c24-d0d4e470044b@oracle.com>
Message-ID: <ff02f231-b693-33e7-b265-9f7fb7d99b0a@oracle.com>

Thank you Tobias for your review!

Best regards,
Christian

On 30.04.20 11:35, Tobias Hartmann wrote:
> Hi Christian,
> 
> On 30.04.20 11:09, Christian Hagedorn wrote:
>> http://cr.openjdk.java.net/~chagedorn/8244182/webrev.00/
>>
>> I often find myself calling find_node(Node* n, int idx) from gdb using the following command:
> 
> Same here. Looks good!
> 
> Best regards,
> Tobias
> 

From jatin.bhateja at intel.com  Thu Apr 30 11:45:58 2020
From: jatin.bhateja at intel.com (Bhateja, Jatin)
Date: Thu, 30 Apr 2020 11:45:58 +0000
Subject: RFR[XS] : 8244186 : assertion failure
 test/jdk/javax/net/ssl/DTLS/RespondToRetransmit.java
Message-ID: <MW3PR11MB47146639E139FEFDC5CEE8C4E8AA0@MW3PR11MB4714.namprd11.prod.outlook.com>

Hi All,

Kindly review the patch which fixes assertion failures seen in some jtreg regression.

JBS: http://bugs.openjdk.java.net/browse/JDK-8244186
Webrev: http://cr.openjdk.java.net/~jbhateja/8244186/webrev.01/

Removing an assertion which prevents logic folding over cones already having a 
MacroLogic node as depicted by following graph[1].

Regards,
Jatin

[1] Original ideal Graph:
          |          | 
         [N1](XorV) [N2](XorV)
         / \        /  \
        /   \      /    \
       /     \    /      \
      /       \  /        \
     /         \/          \
    /         [N3](AndV)    \
   /         /   \           \
 [A]        /     \          [D]
           /       \
        [N4](AndV)  \
        /   \        \
      [B]   [C]     [N5](AndV)
                    /  \
                  [C]  [D]

Above DAG has two logic cone roots N1 & N2.

Folding logic on cone rooted at N1, MacroLogic node can have at most 3 distinct inputs:
          |          |
         [N1](XorV) [N2](XorV)
         / \        /  \
        /   \      /    \
       /     \    /      \
      /       \  /        \
     /         \/          \
    /         [N6](MacroL)  \
   /         / | \           \
  /         /  |  \           \
[A]       [B] [C] [D]        [D]
         

Folding logic on Cone rooted at N2:
          |               |
         [N1](XorV)     [N2](XorV)
         / \               \
        /   \               \
       /     \               \
      /       \              [N7] (MacroL)
     /         \            / | \
    /         [N6](MacroL) /  |  \
   /         / | \        [B][C] [D]
  /         /  |  \           
 A]        [B] [C] [D]       


From rwestrel at redhat.com  Thu Apr 30 11:48:46 2020
From: rwestrel at redhat.com (Roland Westrelin)
Date: Thu, 30 Apr 2020 13:48:46 +0200
Subject: RFR(S): 8244086: Following 8241492,
 strip mined loop may run extra iterations
In-Reply-To: <AM4PR02MB3057D76A813EAE602E4EA6939AAA0@AM4PR02MB3057.eurprd02.prod.outlook.com>
References: <87wo5y8z2v.fsf@redhat.com>
 <DB8PR08MB4969D2D5AA8EB23E0C5A0F3B96AA0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <AM4PR02MB3057E81FCB2D86CBC20745D49AAA0@AM4PR02MB3057.eurprd02.prod.outlook.com>
 <878sid8jzn.fsf@redhat.com>
 <AM4PR02MB3057D76A813EAE602E4EA6939AAA0@AM4PR02MB3057.eurprd02.prod.outlook.com>
Message-ID: <87zhat6voh.fsf@redhat.com>


> Maybe another option would be to recognize loops which are affected by this problem and live with having a safepoint in the body (skip LSM).
> AFAICS "for" loops have the check at the beginning and should not be affected.

That requires pattern matching to find a dominating check. That pattern
matching has to be robust enough. I gave it a try and it's
straightforward for a simple:

for (int i = start; i < stop; i++) {

with:

  {
    bool found = false;
    Node* temp_cmp = cmp->clone();
    int limit_edge = cmp->find_edge(limit);
    assert(limit_edge == 1 || limit_edge == 2, "");
    temp_cmp->set_req(limit_edge == 1 ? 2 : 1, init_trip);
    Node* dom_cmp = _igvn.hash_find(temp_cmp);
    if (dom_cmp != NULL) {
      Node* temp_test = test->clone();
      temp_test->set_req(1, dom_cmp);
      Node* dom_cmp = _igvn.hash_find(temp_test);
      if (dom_cmp != NULL) {
        temp_cmp->destruct();
        temp_test->destruct();
        for (DUIterator_Fast imax, i = dom_cmp->fast_outs(imax); i < imax && !found; i++) {
          Node* u = dom_cmp->fast_out(i);
          if (u->is_If()) {
            ProjNode* proj = u->as_If()->proj_out(back_control->as_Proj()->_con);
            proj->dump();
            if (is_dominator(proj, x)) {
              found = true;
            }
          }
        }
      }
    }
  }

But that fails for:

    public static int hashCode(byte[] value) {
        int h = 0;
        for (byte v : value) {

So it feels like quite a bit of extra complexity and it could backfire
if the pattern matching is not robust enough.

Roland.


From volker.simonis at gmail.com  Thu Apr 30 14:45:03 2020
From: volker.simonis at gmail.com (Volker Simonis)
Date: Thu, 30 Apr 2020 16:45:03 +0200
Subject: RFR(XS): 8230552: Provide information when hitting a HaltNode for
 architectures other than x86
In-Reply-To: <19BC4D2D-56F3-45BE-898C-1389469A7B36@amazon.com>
References: <19BC4D2D-56F3-45BE-898C-1389469A7B36@amazon.com>
Message-ID: <CA+3eh13-vr8u=yDrYvSh-F8taHxFehe13DnrEPcVFFhhKUFc+A@mail.gmail.com>

Forwarding to ppc-aix and s390 port mailing lists with the kind request for
testing this simple fix on the corresponding platforms.

Thank you and best regards,
Volker


Liu, Xin <xxinliu at amazon.com> schrieb am Do., 30. Apr. 2020, 08:39:

>
>
> ?On 4/29/20, 11:06 PM, "Pengfei Li" <Pengfei.Li at arm.com> wrote:
>
>
>
>     Hi Xin,
>
>     > I tested on aarch64.  It generates the same crash report as x86_64
> when it
>     > does hit HaltNode.  Halt reason is displayed. I paste report on the
> JBS.
>     > I ran hotspot:tier1 on aarch64 fastdebug build.  It passed except
> for 3
>     > relevant failures[1].
>
>     (NOT a reviewer) The original instruction used should be dcps1 instead
> of dpcs1 - there's a misspelling in AArch64 assembler. Could you add a
> trivial fix to change dpcs1/2/3 to dcps1/2/3?
>
> Oh, I don't know that. I did search dpcs and found nothing.
> I've filed a new issue about the typo thing:  JDK-8244170. Let's resolve
> it in separated issue.
>
>     BTW, how did you test to hit the HaltNode?
>     --
>     Thanks,
>     Pengfei
>
> I followed Christian and Volkers' recipe on JDK-8230552. Both of them can
> generate HaltNode.
> Volker's approach is very interesting. You have to give program a couple
> of "-XX:SuppressErrorAt=" to increase tolerance.
>
> Thanks,
> --lx
>
>
>

From igor.ignatyev at oracle.com  Thu Apr 30 15:08:59 2020
From: igor.ignatyev at oracle.com (Igor Ignatyev)
Date: Thu, 30 Apr 2020 08:08:59 -0700
Subject: RFR(S) : 8243427 : use reproducible random in :vmTestbase_vm_mlvm
In-Reply-To: <563901ac-8fe2-70d3-8801-7302e785d371@oracle.com>
References: <356FA029-9316-40CA-9F35-FA3843DB0BF9@oracle.com>
 <563901ac-8fe2-70d3-8801-7302e785d371@oracle.com>
Message-ID: <09EBB66E-EBBC-4C0D-9E6E-BF8E920959A5@oracle.com>

Thanks Tobias, pushed.
-- Igor

> On Apr 30, 2020, at 12:49 AM, Tobias Hartmann <tobias.hartmann at oracle.com> wrote:
> 
> Hi Igor,
> 
> looks reasonable to me.
> 
> Best regards,
> Tobias
> 
> On 30.04.20 05:11, Igor Ignatyev wrote:
>> http://cr.openjdk.java.net/~iignatyev/8243427/webrev.00
>>> 48 lines changed: 16 ins; 15 del; 17 mod;
>> 
>> 
>> Hi all,
>> 
>> could you please review this small patch?
>> from JBS:
>>> this subtask is to use j.t.l.Utils.getRandomInstance() as a random number generator, where applicable, in :vmTestbase_vm_mlvm test group and marking the tests which make use of "randomness" with a proper k/w.
>> 
>> 
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8243427
>> testing: :vmTestbase_vm_mlvm test group
>> webrev:
>> - code changes: http://cr.openjdk.java.net/~iignatyev/8243427/webrev.00.code/
>> - adding k/w: http://cr.openjdk.java.net/~iignatyev//8243427/webrev.00.kw/
>> - full: http://cr.openjdk.java.net/~iignatyev/8243427/webrev.00
>> 
>> Thanks,
>> -- Igor
>> 


From igor.ignatyev at oracle.com  Thu Apr 30 15:09:01 2020
From: igor.ignatyev at oracle.com (Igor Ignatyev)
Date: Thu, 30 Apr 2020 08:09:01 -0700
Subject: RFR(S) : 8243428 : use reproducible random in
 :vmTestbase_vm_compiler
In-Reply-To: <1d537485-7de4-3e8f-d565-03d43bd9ee8e@oracle.com>
References: <6FC99F72-80A9-4923-9BDF-93D4A7AC3861@oracle.com>
 <1d537485-7de4-3e8f-d565-03d43bd9ee8e@oracle.com>
Message-ID: <8362474F-157A-42F9-92F8-C2F4F3BDD326@oracle.com>

Thanks Tobias, pushed.
-- Igor

> On Apr 30, 2020, at 12:59 AM, Tobias Hartmann <tobias.hartmann at oracle.com> wrote:
> 
> Hi Igor,
> 
> looks reasonable to me.
> 
> Best regards,
> Tobias
> 
> On 30.04.20 05:38, Igor Ignatyev wrote:
>> http://cr.openjdk.java.net/~iignatyev/8243428/webrev.00
>>> 158 lines changed: 76 ins; 2 del; 80 mod;
>> 
>> Hi all,
>> 
>> could you please review this small patch?
>> from JBS:
>>> this subtask is to use j.t.l.Utils.getRandomInstance() as a random number generator, where applicable, in :vmTestbase_vm_compiler test group and marking the tests which make use of "randomness" with a proper k/w.
>> 
>> testing: :vmTestbase_vm_compiler test group
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8243428
>> webrevs:
>> - code changes: http://cr.openjdk.java.net/~iignatyev/8243428/webrev.00.code
>>> 15 lines changed: 7 ins; 2 del; 6 mod;
>> 
>> - adding k/w: http://cr.openjdk.java.net/~iignatyev/8243428/webrev.00.kw/
>>> 69 lines changed: 69 ins; 0 del; 0 mod;
>> 
>> - full: http://cr.openjdk.java.net/~iignatyev/8243428/webrev.00
>>> 158 lines changed: 76 ins; 2 del; 80 mod;
>> 
>> Thanks,
>> -- Igor
>> 


From martin.doerr at sap.com  Thu Apr 30 19:57:30 2020
From: martin.doerr at sap.com (Doerr, Martin)
Date: Thu, 30 Apr 2020 19:57:30 +0000
Subject: RFR(S): 8244086: Following 8241492, strip mined loop may run
 extra iterations
In-Reply-To: <87zhat6voh.fsf@redhat.com>
References: <87wo5y8z2v.fsf@redhat.com>
 <DB8PR08MB4969D2D5AA8EB23E0C5A0F3B96AA0@DB8PR08MB4969.eurprd08.prod.outlook.com>
 <AM4PR02MB3057E81FCB2D86CBC20745D49AAA0@AM4PR02MB3057.eurprd02.prod.outlook.com>
 <878sid8jzn.fsf@redhat.com>
 <AM4PR02MB3057D76A813EAE602E4EA6939AAA0@AM4PR02MB3057.eurprd02.prod.outlook.com>
 <87zhat6voh.fsf@redhat.com>
Message-ID: <AM4PR02MB3057F4BF3ECD6E7969B65F5A9AAA0@AM4PR02MB3057.eurprd02.prod.outlook.com>

Hi Roland,

my idea was rather to check if the trip counter is already checked before the loop.
Check before loop should look like this (stride > 0 example):
CmpINode c = CountedLoop->in(1) -> IfTrue->in(0) -> If->in(1) -> Bool->in(1) -> CompI

(Maybe there's an easier way to find it where it gets generated.)

Comparison of start value:
c->in(1) == Phi(trip counter)->in(1)
with limit:
c->in(2) == CmpI(trip counter)->in(2)

If this matches we should be safe.
I haven't checked if such patterns match often enough. Just as an idea.

Best regards,
Martin


> -----Original Message-----
> From: Roland Westrelin <rwestrel at redhat.com>
> Sent: Donnerstag, 30. April 2020 13:49
> To: Doerr, Martin <martin.doerr at sap.com>; Pengfei Li
> <Pengfei.Li at arm.com>; hotspot-compiler-dev at openjdk.java.net
> Cc: nd <nd at arm.com>
> Subject: RE: RFR(S): 8244086: Following 8241492, strip mined loop may run
> extra iterations
> 
> 
> > Maybe another option would be to recognize loops which are affected by
> this problem and live with having a safepoint in the body (skip LSM).
> > AFAICS "for" loops have the check at the beginning and should not be
> affected.
> 
> That requires pattern matching to find a dominating check. That pattern
> matching has to be robust enough. I gave it a try and it's
> straightforward for a simple:
> 
> for (int i = start; i < stop; i++) {
> 
> with:
> 
>   {
>     bool found = false;
>     Node* temp_cmp = cmp->clone();
>     int limit_edge = cmp->find_edge(limit);
>     assert(limit_edge == 1 || limit_edge == 2, "");
>     temp_cmp->set_req(limit_edge == 1 ? 2 : 1, init_trip);
>     Node* dom_cmp = _igvn.hash_find(temp_cmp);
>     if (dom_cmp != NULL) {
>       Node* temp_test = test->clone();
>       temp_test->set_req(1, dom_cmp);
>       Node* dom_cmp = _igvn.hash_find(temp_test);
>       if (dom_cmp != NULL) {
>         temp_cmp->destruct();
>         temp_test->destruct();
>         for (DUIterator_Fast imax, i = dom_cmp->fast_outs(imax); i < imax &&
> !found; i++) {
>           Node* u = dom_cmp->fast_out(i);
>           if (u->is_If()) {
>             ProjNode* proj = u->as_If()->proj_out(back_control->as_Proj()-
> >_con);
>             proj->dump();
>             if (is_dominator(proj, x)) {
>               found = true;
>             }
>           }
>         }
>       }
>     }
>   }
> 
> But that fails for:
> 
>     public static int hashCode(byte[] value) {
>         int h = 0;
>         for (byte v : value) {
> 
> So it feels like quite a bit of extra complexity and it could backfire
> if the pattern matching is not robust enough.
> 
> Roland.


From vladimir.kozlov at oracle.com  Thu Apr 30 22:34:35 2020
From: vladimir.kozlov at oracle.com (Vladimir Kozlov)
Date: Thu, 30 Apr 2020 15:34:35 -0700
Subject: RFR[XS] : 8244186 : assertion failure
 test/jdk/javax/net/ssl/DTLS/RespondToRetransmit.java
In-Reply-To: <MW3PR11MB47146639E139FEFDC5CEE8C4E8AA0@MW3PR11MB4714.namprd11.prod.outlook.com>
References: <MW3PR11MB47146639E139FEFDC5CEE8C4E8AA0@MW3PR11MB4714.namprd11.prod.outlook.com>
Message-ID: <457fee15-0e58-fb22-ad68-33e4f8c7e44e@oracle.com>

Hi Jatin,

Fix looks fine but please always use {} for if() body.

Also why JDK-8241040 is listed as backport for this bug? Did you mean to link it as related?

Thanks,
Vladimir

On 4/30/20 4:45 AM, Bhateja, Jatin wrote:
> Hi All,
> 
> Kindly review the patch which fixes assertion failures seen in some jtreg regression.
> 
> JBS: http://bugs.openjdk.java.net/browse/JDK-8244186
> Webrev: http://cr.openjdk.java.net/~jbhateja/8244186/webrev.01/
> 
> Removing an assertion which prevents logic folding over cones already having a
> MacroLogic node as depicted by following graph[1].
> 
> Regards,
> Jatin
> 
> [1] Original ideal Graph:
>            |          |
>           [N1](XorV) [N2](XorV)
>           / \        /  \
>          /   \      /    \
>         /     \    /      \
>        /       \  /        \
>       /         \/          \
>      /         [N3](AndV)    \
>     /         /   \           \
>   [A]        /     \          [D]
>             /       \
>          [N4](AndV)  \
>          /   \        \
>        [B]   [C]     [N5](AndV)
>                      /  \
>                    [C]  [D]
> 
> Above DAG has two logic cone roots N1 & N2.
> 
> Folding logic on cone rooted at N1, MacroLogic node can have at most 3 distinct inputs:
>            |          |
>           [N1](XorV) [N2](XorV)
>           / \        /  \
>          /   \      /    \
>         /     \    /      \
>        /       \  /        \
>       /         \/          \
>      /         [N6](MacroL)  \
>     /         / | \           \
>    /         /  |  \           \
> [A]       [B] [C] [D]        [D]
>           
> 
> Folding logic on Cone rooted at N2:
>            |               |
>           [N1](XorV)     [N2](XorV)
>           / \               \
>          /   \               \
>         /     \               \
>        /       \              [N7] (MacroL)
>       /         \            / | \
>      /         [N6](MacroL) /  |  \
>     /         / | \        [B][C] [D]
>    /         /  |  \
>   A]        [B] [C] [D]
> 
> 
> 
>         
>       
>                    
> 
>                
> 
> 

From xxinliu at amazon.com  Thu Apr 30 22:39:27 2020
From: xxinliu at amazon.com (Liu, Xin)
Date: Thu, 30 Apr 2020 22:39:27 +0000
Subject: RFR[M]: 8151779: Some intrinsic flags could be replaced with one
 general flag
In-Reply-To: <801D878C-CAE5-4EBE-8AFE-4E35346CD5BD@amazon.com>
References: <19CD3956-4DC6-4908-8626-27D48A9AB4A4@amazon.com>
 <b2b2226f-8e97-75d0-8e3d-b8ffbf5f474d@oracle.com>
 <0EDAAC88-E5D9-424F-A19E-5E20C689C2F3@amazon.com>
 <801D878C-CAE5-4EBE-8AFE-4E35346CD5BD@amazon.com>
Message-ID: <B7F5959A-A18E-49FC-BC3C-B7FD0CD3A2D7@amazon.com>

Hi, 

Ping for this code review. 

I've updated the rev02 a little bit.  Here is new revision. 
https://cr.openjdk.java.net/~xliu/8151779/02/webrev/

1. resolve merging conflict with TIP.
2. add fill_in functions to pass sanity test of submit repo. 
NOTHING_TO_RUN: 0
UNABLE_TO_RUN: 0
KILLED: 0
NA: 0
HARNESS_ERROR: 0
FAILED: 0
EXECUTED_WITH_FAILURE: 0
PASSED: 84

3. I also changed the description of ControlIntrinsic.  
java -XX:+PrintFlagsWithComments | grep ControlIntrinsic
ccstrlist ControlIntrinsic                         =                                        {diagnostic} {default}       Control intrinsics using a list of +/- (internal) names, separated by commas

thanks,
--lx


?On 4/24/20, 1:40 AM, "hotspot-compiler-dev on behalf of Liu, Xin" <hotspot-compiler-dev-bounces at openjdk.java.net on behalf of xxinliu at amazon.com> wrote:

    Hi,  

    May I get reviewed for this new revision? 
    JBS: https://bugs.openjdk.java.net/browse/JDK-8151779
    webrev: https://cr.openjdk.java.net/~xliu/8151779/01/webrev/

    I introduce a new option -XX:ControlIntrinsic=+_id1,-id2...
    The id is vmIntrinsics::ID.  As prior discussion, ControlIntrinsic is expected to replace DisableIntrinsic. 
    I keep DisableIntrinsic in this revision. DisableIntrinsic prevails when an intrinsic appears on both lists.   

    I use an array of tribool to mark each intrinsic is enabled or not. In this way, hotspot can avoid expensive string querying among intrinsics.
    A Tribool value has 3 states: Default, true, or false. 
    If developers don't explicitly set an intrinsic, it will be available unless is disabled by the corresponding UseXXXIntrinsics. 
    Traditional Boolean value can't express fine/coarse-grained control. Ie. We only go through those auxiliary options UseXXXIntrinsics if developers don't control a specific intrinsic.   

    I also add the support of ControlIntrinsic to CompilerDirectives. 

    Test:
    I reuse jtreg tests of DisableIntrinsic. Add add more @run annotations to verify ControlIntrinsics.
    I passed hotspot:Tier1 test and all tests on x86_64/linux. 

    Thanks,
    --lx

    On 4/17/20, 7:22 PM, "hotspot-compiler-dev on behalf of Liu, Xin" <hotspot-compiler-dev-bounces at openjdk.java.net on behalf of xxinliu at amazon.com> wrote:

        Hi, Vladimir, 

        Thanks for the clarification. 
        Oh, yes, it's theoretically possible, but it's tedious. I am wrong at that point.
        I think I got your point. ControlIntrinsics will make developer's life easier. I will implement it. 

        Thanks,
        --lx


        On 4/17/20, 6:46 PM, "Vladimir Kozlov" <vladimir.kozlov at oracle.com> wrote:

            CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.


            I withdraw my suggestion about EnableIntrinsic from JDK-8151779 because ControlIntrinsics will provide such
            functionality and will replace existing DisableIntrinsic.

            Note, we can start deprecating Use*Intrinsic flags (and DisableIntrinsic) later in other changes. You don't need to do
            everything at once. What we need now a mechanism to replace them.

            On 4/16/20 11:58 PM, Liu, Xin wrote:
            > Hi, Corey and Vladimir,
            >
            > I recently go through vmSymbols.hpp/cpp. I think I understand your comments.
            > Each UseXXXIntrinsics does control a bunch of intrinsics (plural). Thanks for the hint.
            >
            > Even though I feel I know intrinsics mechanism of hotspot better, I still need a clarification of JDK- 8151779.
            >
            > There're 321 intrinsics (https://chriswhocodes.com/hotspot_intrinsics_jdk15.html).
            > If there's no any option, they are all available for compilers.  That makes sense because intrinsics are always beneficial.
            > But there're reasons we need to disable a subset of them. A specific architecture may miss efficient instructions or fixed functions. Or simply because an intrinsic is buggy.
            >
            > Currently, JDK provides developers 2 ways to control intrinsics. > 1. Some diagnostic options. Eg. InlineMathNatives, UseBase64Intrinsics.
            > Developers can use one option to disable a group of intrinsics.  That is to say, it's a coarse-grained approach.
            >
            > 2. DisableIntrinsic="a,b,c"
            > By passing a string list of vmIntrinsics::IDs, it's capable of disabling any specified intrinsic.
            >
            > But even putting above 2 approaches together, we still can't precisely control any intrinsic.

            Yes, you are right. We seems are trying to put these 2 different ways into one flag which may be mistake.

            -XX:ControlIntrinsic=-_updateBytesCRC32C,-_updateDirectByteBufferCRC32C is a similar to -XX:-UseCRC32CIntrinsics but it
            requires more detailed knowledge about intrinsics ids.

            May be we can have 2nd flag, as you suggested -XX:UseIntrinsics=-AESCTR,+CRC32C, for such cases.

            > If we want to enable an intrinsic which is under control of InlineMathNatives but keep others disable, it's impossible now.  [please correct if I am wrong here].

            You can disable all other from 321 intrinsics with DisableIntrinsic flag which is very tedious I agree.

            > I think that the motivation JDK-8151779 tried to solve.

            The idea is that instead of flags we use to control particular intrinsics depending on CPU we will use vmIntrinsics::IDs
            or other tables as you showed in your changes. It will require changes in vm_version_<cpu> codes.

            >
            > If we provide a new option EnableIntrinsic and put it least priority, then we can precisely control any intrinsic.
            > Quote Vladimir Kozlov "DisableIntrinsic list prevails if an intrinsic is specified on both EnableIntrinsic and DisableIntrinsic."
            >
            >   "-XX:ControlIntrinsic=+_dabs,-_fabs,-_getClass" looks more elegant, but it will confuse developers with DisableIntrinsic.
            > If we decide to deprecate DisableIntrinsic, I think ControlIntrinsic may be a better option. Now I prefer to provide EnableIntrinsic for simplicity and symmetry.

            I prefer to have one ControlIntrinsic flag and deprecate DisableIntrinsic. I don't think it is confusing.

            Thanks,
            Vladimir

            > What do you think?
            >
            > Thanks,
            > --lx
            >
            >
            > On 4/13/20, 1:47 PM, "hotspot-compiler-dev on behalf of Corey Ashford" <hotspot-compiler-dev-bounces at openjdk.java.net on behalf of cjashfor at linux.ibm.com> wrote:
            >
            >      CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
            >
            >
            >
            >      On 4/13/20 10:33 AM, Liu, Xin wrote:
            >      > Hi, compiler developers,
            >      > I attempt to refactor UseXXXIntrinsics for JDK-8151779.  I think we still need to keep UseXXXIntrinsics options because many applications may be using them.
            >      >
            >      > My change provide 2 new features:
            >      > 1) a shorthand to enable/disable intrinsics.
            >      > A comma-separated string. Each one is an intrinsic. An optional tailing symbol + or '-' denotes enabling or disabling.
            >      > If the tailing symbol is missing, it means enable.
            >      > Eg. -XX:UseIntrinsics="AESCTR-,CRC32C+,CRC32-,MathExact"
            >      > This jvm option will expand to multiple options -XX:-UseAESCTRIntrinsics, -XX:+UseCRC32CIntrinsics, -XX:-UseCRC32Intrinsics, -XX:UseMathExactIntrinsics
            >      >
            >      > 2) provide a set of macro to declare intrinsic options
            >      > Developers declare once in intrinsics.hpp and macros will take care all other places.
            >      > Here are example: https://cr.openjdk.java.net/~xliu/8151779/00/webrev/src/hotspot/share/compiler/intrinsics.hpp.html
            >      > Ion Lam is overhauling jvm options.  I am thinking how to be consistent with his proposal.
            >      >
            >
            >      Great idea, though to be consistent with the original syntax, I think
            >      the +/- should be in front of the name:
            >
            >      -XX:UseIntrinsics=-AESCTR,+CRC32C,...
            >
            >
            >      > I handle UseIntrinsics before VM_Version::initialize. It means that platform-specific initialization still has a chance to correct those options.
            >      > If we do that after VM_Version::initialize,  some intrinsics may cause JVM crash.  Eg. +UseBase64Intrinsics on x86_64 Linux.
            >      > Even though this behavior is same as -XX:+UseXXXIntrinsics, from user's perspective, it's not straightforward when JVM overrides what users specify implicitly. It's dilemma here,  stable jvm or fidelity of cmdline.  What do you think?
            >      >
            >      > Another problem is naming convention. Almost all intrinsics options use UseXXXIntrinsics. One exception is UseVectorizedMismatchIntrinsic.
            >      > Personally, I think it should be "UseXXXIntrinsic" because one option is for one intrinsic, right?  Is it possible to change this name convention?
            >
            >      Some (many?) intrinsic options turn on more than one .ad instruct
            >      instrinsic, or library instrinsics at the same time.  I think that's why
            >      the plural is there.  Also, consistently adding the plural allows you to
            >      add more capabilities to a flag that initially only had one intrinsic
            >      without changing the plurality (and thus backward compatibility).
            >
            >      Regards,
            >
            >      - Corey
            >
            >