Performance issue with Nashorn and C2's global code motion

Mon Sep 14 08:39:17 UTC 2015

Hi,

this thread will continue with the subject:
RFE 8136445: Performance issue with Nashorn and C2's global code motion

Best regards,
Martin

-----Original Message-----
From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Doerr, Martin
Sent: Freitag, 11. September 2015 15:59
To: Vladimir Kozlov; hotspot-compiler-dev at openjdk.java.net
Subject: RE: Performance issue with Nashorn and C2's global code motion

Hi Vladimir,

thanks for your quick response.

I've uploaded a small subset of the Octane benchmark with a simple launcher here:
http://cr.openjdk.java.net/~mdoerr/miniOctane

It can be run by the following command line:
openjdk_8/bin/java -agentlib:jdwp=transport=dt_socket,address=8000,server=y,suspend=n OctaneLauncher

It only contains the "EarleyBoyer" benchmark which is a good for stressing Node_Backward_Iterator::next().
The DUIterator_Fast may iterate over several billions of edges in sum consuming the big majority of the whole CPU time.

Even without the parameter which enables can_access_local_variables the Node_Backward_Iterator::next() consumes a noticeable (but not dominant) amount of CPU time.

Best regards,
Martin

-----Original Message-----
From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] 
Sent: Donnerstag, 10. September 2015 22:16
To: Doerr, Martin; hotspot-compiler-dev at openjdk.java.net
Subject: Re: Performance issue with Nashorn and C2's global code motion

Hi Martin,

It is first time I am hearing about this method's performance problem.
That code was not changed for very long time since we never thought we need to optimize it.

   // '_stack' is emulating a real _stack.  The 'visit-all-users' loop has been
   // made stateless, so I do not need to record the index 'i' on my _stack.
   // Instead I visit all users each time, scanning for unvisited users.

May be we can optimize it by not going through all users each time. We should file RFE for this.

We would need to know how to reproduce it.

Thanks,
Vladimir

On 9/10/15 5:17 AM, Doerr, Martin wrote:
> Hi,
>
> we were running Octane benchmark and noticed a very significant performance drop with JVMTI.
>
> VTune measurement showed that the JVM has spent the majority of the whole CPU time in Node_Backward_Iterator::next
> during PhaseCFG::schedule_late when JvmtiExport::_can_access_local_variables is on
>
> (see http://cr.openjdk.java.net/~mdoerr/OctaneVTune.jpg).
>
> We were using openjdk 8 with/without the following option:
>
> -agentlib:jdwp=transport=dt_socket,address=8000,server=y,suspend=n
>
> This option activates the JVMTI capability can_access_local_variables which prevents C2 from killing dead locals leading
> to a higher number of edges in the graph.
>
> If we don't use this option PhaseCFG::schedule_late does no longer play a significant role regarding the CPU time.
>
> Have you noticed this before? Is this of interest to you?
>
> For us, this is a significant issue, as we have can_access_local_variables on by default.
>
> As a solution we could think of limiting the node iterations in schedule_late and generating a quicker and less
> optimized schedule in extreme cases.
>
> Best regards,
>
> Martin
>