Discussion: 8172978: Remove Interpreter TOS optimization

Tue Feb 28 15:00:12 UTC 2017

Hi,

I've ran jvm98 with -Xint on several PPC64 machines and on a recent s390x machine.

Surprisingly, disabling of the Tos optimization does not hurt on older hardware (Power 5 and 6).
Seems like some sub benchmarks don't suffer at all or even benefit.
But it really hurts on recent Power 8.

Measured performance change on AIX 7.1 on Power 8 -XX:+EnableTosCache vs. -XX:-EnableTosCache:
Compress
38989 vs. 47686 -22%
Jess
11256 vs. 11849 -5%
Raytrace
17647 vs. 18300 -4%
Db
22713 vs. 25181 -11%
Javac
14554 vs. 15130 -4%
These sub benchmarks are relatively stable. It's possible to reproduce these results.

We lose about 3% on recent s390x hardware (z13).

I first liked the idea to remove the optimization, but these numbers speak against doing it.

Best regards,
Martin

-----Original Message-----
From: Doerr, Martin 
Sent: Freitag, 24. Februar 2017 17:41
To: 'Max Ockner' <max.ockner at oracle.com>; hotspot-dev at openjdk.java.net
Subject: RE: Discussion: 8172978: Remove Interpreter TOS optimization

Hi Max,

thank you very much for sharing your results and for sending the patch.

I guess it covers the most relevant cases, but not all ones. I think it'd be better to modify dispatch_next instead of dispatch_epilog on x86.
(dispatch_next is also used by generate_return_entry_for and generate_deopt_entry_for.)

On s390, I'm using dispatch_next with:
  if (!EnableTosCache) {
    push(state);
    state = vtos;
  }
  dispatch_base(state, Interpreter::dispatch_table(state));

I also added an assertion to dispatch_base in order to make sure I'm hitting all dispatch usages:
assert(EnableTosCache || state == vtos, "sanity");

Unfortunately, the performance results of SPEC jvm98 with -Xint seem to drop significantly with -XX:-EnableTosCache on both, PPC64 and s390.
But we need to perform more measurements to get more reliable results.

Best regards,
Martin

-----Original Message-----
From: hotspot-dev [mailto:hotspot-dev-bounces at openjdk.java.net] On Behalf Of Max Ockner
Sent: Donnerstag, 23. Februar 2017 22:21
To: hotspot-dev at openjdk.java.net
Subject: Re: Discussion: 8172978: Remove Interpreter TOS optimization

Hi Volker,
I have attached the patch that I have been testing.
Thanks,
Max

On 2/20/2017 5:45 AM, Volker Simonis wrote:
> Hi,
>
> besides the fact that this of course means some work for us :) I
> currently don't see any problems for our porting platforms (ppc64 and
> s390x).
>
> Are there any webrevs available, so we can see how big they are and
> maybe do some own benchmarking?
>
> Thanks,
> Volker
>
>
> On Sun, Feb 19, 2017 at 11:11 PM,  <coleen.phillimore at oracle.com> wrote:
>>
>> On 2/18/17 11:14 AM, coleen.phillimore at oracle.com wrote:
>>> When Max gets back from the long weekend, he'll post the platforms in your
>>> bug.
>>>
>>> It's amazing that for -Xint there's no significant difference. I've seen
>>> -Xint performance of 15% slower cause a 2% slowdown with server but that was
>>> before tiered compilation.
>>
>> I should clarify this.  I've seen this slowdown for *different* interpreter
>> optimizations, which *can* affect server performance.  I was measuring
>> specjvm98 on linux x64.   If there's no significant difference for this TOS
>> optimization, there is no chance of a degredation in overall performance.
>>
>> Coleen
>>
>>> The reason for this query was to see what developers for the other
>>> platform ports think, since this change would affect all of the platforms.
>>>
>>> Thanks,
>>> Coleen
>>>
>>> On 2/18/17 10:50 AM, Daniel D. Daugherty wrote:
>>>> If Claes is happy with the perf testing, then I'm happy. :-)
>>>>
>>>> Dan
>>>>
>>>>
>>>> On 2/18/17 3:46 AM, Claes Redestad wrote:
>>>>> Hi,
>>>>>
>>>>> I've seen Max has run plenty of tests on our internal performance
>>>>> infrastructure and everything I've seen there seems to corroborate the
>>>>> idea that this removal is OK from a performance point of view, the
>>>>> footprint improvements are small but significant and any negative
>>>>> performance impact on throughput benchmarks is at noise levels even
>>>>> with -Xint (it appears many benchmarks time out with this setting
>>>>> both before and after, though; Max, let's discuss offline how to
>>>>> deal with that :-))
>>>>>
>>>>> I expect this will be tested more thoroughly once adapted to all
>>>>> platforms (which I assume is the intent?), but see no concern from
>>>>> a performance testing point of view: Do it!
>>>>>
>>>>> Thanks!
>>>>>
>>>>> /Claes
>>>>>
>>>>> On 2017-02-16 16:40, Daniel D. Daugherty wrote:
>>>>>> Hi Max,
>>>>>>
>>>>>> Added a note to your bug. Interesting idea, but I think your data is
>>>>>> a bit incomplete at the moment.
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>>
>>>>>> On 2/15/17 3:18 PM, Max Ockner wrote:
>>>>>>> Hello all,
>>>>>>>
>>>>>>> We have filed a bug to remove the interpreter stack caching
>>>>>>> optimization for jdk10.  Ideally we can make this change *early*
>>>>>>> during the jdk10 development cycle. See below for justification:
>>>>>>>
>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8172978
>>>>>>>
>>>>>>> Stack caching has been around for a long time and is intended to
>>>>>>> replace some of the load/store (pop/push) operations with
>>>>>>> corresponding register operations. The need for this optimization
>>>>>>> arose before caching could adequately lessen the burden of memory
>>>>>>> access. We have reevaluated the JVM stack caching optimization and
>>>>>>> have found that it has a high memory footprint and is very costly to
>>>>>>> maintain, but does not provide significant measurable or theoretical
>>>>>>> benefit for us when used with modern hardware.
>>>>>>>
>>>>>>> Minimal Theoretical Benefit.
>>>>>>> Because modern hardware does not slap us with the same cost for
>>>>>>> accessing memory as it once did, the benefit of replacing memory
>>>>>>> access with register access is far less dramatic now than it once was.
>>>>>>> Additionally, the interpreter runs for a relatively short time before
>>>>>>> relevant code sections are compiled. When the VM starts running
>>>>>>> compiled code instead of interpreted code, performance should begin to
>>>>>>> move asymptotically towards that of compiled code, diluting any
>>>>>>> performance penalties from the interpreter to small performance
>>>>>>> variations.
>>>>>>>
>>>>>>> No Measurable Benefit.
>>>>>>> Please see the results files attached in the bug page. This change
>>>>>>> was adapted for x86 and sparc, and interpreter performance was
>>>>>>> measured with Specjvm98 (run with -Xint).  No significant decrease in
>>>>>>> performance was observed.
>>>>>>>
>>>>>>> Memory footprint and code complexity.
>>>>>>> Stack caching in the JVM is implemented by switching the instruction
>>>>>>> look-up table depending on the tos (top-of-stack) state. At any moment
>>>>>>> there are is an active table consisting of one dispatch table for each
>>>>>>> of the 10 tos states.  When we enter a safepoint, we copy all 10
>>>>>>> safepoint dispatch tables into the active table.  The additional entry
>>>>>>> code makes this copy less efficient and makes any work in the
>>>>>>> interpreter harder to debug.
>>>>>>>
>>>>>>> If we remove this optimization, we will:
>>>>>>>    - decrease memory usage in the interpreter,
>>>>>>>    - eliminated wasteful memory transactions during safepoints,
>>>>>>>    - decrease code complexity (a lot).
>>>>>>>
>>>>>>> Please let me know what you think.
>>>>>>> Thanks,
>>>>>>> Max
>>>>>>>