Discussion: 8172978: Remove Interpreter TOS optimization

Sat Feb 18 17:55:16 UTC 2017

I think it's worthwhile to hand craft a micro benchmark with lots of stack operations and see if there's any performance in Xint mode. Also, run this on a wide range of architectures such as 10 year old x86 vs latest x86, ARM, etc.

That will give us more insights than the results from large complicated benchmarks.

Ioi

> On Feb 18, 2017, at 8:14 AM, coleen.phillimore at oracle.com wrote:
> 
> When Max gets back from the long weekend, he'll post the platforms in your bug.
> 
> It's amazing that for -Xint there's no significant difference. I've seen -Xint performance of 15% slower cause a 2% slowdown with server but that was before tiered compilation.
> 
> The reason for this query was to see what developers for the other platform ports think, since this change would affect all of the platforms.
> 
> Thanks,
> Coleen
> 
>> On 2/18/17 10:50 AM, Daniel D. Daugherty wrote:
>> If Claes is happy with the perf testing, then I'm happy. :-)
>> 
>> Dan
>> 
>> 
>>> On 2/18/17 3:46 AM, Claes Redestad wrote:
>>> Hi,
>>> 
>>> I've seen Max has run plenty of tests on our internal performance
>>> infrastructure and everything I've seen there seems to corroborate the
>>> idea that this removal is OK from a performance point of view, the
>>> footprint improvements are small but significant and any negative
>>> performance impact on throughput benchmarks is at noise levels even
>>> with -Xint (it appears many benchmarks time out with this setting
>>> both before and after, though; Max, let's discuss offline how to
>>> deal with that :-))
>>> 
>>> I expect this will be tested more thoroughly once adapted to all
>>> platforms (which I assume is the intent?), but see no concern from
>>> a performance testing point of view: Do it!
>>> 
>>> Thanks!
>>> 
>>> /Claes
>>> 
>>>> On 2017-02-16 16:40, Daniel D. Daugherty wrote:
>>>> Hi Max,
>>>> 
>>>> Added a note to your bug. Interesting idea, but I think your data is
>>>> a bit incomplete at the moment.
>>>> 
>>>> Dan
>>>> 
>>>> 
>>>>> On 2/15/17 3:18 PM, Max Ockner wrote:
>>>>> Hello all,
>>>>> 
>>>>> We have filed a bug to remove the interpreter stack caching
>>>>> optimization for jdk10.  Ideally we can make this change *early*
>>>>> during the jdk10 development cycle. See below for justification:
>>>>> 
>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8172978
>>>>> 
>>>>> Stack caching has been around for a long time and is intended to
>>>>> replace some of the load/store (pop/push) operations with
>>>>> corresponding register operations. The need for this optimization
>>>>> arose before caching could adequately lessen the burden of memory
>>>>> access. We have reevaluated the JVM stack caching optimization and
>>>>> have found that it has a high memory footprint and is very costly to
>>>>> maintain, but does not provide significant measurable or theoretical
>>>>> benefit for us when used with modern hardware.
>>>>> 
>>>>> Minimal Theoretical Benefit.
>>>>> Because modern hardware does not slap us with the same cost for
>>>>> accessing memory as it once did, the benefit of replacing memory
>>>>> access with register access is far less dramatic now than it once was.
>>>>> Additionally, the interpreter runs for a relatively short time before
>>>>> relevant code sections are compiled. When the VM starts running
>>>>> compiled code instead of interpreted code, performance should begin to
>>>>> move asymptotically towards that of compiled code, diluting any
>>>>> performance penalties from the interpreter to small performance
>>>>> variations.
>>>>> 
>>>>> No Measurable Benefit.
>>>>> Please see the results files attached in the bug page.  This change
>>>>> was adapted for x86 and sparc, and interpreter performance was
>>>>> measured with Specjvm98 (run with -Xint).  No significant decrease in
>>>>> performance was observed.
>>>>> 
>>>>> Memory footprint and code complexity.
>>>>> Stack caching in the JVM is implemented by switching the instruction
>>>>> look-up table depending on the tos (top-of-stack) state. At any moment
>>>>> there are is an active table consisting of one dispatch table for each
>>>>> of the 10 tos states.  When we enter a safepoint, we copy all 10
>>>>> safepoint dispatch tables into the active table.  The additional entry
>>>>> code makes this copy less efficient and makes any work in the
>>>>> interpreter harder to debug.
>>>>> 
>>>>> If we remove this optimization, we will:
>>>>>  - decrease memory usage in the interpreter,
>>>>>  - eliminated wasteful memory transactions during safepoints,
>>>>>  - decrease code complexity (a lot).
>>>>> 
>>>>> Please let me know what you think.
>>>>> Thanks,
>>>>> Max
>>>>> 
>>>> 
>> 
>