Discussion: 8172978: Remove Interpreter TOS optimization
Ioi Lam
ioi.lam at oracle.com
Sat Feb 18 17:55:16 UTC 2017
I think it's worthwhile to hand craft a micro benchmark with lots of stack operations and see if there's any performance in Xint mode. Also, run this on a wide range of architectures such as 10 year old x86 vs latest x86, ARM, etc.
That will give us more insights than the results from large complicated benchmarks.
Ioi
> On Feb 18, 2017, at 8:14 AM, coleen.phillimore at oracle.com wrote:
>
> When Max gets back from the long weekend, he'll post the platforms in your bug.
>
> It's amazing that for -Xint there's no significant difference. I've seen -Xint performance of 15% slower cause a 2% slowdown with server but that was before tiered compilation.
>
> The reason for this query was to see what developers for the other platform ports think, since this change would affect all of the platforms.
>
> Thanks,
> Coleen
>
>> On 2/18/17 10:50 AM, Daniel D. Daugherty wrote:
>> If Claes is happy with the perf testing, then I'm happy. :-)
>>
>> Dan
>>
>>
>>> On 2/18/17 3:46 AM, Claes Redestad wrote:
>>> Hi,
>>>
>>> I've seen Max has run plenty of tests on our internal performance
>>> infrastructure and everything I've seen there seems to corroborate the
>>> idea that this removal is OK from a performance point of view, the
>>> footprint improvements are small but significant and any negative
>>> performance impact on throughput benchmarks is at noise levels even
>>> with -Xint (it appears many benchmarks time out with this setting
>>> both before and after, though; Max, let's discuss offline how to
>>> deal with that :-))
>>>
>>> I expect this will be tested more thoroughly once adapted to all
>>> platforms (which I assume is the intent?), but see no concern from
>>> a performance testing point of view: Do it!
>>>
>>> Thanks!
>>>
>>> /Claes
>>>
>>>> On 2017-02-16 16:40, Daniel D. Daugherty wrote:
>>>> Hi Max,
>>>>
>>>> Added a note to your bug. Interesting idea, but I think your data is
>>>> a bit incomplete at the moment.
>>>>
>>>> Dan
>>>>
>>>>
>>>>> On 2/15/17 3:18 PM, Max Ockner wrote:
>>>>> Hello all,
>>>>>
>>>>> We have filed a bug to remove the interpreter stack caching
>>>>> optimization for jdk10. Ideally we can make this change *early*
>>>>> during the jdk10 development cycle. See below for justification:
>>>>>
>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8172978
>>>>>
>>>>> Stack caching has been around for a long time and is intended to
>>>>> replace some of the load/store (pop/push) operations with
>>>>> corresponding register operations. The need for this optimization
>>>>> arose before caching could adequately lessen the burden of memory
>>>>> access. We have reevaluated the JVM stack caching optimization and
>>>>> have found that it has a high memory footprint and is very costly to
>>>>> maintain, but does not provide significant measurable or theoretical
>>>>> benefit for us when used with modern hardware.
>>>>>
>>>>> Minimal Theoretical Benefit.
>>>>> Because modern hardware does not slap us with the same cost for
>>>>> accessing memory as it once did, the benefit of replacing memory
>>>>> access with register access is far less dramatic now than it once was.
>>>>> Additionally, the interpreter runs for a relatively short time before
>>>>> relevant code sections are compiled. When the VM starts running
>>>>> compiled code instead of interpreted code, performance should begin to
>>>>> move asymptotically towards that of compiled code, diluting any
>>>>> performance penalties from the interpreter to small performance
>>>>> variations.
>>>>>
>>>>> No Measurable Benefit.
>>>>> Please see the results files attached in the bug page. This change
>>>>> was adapted for x86 and sparc, and interpreter performance was
>>>>> measured with Specjvm98 (run with -Xint). No significant decrease in
>>>>> performance was observed.
>>>>>
>>>>> Memory footprint and code complexity.
>>>>> Stack caching in the JVM is implemented by switching the instruction
>>>>> look-up table depending on the tos (top-of-stack) state. At any moment
>>>>> there are is an active table consisting of one dispatch table for each
>>>>> of the 10 tos states. When we enter a safepoint, we copy all 10
>>>>> safepoint dispatch tables into the active table. The additional entry
>>>>> code makes this copy less efficient and makes any work in the
>>>>> interpreter harder to debug.
>>>>>
>>>>> If we remove this optimization, we will:
>>>>> - decrease memory usage in the interpreter,
>>>>> - eliminated wasteful memory transactions during safepoints,
>>>>> - decrease code complexity (a lot).
>>>>>
>>>>> Please let me know what you think.
>>>>> Thanks,
>>>>> Max
>>>>>
>>>>
>>
>
More information about the hotspot-dev
mailing list