Discussion: 8172978: Remove Interpreter TOS optimization
Max Ockner
max.ockner at oracle.com
Thu Feb 23 21:21:23 UTC 2017
Hi Volker,
I have attached the patch that I have been testing.
Thanks,
Max
On 2/20/2017 5:45 AM, Volker Simonis wrote:
> Hi,
>
> besides the fact that this of course means some work for us :) I
> currently don't see any problems for our porting platforms (ppc64 and
> s390x).
>
> Are there any webrevs available, so we can see how big they are and
> maybe do some own benchmarking?
>
> Thanks,
> Volker
>
>
> On Sun, Feb 19, 2017 at 11:11 PM, <coleen.phillimore at oracle.com> wrote:
>>
>> On 2/18/17 11:14 AM, coleen.phillimore at oracle.com wrote:
>>> When Max gets back from the long weekend, he'll post the platforms in your
>>> bug.
>>>
>>> It's amazing that for -Xint there's no significant difference. I've seen
>>> -Xint performance of 15% slower cause a 2% slowdown with server but that was
>>> before tiered compilation.
>>
>> I should clarify this. I've seen this slowdown for *different* interpreter
>> optimizations, which *can* affect server performance. I was measuring
>> specjvm98 on linux x64. If there's no significant difference for this TOS
>> optimization, there is no chance of a degredation in overall performance.
>>
>> Coleen
>>
>>> The reason for this query was to see what developers for the other
>>> platform ports think, since this change would affect all of the platforms.
>>>
>>> Thanks,
>>> Coleen
>>>
>>> On 2/18/17 10:50 AM, Daniel D. Daugherty wrote:
>>>> If Claes is happy with the perf testing, then I'm happy. :-)
>>>>
>>>> Dan
>>>>
>>>>
>>>> On 2/18/17 3:46 AM, Claes Redestad wrote:
>>>>> Hi,
>>>>>
>>>>> I've seen Max has run plenty of tests on our internal performance
>>>>> infrastructure and everything I've seen there seems to corroborate the
>>>>> idea that this removal is OK from a performance point of view, the
>>>>> footprint improvements are small but significant and any negative
>>>>> performance impact on throughput benchmarks is at noise levels even
>>>>> with -Xint (it appears many benchmarks time out with this setting
>>>>> both before and after, though; Max, let's discuss offline how to
>>>>> deal with that :-))
>>>>>
>>>>> I expect this will be tested more thoroughly once adapted to all
>>>>> platforms (which I assume is the intent?), but see no concern from
>>>>> a performance testing point of view: Do it!
>>>>>
>>>>> Thanks!
>>>>>
>>>>> /Claes
>>>>>
>>>>> On 2017-02-16 16:40, Daniel D. Daugherty wrote:
>>>>>> Hi Max,
>>>>>>
>>>>>> Added a note to your bug. Interesting idea, but I think your data is
>>>>>> a bit incomplete at the moment.
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>>
>>>>>> On 2/15/17 3:18 PM, Max Ockner wrote:
>>>>>>> Hello all,
>>>>>>>
>>>>>>> We have filed a bug to remove the interpreter stack caching
>>>>>>> optimization for jdk10. Ideally we can make this change *early*
>>>>>>> during the jdk10 development cycle. See below for justification:
>>>>>>>
>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8172978
>>>>>>>
>>>>>>> Stack caching has been around for a long time and is intended to
>>>>>>> replace some of the load/store (pop/push) operations with
>>>>>>> corresponding register operations. The need for this optimization
>>>>>>> arose before caching could adequately lessen the burden of memory
>>>>>>> access. We have reevaluated the JVM stack caching optimization and
>>>>>>> have found that it has a high memory footprint and is very costly to
>>>>>>> maintain, but does not provide significant measurable or theoretical
>>>>>>> benefit for us when used with modern hardware.
>>>>>>>
>>>>>>> Minimal Theoretical Benefit.
>>>>>>> Because modern hardware does not slap us with the same cost for
>>>>>>> accessing memory as it once did, the benefit of replacing memory
>>>>>>> access with register access is far less dramatic now than it once was.
>>>>>>> Additionally, the interpreter runs for a relatively short time before
>>>>>>> relevant code sections are compiled. When the VM starts running
>>>>>>> compiled code instead of interpreted code, performance should begin to
>>>>>>> move asymptotically towards that of compiled code, diluting any
>>>>>>> performance penalties from the interpreter to small performance
>>>>>>> variations.
>>>>>>>
>>>>>>> No Measurable Benefit.
>>>>>>> Please see the results files attached in the bug page. This change
>>>>>>> was adapted for x86 and sparc, and interpreter performance was
>>>>>>> measured with Specjvm98 (run with -Xint). No significant decrease in
>>>>>>> performance was observed.
>>>>>>>
>>>>>>> Memory footprint and code complexity.
>>>>>>> Stack caching in the JVM is implemented by switching the instruction
>>>>>>> look-up table depending on the tos (top-of-stack) state. At any moment
>>>>>>> there are is an active table consisting of one dispatch table for each
>>>>>>> of the 10 tos states. When we enter a safepoint, we copy all 10
>>>>>>> safepoint dispatch tables into the active table. The additional entry
>>>>>>> code makes this copy less efficient and makes any work in the
>>>>>>> interpreter harder to debug.
>>>>>>>
>>>>>>> If we remove this optimization, we will:
>>>>>>> - decrease memory usage in the interpreter,
>>>>>>> - eliminated wasteful memory transactions during safepoints,
>>>>>>> - decrease code complexity (a lot).
>>>>>>>
>>>>>>> Please let me know what you think.
>>>>>>> Thanks,
>>>>>>> Max
>>>>>>>
-------------- next part --------------
diff --git a/src/cpu/sparc/vm/interp_masm_sparc.cpp b/src/cpu/sparc/vm/interp_masm_sparc.cpp
--- a/src/cpu/sparc/vm/interp_masm_sparc.cpp
+++ b/src/cpu/sparc/vm/interp_masm_sparc.cpp
@@ -90,9 +90,16 @@
#else
ldub( Lbcp, bcp_incr, Lbyte_code); // load next bytecode
// dispatch table to use
+
+ if (!EnableTosCache){
+ AddressLiteral tbl(Interpreter::dispatch_table(vtos));
+ sll(Lbyte_code, LogBytesPerWord, Lbyte_code); // multiply by wordSize
+ set(tbl, G3_scratch); // compute addr of table
+ } else {
AddressLiteral tbl(Interpreter::dispatch_table(state));
sll(Lbyte_code, LogBytesPerWord, Lbyte_code); // multiply by wordSize
set(tbl, G3_scratch); // compute addr of table
+ }
ld_ptr(G3_scratch, Lbyte_code, IdispatchAddress); // get entry addr
#endif
}
@@ -105,6 +112,10 @@
assert_not_delayed();
verify_FPU(1, state);
interp_verify_oop(Otos_i, state, __FILE__, __LINE__);
+
+ if (!EnableTosCache){
+ push(state);
+ }
jmp( IdispatchAddress, 0 );
if (bcp_incr != 0) delayed()->inc(Lbcp, bcp_incr);
else delayed()->nop();
diff --git a/src/cpu/x86/vm/interp_masm_x86.cpp b/src/cpu/x86/vm/interp_masm_x86.cpp
--- a/src/cpu/x86/vm/interp_masm_x86.cpp
+++ b/src/cpu/x86/vm/interp_masm_x86.cpp
@@ -796,7 +796,25 @@
}
void InterpreterMacroAssembler::dispatch_epilog(TosState state, int step) {
+ if (EnableTosCache) {
dispatch_next(state, step);
+ } else {
+ switch (state) {
+ case btos:
+ case ctos:
+ case stos:
+ ShouldNotReachHere(); // btos/ctos/stos should use itos.
+ break;
+ case atos: push(atos); break;
+ case itos: push(itos); break;
+ case ltos: push(ltos); break;
+ case ftos: push(ftos); break;
+ case dtos: push(dtos); break;
+ case vtos: break;
+ default : ShouldNotReachHere(); break;
+ }
+ dispatch_next(vtos, step);
+ }
}
void InterpreterMacroAssembler::dispatch_base(TosState state,
diff --git a/src/share/vm/runtime/globals.hpp b/src/share/vm/runtime/globals.hpp
--- a/src/share/vm/runtime/globals.hpp
+++ b/src/share/vm/runtime/globals.hpp
@@ -2675,6 +2675,8 @@
"Produce histogram of IC misses") \
\
/* interpreter */ \
+ product(bool, EnableTosCache, true, \
+ "Enable Top of Stack Cache optimization") \
product_pd(bool, RewriteBytecodes, \
"Allow rewriting of bytecodes (bytecodes are not immutable)") \
\
@@ -2776,7 +2778,7 @@
"Test only") \
\
/* compilation */ \
- product(bool, UseCompiler, true, \
+ product(bool, UseCompiler, false, \
"Use Just-In-Time compilation") \
\
develop(bool, TraceCompilationPolicy, false, \
More information about the hotspot-dev
mailing list