Discussion: 8172978: Remove Interpreter TOS optimization

Thu Feb 23 21:21:23 UTC 2017

Hi Volker,
I have attached the patch that I have been testing.
Thanks,
Max

On 2/20/2017 5:45 AM, Volker Simonis wrote:
> Hi,
>
> besides the fact that this of course means some work for us :) I
> currently don't see any problems for our porting platforms (ppc64 and
> s390x).
>
> Are there any webrevs available, so we can see how big they are and
> maybe do some own benchmarking?
>
> Thanks,
> Volker
>
>
> On Sun, Feb 19, 2017 at 11:11 PM,  <coleen.phillimore at oracle.com> wrote:
>>
>> On 2/18/17 11:14 AM, coleen.phillimore at oracle.com wrote:
>>> When Max gets back from the long weekend, he'll post the platforms in your
>>> bug.
>>>
>>> It's amazing that for -Xint there's no significant difference. I've seen
>>> -Xint performance of 15% slower cause a 2% slowdown with server but that was
>>> before tiered compilation.
>>
>> I should clarify this.  I've seen this slowdown for *different* interpreter
>> optimizations, which *can* affect server performance.  I was measuring
>> specjvm98 on linux x64.   If there's no significant difference for this TOS
>> optimization, there is no chance of a degredation in overall performance.
>>
>> Coleen
>>
>>> The reason for this query was to see what developers for the other
>>> platform ports think, since this change would affect all of the platforms.
>>>
>>> Thanks,
>>> Coleen
>>>
>>> On 2/18/17 10:50 AM, Daniel D. Daugherty wrote:
>>>> If Claes is happy with the perf testing, then I'm happy. :-)
>>>>
>>>> Dan
>>>>
>>>>
>>>> On 2/18/17 3:46 AM, Claes Redestad wrote:
>>>>> Hi,
>>>>>
>>>>> I've seen Max has run plenty of tests on our internal performance
>>>>> infrastructure and everything I've seen there seems to corroborate the
>>>>> idea that this removal is OK from a performance point of view, the
>>>>> footprint improvements are small but significant and any negative
>>>>> performance impact on throughput benchmarks is at noise levels even
>>>>> with -Xint (it appears many benchmarks time out with this setting
>>>>> both before and after, though; Max, let's discuss offline how to
>>>>> deal with that :-))
>>>>>
>>>>> I expect this will be tested more thoroughly once adapted to all
>>>>> platforms (which I assume is the intent?), but see no concern from
>>>>> a performance testing point of view: Do it!
>>>>>
>>>>> Thanks!
>>>>>
>>>>> /Claes
>>>>>
>>>>> On 2017-02-16 16:40, Daniel D. Daugherty wrote:
>>>>>> Hi Max,
>>>>>>
>>>>>> Added a note to your bug. Interesting idea, but I think your data is
>>>>>> a bit incomplete at the moment.
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>>
>>>>>> On 2/15/17 3:18 PM, Max Ockner wrote:
>>>>>>> Hello all,
>>>>>>>
>>>>>>> We have filed a bug to remove the interpreter stack caching
>>>>>>> optimization for jdk10.  Ideally we can make this change *early*
>>>>>>> during the jdk10 development cycle. See below for justification:
>>>>>>>
>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8172978
>>>>>>>
>>>>>>> Stack caching has been around for a long time and is intended to
>>>>>>> replace some of the load/store (pop/push) operations with
>>>>>>> corresponding register operations. The need for this optimization
>>>>>>> arose before caching could adequately lessen the burden of memory
>>>>>>> access. We have reevaluated the JVM stack caching optimization and
>>>>>>> have found that it has a high memory footprint and is very costly to
>>>>>>> maintain, but does not provide significant measurable or theoretical
>>>>>>> benefit for us when used with modern hardware.
>>>>>>>
>>>>>>> Minimal Theoretical Benefit.
>>>>>>> Because modern hardware does not slap us with the same cost for
>>>>>>> accessing memory as it once did, the benefit of replacing memory
>>>>>>> access with register access is far less dramatic now than it once was.
>>>>>>> Additionally, the interpreter runs for a relatively short time before
>>>>>>> relevant code sections are compiled. When the VM starts running
>>>>>>> compiled code instead of interpreted code, performance should begin to
>>>>>>> move asymptotically towards that of compiled code, diluting any
>>>>>>> performance penalties from the interpreter to small performance
>>>>>>> variations.
>>>>>>>
>>>>>>> No Measurable Benefit.
>>>>>>> Please see the results files attached in the bug page. This change
>>>>>>> was adapted for x86 and sparc, and interpreter performance was
>>>>>>> measured with Specjvm98 (run with -Xint).  No significant decrease in
>>>>>>> performance was observed.
>>>>>>>
>>>>>>> Memory footprint and code complexity.
>>>>>>> Stack caching in the JVM is implemented by switching the instruction
>>>>>>> look-up table depending on the tos (top-of-stack) state. At any moment
>>>>>>> there are is an active table consisting of one dispatch table for each
>>>>>>> of the 10 tos states.  When we enter a safepoint, we copy all 10
>>>>>>> safepoint dispatch tables into the active table.  The additional entry
>>>>>>> code makes this copy less efficient and makes any work in the
>>>>>>> interpreter harder to debug.
>>>>>>>
>>>>>>> If we remove this optimization, we will:
>>>>>>>    - decrease memory usage in the interpreter,
>>>>>>>    - eliminated wasteful memory transactions during safepoints,
>>>>>>>    - decrease code complexity (a lot).
>>>>>>>
>>>>>>> Please let me know what you think.
>>>>>>> Thanks,
>>>>>>> Max
>>>>>>>

-------------- next part --------------

diff --git a/src/cpu/sparc/vm/interp_masm_sparc.cpp b/src/cpu/sparc/vm/interp_masm_sparc.cpp
--- a/src/cpu/sparc/vm/interp_masm_sparc.cpp
+++ b/src/cpu/sparc/vm/interp_masm_sparc.cpp
@@ -90,9 +90,16 @@
 #else
   ldub( Lbcp, bcp_incr, Lbyte_code);                    // load next bytecode
   // dispatch table to use
+   
+  if (!EnableTosCache){
+    AddressLiteral tbl(Interpreter::dispatch_table(vtos));
+    sll(Lbyte_code, LogBytesPerWord, Lbyte_code);         // multiply by wordSize
+    set(tbl, G3_scratch);                                 // compute addr of table
+  } else {
   AddressLiteral tbl(Interpreter::dispatch_table(state));
   sll(Lbyte_code, LogBytesPerWord, Lbyte_code);         // multiply by wordSize
   set(tbl, G3_scratch);                                 // compute addr of table
+  }
   ld_ptr(G3_scratch, Lbyte_code, IdispatchAddress);     // get entry addr
 #endif
 }
@@ -105,6 +112,10 @@
   assert_not_delayed();
   verify_FPU(1, state);
   interp_verify_oop(Otos_i, state, __FILE__, __LINE__);
+
+  if (!EnableTosCache){
+    push(state);
+  }
   jmp( IdispatchAddress, 0 );
   if (bcp_incr != 0)  delayed()->inc(Lbcp, bcp_incr);
   else                delayed()->nop();
diff --git a/src/cpu/x86/vm/interp_masm_x86.cpp b/src/cpu/x86/vm/interp_masm_x86.cpp
--- a/src/cpu/x86/vm/interp_masm_x86.cpp
+++ b/src/cpu/x86/vm/interp_masm_x86.cpp
@@ -796,7 +796,25 @@
 }
 
 void InterpreterMacroAssembler::dispatch_epilog(TosState state, int step) {
+  if (EnableTosCache) {
   dispatch_next(state, step);
+  } else {
+    switch (state) {
+    case btos:
+    case ctos:
+    case stos:
+      ShouldNotReachHere();  // btos/ctos/stos should use itos.
+      break;
+    case atos: push(atos); break;
+    case itos: push(itos); break;
+    case ltos: push(ltos); break;
+    case ftos: push(ftos); break;
+    case dtos: push(dtos); break;
+    case vtos: break;
+    default  : ShouldNotReachHere(); break;
+    }
+    dispatch_next(vtos, step);
+  }
 }
 
 void InterpreterMacroAssembler::dispatch_base(TosState state,
diff --git a/src/share/vm/runtime/globals.hpp b/src/share/vm/runtime/globals.hpp
--- a/src/share/vm/runtime/globals.hpp
+++ b/src/share/vm/runtime/globals.hpp
@@ -2675,6 +2675,8 @@
           "Produce histogram of IC misses")                                 \
                                                                             \
   /* interpreter */                                                         \
+  product(bool, EnableTosCache, true,                                       \
+	  "Enable Top of Stack Cache optimization")                         \
   product_pd(bool, RewriteBytecodes,                                        \
           "Allow rewriting of bytecodes (bytecodes are not immutable)")     \
                                                                             \
@@ -2776,7 +2778,7 @@
           "Test only")                                                      \
                                                                             \
   /* compilation */                                                         \
-  product(bool, UseCompiler, true,                                          \
+  product(bool, UseCompiler, false,                                         \
           "Use Just-In-Time compilation")                                   \
                                                                             \
   develop(bool, TraceCompilationPolicy, false,                              \