RemoveNeverExecutedCode

Sat Aug 23 00:48:46 UTC 2014

I think the idea is that a method containing a hot loop is also considered valuable (to be profiled/compiled).

AdvancedThresholdPolicy::should_create_mdo also depends on the backedge_counter:

bool AdvancedThresholdPolicy::should_create_mdo(Method* method, CompLevel cur_level) {
  if (cur_level == CompLevel_none &&
      CompileBroker::queue_size(CompLevel_full_optimization) <=
      Tier3DelayOn * compiler_count(CompLevel_full_optimization)) {
    int i = method->invocation_count();
    int b = method->backedge_count();
    double k = Tier0ProfilingStartPercentage / 100.0;
    return call_predicate_helper<CompLevel_none>(i, b, k) || loop_predicate_helper<CompLevel_none>(i, b, k);
  }
  return false;
}

template<CompLevel level>
bool SimpleThresholdPolicy::call_predicate_helper(int i, int b, double scale) {
  switch(level) {
  case CompLevel_none:
  case CompLevel_limited_profile:
    return (i > Tier3InvocationThreshold * scale) ||
           (i > Tier3MinInvocationThreshold * scale && i + b > Tier3CompileThreshold * scale);
  case CompLevel_full_profile:
   return (i > Tier4InvocationThreshold * scale) ||
          (i > Tier4MinInvocationThreshold * scale && i + b > Tier4CompileThreshold * scale);
  }
  return true;
}

template<CompLevel level>
bool SimpleThresholdPolicy::loop_predicate_helper(int i, int b, double scale) {
  switch(level) {
  case CompLevel_none:
  case CompLevel_limited_profile:
    return b > Tier3BackEdgeThreshold * scale;
  case CompLevel_full_profile:
    return b > Tier4BackEdgeThreshold * scale;
  }
  return true;
}

I do agree the policy could be much different according to the application code.
For GPU, I would expect heavy loops and they should be weighted more than invocation counters.

Yudi

On 23 Aug 2014, at 01:57, Tom Rodriguez <tom.rodriguez at oracle.com<mailto:tom.rodriguez at oracle.com>> wrote:

On Aug 22, 2014, at 3:44 PM, Yudi Zheng <yudi.zheng at usi.ch<mailto:yudi.zheng at usi.ch>> wrote:

The backedge_counter may also contribute to the triggering of creating MDO.

I was misled by that as well.  That code is only used in non-tiered.  The code block above this one is the “if (TieredCompilation)” part and that basically just ends up calling back into every (1 << Tier0InvokeNotifyFreqLog) - 1 invocations and the compilation policy just decides when it wants to create the MDO.

tom

   // Update standard invocation counters
   __ movl(rcx, invocation_counter);
   __ incrementl(rcx, InvocationCounter::count_increment);
   __ movl(invocation_counter, rcx); // save invocation count

   __ movl(rax, backedge_counter);   // load backedge counter
   __ andl(rax, InvocationCounter::count_mask_value); // mask out the status bits

   __ addl(rcx, rax);                // add both counters

   // profile_method is non-null only for interpreted method so
   // profile_method != NULL == !native_call

   if (ProfileInterpreter && profile_method != NULL) {
     // Test to see if we should create a method data oop
     __ cmp32(rcx, ExternalAddress((address)&InvocationCounter::InterpreterProfileLimit));
     __ jcc(Assembler::less, *profile_method_continue);

     // if no method data exists, go to profile_method
     __ test_method_data_pointer(rax, *profile_method);
   }

In pseudo code:

if (invocation_counter + backedge_counter >= InterpreterProfileLimit) {
 create MDO
}

The backedge_counter increases when there is backward branch, e.g. loop.
For detail, you could refer to cpu/x86/vm/templateTable_x86_64.cpp#TemplateTable::branch

Yudi

On 22 Aug 2014, at 22:37, Tom Rodriguez <tom.rodriguez at oracle.com<mailto:tom.rodriguez at oracle.com>> wrote:

The tiered policy is more complicated and is driven by AdvancedThresholdPolicy::should_create_mdo instead of being a fixed percentage in the interpreter.  You’d have to set through that code to understand what the actual thresholds really are plus there are other wrinkle like creating the MDO for the caller under some conditions.

tom

On Aug 22, 2014, at 1:01 PM, Deneau, Tom <tom.deneau at amd.com<mailto:tom.deneau at amd.com>> wrote:

Thanks, Yudi.

Although if CompileThreshold is 10000, I still don't see how this would explain in my case where the profiled execution count was 512 less than the actual execution count.

-- Tom

-----Original Message-----
From: Yudi Zheng [mailto:yudi.zheng at usi.ch]
Sent: Friday, August 22, 2014 2:05 PM
To: Tom Rodriguez
Cc: Deneau, Tom; graal-dev at openjdk.java.net<mailto:graal-dev at openjdk.java.net>
Subject: Re: RemoveNeverExecutedCode

I had the same problem before..

The threshold is defined in share/vm/interpreter/invocationCounter.cpp#InvocationCounter::reinitialize

InterpreterProfileLimit = ((CompileThreshold * InterpreterProfilePercentage) / 100)<< number_of_noncount_bits;

where InterpreterProfilePercentage is 33 by default.

On X86_64, it is used in cpu/x86/vm/templateInterpreter_x86_64.cpp#InterpreterGenerator::generate_counter_incr

 if (ProfileInterpreter && profile_method != NULL) {
   // Test to see if we should create a method data oop
   __ cmp32(rcx, ExternalAddress((address)&InvocationCounter::InterpreterProfileLimit));
   __ jcc(Assembler::less, *profile_method_continue);

   // if no method data exists, go to profile_method
   __ test_method_data_pointer(rax, *profile_method);
 }

Initially, the invocation counter is stored in the MethodCounters*.
If the profiling is enabled, the interpreter creates a MethodData when the threshold is met.

class Method : public Metadata {
friend class VMStructs;
private:
ConstMethod*      _constMethod;                // Method read-only data.
MethodData*       _method_data;
MethodCounters*   _method_counters;

Yudi

On 22 Aug 2014, at 19:44, Tom Rodriguez <tom.rodriguez at oracle.com<mailto:tom.rodriguez at oracle.com>> wrote:

On Aug 22, 2014, at 9:10 AM, Deneau, Tom <tom.deneau at amd.com<mailto:tom.deneau at amd.com>> wrote:

Doug --

Yes, for the bytecode in question this tells me that the branchProbability is 1.0 but I was trying to understand why.

For instance, 10000 elements in array, the first 10 should all not take the branch, but I see...

executionCount at 13: 9488; branchProbability at 13: 1.000000; exceptionSeen at 13: FALSE;

It looks like maybe the profiling only kicks in after the first 512 times a bytecode is executed (since executionCount above is 9488)?   Also if I move the location of the differing elements to past 512, I see that it gets recorded. Is there some logic like that in the hotspot profiling interpreter?

Profiling only kicks in after a certain number of invocations.  It's usually 33% of the first tier compile threshold, though I think it works differently in tiered.  I'm having trouble finding the exact logic.  Anyway, it's a profile so there's no guarantee about what's in it.  For unreachable code it will eventually come to a proper stable state about what has been reached which is all we normally care about.

For GPU you might want a more conservative notion of never executed, possibly taking into account the maturity of the branch itself or maybe trusting those counts is just a bad idea for GPU.  Is that what you're trying to figure out?

tom

-- Tom

-----Original Message-----
From: Doug Simon [mailto:doug.simon at oracle.com]
Sent: Friday, August 22, 2014 7:39 AM
To: Deneau, Tom
Cc: Tom Rodriguez; graal-dev at openjdk.java.net<mailto:graal-dev at openjdk.java.net>
Subject: Re: RemoveNeverExecutedCode

You should get all the info you need with ResolvedJavaMethod.getProfilingInfo().toString().

On Aug 22, 2014, at 2:34 PM, Deneau, Tom <tom.deneau at amd.com<mailto:tom.deneau at amd.com>> wrote:

I don't know the details of how the profile is collected but...
In my case with 2000000 elements and 1 different, in the profile, the branchTakenProbability is showing up as 1.0.
For other combinations of elements and differs I saw this:
Elems    Differs      BTP
1000000    1000        < 1.0
1000000      10        1.0
100000      10        1.0
10000      10        1.0
 1000      10        < 1.0

-- Tom

-----Original Message-----
From: Tom Rodriguez [mailto:tom.rodriguez at oracle.com]
Sent: Thursday, August 21, 2014 7:12 PM
To: Deneau, Tom
Cc: graal-dev at openjdk.java.net<mailto:graal-dev at openjdk.java.net>
Subject: Re: RemoveNeverExecutedCode

On Aug 21, 2014, at 4:39 PM, Deneau, Tom <tom.deneau at amd.com<mailto:tom.deneau at amd.com>> wrote:

I am trying to benchmark an HSAIL kernel that is implementing the following lambda
n -> {
if (a1[n] != a2[n]) {
  isEqual = false;
}

where a1 and a2 are arrays of doubles.   So it is basically doing an Arrays.equals.

I set up the test to have 2,000,000 elements in the arrays and one of them does not match.
Before I compile for HSAIL, I enable some profiling by running the above lambda on the cpu in non-parallel mode (IntStream.range().forEach()) so the lambda gets executed 2000000 times (and takes the false branch once).

But even though one of the elements does not match, when I compile thru graal with the default -G:+RemoveNeverExecutedCode, I see that the isEqual = false path has been considered "not executed" and has been removed.

Is taking a branch once out of 2,000,000 times considered "not executed"?
Or is there some other flaw here?

Remember this means never executed according to the profile, so if it doesn't show up in the profile then it didn't happen.  Did the profile include the taken branch?

tom

Of course I can work around this  in this case by forcing -G:-RemoveNeverExecutedCode but I'd like to understand this.
This is running in -server mode.

-- Tom