RFR: 8293491: Avoid unexpected deoptimization in loop exit due to incorrect branch profiling

Thu Sep 8 15:06:44 UTC 2022

On Thu, 8 Sep 2022 10:55:13 GMT, Tobias Hartmann <thartmann at openjdk.org> wrote:

>> Hi all,
>> 
>> Please review this patch which fixes the unexpected deoptimizations in loop exit due to incorrect branch profiling.
>> 
>> # Background
>> 
>> While analyzing our big data Apps, we observed unexpected deoptimizations in loop exit due to incorrect branch profiling.
>> 
>> Here is a reproducer.
>> 
>> public class UnexpectedLoopExitDeopt {
>>     public static final int N = 20000000;
>> 
>>     public static int d1[] = new int[N];
>>     public static int d2[] = new int[N];
>> 
>>     public static void main(String[] args) {
>>         System.out.println(test(d1));
>>         System.out.println(test(d2));
>>     }
>> 
>>     public static int test(int[] a) {
>>         int sum = 0;
>>         for(int i = 0; i < a.length; i++) {
>>             sum += a[i];
>>         }
>>         return sum;
>>     }
>> }
>> 
>> 
>> The following is the compilation sequence.
>> 
>>      77    1       3       java.lang.Object::<init> (1 bytes)
>>      83    2       3       java.lang.String::isLatin1 (19 bytes)
>>      84    6       3       jdk.internal.util.Preconditions::checkIndex (18 bytes)
>>      84    3       3       java.lang.String::charAt (25 bytes)
>>      85    4       3       java.lang.StringLatin1::charAt (15 bytes)
>>      86    7       3       java.lang.String::coder (15 bytes)
>>      86    8       3       java.lang.String::hashCode (60 bytes)
>>      87    5       3       java.lang.String::checkIndex (10 bytes)
>>      87    9       3       java.lang.String::length (11 bytes)
>>      93   10     n 0       java.lang.invoke.MethodHandle::linkToStatic(LLLLLLL)L (native)   (static)
>>      96   11     n 0       java.lang.invoke.MethodHandle::linkToSpecial(LLLL)L (native)   (static)
>>      96   12     n 0       java.lang.Object::hashCode (native)   
>>      97   13     n 0       java.lang.invoke.MethodHandle::invokeBasic(LLLLLL)L (native)   
>>      98   14       3       java.util.Objects::requireNonNull (14 bytes)
>>      98   15     n 0       java.lang.invoke.MethodHandle::linkToSpecial(LLLLLLLL)L (native)   (static)
>>      98   16       1       java.lang.Enum::ordinal (5 bytes)
>>     101   17     n 0       java.lang.invoke.MethodHandle::linkToSpecial(LLLL)V (native)   (static)
>>     102   18     n 0       java.lang.invoke.MethodHandle::invokeBasic(LL)L (native)   
>>     212   19 %     3       UnexpectedLoopExitDeopt::test @ 4 (24 bytes)
>>     213   20 %     4       UnexpectedLoopExitDeopt::test @ 4 (24 bytes)
>>     221   19 %     3       UnexpectedLoopExitDeopt::test @ 4 (24 bytes)   made not entrant
>>     221   21       4       UnexpectedLoopExitDeopt::test (24 bytes)
>>     230   20 %     4       UnexpectedLoopExitDeopt::test @ 4 (24 bytes)   made not entrant     <--- Unexpected deopt
>> 0
>>     242   21       4       UnexpectedLoopExitDeopt::test (24 bytes)   made not entrant         <--- Unexpected deopt
>> 0
>> 
>> 
>> The last two deopts (made not entrant) happened in the loop exit which are unexpected.
>> 
>> 
>> # Reason
>> 
>> The unexpected deopts were caused by the incorrect branch profiling count (0 taken count for loop predicate).
>> 
>> Here is the profiling data for `UnexpectedLoopExitDeopt::test`.
>> We can see that for `if_icmpge` @ bci=7, the count for `not taken` is 264957, while 0 for `taken`.
>> The profile count for zero taken is obvious incorrect since the loop will finally exit (when `i >= a.length`).
>> So the taken count should be at least 1 for `if_icmpge` @ bci=7.
>> 
>> 0 iconst_0
>> 1 istore_1
>> 2 iconst_0
>> 3 istore_2
>> 
>> 4 iload_2
>> 5 fast_aload_0
>> 6 arraylength
>> 7 if_icmpge 22
>>   0   bci: 7    BranchData          taken(0) displacement(56)
>>                                     not taken(264957)
>> 
>> 10 iload_1
>> 11 fast_aload_0
>> 12 iload_2
>> 13 iaload
>> 14 iadd
>> 15 istore_1
>> 16 iinc #2 1
>> 19 goto 4
>>   32  bci: 19   JumpData            taken(266667) displacement(-32)
>> 
>> 22 iload_1
>> 23 ireturn
>> 
>> 
>> # Fix
>> 
>> The main idea is to detect if the branch taken target is a loop exit.
>> If so, set the taken count to be at least 1.
>> This is fine because most loops should be finite and would execute the loop exit code at lease once.
>> For infinite loops like `while (true) {...}`, the patch won't change the original behaviour since there is no loop exit.
>> 
>> # Testing
>> 
>> tier1~3 on Linux/x64, no regression
>> 
>> Thanks.
>> Best regards,
>> Jie
>
> Do these deoptimizations really affect performance of your program or did you just spot them when looking at the logs?
> 
> Such surprising deopts are actually expected with optimistic, profile guided optimizations and happen in many other scenarios as well. They are usually harmless. Also, the profile information is not necessarily incorrect but might just be outdated because we stop profiling once we reach C2. Marking all loop exits as taken seems hacky and might have unexpected side effects.
> 
> Also, wouldn't C2 still insert a `Deoptimization::Reason_unreached` or `Deoptimization::Reason_unstable_if` trap for subsequent instructions after the loop exit for which profiling also suggests that they were never executed?

Thanks @TobiHartmann for your review and valuable comments.

> Do these deoptimizations really affect performance of your program or did you just spot them when looking at the logs?
> 

I didn't see performance drop due to this issue in real programs.
The loop exit deopts were discovered while we were trying to optimize some big data processing patterns.

> Such surprising deopts are actually expected with optimistic, profile guided optimizations and happen in many other scenarios as well. They are usually harmless. Also, the profile information is not necessarily incorrect but might just be outdated because we stop profiling once we reach C2. Marking all loop exits as taken seems hacky and might have unexpected side effects.
> 

Well, deoptimization leads to heavy runtime overhead involving control transfering, re-interpreting/profiling and re-compiling, which really wastes cpu cycles and memory.  
So if some kinds of deopts could be avoided, why not do it?

Do you aggree that for most common cases in real programs, the loop exit would be executed at least once?
If so, it seems unreasonable to replace the loop exit with an unstable_if uncommon trap during compilation, right?

> Also, wouldn't C2 still insert a `Deoptimization::Reason_unreached` or `Deoptimization::Reason_unstable_if` trap for subsequent instructions after the loop exit for which profiling also suggests that they were never executed?

It's impossible to avoid all deopts with optimistic, profile guided compilations.
So this patch only aims at eliminating unreasonable deopts in loop exit block as many as possible, not all unexpected deopts.
And there seems no good reasons to disable them except for the loop exit deopts.

-------------

PR: https://git.openjdk.org/jdk/pull/10200