RFR(M): 8054478 C2: Incorrectly compiled char[] array access crashes JVM

Tue Nov 4 20:03:25 UTC 2014

> Sorry, it took so much time. I was confused by number of files you changed. In general it seems good to me.

Thanks Vladimir for looking at this.

> Do you know why GCM scheduled the load into previous block? Usually loads are placed near their uses in:
> 
> 0c7   B13: #    B17 B14 <- B11  Freq: 0.768468
> 0c7     movzwl  R8, [RCX + #24 + R10 << #1]     # ushort/char << SEGV
> 0cd     addl    RDI, #-2        # int
> 0d0     movl    R9, #-2 # int
> 0d6     testl   R10, R10
> 0d9     jle,s   B17  P=0.000001 C=-1.000000
> 0d9
> 0db   B14: #    B21 B15 <- B13  Freq: 0.768467
> 0db     testl   R8, R8
> 0de     je,s   B21  P=0.100000 C=-1.000000

There are 2 uses for the value that is loaded. I don’t have the same block numbering/register allocation when I run it so it looks like this for me:

04f   B2: #     B23 B3 <- B15  Freq: 5848.51
04f     movzwl  R10, [RDX + #16 + R8 << #1]     # ushort/char
055     addl    RCX, #-2        # int
058     testl   R8, R8
05b     jle     B23  P=0.000001 C=-1.000000
05b
061   B3: #     B22 B4 <- B2  Freq: 5848.5
061     testl   R10, R10
064     je     B22  P=0.000001 C=-1.000000

...

149   B23: #    B5 B24 <- B2  Freq: 0.00592617
149     cmpl    R8, #-1
14d     jle     B5  P=0.100000 C=-1.000000
14d
153   B24: #    B22 B25 <- B23  Freq: 0.00533355
153     testl   R10, R10
156     je,s   B22  P=0.000001 C=-1.000000

B2 (in my case) is the LCA of both uses (B3 & B24).

> castnode.hpp:
> 
> Fields names: _required --> _carry_dependency.

Better naming indeed.

> I am not convinced that you need _inexact. Based on your code _inexact can be converted from 'true' to 'false' only when phase->type(cmp->in(2))->is_int()->is_con() and all conditions at the beginning of CastIINode::Ideal() are true. But if input's type is narrowed later we can still improve CastII's type.

What do you mean by later? That the input becomes a constant and then the constant changes?

> Do you think we will get worse code if we don't do Identity for _carry_dependency CastII nodes? If _carry_dependency is set we can't replace it with an other node, I think, (unless all inputs are matching - hash() and cmp()).

To be safe I should define hash() and cmp() for CastIINode to account for _carry_dependency (and _inexact if we keep it), right?

Do I think we should have Identity skip _carry_dependency CastII nodes entirely? I think we might miss some optimization opportunities if we do. For instance something like this maybe:

for (int i = j; i < 100; i++) {
  value += some_array[i];
  // some test that make the backbranch dead in the main loop
}

value += some_array[j+1];

Assuming a 1 iteration pre loop, the load after the loop should common with the load in the main loop but wouldn’t with the CastII because they wouldn't have the same address input.

Anyway I couldn’t make that in an actual test case.

Also, I think we might use the modified CastII elsewhere in the future. Currently, we only add CastII during parsing when we know we compare against a constant. The modified CastII could be added even if we don’t compare against a constant and if the constant is discovered later, the CastII would help optimization. If we sprinkle CastII nodes in the graph, we would want the useless ones to go away.

> castnode.cpp
> 
> The type change belongs to Value() not Ideal(). Especially if you don't need to set _inexact field and create new CastII node for that.

That makes sense.

> Use fatal with message which includes BoolTest::mask name instead of ShouldNotReachHere().

Ok.

> loopTransform.cpp
> 
> insert_castii_before_loop() --> cast_incr_before_loop()
> You missed registration of new CastII nodes - register_new_node().

Ok. Thanks for spotting that.

> opaquenode.cpp
> 
> What test you are talking about: "The pre loop is guarded by a test on an opaque node which is later removed"? I did not get first part of the code. You are putting on worklist a Phi from *previous* (pre-loop) loop. I would understand if you do that for the following (guarded main-, post-) loop, and that is already taking care by putting CastII on worklist.

Once the range of values for the pre loop is know, we can optimize the test that guards the main loop. That range of values is only known once the opaque node for the pre loop is removed. 

> There is loop_limit_check predicate before a counted loop which has opaqued limit. It . Those opaque nodes are removed by cleanup_loop_predicates(). Why Phi is not put on worklist at that time?

So that code:

Node *Opaque1Node::Identity(PhaseTransform *phase) {
  return phase->C->major_progress() ? this : in(1);
}

is not used to remove the opaque nodes?
I thought I observed they were removed there.

Roland.

> 
> Thanks,
> Vladimir
> 
> On 10/6/14 2:49 AM, Roland Westrelin wrote:
>> 
>> Anyone?
>> 
>> Roland.
>> 
>> On Sep 17, 2014, at 7:03 PM, Roland Westrelin <roland.westrelin at oracle.com> wrote:
>> 
>>> http://cr.openjdk.java.net/~roland/8054478/webrev.00/
>>> 
>>> The IR after parsing is this in pseudo code:
>>> 
>>> i_main = 0;
>>> while(true) {
>>> if (i_main >= 1000000) return;
>>> i_main++;
>>> char[] array = new char[1];
>>> if (pattern1 != null) {
>>>   i = 0;
>>>   while(true) {
>>>     if (i >= 0) {
>>>       do {
>>>         if (pattern0 == null) uncommon_trap;
>>>         if (pattern0.length u< i) uncommon_trap;
>>>         if (pattern0[i] != 0) {
>>>           if (pattern1.length u< i) uncommon_trap;
>>>           if (pattern1[i] != 0) {
>>>             goto out1;
>>>           }
>>>         }
>>>         i—;
>>>         pos—;
>>>         if (pos != -1) {
>>>           goto out2;
>>>         }
>>>       } while (i >= 0);
>>>       goto out1;
>>>     }
>>>     out2:
>>>     c = array[pos];
>>>   }
>>> }
>>> out1:
>>> }
>>> 
>>> The do {} while loop is a CountedLoop. The null check & range check are moved out of the loop. Then a pre and a post loops are created and the body is unrolled. The pattern0[i] LoadNode nodes in the unrolled body have a control that is set to the range check before the pre loop. In the unrolled loop, because of the if (pos != -1) test in the first copy of the loop, the compiler finds that in the second copy of the loop body pos is != -1 and so the loop is exited before reaching the end of the unrolled loop. The back branch of the unrolled loop is dead. The compiler optimizes the CountedLoopNode of the unrolled loop out because it doesn’t have a back branch anymore, PhiNodes are removed and the LoadNode for pattern0[i] in the unroll loop body is now independent of the input control of the CountedLoop. Its control is still set to the range check before the pre loop. In the generated code, the pre loop is followed by the pattern1[i] access which is incorrect because it happens before 
> the if (i >= 0) that dominated the unrolled loop before it was removed.
>>> 
>>> The graph is good as long as the CountedLoop is not removed because the LoadNode depends on a PhiNode of the CountedLoop. But as soon as the CountedLoop is removed and the PhiNode is disconnected, the LoadNode no longer depends on the check that guarded the CountedLoop.
>>> 
>>> Vladimir suggested (in a private conversation) that I use a CastII (control dependent on the if that guards the main loop with as input the input value to the induction variable in the loop) as a way to guarantee correct dependencies. There are 2 problems with this:
>>> 
>>> 1) the CastII nodes are removed after CCP. I added a flag to the CastII nodes so we can mark CastII nodes that can’t be removed and should be left as they are post-CCP.
>>> 2) we don’t know the range of values that are possible on the if branch that enters the main loop right away (it can be adjusted as loop opts are applied). We may actually never know it if it depends on non constants. If we use a type for a range that is too wide in the CastII the CastII may be optimized out incorrectly. I added another flag to the CastII to mark nodes that cannot be optimized. In CastII:Ideal I added code that tries to figure out a better type for the CastII so we may be able to optimize the CastII as the compiler optimizes the IR.
>>> 
>>> Another independent issue here is that the main loop is actually never executed because the loop of the test is a single iteration loop so the compiler should be able to optimize out the main loop entirely but it doesn’t. PhiNode::Value() has this code:
>>> 
>>> CountedLoopNode *l = r->is_CountedLoop() ? r->as_CountedLoop() : NULL;
>>> if( l && l->can_be_counted_loop(phase) &&
>>>     ((const Node*)l->phi() == this) ) { // Trip counted loop!
>>> 
>>> ...
>>>       if( lo->_hi < hi->_lo )     // Reversed endpoints are well defined :-(
>>>         return TypeInt::make(lo->_lo,hi->_hi,3);
>>>     }
>>>   }
>>> }
>>> 
>>> that returns an accurate value for the Phi node of a CountedLoop. In this case we want the pre loop’s Phi to return an accurate type that would then make the guard of the main loop optimize. The pre loop is guarded by a test on an opaque node which is later removed. If PhiNode::Value() is called before the Opaque1 node is removed then, the limit type is int and it cannot return anything accurate. If it’s called after, then limit type is a constant and PhiNode::Value() also returns a constant for the pre-loop. Then the main loop is optimized out. To fix this: I added code to Opaque1Node::Ideal so it reenqueues the Phi when the opaque node is removed.
>>> 
>>> Roland.
>>