Possible problem with optimization of Memory Barriers in C2 compiler

Mon Feb 11 11:05:29 PST 2013

Martin,

An other issue here is aliasing of membars. The reason we have 3 membars 
at this part of code is they are aliased to different memory slices (see 
memory Proj nodes below) but post_store_load_barrier() treat them as 
wide (C2 bottom type) barriers. Which is correct since in hardware they 
are wide. Membars are generated in Parse::do_put_xxx() method. And I 
need to ask around why we don't use one fat membar there.

Thanks,
Vladimir

  328	StoreI	===  324  325  327  58  [[ 329  329  332  335 ]] 
@TestRAW$TestData+12 *, name=a, idx=10; Volatile!  Memory: 
@TestRAW$TestData:NotNull+12 *, name=a, idx=10; !jvms: TestRAW::actor1 @ 
bci:2 TestRAW$runner1::run @ bci:40

  329	MemBarVolatile	===  324  1  328  1  1  328  [[ 330  331 ]]  !jvms: 
TestRAW::actor1 @ bci:2 TestRAW$runner1::run @ bci:40
  331	Proj	===  329  [[ 326 ]] #2  Memory: @TestRAW$TestData+12 *, 
name=a, idx=10; !jvms: TestRAW::actor1 @ bci:2 TestRAW$runner1::run @ bci:40

  332	MemBarVolatile	===  330  1  325  1  1  328  [[ 333  334 ]]  !jvms: 
TestRAW::actor1 @ bci:2 TestRAW$runner1::run @ bci:40
  334	Proj	===  332  [[ 326  335  339 ]] #2  Memory: @BotPTR *+bot, 
idx=Bot; !jvms: TestRAW::actor1 @ bci:2 TestRAW$runner1::run @ bci:40

  335	MemBarVolatile	===  333  1  334  1  1  328  [[ 336  337 ]]  !jvms: 
TestRAW::actor1 @ bci:2 TestRAW$runner1::run @ bci:40
  337	Proj	===  335  [[ 326 ]] #2  Memory: @TestRAW+16 *, 
name=start_trigger, idx=5; !jvms: TestRAW::actor1 @ bci:2 
TestRAW$runner1::run @ bci:40

  339	LoadI	===  315  334  338  [[ 351  340  358 ]] 
@TestRAW$TestData+16 *, name=b, idx=11; Volatile! #int !jvms: 
TestRAW::actor1 @ bci:6 TestRAW$runner1::run @ bci:40

On 2/11/13 2:03 AM, Doerr, Martin wrote:
> Hi Vladimir,
>
> thanks for filing the bug.
> Following the memory edges in addition will indeed fix the problem.
>
> However, we could end up in emitting the MemBarVolatile  twice.
> 332 would become a necessary MemBar because of the Proj(334) and LoadI(339)
> which fixes the problem. But the MemBarVolatile 335 is still there, but
> the current implementation would not recognize it as unnecessary.
>
> Thanks,
> Martin
>
>
> -----Original Message-----
> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
> Sent: Samstag, 9. Februar 2013 04:19
> To: Doerr, Martin
> Cc: hotspot-compiler-dev at openjdk.java.net
> Subject: Re: Possible problem with optimization of Memory Barriers in C2 compiler
>
> Never-mind my previous answer (end of week :) ).
>
> You are right, and I think the problem is in the next wrong assumption
> in post_store_load_barrier():
>
>       if (x->is_MemBar()) {
>         // We must retain this membar if there is an upcoming volatile
>         // load, which will be preceded by acquire membar.
>         if (xop == Op_MemBarAcquire)
>           return false;
>
> MemBarAcquire is following a volatile load and not preceding. So we need
> to fix post_store_load_barrier() to follow not only control edges but
> also memory to look for volatile loads.
>
> I filed bug:
>
> 8007898: Incorrect optimization of Memory Barriers in
> Matcher::post_store_load_barrier()
>
> Thanks,
> Vladimir
>
> On 2/8/13 6:42 PM, Vladimir Kozlov wrote:
>> Martin,
>>
>> Matcher::post_store_load_barrier() does not remove mach nodes from
>> graph. Matcher still should map your ideal membar nodes to mach nodes,
>> something like next:
>>
>> n0  storeI
>> n1  unnecessary_membar_volatile
>> n2  unnecessary_membar_volatile
>> n3  membar_volatile
>> n4  loadI
>>
>> with all corresponding edges. It is only during assembler code emission
>> assembler is not generated for unnecessary_membar_volatile. But until
>> then they are in graph during gcm, lcm, ra and scheduler.
>>
>> There is comment about that:
>>
>> // We retain the Node to act as a compiler ordering barrier.
>> bool Matcher::post_store_load_barrier(const Node *vmb) {
>>
>> Regards,
>> Vladimir
>>
>>
>> On 2/8/13 5:14 AM, Doerr, Martin wrote:
>>> Hi,
>>>
>>> we contributed a change fixing memory ordering in taskqueue (8006971).
>>>
>>> In the course of the discussion Alexey Shipilev pointed us to the
>>>
>>> concurrency torture tests.
>>>
>>> While running these tests we found one test, for which, in our eyes,
>>>
>>> Matcher::post_store_load_barrier() generates wrong IR.  Nevertheless
>>>
>>> the test passes with openJDK, because lcm establishes a proper ordering.
>>>
>>> We would like to know whether this happens by accident, or whether
>>>
>>> this is designed into lcm (and OptoScheduling).
>>>
>>> I have written a self-contained test program similar to "DekkerTest"
>>>
>>> which tests volatile read after write.
>>>
>>>      class TestData {
>>>
>>>                      volatile int a;
>>>
>>>                      volatile int b;
>>>
>>>      }
>>>
>>>      int actor1(TestData t) {
>>>
>>>                      t.a = 1;
>>>
>>>                      return t.b;
>>>
>>>      }
>>>
>>> (Whole program under http://cr.openjdk.java.net/~goetz/c2_membar_issue/)
>>>
>>> Below you can see a part of the Ideal graph generated for
>>>
>>> TestRAW$runner1::run with inlined actor1.
>>>
>>> Membar 329 and 332 order Store 328 and Load 339.
>>>
>>> Matcher::post_store_load_barrier() turns 329 and 332 into
>>>
>>> unnecessary_MemBar_volatile so that the Store and Load are no more
>>>
>>> ordered by these ones.
>>>
>>> Lcm places Load 339 behind Membar 335 which results in correct code.
>>>
>>> Unfortunately there is no edge that forces Load 339 behind Membar 335.
>>>
>>> In other words, the ideal graph allows the schedule:
>>>
>>> 328 StoreI - 339 LoadI - 335 MemBarVolatile and the outcome of lcm
>>>
>>> seems to be accidential.
>>>
>>> 328    StoreI  ===  324  325  327  58  [[ 329  329  332  335 ]]
>>> @TestRAW$TestData+12 *, name=a, idx=10; Volatile!  Memory:
>>> @TestRAW$TestData:NotNull+12 *, name=a, idx=10; !jvms: TestRAW::actor1 @
>>> bci:2 TestRAW$runner1::run @ bci:40
>>>
>>> 329    MemBarVolatile  ===  324  1  328  1  1  328  [[ 330  331 ]]
>>> !jvms: TestRAW::actor1 @ bci:2 TestRAW$runner1::run @ bci:40
>>>
>>> 330    Proj    ===  329  [[ 332 ]] #0 !jvms: TestRAW::actor1 @ bci:2
>>> TestRAW$runner1::run @ bci:40
>>>
>>> 332    MemBarVolatile  ===  330  1  325  1  1  328  [[ 333  334 ]]
>>> !jvms: TestRAW::actor1 @ bci:2 TestRAW$runner1::run @ bci:40
>>>
>>> 333    Proj    ===  332  [[ 335 ]] #0 !jvms: TestRAW::actor1 @ bci:2
>>> TestRAW$runner1::run @ bci:40
>>>
>>>    334    Proj    ===  332  [[ 326  335  339 ]] #2  Memory: @BotPTR
>>> *+bot, idx=Bot; !jvms: TestRAW::actor1 @ bci:2 TestRAW$runner1::run @
>>> bci:40
>>>
>>> 335    MemBarVolatile  ===  333  1  334  1  1  328  [[ 336  337 ]]
>>> !jvms: TestRAW::actor1 @ bci:2 TestRAW$runner1::run @ bci:40
>>>
>>>    339    LoadI   ===  315  334  338  [[ 351  340  358 ]]
>>> @TestRAW$TestData+16 *, name=b, idx=11; Volatile! #int !jvms:
>>> TestRAW::actor1 @ bci:6 TestRAW$runner1::run @ bci:40
>>>
>>> (Whole ideal graph under
>>> http://cr.openjdk.java.net/~goetz/c2_membar_issue/)
>>>
>>> The question we now have is the following. Is there a mechanism
>>>
>>> (e.g. in local code motion) which guarantees that the MemBarVolatile
>>>
>>> gets scheduled before the LoadI?
>>>
>>> If we change the local code motion, we can get the schedule
>>>
>>> StoreI-LoadI-MemBarVolatile causing this test to fail.
>>>
>>> lcm changes may make sense for architectures with many registers
>>>
>>> which may benefit from longer life ranges.
>>>
>>> Kind regards,
>>>
>>> Martin and Götz
>>>