RFR (S) 8019968: Reference CAS induces GC store barrier even on failure

Vladimir Kozlov vladimir.kozlov at oracle.com
Tue Aug 11 15:58:31 UTC 2015


Thank you for doing additional experiments, Aleksey, and explanation.
Now I agree with your changes. Reviewed.

Thanks,
Vladimir

On 8/11/15 2:22 AM, Aleksey Shipilev wrote:
> Hi Vladimir,
>
> My previous disassembly demonstrated the code generated for CAS
> spinloop. There, it's easy to confuse the "second" branch with a proper
> backbranch in the loop. Here is the disassembly for the "one-off"
> failing CAS with patched VM:
>
>                    │   0x00007fa3acba446c: lock cmpxchg %r11d,(%r10)
>   46.63%   83.18%  │   0x00007fa3acba4471: sete   %r8b
>    0.03%           │   0x00007fa3acba4475: movzbl %r8b,%r8d
>    2.23%           │   0x00007fa3acba4479: test   %r8d,%r8d <- removable?
>                    │╭  0x00007fa3acba447c: je     0x00007fa3acba4490
>                    ││  0x00007fa3acba447e: shr    $0x9,%r10
>                    ││  0x00007fa3acba4482: movabs $0x7fa3a0dbf000,%r11
>                    ││  0x00007fa3acba448c: mov    %r12b,(%r11,%r10,1)
>    0.93%           │↘  0x00007fa3acba4490: mov    %r8d,%eax
>    0.04%           │   0x00007fa3acba4493: add    $0x20,%rsp
>    1.05%           │   0x00007fa3acba4497: pop    %rbp
>    0.98%           │   0x00007fa3acba4498: test   %eax,0x11af2b62(%rip)
>>                    │   0x00007fa3acba449e: retq
>
> ...compare this to baseline VM that does an unconditional barrier:
>
>    2.31%    3.64%  │  0x00007fcf595fd4f9: lock cmpxchg %r10d,(%r11)
>   43.22%   78.37%  │  0x00007fcf595fd4fe: sete   %r8b
>    0.04%           │  0x00007fcf595fd502: movzbl %r8b,%r8d
>    2.20%           │  0x00007fcf595fd506: mov    %r11,%r10
>                    │  0x00007fcf595fd509: shr    $0x9,%r10
>                    │  0x00007fcf595fd50d: movabs $0x7fcf4dd0c000,%r11
>                    │  0x00007fcf595fd517: mov    %r12b,(%r11,%r10,1)
>    2.20%           │  0x00007fcf595fd51b: mov    %r8d,%eax
>                    │  0x00007fcf595fd51e: add    $0x20,%rsp
>                    │  0x00007fcf595fd522: pop    %rbp
>    1.82%           │  0x00007fcf595fd523: test   %eax,0x12383ad7(%rip)
>>                    │  0x00007fcf595fd529: retq
>
> Well, yeah, I can see that test at 0x00007fa3acba4479 is avoidable,
> since cmpxchg already sets the flag. But, I doubt it actually matters,
> since:
>   a) test-je are routinely macrofused into single uop on modern x86;
>   b) the flag is materialized in register anyway for method return;
>   c) as you predicted, my quick exploration blows up considerably;
>
> Notably, handling native oops require missing StoreNConditionalNode,
> which spreads all the way to AD and various places in compiler that
> match StorePConditionalNode. Also, my naive attempts of using Ideal to
> pick up StoreNConditional result and produce 0/1 yields full branches,
> not the "sete" that is coming from CompareAndSwapP AD encoding -- with
> terrible performance results.
>
> With that, I think we should play it safe, and push the existing
> obviously correct version that improves performance a lot, instead of
> blowing up the complexity for purely theoretical improvement.
>
> Thanks,
> -Aleksey
>
> On 08/11/2015 05:21 AM, Vladimir Kozlov wrote:
>> My bad, I forgot that CompareAndSwapP assembler code produces Boolean
>> value in register. I mistook it for StorePConditional which produces flag.
>>
>> But I think you can get better code since you want to generate test and
>> main point of having specialized CompareAndSwapP is to avoid test
>> instruction.
>>
>> If we use StorePConditional instead of CompareAndSwapP we may remove
>> second branch:
>>
>>>                     │││  0x00007fa06809cdd1: test   %r11d,%r11d
>>>                          ; CAS fail, jump to respin
>>>                     │ ╰  0x00007fe618af412b: jeq   0x00007fe618af4082
>>>                          ; CAS success, jump to store barrier
>>>                          <UNCONDITIONAL STORE BARRIER HERE>
>>
>> But C2 changes will be much larger. We would need new Ideal::if_then()
>> which take in result of StorePConditional and set load_store on both
>> paths to 0/1.
>>
>> We may need to play with probability of if_then() to get barrier in
>> follow code.
>>
>> Thanks,
>> Vladimir
>>
>> On 8/10/15 2:13 AM, Aleksey Shipilev wrote:
>>> Hi Vladimir!
>>>
>>> On 07/31/2015 06:03 AM, Vladimir Kozlov wrote:
>>>> I think the test is wrong. It should be:
>>>>
>>>> if_then(load_store, BoolTest::eq, oldval, PROB_STATIC_FREQUENT);
>>>
>>> Um, no? I remember eyeballing the assembly to confirm this.
>>>
>>> For LS_cmpxchg, we are inlining "*boolean* cas(...)", so the load_store
>>> seems to have a boolean value, but "oldval" is oop. In other words,
>>> "load_store != 0" tests "(boolean)load_store != false".
>>>
>>> Current VM produces:
>>>
>>>    13.46%   45.85%    │││  0x00007fa06809cdaf: lock cmpxchg %r8d,(%rdi)
>>>    14.91%    4.41%    │││  0x00007fa06809cdb4: sete   %r11b
>>>     0.07%             │││  0x00007fa06809cdb8: movzbl %r11b,%r11d
>>>                                 <UNCONDITIONAL STORE BARRIER HERE>
>>>     0.06%             │││  0x00007fa06809cdd1: test   %r11d,%r11d
>>>
>>>                            ; CAS fail, jump to respin
>>>                       │╰│  0x00007fa06809cdd4: je     0x00007fa06809ccf0
>>>
>>>
>>> Patched VM piggybacks on the same result:
>>>
>>>     1.97%    0.05%  │││  0x00007fe618af4115: lock cmpxchg %ebx,(%r9)
>>>    50.59%   90.01%  │││  0x00007fe618af411a: sete   %r11b
>>>     0.05%    0.01%  │││  0x00007fe618af411e: movzbl %r11b,%r11d
>>>     3.02%    1.80%  │││  0x00007fe618af4122: test   %r11d,%r11d
>>>
>>>                          ; CAS success, jump to store barrier
>>>                     │╰│  0x00007fe618af4125: jne    0x00007fe618af4070
>>>
>>>                          ; CAS fail, jump to respin
>>>                     │ ╰  0x00007fe618af412b: jmpq   0x00007fe618af4082
>>>
>>>
>>> Your suggestion seems to ignore the test completely (GVN helped?), and
>>> while it's still technically correct with emitting the barrier always,
>>> it defeats the purpose of the change:
>>>
>>>     2.35%    4.97%  │ ││  0x00007f7790aefd26: lock cmpxchg %r10d,(%rbx)
>>>    48.54%   86.32%  │ ││  0x00007f7790aefd2b: mov    $0x1,%eax
>>>     0.03%           │╭││  0x00007f7790aefd30: je     0x00007f7790aefd3b
>>>                     ││││  0x00007f7790aefd36: mov    $0x0,%eax
>>>                                 <UNCONDITIONAL STORE BARRIER HERE>
>>>     2.16%    0.01%  │ ││  0x00007f7790aefd53: cmp    $0x0,%eax
>>>
>>>                           ; CAS fail, jump back to respin
>>>                     │ ╰│  0x00007f7790aefd56: je     0x00007f7790aefd10
>>>
>>>                           ; CAS success, follow to exit
>>>
>>> Also, AFAIU, performance results would look different if we screwed the
>>> success check. But they seem to be coherent with our expectations: when
>>> CAS fails, either the conditional card marking or this change helps, and
>>> the change does not help when CAS succeeds.
>>>
>>> Thanks,
>>> -Aleksey
>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>> On 7/29/15 2:57 AM, Aleksey Shipilev wrote:
>>>>> On 07/29/2015 12:24 PM, Andrew Dinn wrote:
>>>>>> On 29/07/15 09:58, Aleksey Shipilev wrote:
>>>>>>> I would like to suggest a fix for:
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8019968
>>>>>>
>>>>>>> In short, current reference CAS intrinsic blindly emits
>>>>>>> post_barrier, ignoring the CAS result. In some cases, notably
>>>>>>> contended CAS spin-loops, we fail the CAS a lot, and thus go for a
>>>>>>> post_barrier excessively. Instead, we can conditionalize on the
>>>>>>> result of the store itself, and put the post_barrier only on
>>>>>>> success path: http://cr.openjdk.java.net/~shade/8019968/webrev.01/
>>>>>>
>>>>>>> More performance results here:
>>>>>>> http://cr.openjdk.java.net/~shade/8019968/notes.txt
>>>>>>
>>>>>> Nice! The code looks fine and your test results are very convincing.
>>>>>> I'll be interested to see how this looks on AArch64.
>>>>>
>>>>> Thanks Andrew!
>>>>>
>>>>> The change passes JPRT, so AArch64 build is available. The benchmark
>>>>> JAR
>>>>> mentioned in the issue comments would run without intervention, taking
>>>>> around 40 minutes. You are very welcome to try, while Reviewers are
>>>>> taking a look. I can do that only next week.
>>>>>
>>>>>> That said, I am afraid you still need a Reviewer!
>>>>>
>>>>> That reminds me I haven't spelled out what testing was done:
>>>>>
>>>>>     * JPRT on all open platforms
>>>>>     * Targeted benchmarks
>>>>>     * Eyeballing the generated x86 assembly
>>>>>
>>>>> Thanks,
>>>>> -Aleksey
>>>>>
>>>>>
>>>
>>>
>
>


More information about the hotspot-compiler-dev mailing list