RFR (S) 8019968: Reference CAS induces GC store barrier even on failure

Aleksey Shipilev aleksey.shipilev at oracle.com
Wed Aug 12 07:04:12 UTC 2015


Thanks, Vladimir!

Here's a changeset:
 http://cr.openjdk.java.net/~shade/8019968/8019968.changeset

Please sponsor!

-Aleksey

On 08/11/2015 06:58 PM, Vladimir Kozlov wrote:
> Thank you for doing additional experiments, Aleksey, and explanation.
> Now I agree with your changes. Reviewed.
> 
> Thanks,
> Vladimir
> 
> On 8/11/15 2:22 AM, Aleksey Shipilev wrote:
>> Hi Vladimir,
>>
>> My previous disassembly demonstrated the code generated for CAS
>> spinloop. There, it's easy to confuse the "second" branch with a proper
>> backbranch in the loop. Here is the disassembly for the "one-off"
>> failing CAS with patched VM:
>>
>>                    │   0x00007fa3acba446c: lock cmpxchg %r11d,(%r10)
>>   46.63%   83.18%  │   0x00007fa3acba4471: sete   %r8b
>>    0.03%           │   0x00007fa3acba4475: movzbl %r8b,%r8d
>>    2.23%           │   0x00007fa3acba4479: test   %r8d,%r8d <- removable?
>>                    │╭  0x00007fa3acba447c: je     0x00007fa3acba4490
>>                    ││  0x00007fa3acba447e: shr    $0x9,%r10
>>                    ││  0x00007fa3acba4482: movabs $0x7fa3a0dbf000,%r11
>>                    ││  0x00007fa3acba448c: mov    %r12b,(%r11,%r10,1)
>>    0.93%           │↘  0x00007fa3acba4490: mov    %r8d,%eax
>>    0.04%           │   0x00007fa3acba4493: add    $0x20,%rsp
>>    1.05%           │   0x00007fa3acba4497: pop    %rbp
>>    0.98%           │   0x00007fa3acba4498: test   %eax,0x11af2b62(%rip)
>>>>                    │   0x00007fa3acba449e: retq
>>
>> ...compare this to baseline VM that does an unconditional barrier:
>>
>>    2.31%    3.64%  │  0x00007fcf595fd4f9: lock cmpxchg %r10d,(%r11)
>>   43.22%   78.37%  │  0x00007fcf595fd4fe: sete   %r8b
>>    0.04%           │  0x00007fcf595fd502: movzbl %r8b,%r8d
>>    2.20%           │  0x00007fcf595fd506: mov    %r11,%r10
>>                    │  0x00007fcf595fd509: shr    $0x9,%r10
>>                    │  0x00007fcf595fd50d: movabs $0x7fcf4dd0c000,%r11
>>                    │  0x00007fcf595fd517: mov    %r12b,(%r11,%r10,1)
>>    2.20%           │  0x00007fcf595fd51b: mov    %r8d,%eax
>>                    │  0x00007fcf595fd51e: add    $0x20,%rsp
>>                    │  0x00007fcf595fd522: pop    %rbp
>>    1.82%           │  0x00007fcf595fd523: test   %eax,0x12383ad7(%rip)
>>>>                    │  0x00007fcf595fd529: retq
>>
>> Well, yeah, I can see that test at 0x00007fa3acba4479 is avoidable,
>> since cmpxchg already sets the flag. But, I doubt it actually matters,
>> since:
>>   a) test-je are routinely macrofused into single uop on modern x86;
>>   b) the flag is materialized in register anyway for method return;
>>   c) as you predicted, my quick exploration blows up considerably;
>>
>> Notably, handling native oops require missing StoreNConditionalNode,
>> which spreads all the way to AD and various places in compiler that
>> match StorePConditionalNode. Also, my naive attempts of using Ideal to
>> pick up StoreNConditional result and produce 0/1 yields full branches,
>> not the "sete" that is coming from CompareAndSwapP AD encoding -- with
>> terrible performance results.
>>
>> With that, I think we should play it safe, and push the existing
>> obviously correct version that improves performance a lot, instead of
>> blowing up the complexity for purely theoretical improvement.
>>
>> Thanks,
>> -Aleksey
>>
>> On 08/11/2015 05:21 AM, Vladimir Kozlov wrote:
>>> My bad, I forgot that CompareAndSwapP assembler code produces Boolean
>>> value in register. I mistook it for StorePConditional which produces
>>> flag.
>>>
>>> But I think you can get better code since you want to generate test and
>>> main point of having specialized CompareAndSwapP is to avoid test
>>> instruction.
>>>
>>> If we use StorePConditional instead of CompareAndSwapP we may remove
>>> second branch:
>>>
>>>>                     │││  0x00007fa06809cdd1: test   %r11d,%r11d
>>>>                          ; CAS fail, jump to respin
>>>>                     │ ╰  0x00007fe618af412b: jeq   0x00007fe618af4082
>>>>                          ; CAS success, jump to store barrier
>>>>                          <UNCONDITIONAL STORE BARRIER HERE>
>>>
>>> But C2 changes will be much larger. We would need new Ideal::if_then()
>>> which take in result of StorePConditional and set load_store on both
>>> paths to 0/1.
>>>
>>> We may need to play with probability of if_then() to get barrier in
>>> follow code.
>>>
>>> Thanks,
>>> Vladimir
>>>
>>> On 8/10/15 2:13 AM, Aleksey Shipilev wrote:
>>>> Hi Vladimir!
>>>>
>>>> On 07/31/2015 06:03 AM, Vladimir Kozlov wrote:
>>>>> I think the test is wrong. It should be:
>>>>>
>>>>> if_then(load_store, BoolTest::eq, oldval, PROB_STATIC_FREQUENT);
>>>>
>>>> Um, no? I remember eyeballing the assembly to confirm this.
>>>>
>>>> For LS_cmpxchg, we are inlining "*boolean* cas(...)", so the load_store
>>>> seems to have a boolean value, but "oldval" is oop. In other words,
>>>> "load_store != 0" tests "(boolean)load_store != false".
>>>>
>>>> Current VM produces:
>>>>
>>>>    13.46%   45.85%    │││  0x00007fa06809cdaf: lock cmpxchg %r8d,(%rdi)
>>>>    14.91%    4.41%    │││  0x00007fa06809cdb4: sete   %r11b
>>>>     0.07%             │││  0x00007fa06809cdb8: movzbl %r11b,%r11d
>>>>                                 <UNCONDITIONAL STORE BARRIER HERE>
>>>>     0.06%             │││  0x00007fa06809cdd1: test   %r11d,%r11d
>>>>
>>>>                            ; CAS fail, jump to respin
>>>>                       │╰│  0x00007fa06809cdd4: je    
>>>> 0x00007fa06809ccf0
>>>>
>>>>
>>>> Patched VM piggybacks on the same result:
>>>>
>>>>     1.97%    0.05%  │││  0x00007fe618af4115: lock cmpxchg %ebx,(%r9)
>>>>    50.59%   90.01%  │││  0x00007fe618af411a: sete   %r11b
>>>>     0.05%    0.01%  │││  0x00007fe618af411e: movzbl %r11b,%r11d
>>>>     3.02%    1.80%  │││  0x00007fe618af4122: test   %r11d,%r11d
>>>>
>>>>                          ; CAS success, jump to store barrier
>>>>                     │╰│  0x00007fe618af4125: jne    0x00007fe618af4070
>>>>
>>>>                          ; CAS fail, jump to respin
>>>>                     │ ╰  0x00007fe618af412b: jmpq   0x00007fe618af4082
>>>>
>>>>
>>>> Your suggestion seems to ignore the test completely (GVN helped?), and
>>>> while it's still technically correct with emitting the barrier always,
>>>> it defeats the purpose of the change:
>>>>
>>>>     2.35%    4.97%  │ ││  0x00007f7790aefd26: lock cmpxchg %r10d,(%rbx)
>>>>    48.54%   86.32%  │ ││  0x00007f7790aefd2b: mov    $0x1,%eax
>>>>     0.03%           │╭││  0x00007f7790aefd30: je     0x00007f7790aefd3b
>>>>                     ││││  0x00007f7790aefd36: mov    $0x0,%eax
>>>>                                 <UNCONDITIONAL STORE BARRIER HERE>
>>>>     2.16%    0.01%  │ ││  0x00007f7790aefd53: cmp    $0x0,%eax
>>>>
>>>>                           ; CAS fail, jump back to respin
>>>>                     │ ╰│  0x00007f7790aefd56: je     0x00007f7790aefd10
>>>>
>>>>                           ; CAS success, follow to exit
>>>>
>>>> Also, AFAIU, performance results would look different if we screwed the
>>>> success check. But they seem to be coherent with our expectations: when
>>>> CAS fails, either the conditional card marking or this change helps,
>>>> and
>>>> the change does not help when CAS succeeds.
>>>>
>>>> Thanks,
>>>> -Aleksey
>>>>
>>>>> Thanks,
>>>>> Vladimir
>>>>>
>>>>> On 7/29/15 2:57 AM, Aleksey Shipilev wrote:
>>>>>> On 07/29/2015 12:24 PM, Andrew Dinn wrote:
>>>>>>> On 29/07/15 09:58, Aleksey Shipilev wrote:
>>>>>>>> I would like to suggest a fix for:
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8019968
>>>>>>>
>>>>>>>> In short, current reference CAS intrinsic blindly emits
>>>>>>>> post_barrier, ignoring the CAS result. In some cases, notably
>>>>>>>> contended CAS spin-loops, we fail the CAS a lot, and thus go for a
>>>>>>>> post_barrier excessively. Instead, we can conditionalize on the
>>>>>>>> result of the store itself, and put the post_barrier only on
>>>>>>>> success path: http://cr.openjdk.java.net/~shade/8019968/webrev.01/
>>>>>>>
>>>>>>>> More performance results here:
>>>>>>>> http://cr.openjdk.java.net/~shade/8019968/notes.txt
>>>>>>>
>>>>>>> Nice! The code looks fine and your test results are very convincing.
>>>>>>> I'll be interested to see how this looks on AArch64.
>>>>>>
>>>>>> Thanks Andrew!
>>>>>>
>>>>>> The change passes JPRT, so AArch64 build is available. The benchmark
>>>>>> JAR
>>>>>> mentioned in the issue comments would run without intervention,
>>>>>> taking
>>>>>> around 40 minutes. You are very welcome to try, while Reviewers are
>>>>>> taking a look. I can do that only next week.
>>>>>>
>>>>>>> That said, I am afraid you still need a Reviewer!
>>>>>>
>>>>>> That reminds me I haven't spelled out what testing was done:
>>>>>>
>>>>>>     * JPRT on all open platforms
>>>>>>     * Targeted benchmarks
>>>>>>     * Eyeballing the generated x86 assembly
>>>>>>
>>>>>> Thanks,
>>>>>> -Aleksey
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150812/12680595/signature.asc>


More information about the hotspot-compiler-dev mailing list