RFR (S) 8019968: Reference CAS induces GC store barrier even on failure

Aleksey Shipilev aleksey.shipilev at oracle.com
Tue Aug 11 09:22:58 UTC 2015


Hi Vladimir,

My previous disassembly demonstrated the code generated for CAS
spinloop. There, it's easy to confuse the "second" branch with a proper
backbranch in the loop. Here is the disassembly for the "one-off"
failing CAS with patched VM:

                  │   0x00007fa3acba446c: lock cmpxchg %r11d,(%r10)
 46.63%   83.18%  │   0x00007fa3acba4471: sete   %r8b
  0.03%           │   0x00007fa3acba4475: movzbl %r8b,%r8d
  2.23%           │   0x00007fa3acba4479: test   %r8d,%r8d <- removable?
                  │╭  0x00007fa3acba447c: je     0x00007fa3acba4490
                  ││  0x00007fa3acba447e: shr    $0x9,%r10
                  ││  0x00007fa3acba4482: movabs $0x7fa3a0dbf000,%r11
                  ││  0x00007fa3acba448c: mov    %r12b,(%r11,%r10,1)
  0.93%           │↘  0x00007fa3acba4490: mov    %r8d,%eax
  0.04%           │   0x00007fa3acba4493: add    $0x20,%rsp
  1.05%           │   0x00007fa3acba4497: pop    %rbp
  0.98%           │   0x00007fa3acba4498: test   %eax,0x11af2b62(%rip)
                  │
                  │   0x00007fa3acba449e: retq

...compare this to baseline VM that does an unconditional barrier:

  2.31%    3.64%  │  0x00007fcf595fd4f9: lock cmpxchg %r10d,(%r11)
 43.22%   78.37%  │  0x00007fcf595fd4fe: sete   %r8b
  0.04%           │  0x00007fcf595fd502: movzbl %r8b,%r8d
  2.20%           │  0x00007fcf595fd506: mov    %r11,%r10
                  │  0x00007fcf595fd509: shr    $0x9,%r10
                  │  0x00007fcf595fd50d: movabs $0x7fcf4dd0c000,%r11
                  │  0x00007fcf595fd517: mov    %r12b,(%r11,%r10,1)
  2.20%           │  0x00007fcf595fd51b: mov    %r8d,%eax
                  │  0x00007fcf595fd51e: add    $0x20,%rsp
                  │  0x00007fcf595fd522: pop    %rbp
  1.82%           │  0x00007fcf595fd523: test   %eax,0x12383ad7(%rip)
                  │
                  │  0x00007fcf595fd529: retq

Well, yeah, I can see that test at 0x00007fa3acba4479 is avoidable,
since cmpxchg already sets the flag. But, I doubt it actually matters,
since:
 a) test-je are routinely macrofused into single uop on modern x86;
 b) the flag is materialized in register anyway for method return;
 c) as you predicted, my quick exploration blows up considerably;

Notably, handling native oops require missing StoreNConditionalNode,
which spreads all the way to AD and various places in compiler that
match StorePConditionalNode. Also, my naive attempts of using Ideal to
pick up StoreNConditional result and produce 0/1 yields full branches,
not the "sete" that is coming from CompareAndSwapP AD encoding -- with
terrible performance results.

With that, I think we should play it safe, and push the existing
obviously correct version that improves performance a lot, instead of
blowing up the complexity for purely theoretical improvement.

Thanks,
-Aleksey

On 08/11/2015 05:21 AM, Vladimir Kozlov wrote:
> My bad, I forgot that CompareAndSwapP assembler code produces Boolean
> value in register. I mistook it for StorePConditional which produces flag.
> 
> But I think you can get better code since you want to generate test and
> main point of having specialized CompareAndSwapP is to avoid test
> instruction.
> 
> If we use StorePConditional instead of CompareAndSwapP we may remove
> second branch:
> 
>>                    │││  0x00007fa06809cdd1: test   %r11d,%r11d
>>                         ; CAS fail, jump to respin
>>                    │ ╰  0x00007fe618af412b: jeq   0x00007fe618af4082
>>                         ; CAS success, jump to store barrier
>>                         <UNCONDITIONAL STORE BARRIER HERE>
> 
> But C2 changes will be much larger. We would need new Ideal::if_then()
> which take in result of StorePConditional and set load_store on both
> paths to 0/1.
> 
> We may need to play with probability of if_then() to get barrier in
> follow code.
> 
> Thanks,
> Vladimir
> 
> On 8/10/15 2:13 AM, Aleksey Shipilev wrote:
>> Hi Vladimir!
>>
>> On 07/31/2015 06:03 AM, Vladimir Kozlov wrote:
>>> I think the test is wrong. It should be:
>>>
>>> if_then(load_store, BoolTest::eq, oldval, PROB_STATIC_FREQUENT);
>>
>> Um, no? I remember eyeballing the assembly to confirm this.
>>
>> For LS_cmpxchg, we are inlining "*boolean* cas(...)", so the load_store
>> seems to have a boolean value, but "oldval" is oop. In other words,
>> "load_store != 0" tests "(boolean)load_store != false".
>>
>> Current VM produces:
>>
>>   13.46%   45.85%    │││  0x00007fa06809cdaf: lock cmpxchg %r8d,(%rdi)
>>   14.91%    4.41%    │││  0x00007fa06809cdb4: sete   %r11b
>>    0.07%             │││  0x00007fa06809cdb8: movzbl %r11b,%r11d
>>                                <UNCONDITIONAL STORE BARRIER HERE>
>>    0.06%             │││  0x00007fa06809cdd1: test   %r11d,%r11d
>>
>>                           ; CAS fail, jump to respin
>>                      │╰│  0x00007fa06809cdd4: je     0x00007fa06809ccf0
>>
>>
>> Patched VM piggybacks on the same result:
>>
>>    1.97%    0.05%  │││  0x00007fe618af4115: lock cmpxchg %ebx,(%r9)
>>   50.59%   90.01%  │││  0x00007fe618af411a: sete   %r11b
>>    0.05%    0.01%  │││  0x00007fe618af411e: movzbl %r11b,%r11d
>>    3.02%    1.80%  │││  0x00007fe618af4122: test   %r11d,%r11d
>>
>>                         ; CAS success, jump to store barrier
>>                    │╰│  0x00007fe618af4125: jne    0x00007fe618af4070
>>
>>                         ; CAS fail, jump to respin
>>                    │ ╰  0x00007fe618af412b: jmpq   0x00007fe618af4082
>>
>>
>> Your suggestion seems to ignore the test completely (GVN helped?), and
>> while it's still technically correct with emitting the barrier always,
>> it defeats the purpose of the change:
>>
>>    2.35%    4.97%  │ ││  0x00007f7790aefd26: lock cmpxchg %r10d,(%rbx)
>>   48.54%   86.32%  │ ││  0x00007f7790aefd2b: mov    $0x1,%eax
>>    0.03%           │╭││  0x00007f7790aefd30: je     0x00007f7790aefd3b
>>                    ││││  0x00007f7790aefd36: mov    $0x0,%eax
>>                                <UNCONDITIONAL STORE BARRIER HERE>
>>    2.16%    0.01%  │ ││  0x00007f7790aefd53: cmp    $0x0,%eax
>>
>>                          ; CAS fail, jump back to respin
>>                    │ ╰│  0x00007f7790aefd56: je     0x00007f7790aefd10
>>
>>                          ; CAS success, follow to exit
>>
>> Also, AFAIU, performance results would look different if we screwed the
>> success check. But they seem to be coherent with our expectations: when
>> CAS fails, either the conditional card marking or this change helps, and
>> the change does not help when CAS succeeds.
>>
>> Thanks,
>> -Aleksey
>>
>>> Thanks,
>>> Vladimir
>>>
>>> On 7/29/15 2:57 AM, Aleksey Shipilev wrote:
>>>> On 07/29/2015 12:24 PM, Andrew Dinn wrote:
>>>>> On 29/07/15 09:58, Aleksey Shipilev wrote:
>>>>>> I would like to suggest a fix for:
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8019968
>>>>>
>>>>>> In short, current reference CAS intrinsic blindly emits
>>>>>> post_barrier, ignoring the CAS result. In some cases, notably
>>>>>> contended CAS spin-loops, we fail the CAS a lot, and thus go for a
>>>>>> post_barrier excessively. Instead, we can conditionalize on the
>>>>>> result of the store itself, and put the post_barrier only on
>>>>>> success path: http://cr.openjdk.java.net/~shade/8019968/webrev.01/
>>>>>
>>>>>> More performance results here:
>>>>>> http://cr.openjdk.java.net/~shade/8019968/notes.txt
>>>>>
>>>>> Nice! The code looks fine and your test results are very convincing.
>>>>> I'll be interested to see how this looks on AArch64.
>>>>
>>>> Thanks Andrew!
>>>>
>>>> The change passes JPRT, so AArch64 build is available. The benchmark
>>>> JAR
>>>> mentioned in the issue comments would run without intervention, taking
>>>> around 40 minutes. You are very welcome to try, while Reviewers are
>>>> taking a look. I can do that only next week.
>>>>
>>>>> That said, I am afraid you still need a Reviewer!
>>>>
>>>> That reminds me I haven't spelled out what testing was done:
>>>>
>>>>    * JPRT on all open platforms
>>>>    * Targeted benchmarks
>>>>    * Eyeballing the generated x86 assembly
>>>>
>>>> Thanks,
>>>> -Aleksey
>>>>
>>>>
>>
>>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150811/61fc33e4/signature.asc>


More information about the hotspot-compiler-dev mailing list