RFR (S) 8019968: Reference CAS induces GC store barrier even on failure
Vladimir Kozlov
vladimir.kozlov at oracle.com
Tue Aug 11 15:58:31 UTC 2015
Thank you for doing additional experiments, Aleksey, and explanation.
Now I agree with your changes. Reviewed.
Thanks,
Vladimir
On 8/11/15 2:22 AM, Aleksey Shipilev wrote:
> Hi Vladimir,
>
> My previous disassembly demonstrated the code generated for CAS
> spinloop. There, it's easy to confuse the "second" branch with a proper
> backbranch in the loop. Here is the disassembly for the "one-off"
> failing CAS with patched VM:
>
> │ 0x00007fa3acba446c: lock cmpxchg %r11d,(%r10)
> 46.63% 83.18% │ 0x00007fa3acba4471: sete %r8b
> 0.03% │ 0x00007fa3acba4475: movzbl %r8b,%r8d
> 2.23% │ 0x00007fa3acba4479: test %r8d,%r8d <- removable?
> │╭ 0x00007fa3acba447c: je 0x00007fa3acba4490
> ││ 0x00007fa3acba447e: shr $0x9,%r10
> ││ 0x00007fa3acba4482: movabs $0x7fa3a0dbf000,%r11
> ││ 0x00007fa3acba448c: mov %r12b,(%r11,%r10,1)
> 0.93% │↘ 0x00007fa3acba4490: mov %r8d,%eax
> 0.04% │ 0x00007fa3acba4493: add $0x20,%rsp
> 1.05% │ 0x00007fa3acba4497: pop %rbp
> 0.98% │ 0x00007fa3acba4498: test %eax,0x11af2b62(%rip)
> │
> │ 0x00007fa3acba449e: retq
>
> ...compare this to baseline VM that does an unconditional barrier:
>
> 2.31% 3.64% │ 0x00007fcf595fd4f9: lock cmpxchg %r10d,(%r11)
> 43.22% 78.37% │ 0x00007fcf595fd4fe: sete %r8b
> 0.04% │ 0x00007fcf595fd502: movzbl %r8b,%r8d
> 2.20% │ 0x00007fcf595fd506: mov %r11,%r10
> │ 0x00007fcf595fd509: shr $0x9,%r10
> │ 0x00007fcf595fd50d: movabs $0x7fcf4dd0c000,%r11
> │ 0x00007fcf595fd517: mov %r12b,(%r11,%r10,1)
> 2.20% │ 0x00007fcf595fd51b: mov %r8d,%eax
> │ 0x00007fcf595fd51e: add $0x20,%rsp
> │ 0x00007fcf595fd522: pop %rbp
> 1.82% │ 0x00007fcf595fd523: test %eax,0x12383ad7(%rip)
> │
> │ 0x00007fcf595fd529: retq
>
> Well, yeah, I can see that test at 0x00007fa3acba4479 is avoidable,
> since cmpxchg already sets the flag. But, I doubt it actually matters,
> since:
> a) test-je are routinely macrofused into single uop on modern x86;
> b) the flag is materialized in register anyway for method return;
> c) as you predicted, my quick exploration blows up considerably;
>
> Notably, handling native oops require missing StoreNConditionalNode,
> which spreads all the way to AD and various places in compiler that
> match StorePConditionalNode. Also, my naive attempts of using Ideal to
> pick up StoreNConditional result and produce 0/1 yields full branches,
> not the "sete" that is coming from CompareAndSwapP AD encoding -- with
> terrible performance results.
>
> With that, I think we should play it safe, and push the existing
> obviously correct version that improves performance a lot, instead of
> blowing up the complexity for purely theoretical improvement.
>
> Thanks,
> -Aleksey
>
> On 08/11/2015 05:21 AM, Vladimir Kozlov wrote:
>> My bad, I forgot that CompareAndSwapP assembler code produces Boolean
>> value in register. I mistook it for StorePConditional which produces flag.
>>
>> But I think you can get better code since you want to generate test and
>> main point of having specialized CompareAndSwapP is to avoid test
>> instruction.
>>
>> If we use StorePConditional instead of CompareAndSwapP we may remove
>> second branch:
>>
>>> │││ 0x00007fa06809cdd1: test %r11d,%r11d
>>> ; CAS fail, jump to respin
>>> │ ╰ 0x00007fe618af412b: jeq 0x00007fe618af4082
>>> ; CAS success, jump to store barrier
>>> <UNCONDITIONAL STORE BARRIER HERE>
>>
>> But C2 changes will be much larger. We would need new Ideal::if_then()
>> which take in result of StorePConditional and set load_store on both
>> paths to 0/1.
>>
>> We may need to play with probability of if_then() to get barrier in
>> follow code.
>>
>> Thanks,
>> Vladimir
>>
>> On 8/10/15 2:13 AM, Aleksey Shipilev wrote:
>>> Hi Vladimir!
>>>
>>> On 07/31/2015 06:03 AM, Vladimir Kozlov wrote:
>>>> I think the test is wrong. It should be:
>>>>
>>>> if_then(load_store, BoolTest::eq, oldval, PROB_STATIC_FREQUENT);
>>>
>>> Um, no? I remember eyeballing the assembly to confirm this.
>>>
>>> For LS_cmpxchg, we are inlining "*boolean* cas(...)", so the load_store
>>> seems to have a boolean value, but "oldval" is oop. In other words,
>>> "load_store != 0" tests "(boolean)load_store != false".
>>>
>>> Current VM produces:
>>>
>>> 13.46% 45.85% │││ 0x00007fa06809cdaf: lock cmpxchg %r8d,(%rdi)
>>> 14.91% 4.41% │││ 0x00007fa06809cdb4: sete %r11b
>>> 0.07% │││ 0x00007fa06809cdb8: movzbl %r11b,%r11d
>>> <UNCONDITIONAL STORE BARRIER HERE>
>>> 0.06% │││ 0x00007fa06809cdd1: test %r11d,%r11d
>>>
>>> ; CAS fail, jump to respin
>>> │╰│ 0x00007fa06809cdd4: je 0x00007fa06809ccf0
>>>
>>>
>>> Patched VM piggybacks on the same result:
>>>
>>> 1.97% 0.05% │││ 0x00007fe618af4115: lock cmpxchg %ebx,(%r9)
>>> 50.59% 90.01% │││ 0x00007fe618af411a: sete %r11b
>>> 0.05% 0.01% │││ 0x00007fe618af411e: movzbl %r11b,%r11d
>>> 3.02% 1.80% │││ 0x00007fe618af4122: test %r11d,%r11d
>>>
>>> ; CAS success, jump to store barrier
>>> │╰│ 0x00007fe618af4125: jne 0x00007fe618af4070
>>>
>>> ; CAS fail, jump to respin
>>> │ ╰ 0x00007fe618af412b: jmpq 0x00007fe618af4082
>>>
>>>
>>> Your suggestion seems to ignore the test completely (GVN helped?), and
>>> while it's still technically correct with emitting the barrier always,
>>> it defeats the purpose of the change:
>>>
>>> 2.35% 4.97% │ ││ 0x00007f7790aefd26: lock cmpxchg %r10d,(%rbx)
>>> 48.54% 86.32% │ ││ 0x00007f7790aefd2b: mov $0x1,%eax
>>> 0.03% │╭││ 0x00007f7790aefd30: je 0x00007f7790aefd3b
>>> ││││ 0x00007f7790aefd36: mov $0x0,%eax
>>> <UNCONDITIONAL STORE BARRIER HERE>
>>> 2.16% 0.01% │ ││ 0x00007f7790aefd53: cmp $0x0,%eax
>>>
>>> ; CAS fail, jump back to respin
>>> │ ╰│ 0x00007f7790aefd56: je 0x00007f7790aefd10
>>>
>>> ; CAS success, follow to exit
>>>
>>> Also, AFAIU, performance results would look different if we screwed the
>>> success check. But they seem to be coherent with our expectations: when
>>> CAS fails, either the conditional card marking or this change helps, and
>>> the change does not help when CAS succeeds.
>>>
>>> Thanks,
>>> -Aleksey
>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>> On 7/29/15 2:57 AM, Aleksey Shipilev wrote:
>>>>> On 07/29/2015 12:24 PM, Andrew Dinn wrote:
>>>>>> On 29/07/15 09:58, Aleksey Shipilev wrote:
>>>>>>> I would like to suggest a fix for:
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8019968
>>>>>>
>>>>>>> In short, current reference CAS intrinsic blindly emits
>>>>>>> post_barrier, ignoring the CAS result. In some cases, notably
>>>>>>> contended CAS spin-loops, we fail the CAS a lot, and thus go for a
>>>>>>> post_barrier excessively. Instead, we can conditionalize on the
>>>>>>> result of the store itself, and put the post_barrier only on
>>>>>>> success path: http://cr.openjdk.java.net/~shade/8019968/webrev.01/
>>>>>>
>>>>>>> More performance results here:
>>>>>>> http://cr.openjdk.java.net/~shade/8019968/notes.txt
>>>>>>
>>>>>> Nice! The code looks fine and your test results are very convincing.
>>>>>> I'll be interested to see how this looks on AArch64.
>>>>>
>>>>> Thanks Andrew!
>>>>>
>>>>> The change passes JPRT, so AArch64 build is available. The benchmark
>>>>> JAR
>>>>> mentioned in the issue comments would run without intervention, taking
>>>>> around 40 minutes. You are very welcome to try, while Reviewers are
>>>>> taking a look. I can do that only next week.
>>>>>
>>>>>> That said, I am afraid you still need a Reviewer!
>>>>>
>>>>> That reminds me I haven't spelled out what testing was done:
>>>>>
>>>>> * JPRT on all open platforms
>>>>> * Targeted benchmarks
>>>>> * Eyeballing the generated x86 assembly
>>>>>
>>>>> Thanks,
>>>>> -Aleksey
>>>>>
>>>>>
>>>
>>>
>
>
More information about the hotspot-compiler-dev
mailing list