RFR: 8272138: ZGC: Adopt release ordering for self-healing

Tue Aug 10 07:46:29 UTC 2021

On Mon, 9 Aug 2021 13:43:31 GMT, Erik Österlund <eosterlund at openjdk.org> wrote:

>> ZGC utilizes self-healing in load barrier to fix bad references. Currently, this fixing (ZBarrier::self_heal) adopts memory_order_conservative to guarantee that (1) the slow path (relocate, mark, etc., where addresses get healed) always happens before self healing, and (2) the other thread that accesses the same reference is able to access the healed address.
>> Let us consider memory_order_release for ZBarrier::self_heal. For example, Thread 1 is fixing a reference, and Thread 2 attempts to access the same reference. There exists data dependency in T2 where access of pointer happens before access of object’s content, equaling acquire semantic. Pairing with the release semantic in self healing, this makes up inter-thread acquire-release memory ordering. As a result, the two statements we mentioned above can be guaranteed by the acquire-release ordering.
>> We performed an experiment on benchmark corretto/heapothesys on AArch64. The optimized version results in both (1) shorter average concurrent mark time and (2) shorter average concurrent relocation time. Furthermore, we notice shorter average latency in almost all testcases.
>> 
>> 
>> [root at localhost corretto]# grep "00.*Phase: Concurrent Mark           " *.log
>> baseline.log:[100.412s][info][gc,stats    ]       Phase: Concurrent Mark                             960.359 / 960.359     587.203 / 1248.362    587.203 / 1248.362    587.203 / 1248.362    ms
>> baseline.log:[200.411s][info][gc,stats    ]       Phase: Concurrent Mark                             116.748 / 116.748     656.777 / 1736.469    656.777 / 1736.469    656.777 / 1736.469    ms
>> baseline.log:[300.411s][info][gc,stats    ]       Phase: Concurrent Mark                             125.460 / 125.460     620.440 / 1736.469    620.440 / 1736.469    620.440 / 1736.469    ms
>> baseline.log:[400.411s][info][gc,stats    ]       Phase: Concurrent Mark                             935.295 / 935.295     673.080 / 1736.469    673.080 / 1736.469    673.080 / 1736.469    ms
>> baseline.log:[500.411s][info][gc,stats    ]       Phase: Concurrent Mark                            1448.705 / 1448.705    723.484 / 1814.849    723.484 / 1814.849    723.484 / 1814.849    ms
>> baseline.log:[600.411s][info][gc,stats    ]       Phase: Concurrent Mark                            1490.123 / 1490.123    796.960 / 1842.794    796.960 / 1842.794    796.960 / 1842.794    ms
>> baseline.log:[700.411s][info][gc,stats    ]       Phase: Concurrent Mark                               0.000 / 0.000       912.439 / 2183.065    867.799 / 2183.065    867.799 / 2183.065    ms
>> baseline.log:[800.412s][info][gc,stats    ]       Phase: Concurrent Mark                            1468.594 / 1468.594    990.044 / 2183.065    912.281 / 2183.065    912.281 / 2183.065    ms
>> baseline.log:[900.411s][info][gc,stats    ]       Phase: Concurrent Mark                             137.435 / 137.435    1109.116 / 2276.535    967.470 / 2276.535    967.470 / 2276.535    ms
>> baseline.log:[1000.411s][info][gc,stats    ]       Phase: Concurrent Mark                             184.093 / 184.093    1172.446 / 2276.535    997.343 / 2276.535    997.343 / 2276.535    ms
>> baseline.log:[1100.411s][info][gc,stats    ]       Phase: Concurrent Mark                            1537.673 / 1537.673   1211.815 / 2276.535   1013.076 / 2276.535   1013.076 / 2276.535    ms
>> baseline.log:[1200.412s][info][gc,stats    ]       Phase: Concurrent Mark                               0.000 / 0.000      1218.085 / 2276.535   1025.443 / 2276.535   1025.443 / 2276.535    ms
>> optimized.log:[100.423s][info][gc,stats    ]       Phase: Concurrent Mark                            1053.065 / 1053.065    581.646 / 1249.822    581.646 / 1249.822    581.646 / 1249.822    ms
>> optimized.log:[200.423s][info][gc,stats    ]       Phase: Concurrent Mark                             885.795 / 885.795     573.650 / 1277.782    573.650 / 1277.782    573.650 / 1277.782    ms
>> optimized.log:[300.423s][info][gc,stats    ]       Phase: Concurrent Mark                             124.236 / 124.236     641.028 / 1828.124    641.028 / 1828.124    641.028 / 1828.124    ms
>> optimized.log:[400.423s][info][gc,stats    ]       Phase: Concurrent Mark                             875.383 / 875.383     666.465 / 1828.124    666.465 / 1828.124    666.465 / 1828.124    ms
>> optimized.log:[500.423s][info][gc,stats    ]       Phase: Concurrent Mark                            1937.305 / 1937.305    754.228 / 1937.305    754.228 / 1937.305    754.228 / 1937.305    ms
>> optimized.log:[600.423s][info][gc,stats    ]       Phase: Concurrent Mark                             173.064 / 173.064     771.387 / 1937.305    771.387 / 1937.305    771.387 / 1937.305    ms
>> optimized.log:[700.423s][info][gc,stats    ]       Phase: Concurrent Mark                            1832.584 / 1832.584    899.646 / 2048.471    856.838 / 2048.471    856.838 / 2048.471    ms
>> optimized.log:[800.423s][info][gc,stats    ]       Phase: Concurrent Mark                            1510.755 / 1510.755    981.807 / 2048.471    893.373 / 2048.471    893.373 / 2048.471    ms
>> optimized.log:[900.423s][info][gc,stats    ]       Phase: Concurrent Mark                            1472.737 / 1472.737   1044.755 / 2089.539    927.733 / 2089.539    927.733 / 2089.539    ms
>> optimized.log:[1000.423s][info][gc,stats    ]       Phase: Concurrent Mark                            1513.077 / 1513.077   1095.827 / 2089.539    947.202 / 2089.539    947.202 / 2089.539    ms
>> optimized.log:[1100.423s][info][gc,stats    ]       Phase: Concurrent Mark                               0.000 / 0.000      1073.703 / 2089.539    943.684 / 2089.539    943.684 / 2089.539    ms
>> optimized.log:[1200.423s][info][gc,stats    ]       Phase: Concurrent Mark                            1337.865 / 1337.865   1119.936 / 2113.895    962.172 / 2113.895    962.172 / 2113.895    ms
>> 
>> [root at localhost corretto]# grep "00.*Phase: Concurrent Relocate           " *.log
>> baseline.log:[100.412s][info][gc,stats    ]       Phase: Concurrent Relocate                         196.522 / 196.522     114.318 / 245.371     114.318 / 245.371     114.318 / 245.371     ms
>> baseline.log:[200.411s][info][gc,stats    ]       Phase: Concurrent Relocate                          47.748 / 47.748      130.861 / 331.948     130.861 / 331.948     130.861 / 331.948     ms
>> baseline.log:[300.411s][info][gc,stats    ]       Phase: Concurrent Relocate                          56.922 / 56.922      129.174 / 331.948     129.174 / 331.948     129.174 / 331.948     ms
>> baseline.log:[400.411s][info][gc,stats    ]       Phase: Concurrent Relocate                         218.707 / 218.707     137.495 / 331.948     137.495 / 331.948     137.495 / 331.948     ms
>> baseline.log:[500.411s][info][gc,stats    ]       Phase: Concurrent Relocate                         197.166 / 197.166     144.216 / 359.644     144.216 / 359.644     144.216 / 359.644     ms
>> baseline.log:[600.411s][info][gc,stats    ]       Phase: Concurrent Relocate                         202.118 / 202.118     153.507 / 373.447     153.507 / 373.447     153.507 / 373.447     ms
>> baseline.log:[700.411s][info][gc,stats    ]       Phase: Concurrent Relocate                           0.000 / 0.000       172.241 / 395.113     164.291 / 395.113     164.291 / 395.113     ms
>> baseline.log:[800.412s][info][gc,stats    ]       Phase: Concurrent Relocate                         215.121 / 215.121     186.007 / 421.039     173.139 / 421.039     173.139 / 421.039     ms
>> baseline.log:[900.411s][info][gc,stats    ]       Phase: Concurrent Relocate                          48.550 / 48.550      203.420 / 421.982     181.899 / 421.982     181.899 / 421.982     ms
>> baseline.log:[1000.411s][info][gc,stats    ]       Phase: Concurrent Relocate                          53.847 / 53.847      211.774 / 421.982     185.728 / 421.982     185.728 / 421.982     ms
>> baseline.log:[1100.411s][info][gc,stats    ]       Phase: Concurrent Relocate                         224.489 / 224.489     218.195 / 431.088     188.087 / 431.088     188.087 / 431.088     ms
>> baseline.log:[1200.412s][info][gc,stats    ]       Phase: Concurrent Relocate                           0.000 / 0.000       222.852 / 431.088     191.130 / 431.088     191.130 / 431.088     ms
>> optimized.log:[100.423s][info][gc,stats    ]       Phase: Concurrent Relocate                         193.811 / 193.811     113.043 / 248.471     113.043 / 248.471     113.043 / 248.471     ms
>> optimized.log:[200.423s][info][gc,stats    ]       Phase: Concurrent Relocate                         196.220 / 196.220     117.810 / 248.471     117.810 / 248.471     117.810 / 248.471     ms
>> optimized.log:[300.423s][info][gc,stats    ]       Phase: Concurrent Relocate                          48.786 / 48.786      131.753 / 351.890     131.753 / 351.890     131.753 / 351.890     ms
>> optimized.log:[400.423s][info][gc,stats    ]       Phase: Concurrent Relocate                         195.302 / 195.302     139.115 / 351.890     139.115 / 351.890     139.115 / 351.890     ms
>> optimized.log:[500.423s][info][gc,stats    ]       Phase: Concurrent Relocate                         374.022 / 374.022     155.204 / 374.022     155.204 / 374.022     155.204 / 374.022     ms
>> optimized.log:[600.423s][info][gc,stats    ]       Phase: Concurrent Relocate                          49.222 / 49.222      159.444 / 400.795     159.444 / 400.795     159.444 / 400.795     ms
>> optimized.log:[700.423s][info][gc,stats    ]       Phase: Concurrent Relocate                         381.072 / 381.072     182.488 / 409.086     173.140 / 409.086     173.140 / 409.086     ms
>> optimized.log:[800.423s][info][gc,stats    ]       Phase: Concurrent Relocate                         223.399 / 223.399     191.774 / 409.086     175.748 / 409.086     175.748 / 409.086     ms
>> optimized.log:[900.423s][info][gc,stats    ]       Phase: Concurrent Relocate                         214.184 / 214.184     201.526 / 409.086     181.302 / 409.086     181.302 / 409.086     ms
>> optimized.log:[1000.423s][info][gc,stats    ]       Phase: Concurrent Relocate                         208.600 / 208.600     207.389 / 420.479     183.756 / 420.479     183.756 / 420.479     ms
>> optimized.log:[1100.423s][info][gc,stats    ]       Phase: Concurrent Relocate                         209.444 / 209.444     202.367 / 420.479     183.173 / 420.479     183.173 / 420.479     ms
>> optimized.log:[1200.423s][info][gc,stats    ]       Phase: Concurrent Relocate                         223.841 / 223.841     206.268 / 420.479     185.074 / 420.479     185.074 / 420.479     ms
>> 
>> [root at localhost corretto]# grep "average latency:" nohup_baseline.out 
>>         average latency: 2ms:40us
>>         average latency: 6ms:550us
>>         average latency: 6ms:543us
>>         average latency: 6ms:493us
>>         average latency: 928us
>>         average latency: 794us
>>         average latency: 1ms:403us
>>         average latency: 23ms:216us
>>         average latency: 775us
>> [root at localhost corretto]# grep "average latency:" nohup_optimized.out
>>         average latency: 2ms:48us
>>         average latency: 5ms:948us
>>         average latency: 5ms:940us
>>         average latency: 5ms:875us
>>         average latency: 850us
>>         average latency: 723us
>>         average latency: 1ms:221us
>>         average latency: 22ms:653us
>>         average latency: 693us
>
> A thread that copies the object and self heals will conceptually do the following, assuming relaxed memory ordering:
> 
> copy();
> release();
> cas_forwarding_table();
> cas_self_heal();
> 
> The release before casing in the forwading table, acts as a release for both accesses, in the scenario when the copy is being published. So in the scenario you describe, the release in the forwarding table is already enough, to ensure that anyone reading the self healed pointer, is guaranteed to not observe bytes from before the copy. In the scenario when one thread performs the copy that gets published to the forwarding table, and another thread self-heals the pointer with the value acquired from the forwarding table, we will indeed not have a release to publish the pointer, only an acquire used to read from the forwarding table. However, this is fine, as the new MCA ARMv8 architecture does not allow causal consistency violations like WRC (cf. https://dl.acm.org/doi/pdf/10.1145/3158107 section 4). So we no longer need to use acquire/release to guarantee causal consistency across threads. This would naturally not hold for PPC, but there is no PPC port for ZGC yet.
> 
> It is interesting though that when loading a self-healed pointer, we do not perform any acquire. That is fine when dereferencing the loaded pointer, as a dependent load, eliding the need for an acquire. And that is indeed fine for the JIT compiled code, because we know it is always a dependent load (or safe in other ways). However, for the C++ code, we can not *guarantee* that there will be a dependent load in a spec conforming way. That might be something to look into. In practice, there isn't any good reason why reading and oop and then dereferencing it wouldn't yield a dependent load, but the spec doesn't promise anything and could in theory allow compilers to mess this up. However, having an acquire for every oop load in the runtime does sound a bit costly. The memory_order_consume semantics were supposed to solve this, but I'm not sure if the compilers have yet become good at doing something useful with that, other than just having it be equivalent to acquire. Might be somethi
 ng to check out in the disassembly to see what it yields. But that is an exercise for another day, as this isn't an issue you are introducing with this patch.
> 
> Hope this helps explain my thoughts in more detail.

@fisk Hi, Eric. We are wondering if one thread loading a healed pointer can observe the corresponding copy has not finished yet. Assuming relaxed ordering for `cas_self_heal`, both Thread A and Thread B are loading the same reference.

**Thread A**:  `load obj.fld; // will relocate the object referenced by obj.fld`
thread A will do the following:

1    copy();
2    cas_forwarding_table(); // release
3    cas_self_heal(); // relaxed

**Thread B**:  `load obj.fld; // load the same reference`
thread B may obverses the following reordering of **thread A**:

3    cas_self_heal(); // relaxed
1    copy();
2    cas_forwarding_table(); // release

To our knowledge, release ordering in _line 2_ does not prevent _line 3_ to be reordering before _line 1_, which indicates the release in the forwarding table is not enough. Perhaps we need to add acquire ordering to _line 2_ or add release ordering to _line 3_.

In another way, as @weixlu said,
> Instead, it maybe serves as membar to block all the CASes afterwards.

relaxed ordering in _line 2_ along with release ordering in _line 3_ can indeed ensure thread B always observes the object copy.

Looking forward to your advice.

-------------

PR: https://git.openjdk.java.net/jdk/pull/5046