RFR: 8272138: ZGC: Adopt release ordering for self-healing
Hao Tang
github.com+7947546+tanghaoth90 at openjdk.java.net
Tue Aug 10 07:46:29 UTC 2021
On Mon, 9 Aug 2021 13:43:31 GMT, Erik Österlund <eosterlund at openjdk.org> wrote:
>> ZGC utilizes self-healing in load barrier to fix bad references. Currently, this fixing (ZBarrier::self_heal) adopts memory_order_conservative to guarantee that (1) the slow path (relocate, mark, etc., where addresses get healed) always happens before self healing, and (2) the other thread that accesses the same reference is able to access the healed address.
>> Let us consider memory_order_release for ZBarrier::self_heal. For example, Thread 1 is fixing a reference, and Thread 2 attempts to access the same reference. There exists data dependency in T2 where access of pointer happens before access of object’s content, equaling acquire semantic. Pairing with the release semantic in self healing, this makes up inter-thread acquire-release memory ordering. As a result, the two statements we mentioned above can be guaranteed by the acquire-release ordering.
>> We performed an experiment on benchmark corretto/heapothesys on AArch64. The optimized version results in both (1) shorter average concurrent mark time and (2) shorter average concurrent relocation time. Furthermore, we notice shorter average latency in almost all testcases.
>>
>>
>> [root at localhost corretto]# grep "00.*Phase: Concurrent Mark " *.log
>> baseline.log:[100.412s][info][gc,stats ] Phase: Concurrent Mark 960.359 / 960.359 587.203 / 1248.362 587.203 / 1248.362 587.203 / 1248.362 ms
>> baseline.log:[200.411s][info][gc,stats ] Phase: Concurrent Mark 116.748 / 116.748 656.777 / 1736.469 656.777 / 1736.469 656.777 / 1736.469 ms
>> baseline.log:[300.411s][info][gc,stats ] Phase: Concurrent Mark 125.460 / 125.460 620.440 / 1736.469 620.440 / 1736.469 620.440 / 1736.469 ms
>> baseline.log:[400.411s][info][gc,stats ] Phase: Concurrent Mark 935.295 / 935.295 673.080 / 1736.469 673.080 / 1736.469 673.080 / 1736.469 ms
>> baseline.log:[500.411s][info][gc,stats ] Phase: Concurrent Mark 1448.705 / 1448.705 723.484 / 1814.849 723.484 / 1814.849 723.484 / 1814.849 ms
>> baseline.log:[600.411s][info][gc,stats ] Phase: Concurrent Mark 1490.123 / 1490.123 796.960 / 1842.794 796.960 / 1842.794 796.960 / 1842.794 ms
>> baseline.log:[700.411s][info][gc,stats ] Phase: Concurrent Mark 0.000 / 0.000 912.439 / 2183.065 867.799 / 2183.065 867.799 / 2183.065 ms
>> baseline.log:[800.412s][info][gc,stats ] Phase: Concurrent Mark 1468.594 / 1468.594 990.044 / 2183.065 912.281 / 2183.065 912.281 / 2183.065 ms
>> baseline.log:[900.411s][info][gc,stats ] Phase: Concurrent Mark 137.435 / 137.435 1109.116 / 2276.535 967.470 / 2276.535 967.470 / 2276.535 ms
>> baseline.log:[1000.411s][info][gc,stats ] Phase: Concurrent Mark 184.093 / 184.093 1172.446 / 2276.535 997.343 / 2276.535 997.343 / 2276.535 ms
>> baseline.log:[1100.411s][info][gc,stats ] Phase: Concurrent Mark 1537.673 / 1537.673 1211.815 / 2276.535 1013.076 / 2276.535 1013.076 / 2276.535 ms
>> baseline.log:[1200.412s][info][gc,stats ] Phase: Concurrent Mark 0.000 / 0.000 1218.085 / 2276.535 1025.443 / 2276.535 1025.443 / 2276.535 ms
>> optimized.log:[100.423s][info][gc,stats ] Phase: Concurrent Mark 1053.065 / 1053.065 581.646 / 1249.822 581.646 / 1249.822 581.646 / 1249.822 ms
>> optimized.log:[200.423s][info][gc,stats ] Phase: Concurrent Mark 885.795 / 885.795 573.650 / 1277.782 573.650 / 1277.782 573.650 / 1277.782 ms
>> optimized.log:[300.423s][info][gc,stats ] Phase: Concurrent Mark 124.236 / 124.236 641.028 / 1828.124 641.028 / 1828.124 641.028 / 1828.124 ms
>> optimized.log:[400.423s][info][gc,stats ] Phase: Concurrent Mark 875.383 / 875.383 666.465 / 1828.124 666.465 / 1828.124 666.465 / 1828.124 ms
>> optimized.log:[500.423s][info][gc,stats ] Phase: Concurrent Mark 1937.305 / 1937.305 754.228 / 1937.305 754.228 / 1937.305 754.228 / 1937.305 ms
>> optimized.log:[600.423s][info][gc,stats ] Phase: Concurrent Mark 173.064 / 173.064 771.387 / 1937.305 771.387 / 1937.305 771.387 / 1937.305 ms
>> optimized.log:[700.423s][info][gc,stats ] Phase: Concurrent Mark 1832.584 / 1832.584 899.646 / 2048.471 856.838 / 2048.471 856.838 / 2048.471 ms
>> optimized.log:[800.423s][info][gc,stats ] Phase: Concurrent Mark 1510.755 / 1510.755 981.807 / 2048.471 893.373 / 2048.471 893.373 / 2048.471 ms
>> optimized.log:[900.423s][info][gc,stats ] Phase: Concurrent Mark 1472.737 / 1472.737 1044.755 / 2089.539 927.733 / 2089.539 927.733 / 2089.539 ms
>> optimized.log:[1000.423s][info][gc,stats ] Phase: Concurrent Mark 1513.077 / 1513.077 1095.827 / 2089.539 947.202 / 2089.539 947.202 / 2089.539 ms
>> optimized.log:[1100.423s][info][gc,stats ] Phase: Concurrent Mark 0.000 / 0.000 1073.703 / 2089.539 943.684 / 2089.539 943.684 / 2089.539 ms
>> optimized.log:[1200.423s][info][gc,stats ] Phase: Concurrent Mark 1337.865 / 1337.865 1119.936 / 2113.895 962.172 / 2113.895 962.172 / 2113.895 ms
>>
>> [root at localhost corretto]# grep "00.*Phase: Concurrent Relocate " *.log
>> baseline.log:[100.412s][info][gc,stats ] Phase: Concurrent Relocate 196.522 / 196.522 114.318 / 245.371 114.318 / 245.371 114.318 / 245.371 ms
>> baseline.log:[200.411s][info][gc,stats ] Phase: Concurrent Relocate 47.748 / 47.748 130.861 / 331.948 130.861 / 331.948 130.861 / 331.948 ms
>> baseline.log:[300.411s][info][gc,stats ] Phase: Concurrent Relocate 56.922 / 56.922 129.174 / 331.948 129.174 / 331.948 129.174 / 331.948 ms
>> baseline.log:[400.411s][info][gc,stats ] Phase: Concurrent Relocate 218.707 / 218.707 137.495 / 331.948 137.495 / 331.948 137.495 / 331.948 ms
>> baseline.log:[500.411s][info][gc,stats ] Phase: Concurrent Relocate 197.166 / 197.166 144.216 / 359.644 144.216 / 359.644 144.216 / 359.644 ms
>> baseline.log:[600.411s][info][gc,stats ] Phase: Concurrent Relocate 202.118 / 202.118 153.507 / 373.447 153.507 / 373.447 153.507 / 373.447 ms
>> baseline.log:[700.411s][info][gc,stats ] Phase: Concurrent Relocate 0.000 / 0.000 172.241 / 395.113 164.291 / 395.113 164.291 / 395.113 ms
>> baseline.log:[800.412s][info][gc,stats ] Phase: Concurrent Relocate 215.121 / 215.121 186.007 / 421.039 173.139 / 421.039 173.139 / 421.039 ms
>> baseline.log:[900.411s][info][gc,stats ] Phase: Concurrent Relocate 48.550 / 48.550 203.420 / 421.982 181.899 / 421.982 181.899 / 421.982 ms
>> baseline.log:[1000.411s][info][gc,stats ] Phase: Concurrent Relocate 53.847 / 53.847 211.774 / 421.982 185.728 / 421.982 185.728 / 421.982 ms
>> baseline.log:[1100.411s][info][gc,stats ] Phase: Concurrent Relocate 224.489 / 224.489 218.195 / 431.088 188.087 / 431.088 188.087 / 431.088 ms
>> baseline.log:[1200.412s][info][gc,stats ] Phase: Concurrent Relocate 0.000 / 0.000 222.852 / 431.088 191.130 / 431.088 191.130 / 431.088 ms
>> optimized.log:[100.423s][info][gc,stats ] Phase: Concurrent Relocate 193.811 / 193.811 113.043 / 248.471 113.043 / 248.471 113.043 / 248.471 ms
>> optimized.log:[200.423s][info][gc,stats ] Phase: Concurrent Relocate 196.220 / 196.220 117.810 / 248.471 117.810 / 248.471 117.810 / 248.471 ms
>> optimized.log:[300.423s][info][gc,stats ] Phase: Concurrent Relocate 48.786 / 48.786 131.753 / 351.890 131.753 / 351.890 131.753 / 351.890 ms
>> optimized.log:[400.423s][info][gc,stats ] Phase: Concurrent Relocate 195.302 / 195.302 139.115 / 351.890 139.115 / 351.890 139.115 / 351.890 ms
>> optimized.log:[500.423s][info][gc,stats ] Phase: Concurrent Relocate 374.022 / 374.022 155.204 / 374.022 155.204 / 374.022 155.204 / 374.022 ms
>> optimized.log:[600.423s][info][gc,stats ] Phase: Concurrent Relocate 49.222 / 49.222 159.444 / 400.795 159.444 / 400.795 159.444 / 400.795 ms
>> optimized.log:[700.423s][info][gc,stats ] Phase: Concurrent Relocate 381.072 / 381.072 182.488 / 409.086 173.140 / 409.086 173.140 / 409.086 ms
>> optimized.log:[800.423s][info][gc,stats ] Phase: Concurrent Relocate 223.399 / 223.399 191.774 / 409.086 175.748 / 409.086 175.748 / 409.086 ms
>> optimized.log:[900.423s][info][gc,stats ] Phase: Concurrent Relocate 214.184 / 214.184 201.526 / 409.086 181.302 / 409.086 181.302 / 409.086 ms
>> optimized.log:[1000.423s][info][gc,stats ] Phase: Concurrent Relocate 208.600 / 208.600 207.389 / 420.479 183.756 / 420.479 183.756 / 420.479 ms
>> optimized.log:[1100.423s][info][gc,stats ] Phase: Concurrent Relocate 209.444 / 209.444 202.367 / 420.479 183.173 / 420.479 183.173 / 420.479 ms
>> optimized.log:[1200.423s][info][gc,stats ] Phase: Concurrent Relocate 223.841 / 223.841 206.268 / 420.479 185.074 / 420.479 185.074 / 420.479 ms
>>
>> [root at localhost corretto]# grep "average latency:" nohup_baseline.out
>> average latency: 2ms:40us
>> average latency: 6ms:550us
>> average latency: 6ms:543us
>> average latency: 6ms:493us
>> average latency: 928us
>> average latency: 794us
>> average latency: 1ms:403us
>> average latency: 23ms:216us
>> average latency: 775us
>> [root at localhost corretto]# grep "average latency:" nohup_optimized.out
>> average latency: 2ms:48us
>> average latency: 5ms:948us
>> average latency: 5ms:940us
>> average latency: 5ms:875us
>> average latency: 850us
>> average latency: 723us
>> average latency: 1ms:221us
>> average latency: 22ms:653us
>> average latency: 693us
>
> A thread that copies the object and self heals will conceptually do the following, assuming relaxed memory ordering:
>
> copy();
> release();
> cas_forwarding_table();
> cas_self_heal();
>
> The release before casing in the forwading table, acts as a release for both accesses, in the scenario when the copy is being published. So in the scenario you describe, the release in the forwarding table is already enough, to ensure that anyone reading the self healed pointer, is guaranteed to not observe bytes from before the copy. In the scenario when one thread performs the copy that gets published to the forwarding table, and another thread self-heals the pointer with the value acquired from the forwarding table, we will indeed not have a release to publish the pointer, only an acquire used to read from the forwarding table. However, this is fine, as the new MCA ARMv8 architecture does not allow causal consistency violations like WRC (cf. https://dl.acm.org/doi/pdf/10.1145/3158107 section 4). So we no longer need to use acquire/release to guarantee causal consistency across threads. This would naturally not hold for PPC, but there is no PPC port for ZGC yet.
>
> It is interesting though that when loading a self-healed pointer, we do not perform any acquire. That is fine when dereferencing the loaded pointer, as a dependent load, eliding the need for an acquire. And that is indeed fine for the JIT compiled code, because we know it is always a dependent load (or safe in other ways). However, for the C++ code, we can not *guarantee* that there will be a dependent load in a spec conforming way. That might be something to look into. In practice, there isn't any good reason why reading and oop and then dereferencing it wouldn't yield a dependent load, but the spec doesn't promise anything and could in theory allow compilers to mess this up. However, having an acquire for every oop load in the runtime does sound a bit costly. The memory_order_consume semantics were supposed to solve this, but I'm not sure if the compilers have yet become good at doing something useful with that, other than just having it be equivalent to acquire. Might be somethi
ng to check out in the disassembly to see what it yields. But that is an exercise for another day, as this isn't an issue you are introducing with this patch.
>
> Hope this helps explain my thoughts in more detail.
@fisk Hi, Eric. We are wondering if one thread loading a healed pointer can observe the corresponding copy has not finished yet. Assuming relaxed ordering for `cas_self_heal`, both Thread A and Thread B are loading the same reference.
**Thread A**: `load obj.fld; // will relocate the object referenced by obj.fld`
thread A will do the following:
1 copy();
2 cas_forwarding_table(); // release
3 cas_self_heal(); // relaxed
**Thread B**: `load obj.fld; // load the same reference`
thread B may obverses the following reordering of **thread A**:
3 cas_self_heal(); // relaxed
1 copy();
2 cas_forwarding_table(); // release
To our knowledge, release ordering in _line 2_ does not prevent _line 3_ to be reordering before _line 1_, which indicates the release in the forwarding table is not enough. Perhaps we need to add acquire ordering to _line 2_ or add release ordering to _line 3_.
In another way, as @weixlu said,
> Instead, it maybe serves as membar to block all the CASes afterwards.
relaxed ordering in _line 2_ along with release ordering in _line 3_ can indeed ensure thread B always observes the object copy.
Looking forward to your advice.
-------------
PR: https://git.openjdk.java.net/jdk/pull/5046
More information about the hotspot-gc-dev
mailing list