RFR: 8276096: Simplify Unsafe.{load|store}Fence fallbacks by delegating to fullFence
David Holmes
dholmes at openjdk.java.net
Mon Nov 1 02:18:13 UTC 2021
On Thu, 28 Oct 2021 08:47:31 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:
> `Unsafe.{load|store}Fence` falls back to `unsafe.cpp` for `OrderAccess::{acquire|release}Fence()`. It seems too heavy-handed (useless?) to call to runtime for a single memory barrier. We can simplify the native `Unsafe` interface by falling back to `fullFence` when `{load|store}Fence` intrinsics are not available. This would be similar to what `Unsafe.{loadLoad|storeStore}Fences` do.
>
> This is the behavior of these intrinsics now, on x86_64, using benchmarks from JDK-8276054:
>
>
> Benchmark Mode Cnt Score Error Units
>
> # Default
> Single.acquire avgt 3 0.407 ± 0.060 ns/op
> Single.full avgt 3 4.693 ± 0.005 ns/op
> Single.loadLoad avgt 3 0.415 ± 0.095 ns/op
> Single.plain avgt 3 0.406 ± 0.002 ns/op
> Single.release avgt 3 0.408 ± 0.047 ns/op
> Single.storeStore avgt 3 0.408 ± 0.043 ns/op
>
> # -XX:DisableIntrinsic=_storeFence
> Single.acquire avgt 3 0.408 ± 0.016 ns/op
> Single.full avgt 3 4.694 ± 0.002 ns/op
> Single.loadLoad avgt 3 0.406 ± 0.002 ns/op
> Single.plain avgt 3 0.406 ± 0.001 ns/op
> Single.release avgt 3 4.694 ± 0.003 ns/op <--- upgraded to full
> Single.storeStore avgt 3 4.690 ± 0.005 ns/op <--- upgraded to full
>
> # -XX:DisableIntrinsic=_loadFence
> Single.acquire avgt 3 4.691 ± 0.001 ns/op <--- upgraded to full
> Single.full avgt 3 4.693 ± 0.009 ns/op
> Single.loadLoad avgt 3 4.693 ± 0.013 ns/op <--- upgraded to full
> Single.plain avgt 3 0.408 ± 0.072 ns/op
> Single.release avgt 3 0.415 ± 0.016 ns/op
> Single.storeStore avgt 3 0.416 ± 0.041 ns/op
>
> # -XX:DisableIntrinsic=_fullFence
> Single.acquire avgt 3 0.406 ± 0.014 ns/op
> Single.full avgt 3 15.836 ± 0.151 ns/op <--- calls runtime
> Single.loadLoad avgt 3 0.406 ± 0.001 ns/op
> Single.plain avgt 3 0.426 ± 0.361 ns/op
> Single.release avgt 3 0.407 ± 0.021 ns/op
> Single.storeStore avgt 3 0.410 ± 0.061 ns/op
>
> # -XX:DisableIntrinsic=_fullFence,_loadFence
> Single.acquire avgt 3 15.822 ± 0.282 ns/op <--- upgraded, calls runtime
> Single.full avgt 3 15.851 ± 0.127 ns/op <--- calls runtime
> Single.loadLoad avgt 3 15.829 ± 0.045 ns/op <--- upgraded, calls runtime
> Single.plain avgt 3 0.406 ± 0.001 ns/op
> Single.release avgt 3 0.414 ± 0.156 ns/op
> Single.storeStore avgt 3 0.422 ± 0.452 ns/op
>
> # -XX:DisableIntrinsic=_fullFence,_storeFence
> Single.acquire avgt 3 0.407 ± 0.016 ns/op
> Single.full avgt 3 15.347 ± 6.783 ns/op <--- calls runtime
> Single.loadLoad avgt 3 0.406 ± 0.001 ns/op
> Single.plain avgt 3 0.406 ± 0.002 ns/op
> Single.release avgt 3 15.828 ± 0.019 ns/op <--- upgraded, calls runtime
> Single.storeStore avgt 3 15.834 ± 0.045 ns/op <--- upgraded, calls runtime
>
> # -XX:DisableIntrinsic=_fullFence,_loadFence,_storeFence
> Single.acquire avgt 3 15.838 ± 0.030 ns/op <--- upgraded, calls runtime
> Single.full avgt 3 15.854 ± 0.277 ns/op <--- calls runtime
> Single.loadLoad avgt 3 15.826 ± 0.160 ns/op <--- upgraded, calls runtime
> Single.plain avgt 3 0.406 ± 0.003 ns/op
> Single.release avgt 3 15.838 ± 0.019 ns/op <--- upgraded, calls runtime
> Single.storeStore avgt 3 15.844 ± 0.104 ns/op <--- upgraded, calls runtime
>
>
> Additional testing:
> - [x] Linux x86_64 fastdebug `tier1`
I'm not quite seeing the motivation here. Your claim is that the non-intrinsic implementations involve a native call and so that is too expensive; yet the new code still relies on the fullFence being intrinsified else it is still a native call and a heavier barrier. If these fences were intrinisified piecemeal then perhaps this is an issue on some platform, but is that really the case? If you intrinsified one wouldn't you intrinsify all?
-------------
PR: https://git.openjdk.java.net/jdk/pull/6149
More information about the core-libs-dev
mailing list