RFR: 8276096: Simplify Unsafe.{load|store}Fence fallbacks by delegating to fullFence

Mon Nov 1 02:18:13 UTC 2021

On Thu, 28 Oct 2021 08:47:31 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> `Unsafe.{load|store}Fence` falls back to `unsafe.cpp` for `OrderAccess::{acquire|release}Fence()`. It seems too heavy-handed (useless?) to call to runtime for a single memory barrier. We can simplify the native `Unsafe` interface by falling back to `fullFence` when `{load|store}Fence` intrinsics are not available. This would be similar to what `Unsafe.{loadLoad|storeStore}Fences` do. 
> 
> This is the behavior of these intrinsics now, on x86_64, using benchmarks from JDK-8276054:
> 
> 
> Benchmark          Mode  Cnt  Score   Error  Units
> 
> # Default
> Single.acquire     avgt    3   0.407 ± 0.060  ns/op
> Single.full        avgt    3   4.693 ± 0.005  ns/op
> Single.loadLoad    avgt    3   0.415 ± 0.095  ns/op
> Single.plain       avgt    3   0.406 ± 0.002  ns/op
> Single.release     avgt    3   0.408 ± 0.047  ns/op
> Single.storeStore  avgt    3   0.408 ± 0.043  ns/op
> 
> # -XX:DisableIntrinsic=_storeFence
> Single.acquire     avgt    3   0.408 ± 0.016  ns/op
> Single.full        avgt    3   4.694 ± 0.002  ns/op
> Single.loadLoad    avgt    3   0.406 ± 0.002  ns/op
> Single.plain       avgt    3   0.406 ± 0.001  ns/op
> Single.release     avgt    3   4.694 ± 0.003  ns/op <--- upgraded to full
> Single.storeStore  avgt    3   4.690 ± 0.005  ns/op <--- upgraded to full
> 
> # -XX:DisableIntrinsic=_loadFence
> Single.acquire     avgt    3   4.691 ± 0.001  ns/op <--- upgraded to full
> Single.full        avgt    3   4.693 ± 0.009  ns/op
> Single.loadLoad    avgt    3   4.693 ± 0.013  ns/op <--- upgraded to full
> Single.plain       avgt    3   0.408 ± 0.072  ns/op
> Single.release     avgt    3   0.415 ± 0.016  ns/op
> Single.storeStore  avgt    3   0.416 ± 0.041  ns/op
> 
> # -XX:DisableIntrinsic=_fullFence
> Single.acquire     avgt    3   0.406 ± 0.014  ns/op
> Single.full        avgt    3  15.836 ± 0.151  ns/op <--- calls runtime
> Single.loadLoad    avgt    3   0.406 ± 0.001  ns/op
> Single.plain       avgt    3   0.426 ± 0.361  ns/op
> Single.release     avgt    3   0.407 ± 0.021  ns/op
> Single.storeStore  avgt    3   0.410 ± 0.061  ns/op
> 
> # -XX:DisableIntrinsic=_fullFence,_loadFence
> Single.acquire     avgt    3  15.822 ± 0.282  ns/op <--- upgraded, calls runtime
> Single.full        avgt    3  15.851 ± 0.127  ns/op <--- calls runtime
> Single.loadLoad    avgt    3  15.829 ± 0.045  ns/op <--- upgraded, calls runtime
> Single.plain       avgt    3   0.406 ± 0.001  ns/op
> Single.release     avgt    3   0.414 ± 0.156  ns/op
> Single.storeStore  avgt    3   0.422 ± 0.452  ns/op
> 
> # -XX:DisableIntrinsic=_fullFence,_storeFence
> Single.acquire     avgt    3   0.407 ± 0.016  ns/op
> Single.full        avgt    3  15.347 ± 6.783  ns/op <--- calls runtime
> Single.loadLoad    avgt    3   0.406 ± 0.001  ns/op
> Single.plain       avgt    3   0.406 ± 0.002  ns/op 
> Single.release     avgt    3  15.828 ± 0.019  ns/op <--- upgraded, calls runtime
> Single.storeStore  avgt    3  15.834 ± 0.045  ns/op <--- upgraded, calls runtime
> 
> # -XX:DisableIntrinsic=_fullFence,_loadFence,_storeFence
> Single.acquire     avgt    3  15.838 ± 0.030  ns/op <--- upgraded, calls runtime
> Single.full        avgt    3  15.854 ± 0.277  ns/op <--- calls runtime
> Single.loadLoad    avgt    3  15.826 ± 0.160  ns/op <--- upgraded, calls runtime
> Single.plain       avgt    3   0.406 ± 0.003  ns/op
> Single.release     avgt    3  15.838 ± 0.019  ns/op <--- upgraded, calls runtime
> Single.storeStore  avgt    3  15.844 ± 0.104  ns/op <--- upgraded, calls runtime
> 
> 
> Additional testing:
>  - [x] Linux x86_64 fastdebug `tier1`

I'm not quite seeing the motivation here. Your claim is that the non-intrinsic implementations involve a native call and so that is too expensive; yet the new code still relies on the fullFence being intrinsified else it is still a native call and a heavier barrier. If these fences were intrinisified piecemeal then perhaps this is an issue on some platform, but is that really the case? If you intrinsified one wouldn't you intrinsify all?

-------------

PR: https://git.openjdk.java.net/jdk/pull/6149