Official support for Unsafe proposal

Mon Jan 15 19:48:39 UTC 2024

Hi Panama folks,

This proposal was first announced by me on amber-dev, which is later
suggested to move to this mailing list. I will make a summary of the
proposal here.

Java has made much progress in providing powerful alternatives to the usage
of sun.misc.Unsafe that still ensures safe behaviours. However, there is
one important use case of Unsafe that cannot be substituted with other safe
alternatives, that is the ability to access memory in an unsafe manner.

For the vast majority of the cases, bound checks can either be eliminated
by the compiler, or be negligibly cheap. However, there are always
exceptions.

- The compiler can theoretically eliminate a lot of bound checks, but there
are cases where it cannot do anything. An example is if a function inside a
hot loop is not inlined, from the perspective of the function, the checks
only happen once each, and there is no place to hoist it to, but from the
perspective of the program, this can mean numerous bound checks executed
inside its hot loop. Another example is if the access index cannot be
reasoned about from the surrounding context, the compiler cannot do
anything here and must perform a bound check.
- A bound check may be not cheap, it often consists of a memory load, a
compare and jump, and an arithmetic instruction if the types of the
container and the access do not match. Although it may not have any
noticeable effect if the program is latency-bound, it can result in massive
regression if the program bottleneck is in the decoder or the execution
ports. The issue is not only that the effect may be large, but also that it
is unpredictable.

As a single data point, for my 1brc submission, using the same approach,
the only difference is how the accesses are done:

- Using Unsafe [1]:
Instruction count: 1.1e11 (1e9 lines)
Compiled code run time: 7.422 ± 0.093 ms (1e6 lines)
- Using the "everything" segment trick [2]:
Instruction count: 1.4e11 (1e9 lines)
Compiled code run time: 7.686 ± 0.181 ms (1e6 lines)
- Using safe accesses [3]:
Instruction count: 2e11 (1e9 lines)
Compiled code run time: 9.009 ± 0.058 ms (1e6 lines)

Looking at other languages, C++ is unchecked by default, C#, Go, and even
Rust all provide the programmers the ability to access memory in an unsafe
manner if the need arises. This shows that the necessity of unsafe accesses
is evident.

My proposal is to introduce a class java.lang.Unsafe that provides utility
methods such as `static int arrayLoadUnchecked(int[], int)`. This method
will attempt to load an element of an array at the specified index assuming
that the array is not null and the index is not out of bounds. Normally, if
one of these restrictions is violated, the method will throw an
AssertionError. However, if when starting the program, a flag
--enable-unsafe-access=<module-name> is provided, then the compiler is
allowed to elide the checks, which makes the access truly unchecked.

This is different from --enable-native-access due to the fact that
functionally, a valid unchecked and a valid checked access is equivalent,
which makes it possible to replace an unchecked access with a checked
access without compromising the functionality of the program. This approach
has some benefits. Firstly, it allows the libraries to not force usage of
--enable-unsafe-flag on its users. Secondly, a library can be used as a
performance-critical component in some programs, but not in the others,
this solution allows only the needed program to utilise the unchecked
access capability of the library. From the perspective of a program, it is
able to minimise risk as modules not in critical sections will still
perform bound checks as normal.

This proposal is not without concerns. The first one is the unsafety of the
feature itself, as an unchecked access can potentially crash the program,
silently corrupt the progress memory, or worse, result in program
miscompilation. This is unavoidable given the unsafe nature of the
proposal, however, the risk is minimised since this feature would be only
used in very limited circumstances, and even then, the risk is present only
in a limited range of applications. The second concern is regarding
culture, that is the concern that developers may recklessly and carelessly
use unsafe in their code. I think this is a valid but not evident concern.
As it is much more readable and easier to write `arr[i]` than to write
`Unsafe.arrayLoadUnchecked(arr, i)`, there is little chance that developers
will recklessly use unchecked accesses, especially given the existing
culture of using checked accesses of Java. Evidently, other languages that
provide unsafe capabilities as non-default (C#, Go, and Rust) seem to not
have issues with developers recklessly utilise them even after decades of
history. The third concern is the burden of maintenance. I have thought
about it and made a very minimal prototype of the feature [4], my idea is
that the accesses can be implemented purely in Java, and C2 will intercept
and remove the checks. This will mostly be delegated to other routines
already existing in C2, which minimises the overhead of maintenance.

I expect this feature will only be used in performance-sensitive situations
when every bound check counts as they can accumulate really fast such as in
a json parsing library. This brings cascading effects as other libraries
and programs can benefit from the improved performance if and only if the
need arises.

This is my summary and rewrite of the proposal, please let me know if you
have any ideas or concerns. Thanks a lot,
Quan Anh

[1]: https://github.com/merykitty/1brc/tree/main
[2]: https://github.com/merykitty/1brc/tree/removeunsafe
[3]: https://github.com/merykitty/1brc/tree/varhandles
[4]:
https://github.com/openjdk/jdk/compare/master...merykitty:unsafe?expand=1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240116/5c1149d2/attachment-0001.htm>