Official support for Unsafe

Mon Jan 15 18:11:12 UTC 2024

Roughly speaking, throughput is determined by how many instructions a CPU
can execute at the same time, and latency is determined by how many cycles
it takes from when an instruction is executed to when its results are
available.

Consider this sequence:

x1 = a1 + b1
x2 = a2 + b2
x3 = a3 + b3
x4 = a4 + b4

Since all these instructions are independent, all 4 of these can be issued
at the same time, which leads to quadrupling the throughput of the whole
sequence. At this point, the limiting factor is the number of execution
units that are capable of executing adds.

On the other hand, if we consider this sequence:

x1 = x0 + a1
x2 = x1 + a2
x3 = x2 + a3
x4 = x3 + a4

The second instruction depends on the result of the first, and the third
depends on the sequence. In this case, the CPU cannot execute them in
parallel and must do so sequentially. As a result, the limiting factor is
the overall latency of the instruction sequence.

In general, a bound check does not create any additional dependency chain
(only the control flow depends on it, but the branch predictor will take
the right path way before that), which means that it will not impose any
additional pressure if the program is bounded by the dependencies of the
variables. On the other hand, a bound check consumes the execution ports,
which means that the performance can degrade massively if the program is
bounded by the execution throughput.

In this example, when moving from VarHandle to using Universe segment, the
program is bounded by the execution ports, which leads to a massive
decrease in execution time (9ms -> 7.6ms), looking at perf stat the IPC
stays pretty consistent at around 2.2 - 2.4. Removing more bound checks
relieves the pressure on the execution ports and moves the bottleneck to
the instruction latencies. As a result, although the instruction count
decreases by about 20%, the execution time only is only reduced by around
5%. The IPC dips to around 1.8.

Of course, this is machine-dependent, my machine is a Zen 4, which means
that it is more likely to have higher overall throughput due to
instructions generally improved to be able to run on more ports, while the
shortest latency for an instruction is already 1 cycle. The turning point
can be lower on other machines, which means that using Unsafe can be more
advantageous in comparison to the universe segment approach.

PS1: This is simplified, normally a program can have multiple parts that
are bounded by different things ranging from the decoder, scheduler, etc,
and bound checks will affect the performance when most of them are the main
bottleneck.
PS2: A bound check is also not cheap, it often requires a memory load, a
compare and branch, and an arithmetic instruction when the types of the
array and the access do not match.
PS3: This is of course my speculation, which may be completely incorrect,
so please take it with a grain of salt.

Best regards,
Quan Anh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/amber-dev/attachments/20240116/26a3aaf0/attachment.htm>