<div dir="ltr">Roughly speaking, throughput is determined by how many instructions a CPU can execute at the same time, and latency is determined by how many cycles it takes from when an instruction is executed to when its results are available.<div><br></div><div>Consider this sequence:</div><div><br></div><div>x1 = a1 + b1</div><div>x2 = a2 + b2</div><div>x3 = a3 + b3</div><div>x4 = a4 + b4</div><div><br></div><div>Since all these instructions are independent, all 4 of these can be issued at the same time, which leads to quadrupling the throughput of the whole sequence. At this point, the limiting factor is the number of execution units that are capable of executing adds.</div><div><br></div><div>On the other hand, if we consider this sequence:</div><div><br></div><div>x1 = x0 + a1</div><div>x2 = x1 + a2</div><div>x3 = x2 + a3</div><div>x4 = x3 + a4</div><div><br></div><div>The second instruction depends on the result of the first, and the third depends on the sequence. In this case, the CPU cannot execute them in parallel and must do so sequentially. As a result, the limiting factor is the overall latency of the instruction sequence.<br></div><div><br></div><div>In general, a bound check does not create any additional dependency chain (only the control flow depends on it, but the branch predictor will take the right path way before that), which means that it will not impose any additional pressure if the program is bounded by the dependencies of the variables. On the other hand, a bound check consumes the execution ports, which means that the performance can degrade massively if the program is bounded by the execution throughput.</div><div><br></div><div>In this example, when moving from VarHandle to using Universe segment, the program is bounded by the execution ports, which leads to a massive decrease in execution time (9ms -> 7.6ms), looking at perf stat the IPC stays pretty consistent at around 2.2 - 2.4. Removing more bound checks relieves the pressure on the execution ports and moves the bottleneck to the instruction latencies. As a result, although the instruction count decreases by about 20%, the execution time only is only reduced by around 5%. The IPC dips to around 1.8.</div><div><br></div><div>Of course, this is machine-dependent, my machine is a Zen 4, which means that it is more likely to have higher overall throughput due to instructions generally improved to be able to run on more ports, while the shortest latency for an instruction is already 1 cycle. The turning point can be lower on other machines, which means that using Unsafe can be more advantageous in comparison to the universe segment approach.</div><div><br></div><div>PS1: This is simplified, normally a program can have multiple parts that are bounded by different things ranging from the decoder, scheduler, etc, and bound checks will affect the performance when most of them are the main bottleneck.</div><div>PS2: A bound check is also not cheap, it often requires a memory load, a compare and branch, and an arithmetic instruction when the types of the array and the access do not match.</div><div>PS3: This is of course my speculation, which may be completely incorrect, so please take it with a grain of salt.</div><div><br></div><div>Best regards,</div><div>Quan Anh</div></div>