Official support for Unsafe

Mon Jan 15 18:23:04 UTC 2024

Thanks for taking the time to explain. I know these concepts, I just 
needed to map them to the terminology you were using.

But, I have questions: you say that bound check does not add 
dependencies, but consumes execution ports (thanks to branch predictor 
getting it right most of the time).

If that was the case, we should see no difference between original 
version and everything segment version, given both are bounded by 
execution ports, and not by dependencies.

So, why the massive decrease in execution time? (given instruction count 
didn't change that much?)

I'm not necessarily saying your explanation is wrong, but it doesn't 
seem to account for the difference

Maurizio

On 15/01/2024 18:11, Quân Anh Mai wrote:
> Roughly speaking, throughput is determined by how many instructions a 
> CPU can execute at the same time, and latency is determined by how 
> many cycles it takes from when an instruction is executed to when its 
> results are available.
>
> Consider this sequence:
>
> x1 = a1 + b1
> x2 = a2 + b2
> x3 = a3 + b3
> x4 = a4 + b4
>
> Since all these instructions are independent, all 4 of these can be 
> issued at the same time, which leads to quadrupling the throughput of 
> the whole sequence. At this point, the limiting factor is the number 
> of execution units that are capable of executing adds.
>
> On the other hand, if we consider this sequence:
>
> x1 = x0 + a1
> x2 = x1 + a2
> x3 = x2 + a3
> x4 = x3 + a4
>
> The second instruction depends on the result of the first, and the 
> third depends on the sequence. In this case, the CPU cannot execute 
> them in parallel and must do so sequentially. As a result, the 
> limiting factor is the overall latency of the instruction sequence.
>
> In general, a bound check does not create any additional dependency 
> chain (only the control flow depends on it, but the branch 
> predictor will take the right path way before that), which means that 
> it will not impose any additional pressure if the program is bounded 
> by the dependencies of the variables. On the other hand, a bound check 
> consumes the execution ports, which means that the performance can 
> degrade massively if the program is bounded by the execution throughput.
>
> In this example, when moving from VarHandle to using Universe segment, 
> the program is bounded by the execution ports, which leads to a 
> massive decrease in execution time (9ms -> 7.6ms), looking at perf 
> stat the IPC stays pretty consistent at around 2.2 - 2.4. Removing 
> more bound checks relieves the pressure on the execution ports and 
> moves the bottleneck to the instruction latencies. As a result, 
> although the instruction count decreases by about 20%, the execution 
> time only is only reduced by around 5%. The IPC dips to around 1.8.
>
> Of course, this is machine-dependent, my machine is a Zen 4, which 
> means that it is more likely to have higher overall throughput due to 
> instructions generally improved to be able to run on more ports, while 
> the shortest latency for an instruction is already 1 cycle. The 
> turning point can be lower on other machines, which means that using 
> Unsafe can be more advantageous in comparison to the universe segment 
> approach.
>
> PS1: This is simplified, normally a program can have multiple parts 
> that are bounded by different things ranging from the decoder, 
> scheduler, etc, and bound checks will affect the performance when most 
> of them are the main bottleneck.
> PS2: A bound check is also not cheap, it often requires a memory load, 
> a compare and branch, and an arithmetic instruction when the types of 
> the array and the access do not match.
> PS3: This is of course my speculation, which may be completely 
> incorrect, so please take it with a grain of salt.
>
> Best regards,
> Quan Anh