Official support for Unsafe

Maurizio Cimadamore maurizio.cimadamore at oracle.com
Mon Jan 15 18:24:33 UTC 2024


Whoops - I did not realize this was still on amber-dev.

I suggest we take this offline, as I don't think this discussion has any 
relationship with Project Amber.

Maurizio

On 15/01/2024 18:23, Maurizio Cimadamore wrote:
> Thanks for taking the time to explain. I know these concepts, I just 
> needed to map them to the terminology you were using.
>
> But, I have questions: you say that bound check does not add 
> dependencies, but consumes execution ports (thanks to branch predictor 
> getting it right most of the time).
>
> If that was the case, we should see no difference between original 
> version and everything segment version, given both are bounded by 
> execution ports, and not by dependencies.
>
> So, why the massive decrease in execution time? (given instruction 
> count didn't change that much?)
>
> I'm not necessarily saying your explanation is wrong, but it doesn't 
> seem to account for the difference
>
> Maurizio
>
> On 15/01/2024 18:11, Quân Anh Mai wrote:
>> Roughly speaking, throughput is determined by how many instructions a 
>> CPU can execute at the same time, and latency is determined by how 
>> many cycles it takes from when an instruction is executed to when its 
>> results are available.
>>
>> Consider this sequence:
>>
>> x1 = a1 + b1
>> x2 = a2 + b2
>> x3 = a3 + b3
>> x4 = a4 + b4
>>
>> Since all these instructions are independent, all 4 of these can be 
>> issued at the same time, which leads to quadrupling the throughput of 
>> the whole sequence. At this point, the limiting factor is the number 
>> of execution units that are capable of executing adds.
>>
>> On the other hand, if we consider this sequence:
>>
>> x1 = x0 + a1
>> x2 = x1 + a2
>> x3 = x2 + a3
>> x4 = x3 + a4
>>
>> The second instruction depends on the result of the first, and the 
>> third depends on the sequence. In this case, the CPU cannot execute 
>> them in parallel and must do so sequentially. As a result, the 
>> limiting factor is the overall latency of the instruction sequence.
>>
>> In general, a bound check does not create any additional dependency 
>> chain (only the control flow depends on it, but the branch 
>> predictor will take the right path way before that), which means that 
>> it will not impose any additional pressure if the program is bounded 
>> by the dependencies of the variables. On the other hand, a bound 
>> check consumes the execution ports, which means that the performance 
>> can degrade massively if the program is bounded by the execution 
>> throughput.
>>
>> In this example, when moving from VarHandle to using Universe 
>> segment, the program is bounded by the execution ports, which leads 
>> to a massive decrease in execution time (9ms -> 7.6ms), looking at 
>> perf stat the IPC stays pretty consistent at around 2.2 - 2.4. 
>> Removing more bound checks relieves the pressure on the execution 
>> ports and moves the bottleneck to the instruction latencies. As a 
>> result, although the instruction count decreases by about 20%, the 
>> execution time only is only reduced by around 5%. The IPC dips to 
>> around 1.8.
>>
>> Of course, this is machine-dependent, my machine is a Zen 4, which 
>> means that it is more likely to have higher overall throughput due to 
>> instructions generally improved to be able to run on more ports, 
>> while the shortest latency for an instruction is already 1 cycle. The 
>> turning point can be lower on other machines, which means that using 
>> Unsafe can be more advantageous in comparison to the universe segment 
>> approach.
>>
>> PS1: This is simplified, normally a program can have multiple parts 
>> that are bounded by different things ranging from the decoder, 
>> scheduler, etc, and bound checks will affect the performance when 
>> most of them are the main bottleneck.
>> PS2: A bound check is also not cheap, it often requires a memory 
>> load, a compare and branch, and an arithmetic instruction when the 
>> types of the array and the access do not match.
>> PS3: This is of course my speculation, which may be completely 
>> incorrect, so please take it with a grain of salt.
>>
>> Best regards,
>> Quan Anh


More information about the amber-dev mailing list