Official support for Unsafe
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Mon Jan 15 18:24:33 UTC 2024
Whoops - I did not realize this was still on amber-dev.
I suggest we take this offline, as I don't think this discussion has any
relationship with Project Amber.
Maurizio
On 15/01/2024 18:23, Maurizio Cimadamore wrote:
> Thanks for taking the time to explain. I know these concepts, I just
> needed to map them to the terminology you were using.
>
> But, I have questions: you say that bound check does not add
> dependencies, but consumes execution ports (thanks to branch predictor
> getting it right most of the time).
>
> If that was the case, we should see no difference between original
> version and everything segment version, given both are bounded by
> execution ports, and not by dependencies.
>
> So, why the massive decrease in execution time? (given instruction
> count didn't change that much?)
>
> I'm not necessarily saying your explanation is wrong, but it doesn't
> seem to account for the difference
>
> Maurizio
>
> On 15/01/2024 18:11, Quân Anh Mai wrote:
>> Roughly speaking, throughput is determined by how many instructions a
>> CPU can execute at the same time, and latency is determined by how
>> many cycles it takes from when an instruction is executed to when its
>> results are available.
>>
>> Consider this sequence:
>>
>> x1 = a1 + b1
>> x2 = a2 + b2
>> x3 = a3 + b3
>> x4 = a4 + b4
>>
>> Since all these instructions are independent, all 4 of these can be
>> issued at the same time, which leads to quadrupling the throughput of
>> the whole sequence. At this point, the limiting factor is the number
>> of execution units that are capable of executing adds.
>>
>> On the other hand, if we consider this sequence:
>>
>> x1 = x0 + a1
>> x2 = x1 + a2
>> x3 = x2 + a3
>> x4 = x3 + a4
>>
>> The second instruction depends on the result of the first, and the
>> third depends on the sequence. In this case, the CPU cannot execute
>> them in parallel and must do so sequentially. As a result, the
>> limiting factor is the overall latency of the instruction sequence.
>>
>> In general, a bound check does not create any additional dependency
>> chain (only the control flow depends on it, but the branch
>> predictor will take the right path way before that), which means that
>> it will not impose any additional pressure if the program is bounded
>> by the dependencies of the variables. On the other hand, a bound
>> check consumes the execution ports, which means that the performance
>> can degrade massively if the program is bounded by the execution
>> throughput.
>>
>> In this example, when moving from VarHandle to using Universe
>> segment, the program is bounded by the execution ports, which leads
>> to a massive decrease in execution time (9ms -> 7.6ms), looking at
>> perf stat the IPC stays pretty consistent at around 2.2 - 2.4.
>> Removing more bound checks relieves the pressure on the execution
>> ports and moves the bottleneck to the instruction latencies. As a
>> result, although the instruction count decreases by about 20%, the
>> execution time only is only reduced by around 5%. The IPC dips to
>> around 1.8.
>>
>> Of course, this is machine-dependent, my machine is a Zen 4, which
>> means that it is more likely to have higher overall throughput due to
>> instructions generally improved to be able to run on more ports,
>> while the shortest latency for an instruction is already 1 cycle. The
>> turning point can be lower on other machines, which means that using
>> Unsafe can be more advantageous in comparison to the universe segment
>> approach.
>>
>> PS1: This is simplified, normally a program can have multiple parts
>> that are bounded by different things ranging from the decoder,
>> scheduler, etc, and bound checks will affect the performance when
>> most of them are the main bottleneck.
>> PS2: A bound check is also not cheap, it often requires a memory
>> load, a compare and branch, and an arithmetic instruction when the
>> types of the array and the access do not match.
>> PS3: This is of course my speculation, which may be completely
>> incorrect, so please take it with a grain of salt.
>>
>> Best regards,
>> Quan Anh
More information about the amber-dev
mailing list