Official support for Unsafe
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Mon Jan 15 17:34:28 UTC 2024
Ok, the numbers in your benchmark match my expectation.
So, if that approach doesn't match unsafe performance in the 1brc
challenge (or come very close to it), I'm afraid the culprit is not the
bound checks, as much as the time it takes for the var handle machinery
to warm up (inline, unroll and drop the checks).
We're aware of the startup/warmup advantage of Unsafe vs. FFM and we
will be doing more in order to bridge the gap (a similar argument holds
for JNI calls vs. FFM linker calls).
Maurizio
On 15/01/2024 17:05, Quân Anh Mai wrote:
> Sure, I just thought that looking at the instruction count would be
> more helpful, since each machine would express different performance
> behaviours. For example, my machine shows dependency bound going from
> [2] to [1] below, which leads to a much smaller margin of execution
> time compared to the margin measured by other machines (such as the
> test machine). The third implementation is similar to the first one,
> except I use safe accesses in the form of bounded memory segment
> accesses and varhandles.
>
> The JMH numbers for these versions look like this, I define an execute
> function which is:
>
> @Benchmark
> public PoorManMap execute() throws IOException {
> try (var file = FileChannel.open(Path.of(FILE),
> StandardOpenOption.READ);
> var arena = Arena.ofShared()) {
> var data = file.map(MapMode.READ_ONLY, 0, file.size(), arena);
> return processFile(data, 0, data.byteSize());
> }
> }
>
> CalculateAverage_merykitty.execute avgt 5 7.422 ± 0.093
> ms/op // unsafe [1]
> CalculateAverage_merykitty.execute avgt 5 7.686 ± 0.181
> ms/op // universe segment [2]
> CalculateAverage_merykitty.execute avgt 5 9.009 ± 0.058
> ms/op // varhandle [3]
>
> [1]: https://github.com/merykitty/1brc/tree/main
> <https://urldefense.com/v3/__https://github.com/merykitty/1brc/tree/main__;!!ACWV5N9M2RV99hQ!IUCdtouLGOCslnu12ztV0zav6VwnkUFY-SKEQjIpQqeFu1BcYMR23QSVWPOHlO9374x1qxH67yVJEBtQtnyAww$>
> [2]: https://github.com/merykitty/1brc/tree/removeunsafe
> <https://urldefense.com/v3/__https://github.com/merykitty/1brc/tree/removeunsafe__;!!ACWV5N9M2RV99hQ!IUCdtouLGOCslnu12ztV0zav6VwnkUFY-SKEQjIpQqeFu1BcYMR23QSVWPOHlO9374x1qxH67yVJEBv6-AEpwA$>
> [3]: https://github.com/merykitty/1brc/tree/varhandles
> <https://urldefense.com/v3/__https://github.com/merykitty/1brc/tree/varhandles__;!!ACWV5N9M2RV99hQ!IUCdtouLGOCslnu12ztV0zav6VwnkUFY-SKEQjIpQqeFu1BcYMR23QSVWPOHlO9374x1qxH67yVJEBtlf-1yaA$>
>
> Best regards,
> Quan Anh
>
> On Tue, 16 Jan 2024 at 00:29, Maurizio Cimadamore
> <maurizio.cimadamore at oracle.com> wrote:
>
>
> On 15/01/2024 15:44, Quân Anh Mai wrote:
>> Running the same program on 1e6 lines results in only 9e9
>> instructions, so I think the vast majority of the instruction
>> count is of the compiled code. Not using the universe segment is
>> roughly equivalent to my previous version, which would result in
>> around 50% more instructions compared to using one, and almost
>> double the instruction count of using Unsafe.
>
> Without looking at the program some more, it's hard for me to make
> some sense of these numbers. I'm surprised that you don't see any
> difference when using unbounded segment compared to regular ones.
> I wonder if the gap you are seeing is due to the JVM warming up,
> rather than peak performances being worse. Have you tried
> measuring peak performance with e.g. JMH? I would not expect to
> see 20% difference there...
>
> Maurizio
>
>>
>> Regards,
>> Quan Anh
>>
>> On Mon, 15 Jan 2024 at 23:09, Maurizio Cimadamore
>> <maurizio.cimadamore at oracle.com> wrote:
>>
>> I think the increased instruction count is normal, as C2 had
>> to do more work to optimize the bound checks away?
>>
>> Is there any difference compared to the version that doesn't
>> use the universe segment?
>>
>> Maurizio
>>
>> On 15/01/2024 13:52, Quân Anh Mai wrote:
>>> Hi,
>>>
>>> I have tried using a universe segment instead of Unsafe, and
>>> store the custom hashmap buffer in off-heap instead of using
>>> a byte array. The output of perf stat on the program
>>>
>>> Performance counter stats for 'sh
>>> calculate_average_merykittyunsafe.sh':
>>>
>>> 13573.70 msec task-clock:u # 10.942 CPUs utilized
>>> 0 context-switches:u # 0.000 /sec
>>> 0 cpu-migrations:u # 0.000 /sec
>>> 238460 page-faults:u # 17.568 K/sec
>>> 61995179870 cycles:u # 4.567 GHz
>>> 261830581 stalled-cycles-frontend:u # 0.42%
>>> frontend cycles idle
>>> 93823680 stalled-cycles-backend:u # 0.15%
>>> backend cycles idle
>>> 137976098809 instructions:u # 2.23 insn per
>>> cycle
>>> # 0.00 stalled cycles per insn
>>> 18373313803 branches:u # 1.354 G/sec
>>> 43579782 branch-misses:u # 0.24% of all
>>> branches
>>>
>>> 1.240504612 seconds time elapsed
>>>
>>> 12.841563000 seconds user
>>> 0.652428000 seconds sys
>>>
>>> For comparison, this is the unsafe version:
>>>
>>> Performance counter stats for 'sh
>>> calculate_average_merykittyunsafe.sh':
>>>
>>> 13327.46 msec task-clock:u # 11.202 CPUs
>>> utilized
>>> 0 context-switches:u # 0.000 /sec
>>> 0 cpu-migrations:u # 0.000 /sec
>>> 269896 page-faults:u # 20.251 K/sec
>>> 61258348752 cycles:u # 4.596 GHz
>>> 639839262 stalled-cycles-frontend:u # 1.04%
>>> frontend cycles idle
>>> 108018676 stalled-cycles-backend:u # 0.18%
>>> backend cycles idle
>>> 113476168983 instructions:u # 1.85 insn
>>> per cycle
>>> # 0.01 stalled cycles per insn
>>> 11442665370 branches:u # 858.578 M/sec
>>> 44590172 branch-misses:u # 0.39% of
>>> all branches
>>>
>>> 1.189768677 seconds time elapsed
>>>
>>> 12.628512000 seconds user
>>> 0.620083000 seconds sys
>>>
>>> This program running on my machine expresses dependency
>>> bound so the difference in execution time is not as
>>> significant as on the test machine but it can be seen that
>>> removing Unsafe results in over 21% increase in instruction
>>> count.
>>>
>>> Regards,
>>> Quan Anh
>>>
>>> On Sat, 13 Jan 2024 at 01:29, Maurizio Cimadamore
>>> <maurizio.cimadamore at oracle.com> wrote:
>>>
>>>
>>> On 12/01/2024 17:26, Quân Anh Mai wrote:
>>> > FYI, in my submission to 1brc, using Unsafe decreases
>>> the execution
>>> > time from 3.25s to 2.57s on the test machine.
>>>
>>> Just curious - what is the difference compared with the
>>> everything
>>> segment trick?
>>>
>>> (While I know it can't do on-heap access, perhaps you
>>> can tweak the code
>>> to be all off-heap?)
>>>
>>> Maurizio
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/amber-dev/attachments/20240115/239d0065/attachment-0001.htm>
More information about the amber-dev
mailing list