[foreign-jextract] RFR: MemorySegmentPool + Allocator [v7]
Maurizio Cimadamore
mcimadamore at openjdk.java.net
Thu Apr 22 01:20:35 UTC 2021
On Thu, 22 Apr 2021 00:33:58 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:
>> Tweaked the loop and numbers didn't change much from yours:
>>
>>
>> Benchmark (allocations) Mode Cnt Score Error Units
>> AllocatorsForLongRun.arena 1 avgt 30 185.742 ? 4.349 ns/op
>> AllocatorsForLongRun.arena 16 avgt 30 616.261 ? 14.974 ns/op
>> AllocatorsForLongRun.arena 200 avgt 30 6574.496 ? 55.602 ns/op
>> AllocatorsForLongRun.malloc_free 1 avgt 30 25.888 ? 0.272 ns/op
>> AllocatorsForLongRun.malloc_free 16 avgt 30 602.258 ? 11.776 ns/op
>> AllocatorsForLongRun.malloc_free 200 avgt 30 10126.972 ? 151.182 ns/op
>> AllocatorsForLongRun.pool_allocator 1 avgt 30 35.907 ? 0.474 ns/op
>> AllocatorsForLongRun.pool_allocator 16 avgt 30 378.874 ? 8.533 ns/op
>> AllocatorsForLongRun.pool_allocator 200 avgt 30 4489.656 ? 40.615 ns/op
>> AllocatorsForLongRun.pool_allocator_exhausted 1 avgt 30 65.074 ? 3.399 ns/op
>> AllocatorsForLongRun.pool_allocator_exhausted 16 avgt 30 994.809 ? 22.971 ns/op
>> AllocatorsForLongRun.pool_allocator_exhausted 200 avgt 30 16247.051 ? 223.768 ns/op
>> AllocatorsForLongRun.pool_direct 1 avgt 30 15.827 ? 0.398 ns/op
>> AllocatorsForLongRun.pool_direct 16 avgt 30 269.499 ? 3.384 ns/op
>> AllocatorsForLongRun.pool_direct 200 avgt 30 3491.204 ? 35.959 ns/op
>>
>>
>> Seems like 16 allocations is the break even for arena - after which (on 200) arena is better than malloc (I can only imagine that advantage of arena will keep growing with number of allocations). Malloc/free is still surprisingly good, all things considered, especially hard to beat on single shot allocations.
>>
>> The problem I see with the pool strategy is that it's faster than malloc - but not in a radical way (there's no 10x here). And you have to consider best case, and worst case (the best case is better than malloc, the worst case, exhausted, is worse). So it looks like something that can be a great thing, if that's what a program needs, and provided it's used as intended - but it doesn't seem (yet?) to deliver that kind of horizontal, across the board, scaling that would justify its inclusion in the API (although it's great to see that such an allocator can be written on top of the API).
>>
>> What I like about the pool though, is the approach you had for the API - I think that when we will look at allocators again (as I said, we did have some other allocators we were looking at, not too different from what you are trying to do here), I think the API that will be offered will probably be very similar to what you have in here - as I think it's spot on, and plays to the advantages of the new memory API.
>
> btw, things improve considerably if locking code is removed:
>
>
> Benchmark (allocations) Mode Cnt Score Error Units
> AllocatorsForLongRun.pool_allocator 1 avgt 30 32.553 ? 0.583 ns/op
> AllocatorsForLongRun.pool_allocator 16 avgt 30 333.564 ? 3.942 ns/op
> AllocatorsForLongRun.pool_allocator 200 avgt 30 3552.176 ? 43.852 ns/op
> AllocatorsForLongRun.pool_allocator_exhausted 1 avgt 30 62.226 ? 1.408 ns/op
> AllocatorsForLongRun.pool_allocator_exhausted 16 avgt 30 890.346 ? 15.812 ns/op
> AllocatorsForLongRun.pool_allocator_exhausted 200 avgt 30 16232.458 ? 562.777 ns/op
> AllocatorsForLongRun.pool_direct 1 avgt 30 14.983 ? 0.134 ns/op
> AllocatorsForLongRun.pool_direct 16 avgt 30 246.688 ? 3.411 ns/op
> AllocatorsForLongRun.pool_direct 200 avgt 30 2990.063 ? 34.049 ns/op
>
>
> So I think providing an alternate queue implementation to use if the pool scope is confined could make sense.
Did some other experiments comparing direct_pool vs. arena with different allocation sizes:
Benchmark (allocations) Mode Cnt Score Error Units
AllocatorsForLongRun.pool_direct 1 avgt 30 17.313 ? 0.413 ns/op
AllocatorsForLongRun.pool_direct 10 avgt 30 181.496 ? 2.163 ns/op
AllocatorsForLongRun.pool_direct 100 avgt 30 1759.230 ? 24.107 ns/op
AllocatorsForLongRun.pool_direct 1000 avgt 30 32862.773 ? 235.450 ns/op
Benchmark (allocations) Mode Cnt Score Error Units
AllocatorsForLongRun.arena 1 avgt 30 183.450 ? 4.108 ns/op
AllocatorsForLongRun.arena 10 avgt 30 462.525 ? 7.531 ns/op
AllocatorsForLongRun.arena 100 avgt 30 3208.686 ? 41.565 ns/op
AllocatorsForLongRun.arena 1000 avgt 30 40230.111 ? 678.682 ns/op
Benchmark (allocations) Mode Cnt Score Error Units
AllocatorsForLongRun.malloc_free 1 avgt 30 26.440 ? 0.286 ns/op
AllocatorsForLongRun.malloc_free 10 avgt 30 349.220 ? 6.760 ns/op
AllocatorsForLongRun.malloc_free 100 avgt 30 4755.480 ? 124.909 ns/op
AllocatorsForLongRun.malloc_free 1000 avgt 30 55986.975 ? 764.037 ns/op
Admittedly, this doesn't look too good for arena allocation. And I couldn't at first understand why. But then I realized that arena allocator is based on MemorySegment::allocateNative, not the fast/dirty CLinker::allocate/free. For fun I tried to tweak the arena allocator to use those instead:
Benchmark (allocations) Mode Cnt Score Error Units
AllocatorsForLongRun.arena 1 avgt 30 73.263 ? 6.777 ns/op
AllocatorsForLongRun.arena 10 avgt 30 236.181 ? 23.822 ns/op
AllocatorsForLongRun.arena 100 avgt 30 1902.569 ? 70.081 ns/op
AllocatorsForLongRun.arena 1000 avgt 30 22995.554 ? 593.434 ns/op
Now, apart for the one shot allocation, arena performs similar to the (direct) pool allocator, and quite better than malloc/free. It is a bit sad that arena loses so much due to "secondary" stuff like memory zeroing, reserving memory etc.
Perhaps this open the door to an additional arena allocator factory which takes a segment allocator, instead of a ResourceScope (so that if users are ok with loss of initialization, they have a way to opt out).
(btw - this reminds me that, in its current state - the pool allocator doesn't initialize memory either...)
-------------
PR: https://git.openjdk.java.net/panama-foreign/pull/509
More information about the panama-dev
mailing list