[foreign-jextract] RFR: MemorySegmentPool + Allocator [v7]

Thu Apr 22 01:20:35 UTC 2021

On Thu, 22 Apr 2021 00:33:58 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

>> Tweaked the loop and numbers didn't change much from yours:
>> 
>> 
>> Benchmark                                      (allocations)  Mode  Cnt      Score     Error  Units
>> AllocatorsForLongRun.arena                                 1  avgt   30    185.742 ?   4.349  ns/op
>> AllocatorsForLongRun.arena                                16  avgt   30    616.261 ?  14.974  ns/op
>> AllocatorsForLongRun.arena                               200  avgt   30   6574.496 ?  55.602  ns/op
>> AllocatorsForLongRun.malloc_free                           1  avgt   30     25.888 ?   0.272  ns/op
>> AllocatorsForLongRun.malloc_free                          16  avgt   30    602.258 ?  11.776  ns/op
>> AllocatorsForLongRun.malloc_free                         200  avgt   30  10126.972 ? 151.182  ns/op
>> AllocatorsForLongRun.pool_allocator                        1  avgt   30     35.907 ?   0.474  ns/op
>> AllocatorsForLongRun.pool_allocator                       16  avgt   30    378.874 ?   8.533  ns/op
>> AllocatorsForLongRun.pool_allocator                      200  avgt   30   4489.656 ?  40.615  ns/op
>> AllocatorsForLongRun.pool_allocator_exhausted              1  avgt   30     65.074 ?   3.399  ns/op
>> AllocatorsForLongRun.pool_allocator_exhausted             16  avgt   30    994.809 ?  22.971  ns/op
>> AllocatorsForLongRun.pool_allocator_exhausted            200  avgt   30  16247.051 ? 223.768  ns/op
>> AllocatorsForLongRun.pool_direct                           1  avgt   30     15.827 ?   0.398  ns/op
>> AllocatorsForLongRun.pool_direct                          16  avgt   30    269.499 ?   3.384  ns/op
>> AllocatorsForLongRun.pool_direct                         200  avgt   30   3491.204 ?  35.959  ns/op
>> 
>> 
>> Seems like 16 allocations is the break even for arena - after which (on 200) arena is better than malloc (I can only imagine that advantage of arena will keep growing with number of allocations). Malloc/free is still surprisingly good, all things considered, especially hard to beat on single shot allocations.
>> 
>> The problem I see with the pool strategy is that it's faster than malloc - but not in a radical way (there's no 10x here). And you have to consider best case, and worst case (the best case is better than malloc, the worst case, exhausted, is worse). So it looks like something that can be a great thing, if that's what a program needs, and provided it's used as intended - but it doesn't seem (yet?) to deliver that kind of horizontal, across the board, scaling that would justify its inclusion in the API (although it's great to see that such an allocator can be written on top of the API).
>> 
>> What I like about the pool though, is the approach you had for the API - I think that when we will look at allocators again (as I said, we did have some other allocators we were looking at, not too different from what you are trying to do here), I think the API that will be offered will probably be very similar to what you have in here - as I think it's spot on, and plays to the advantages of the new memory API.
>
> btw, things improve considerably if locking code is removed:
> 
> 
> Benchmark                                      (allocations)  Mode  Cnt      Score     Error  Units
> AllocatorsForLongRun.pool_allocator                        1  avgt   30     32.553 ?   0.583  ns/op
> AllocatorsForLongRun.pool_allocator                       16  avgt   30    333.564 ?   3.942  ns/op
> AllocatorsForLongRun.pool_allocator                      200  avgt   30   3552.176 ?  43.852  ns/op
> AllocatorsForLongRun.pool_allocator_exhausted              1  avgt   30     62.226 ?   1.408  ns/op
> AllocatorsForLongRun.pool_allocator_exhausted             16  avgt   30    890.346 ?  15.812  ns/op
> AllocatorsForLongRun.pool_allocator_exhausted            200  avgt   30  16232.458 ? 562.777  ns/op
> AllocatorsForLongRun.pool_direct                           1  avgt   30     14.983 ?   0.134  ns/op
> AllocatorsForLongRun.pool_direct                          16  avgt   30    246.688 ?   3.411  ns/op
> AllocatorsForLongRun.pool_direct                         200  avgt   30   2990.063 ?  34.049  ns/op
> 
> 
> So I think providing an alternate queue implementation to use if the pool scope is confined could make sense.

Did some other experiments comparing direct_pool vs. arena with different allocation sizes:

Benchmark                         (allocations)  Mode  Cnt      Score     Error  Units
AllocatorsForLongRun.pool_direct              1  avgt   30     17.313 ?   0.413  ns/op
AllocatorsForLongRun.pool_direct             10  avgt   30    181.496 ?   2.163  ns/op
AllocatorsForLongRun.pool_direct            100  avgt   30   1759.230 ?  24.107  ns/op
AllocatorsForLongRun.pool_direct           1000  avgt   30  32862.773 ? 235.450  ns/op

Benchmark                   (allocations)  Mode  Cnt      Score     Error  Units
AllocatorsForLongRun.arena              1  avgt   30    183.450 ?   4.108  ns/op
AllocatorsForLongRun.arena             10  avgt   30    462.525 ?   7.531  ns/op
AllocatorsForLongRun.arena            100  avgt   30   3208.686 ?  41.565  ns/op
AllocatorsForLongRun.arena           1000  avgt   30  40230.111 ? 678.682  ns/op

Benchmark                         (allocations)  Mode  Cnt      Score     Error  Units
AllocatorsForLongRun.malloc_free              1  avgt   30     26.440 ?   0.286  ns/op
AllocatorsForLongRun.malloc_free             10  avgt   30    349.220 ?   6.760  ns/op
AllocatorsForLongRun.malloc_free            100  avgt   30   4755.480 ? 124.909  ns/op
AllocatorsForLongRun.malloc_free           1000  avgt   30  55986.975 ? 764.037  ns/op

Admittedly, this doesn't look too good for arena allocation. And I couldn't at first understand why. But then I realized that arena allocator is based on MemorySegment::allocateNative, not the fast/dirty CLinker::allocate/free. For fun I tried to tweak the arena allocator to use those instead:

Benchmark                   (allocations)  Mode  Cnt      Score     Error  Units
AllocatorsForLongRun.arena              1  avgt   30     73.263 ?   6.777  ns/op
AllocatorsForLongRun.arena             10  avgt   30    236.181 ?  23.822  ns/op
AllocatorsForLongRun.arena            100  avgt   30   1902.569 ?  70.081  ns/op
AllocatorsForLongRun.arena           1000  avgt   30  22995.554 ? 593.434  ns/op

Now, apart for the one shot allocation, arena performs similar to the (direct) pool allocator, and quite better than malloc/free. It is a bit sad that arena loses so much due to "secondary" stuff like memory zeroing, reserving memory etc.
Perhaps this open the door to an additional arena allocator factory which takes a segment allocator, instead of a ResourceScope (so that if users are ok with loss of initialization, they have a way to opt out).

(btw - this reminds me that, in its current state - the pool allocator doesn't initialize memory either...)

-------------

PR: https://git.openjdk.java.net/panama-foreign/pull/509