[foreign-jextract] RFR: MemorySegmentPool + Allocator [v7]

Thu Apr 22 00:27:30 UTC 2021

On Wed, 21 Apr 2021 23:54:24 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

>> Radoslaw Smogura has updated the pull request incrementally with two additional commits since the last revision:
>> 
>>  - Better releasing resources on scopes close
>>    New benchamarks for bulk allocations
>>  - Minor fixes to SpinLockQueue
>>    Remove redundant LOCK release (set to 0) in put entry.
>>    Fixed maxSize comparison
>>    Removed unused imports
>>    
>>    Q: Could we reduce setRelease to just set in few places?
>>    Q: Should this method be private and moved to Entry.close to prevent accidental adding element from other queue?
>
> test/micro/org/openjdk/bench/jdk/incubator/foreign/AllocatorsForLongRun.java line 149:
> 
>> 147:   private void readSegment(MemorySegment s) {
>> 148:     final var size = s.byteSize();
>> 149:     for (long l = 0; l <  size; l += 256) {
> 
> this is a loop on `long` - as thing stands - this is gonna defeat all optimizations. For the time being - replace with an `int` loop (and then cast coordinate back to `long`). Or just use `MemoryAccess.setByteAtIndex`.

Tweaked the loop and numbers didn't change much from yours:

Benchmark                                      (allocations)  Mode  Cnt      Score     Error  Units
AllocatorsForLongRun.arena                                 1  avgt   30    185.742 ?   4.349  ns/op
AllocatorsForLongRun.arena                                16  avgt   30    616.261 ?  14.974  ns/op
AllocatorsForLongRun.arena                               200  avgt   30   6574.496 ?  55.602  ns/op
AllocatorsForLongRun.malloc_free                           1  avgt   30     25.888 ?   0.272  ns/op
AllocatorsForLongRun.malloc_free                          16  avgt   30    602.258 ?  11.776  ns/op
AllocatorsForLongRun.malloc_free                         200  avgt   30  10126.972 ? 151.182  ns/op
AllocatorsForLongRun.pool_allocator                        1  avgt   30     35.907 ?   0.474  ns/op
AllocatorsForLongRun.pool_allocator                       16  avgt   30    378.874 ?   8.533  ns/op
AllocatorsForLongRun.pool_allocator                      200  avgt   30   4489.656 ?  40.615  ns/op
AllocatorsForLongRun.pool_allocator_exhausted              1  avgt   30     65.074 ?   3.399  ns/op
AllocatorsForLongRun.pool_allocator_exhausted             16  avgt   30    994.809 ?  22.971  ns/op
AllocatorsForLongRun.pool_allocator_exhausted            200  avgt   30  16247.051 ? 223.768  ns/op
AllocatorsForLongRun.pool_direct                           1  avgt   30     15.827 ?   0.398  ns/op
AllocatorsForLongRun.pool_direct                          16  avgt   30    269.499 ?   3.384  ns/op
AllocatorsForLongRun.pool_direct                         200  avgt   30   3491.204 ?  35.959  ns/op

Seems like 16 allocations is the break even for arena - after which (on 200) arena is better than malloc (I can only imagine that advantage of arena will keep growing with number of allocations). Malloc/free is still surprisingly good, all things considered, especially hard to beat on single shot allocations.

The problem I see with the pool strategy is that it's faster than malloc - but not in a radical way (there's no 10x here). And you have to consider best case, and worst case (the best case is better than malloc, the worst case, exhausted, is worse). So it looks like something that can be a great thing, if that's what a program needs, and provided it's used as intended - but it doesn't seem (yet?) to deliver that kind of horizontal, across the board, scaling that would justify its inclusion in the API (although it's great to see that such an allocator can be written on top of the API).

What I like about the pool though, is the approach you had for the API - I think that when we will look at allocators again (as I said, we did have some other allocators we were looking at, not too different from what you are trying to do here), I think the API that will be offered will probably be very similar to what you have in here - as I think it's spot on, and plays to the advantages of the new memory API.

-------------

PR: https://git.openjdk.java.net/panama-foreign/pull/509