RFR: 8371260: Improve scaling of downcalls using MemorySegments allocated with shared arenas, take 2 [v2]

Peter Levart plevart at openjdk.org
Sun Feb 22 15:01:11 UTC 2026


On Sun, 22 Feb 2026 14:54:57 GMT, Peter Levart <plevart at openjdk.org> wrote:

>> Hi,
>> 
>> When administering my mailing lists, my attention was drawn to this pull request: https://github.com/openjdk/jdk/pull/28575, which tries to tackle this scaling problem. Although it was dismissed, I remembered that I was dealing with a similar problem in the past, so I looked closely...
>> 
>> Here's an alternative take at the problem. It reuses a maintained public component of JDK, the LongAdder, so in this respect, it does not add complexity and maintainance burden. It also does not change the internal API of the MemorySessionImpl. The size of the patch is also smaller.
>> 
>> For experimenting and benchmarking, I created a separate impmenetation of just the acquire/release/close logic with existing "simple" and this new "striped" implementations here:
>> 
>> https://github.com/plevart/acquire-release-close
>> 
>> Running it on my 8 core (16 threads) Linux PC, it gives promising results without regression for single-threaded use:
>> 
>> 
>> ** Simple, measure run #1...
>> concurrency: 1, nanos: 39909697 (x 1.0)
>> concurrency: 2, nanos: 164735444 (x 4.127704702944751)
>> concurrency: 4, nanos: 394283724 (x 9.87939657873123)
>> concurrency: 8, nanos: 672278915 (x 16.84500172978011)
>> concurrency: 16, nanos: 2169282886 (x 54.3547821473062)
>> ** Simple, measure run #2...
>> concurrency: 1, nanos: 40318379 (x 1.0)
>> concurrency: 2, nanos: 163438657 (x 4.053701092496799)
>> concurrency: 4, nanos: 399382210 (x 9.905710991009832)
>> concurrency: 8, nanos: 694862623 (x 17.23438888750959)
>> concurrency: 16, nanos: 2182386494 (x 54.12882531810121)
>> ** Simple, measure run #3...
>> concurrency: 1, nanos: 39871197 (x 1.0)
>> concurrency: 2, nanos: 168843686 (x 4.234728292707139)
>> concurrency: 4, nanos: 375489497 (x 9.417562683156966)
>> concurrency: 8, nanos: 675885694 (x 16.951728186138983)
>> concurrency: 16, nanos: 2083500812 (x 52.255787856080666)
>> ** end.
>> 
>> ** Striped, measure run #1...
>> concurrency: 1, nanos: 36698350 (x 1.0)
>> concurrency: 2, nanos: 47349695 (x 1.290240433152989)
>> concurrency: 4, nanos: 58622304 (x 1.5974098018030782)
>> concurrency: 8, nanos: 60548173 (x 1.6498881557345222)
>> concurrency: 16, nanos: 70607406 (x 1.9239940215295783)
>> ** Striped, measure run #2...
>> concurrency: 1, nanos: 37217044 (x 1.0)
>> concurrency: 2, nanos: 38610020 (x 1.0374284427317764)
>> concurrency: 4, nanos: 39166893 (x 1.0523912914738742)
>> concurrency: 8, nanos: 51778829 (x 1.3912665659314587)
>> concurrency: 16, nanos: 70277394 (x 1.8883120862581133)
>> ** Striped, measu...
>
> Peter Levart has updated the pull request incrementally with one additional commit since the last revision:
> 
>   8371260: Prevent two theoretical reorderings of volatile write beyond volatile read

So after JMM theory kicks in, some faults are discovered. I mitigated them with two explicit fullFence(s). This does introduce some overhead to single-thread usage though. Here's again the comparison report:


** Simple, measure run #1...
concurrency: 1, nanos: 39909697 (x 1.0)
concurrency: 2, nanos: 164735444 (x 4.127704702944751)
concurrency: 4, nanos: 394283724 (x 9.87939657873123)
concurrency: 8, nanos: 672278915 (x 16.84500172978011)
concurrency: 16, nanos: 2169282886 (x 54.3547821473062)
** Simple, measure run #2...
concurrency: 1, nanos: 40318379 (x 1.0)
concurrency: 2, nanos: 163438657 (x 4.053701092496799)
concurrency: 4, nanos: 399382210 (x 9.905710991009832)
concurrency: 8, nanos: 694862623 (x 17.23438888750959)
concurrency: 16, nanos: 2182386494 (x 54.12882531810121)
** Simple, measure run #3...
concurrency: 1, nanos: 39871197 (x 1.0)
concurrency: 2, nanos: 168843686 (x 4.234728292707139)
concurrency: 4, nanos: 375489497 (x 9.417562683156966)
concurrency: 8, nanos: 675885694 (x 16.951728186138983)
concurrency: 16, nanos: 2083500812 (x 52.255787856080666)
** end.

** Striped, measure run #1...
concurrency: 1, nanos: 58248553 (x 1.0)
concurrency: 2, nanos: 77375592 (x 1.3283693416384095)
concurrency: 4, nanos: 70015083 (x 1.2020055330816544)
concurrency: 8, nanos: 60701425 (x 1.0421104366317906)
concurrency: 16, nanos: 65387340 (x 1.1225573277331027)
** Striped, measure run #2...
concurrency: 1, nanos: 58836025 (x 1.0)
concurrency: 2, nanos: 78600629 (x 1.3359269087264138)
concurrency: 4, nanos: 63892822 (x 1.085947291646572)
concurrency: 8, nanos: 62594145 (x 1.063874471465399)
concurrency: 16, nanos: 89972108 (x 1.5292009954785355)
** Striped, measure run #3...
concurrency: 1, nanos: 59242988 (x 1.0)
concurrency: 2, nanos: 63316159 (x 1.0687536388272652)
concurrency: 4, nanos: 60279613 (x 1.0174978513912905)
concurrency: 8, nanos: 66596046 (x 1.1241169334672991)
concurrency: 16, nanos: 107654519 (x 1.8171689618356184)
** end.


There is a 50% increase in latency for single-thread usage which is payed off whenever there is contention in Simple implementation though. I wonder what results would be like on other hardware.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/29866#issuecomment-3941129435


More information about the core-libs-dev mailing list