RFR: 8347408: Create an internal method handle adapter for system calls with errno [v16]
Matthias Ernst
duke at openjdk.org
Thu Jan 16 15:21:40 UTC 2025
On Thu, 16 Jan 2025 11:58:20 GMT, Per Minborg <pminborg at openjdk.org> wrote:
>> Going forward, converting older JDK code to use the relatively new FFM API requires system calls that can provide `errno` and the likes to explicitly allocate a MemorySegment to capture potential error states. This can lead to negative performance implications if not designed carefully and also introduces unnecessary code complexity.
>>
>> Hence, this PR proposes to add a _JDK internal_ method handle adapter that can be used to handle system calls with `errno`, `GetLastError`, and `WSAGetLastError`.
>>
>> It currently relies on a thread-local cache of MemorySegments to allide allocations. If, in the future, a more efficient thread-associated allocation scheme becomes available, we could easily migrate to that one.
>>
>> Here are some benchmarks:
>>
>>
>> Benchmark Mode Cnt Score Error Units
>> CaptureStateUtilBench.explicitAllocationFail avgt 30 41.615 ? 1.203 ns/op
>> CaptureStateUtilBench.explicitAllocationSuccess avgt 30 23.094 ? 0.580 ns/op
>> CaptureStateUtilBench.threadLocalFail avgt 30 14.760 ? 0.078 ns/op
>> CaptureStateUtilBench.threadLocalReuseSuccess avgt 30 7.189 ? 0.151 ns/op
>>
>>
>> Explicit allocation:
>>
>> try (var arena = Arena.ofConfined()) {
>> return (int) HANDLE.invoke(arena.allocate(4), 0, 0);
>> }
>>
>>
>> Thread Local (tl):
>>
>> return (int) ADAPTED_HANDLE.invoke(arena.allocate(4), 0, 0);
>>
>>
>> The graph below shows the difference in latency for a successful call:
>>
>> 
>>
>> This is a ~3x improvement for both the happy and the error path.
>>
>>
>> Tested and passed tiers 1-3.
>
> Per Minborg has updated the pull request incrementally with one additional commit since the last revision:
>
> Remove unused class
As mentioned in https://mail.openjdk.org/pipermail/panama-dev/2025-January/020882.html, I've been looking at pretty much exactly the same thing for another buffer required in the call sequence, with slightly different takeaways:
* the implementation in your PR always leaves ownership of the segments inside the thread-local. Borrowing/releasing threads need to check "can I take this", "am I still on the right carrier, trying to return this to the right slot?". I find it easier to actually remove the elements from the cache slot, and potentially return them to the cache slot of a different CT. Ownership and responsibilities seem clearer to me this way.
* I've built on the assumption that the "CTL.get.test/modify" sequences are actually atomic wrt virtual-thread-scheduling (seeing NIO code that seems to rely on the same) and no mounting/unmounting will happen inside them (but _may_ happen on the acquire-invoke-release bracket; this I have actually observed to be the case between invoke and release). If rescheduling can happen inside these sequences and we have multiple threads contending for the CT-local element, I'm not 100% sure that there is a point to using CTL in the first place?
src/java.base/share/classes/jdk/internal/foreign/CaptureStateUtil.java line 333:
> 331: SegmentPool cache = TL_POOLS.get();
> 332: if (cache == null) {
> 333: TL_POOLS.set(cache = new SegmentPool());
Why not use `initialValue`?
src/java.base/share/classes/jdk/internal/util/SingleElementPool.java line 124:
> 122: public void close() {
> 123: if (UNSAFE.compareAndSetInt(this, CLOSED_OFFSET, FALSE, TRUE)) {
> 124: if (UNSAFE.getIntVolatile(this, POOLED_ELEMENT_TAKEN_OFFSET) == FALSE) {
If we follow the premise that "any scheduling can happen", then this `close` could be racing with another thread trying to borrow the pooled element of this already-dead carrier thread:
VT at CT1:
pool = tl.get()
unmount
CT1 dies =>
pool1.close
taken == false => recycle
VT at CT2
mount
pool.take() succeeds since `taken` still false
BOOM
I believe the close action would need to "take" the element before recycling it.
Wouldn't this obviate the need for a separate "closed" slot?
Again, I'm not sure whether this scheduling is actually possible at this time.
src/java.base/share/classes/jdk/internal/util/SingleElementPool.java line 132:
> 130: @ForceInline
> 131: private boolean takePooledElement() {
> 132: return UNSAFE.getAndSetInt(this, POOLED_ELEMENT_TAKEN_OFFSET, TRUE) == FALSE;
Does this need to use atomics? Looking at similar code in sun.nio.ch.Util.BufferCache which seems to be working without. My interpretation is that BufferCache relies on a virtual thread to not get preempted inside the { CTL.get().takeElement } sequence? Would the same reasoning not work here?
I.e.
element = carrierThreadLocal.get() [1]
if (element.x) [2] a) can this virtual thread be moved to another carrier on [2] or [3]
element.y [3] b) can this virtual thread be suspended and another one schedule between [1] and [2]/[3]?
My understanding is that in the current implementation of virtual threads, this does not happen and we can manage the cache element with straight non-atomic code (correct)? If not correct, how does the BufferCache manage (pinned somewhere?)?
src/java.base/share/classes/jdk/internal/util/SingleElementPool.java line 137:
> 135: @ForceInline
> 136: private void releasePooledElement() {
> 137: UNSAFE.putIntVolatile(this, POOLED_ELEMENT_TAKEN_OFFSET, FALSE);
Similar to above: the use of volatile suggests that you're concerned about a virtual thread switch. Would be good to have clarification on this.
-------------
PR Review: https://git.openjdk.org/jdk/pull/22391#pullrequestreview-2556247903
PR Review Comment: https://git.openjdk.org/jdk/pull/22391#discussion_r1918610681
PR Review Comment: https://git.openjdk.org/jdk/pull/22391#discussion_r1918706463
PR Review Comment: https://git.openjdk.org/jdk/pull/22391#discussion_r1918637436
PR Review Comment: https://git.openjdk.org/jdk/pull/22391#discussion_r1918648090
More information about the core-libs-dev
mailing list