RFR: 8344332: (bf) Migrate DirectByteBuffer to use java.lang.ref.Cleaner [v13]

Sun Feb 23 10:23:05 UTC 2025

On Tue, 18 Feb 2025 14:33:16 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

>> DirectByteBuffers are still using old `jdk.internal.ref.Cleaner` implementation. That implementation carries a doubly-linked list, and so makes DBB suffer from the same issue fixed for generic `java.lang.ref.Cleaner` users with [JDK-8343704](https://bugs.openjdk.org/browse/JDK-8343704). See the bug for the reproducer.
>> 
>> We can migrate DBBs to use `java.lang.ref.Cleaner`.
>> 
>> There are two pecularities during this rewrite.
>> 
>> First, the old ad-hoc `Cleaner` implementation used to exit the VM when cleaning action failed. I presume it was to avoid memory leak / accidental reuse of the buffer. I moved the relevant block to `Deallocator` directly. Unfortunately, I cannot easily test it.
>> 
>> Second is quite a bit hairy. Old DBB cleaning code was hooked straight into `Reference` processing loop. This was possible because we could infer that the weak references we are processing were DBB cleaning actions, since old `Cleaner` was the only use of this code. With standard `Cleaner`, we have lost this association, and so we cannot really do this from the reference processing loop. With the patched version, we now rely on normal `Cleaner` thread to do cleanups for us.
>> 
>> Because of this, there is a new outpacing opportunity window where reference processing might have been over, but cleaner thread has not reacted yet. This is why we need another way to check progress that involves checking if cleaner has acted.
>> 
>> Additional testing:
>>  - [x] Linux x86_64 server fastdebug, `java/nio java/io`
>>  - [x] Linux AArch64 server fastdebug, `java/nio java/io`
>>  - [x] Linux x86_64 server fastdebug, `all`
>>  - [x] Linux AArch64 server fastdebug, `all`
>
> Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision:
> 
>   A bit more comments

I've made a prototype that adds Cleaner.waitForCleaning() and changes
Bits.reserveMemory() to use it.

https://github.com/openjdk/jdk/compare/master...kimbarrett:openjdk-jdk:wait-for-cleaning?expand=1

It's based on the shipilev commits 1-19, for ease of comparison and
prototyping. There are also some shortcuts (like making some things public
that likely shouldn't be), again for ease of prototyping.

I ran the DirectBufferAllocTest test a few hundred times without any
problems.  I also ran it directly, configured with a 1/2 hour runtime, again
without problems.  I ran the two new micros against baseline, the shipilev
changes only, and the waitForCleaning changes.  Results below.

DirectByteBufferGC
baseline
Benchmark                (count)  Mode  Cnt    Score    Error  Units
DirectByteBufferGC.test    16384  avgt    9    4.036 ±  0.244  ms/op
DirectByteBufferGC.test    65536  avgt    9    7.117 ±  0.184  ms/op
DirectByteBufferGC.test   262144  avgt    9   18.482 ±  1.064  ms/op
DirectByteBufferGC.test  1048576  avgt    9  194.590 ± 17.639  ms/op
DirectByteBufferGC.test  4194304  avgt    9  802.354 ± 57.576  ms/op

shipilev
Benchmark                (count)  Mode  Cnt    Score    Error  Units
DirectByteBufferGC.test    16384  avgt    9    4.104 ±  0.045  ms/op
DirectByteBufferGC.test    65536  avgt    9    6.875 ±  0.244  ms/op
DirectByteBufferGC.test   262144  avgt    9   17.943 ±  0.446  ms/op
DirectByteBufferGC.test  1048576  avgt    9   59.002 ±  3.616  ms/op
DirectByteBufferGC.test  4194304  avgt    9  331.328 ± 36.580  ms/op

waitForCleaning
Benchmark                (count)  Mode  Cnt    Score    Error  Units
DirectByteBufferGC.test    16384  avgt    9    4.058 ±  0.167  ms/op
DirectByteBufferGC.test    65536  avgt    9    6.775 ±  0.281  ms/op
DirectByteBufferGC.test   262144  avgt    9   17.724 ±  0.602  ms/op
DirectByteBufferGC.test  1048576  avgt    9   57.284 ±  2.679  ms/op
DirectByteBufferGC.test  4194304  avgt    9  330.002 ± 44.200  ms/op

DirectByteBufferChurn
baseline
Benchmark                   (recipFreq)  Mode  Cnt   Score   Error  Units
DirectByteBufferChurn.test          128  avgt    9  15.693 ± 0.991  ns/op
DirectByteBufferChurn.test          256  avgt    9  11.255 ± 0.369  ns/op
DirectByteBufferChurn.test          512  avgt    9   9.422 ± 0.382  ns/op
DirectByteBufferChurn.test         1024  avgt    9   8.529 ± 0.311  ns/op
DirectByteBufferChurn.test         2048  avgt    9   7.833 ± 0.287  ns/op

shipilev
Benchmark                   (recipFreq)  Mode  Cnt   Score   Error  Units
DirectByteBufferChurn.test          128  avgt    9  12.524 ± 0.476  ns/op
DirectByteBufferChurn.test          256  avgt    9   9.248 ± 0.175  ns/op
DirectByteBufferChurn.test          512  avgt    9   8.583 ± 0.197  ns/op
DirectByteBufferChurn.test         1024  avgt    9   8.227 ± 0.274  ns/op
DirectByteBufferChurn.test         2048  avgt    9   7.526 ± 0.394  ns/op

waitForCleaning
Benchmark                   (recipFreq)  Mode  Cnt   Score   Error  Units
DirectByteBufferChurn.test          128  avgt    9  11.235 ± 1.998  ns/op
DirectByteBufferChurn.test          256  avgt    9   9.129 ± 0.324  ns/op
DirectByteBufferChurn.test          512  avgt    9   8.446 ± 0.115  ns/op
DirectByteBufferChurn.test         1024  avgt    9   7.786 ± 0.425  ns/op
DirectByteBufferChurn.test         2048  avgt    9   7.477 ± 0.150  ns/op

The baseline DirectByteBufferGC suffers significantly at higher counts,
probably because of the long doubly-linked list of registered
jdk.internal.ref.Cleaner objects.

By eye the performance of both shipilev and waitForCleaning on the DBBChurn
benchmark seem to be consistently slightly better than baseline, and within
the noise of each other for both benchmarks.

This waitForCleaning approach raises several questions.

(1) Is the improved reliability of this approach worth the additional effort?
My inclination is to say yes.  But note that it should perhaps be different
effort - see item (2).

(2) Should Cleaner be responsible for providing the additional functionality?
An alternative is to have Bits implement it's own cleaning, directly using a
private PhantomReference derivative, ReferenceQueue, and Thread.

If there were multiple (potential) clients for this sort of thing then putting
it in Cleaner may makes sense, so long as their requirements are similar.  But
we're kind of fighting with it, esp. with the canary approach, but even with
the waitForCleaning approach.  I think things could be simpler and easier to
understand with a direct implementation.  And Cleaner was not intended to be
the last word in the area of reference-based cleanup.  Rather, it's a
convenient package for addressing some common patterns (esp. with
finalization).  It provides some useful infrastructure that would need to be
replaced in a bespoke implementation, but I don't think there's anything all
that large or complicated there, and some streamlining is possible.

Note that I haven't done any work in that direction yet.

(3) Do we need the retry stuff, and how should it be done?  I did some
monitoring of the retries, and waitForCleaning DBBA does need the retries
(though rarely more than 1, and I never saw more than 2).  I kind of feel like
the retries are a kludgy workaround for what seems like an unreasonable test,
but oh well.

The point of the retries is to allow other threads to drop some DBBs that can
become reclaimable during the associated wait, giving the waiting thread an
opportunity to make use of that now available memory.

I experimented with a retry-style more similar to the current code, where each
thread on the slow path will do at most one GC and then go into a wait/sleep
loop if reservation continues to fail.  (I added some locking to avoid the
concurrent requests for GC issue with the current code.) That didn't seem to
make any difference compared to something like the proposed serialized GC and
sleep loop.  I can construct theoretical cases that seem to favor either, but
the style in the proposal seems simpler.

Of course, the question of when to do GCs and how many to do is moot if
-XX:+DisableExplicitGC is used.  (The default for that option is false.)

src/java.base/share/classes/java/nio/Bits.java line 149:

> 147:         //   1. GC needs to discover dead references and hand them over to Reference
> 148:         //      processing thread. This activity can be asynchronous and can complete after
> 149:         //      we unblock from System.gc().

Adding cleared references to the pending list is always completed before the
GC invocation completes.  Doing otherwise would break or significantly
complicate Reference.waitForReferenceProcessing(), and to no good purpose.
That function should only return false when all references cleared by a prior
GC have been enqueued in their respective queues.  There are tests that depend
on that.  (I looked, and so far as I can tell, none of the extant GCs violate
this.)

src/java.base/share/classes/java/nio/Bits.java line 166:

> 164:         // install the canary, call System.gc(), wait for canary to get processed (dead). This
> 165:         // signals that since our last call to System.gc(), steps (1) and (2) have finished, and
> 166:         // step (3) is currently in progress.

The canary having been processed doesn't tell us anything definitive about
step (2), because steps (2) and (3) occur concurrently.  The reference
processing thread transfers references to their respective queues, and the
cleaner thread processes references that have been placed in its queue, both
running at the same time.  All we know is that the canary got transferred and
cleaned.  There may or many not have been other references similarly
transferred and cleaned.  There may or may not be more references in the
pending list, in the cleaner queue, or both.  That the canary has been cleaned
doesn't give us any information about either the pending list or the cleaner
queue.

src/java.base/share/classes/java/nio/Bits.java line 172:

> 170:         // corner case would be handled with a normal retry attempt, after trying to allocate.
> 171:         // If allocation succeeds even after partial cleanup, we are done. If it does not, we get
> 172:         // to try again, this time reliably getting the results of the first cleanup run. Not

After trying again, all we know is that both the previous and the new canary
have been processed.  We don't know anything about other references, either
from the latest or preceeding GCs.  Consider this unfortunate scenario.  The
first canary ended up at the head of the pending list.  The reference
processing thread transfered it to the cleaner queue and then stalled.  The
cleaner thread processed the first canary.  The waiting thread noted that,
retried and failed the reservation, and retried the GC and wait for the
canary.  The retry canary also happened to end up at the front of the updated
pending list.  The reference processor transferred it and again stalled.  The
cleaner thread processed the retry canary.  No real cleaning has happened,
i.e. we have not reliably gotten the results of the first cleanup run.

-------------

Changes requested by kbarrett (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/22165#pullrequestreview-2635485651
PR Review Comment: https://git.openjdk.org/jdk/pull/22165#discussion_r1966720234
PR Review Comment: https://git.openjdk.org/jdk/pull/22165#discussion_r1966720361
PR Review Comment: https://git.openjdk.org/jdk/pull/22165#discussion_r1966720509