Analysis on JDK-8022321 java/lang/ref/OOMEInReferenceHandler.java fails intermittently

Wed Mar 23 20:49:47 UTC 2016

On 03/23/2016 09:40 PM, Kim Barrett wrote:
>> On Mar 23, 2016, at 3:33 PM, Peter Levart <peter.levart at gmail.com> wrote:
>>
>> Hi Kim,
>>
>> On 03/23/2016 07:55 PM, Kim Barrett wrote:
>>>> On Mar 23, 2016, at 10:02 AM, Peter Levart <peter.levart at gmail.com>
>>>>   wrote:
>>>> ...so I checked what it would be needed if there was such getPendingReferences() native method. It turns out that a single native method would not be enough to support the precise direct ByteBuffer allocation. Here's a refactored webrev that introduces a getPendingReferences() method which could be turned into a native equivalent one day. There is another native method needed - int awaitEnqueuePhaseStart():
>>>>
>>>>
>>>> http://cr.openjdk.java.net/~plevart/jdk9-dev/removeInternalCleaner/webrev.09.part2/
>>> I don't think the Reference.awaitEnqueuePhaseStart thing is needed.
>>>
>>> Rather, I think the Direct-X-Buffer allocation should conspire with
>>> the the Direct-X-Buffer cleanups directly to manage that sort of
>>> thing, and not add anything to Reference and the reference processing
>>> thread.  E.g. the phase and signal/wait are purely part of
>>> Direct-X-Buffer.  (I also think something like that could/should have
>>> been done instead of providing Direct-X-Buffer with access to
>>> Reference.tryHandlePending, but that's likely water under the bridge
>>> now.)
>>>
>>> Something very roughly like this:
>>>
>>> allocating thread, after allocation failed
>>>
>>> bool waitForCleanups() {
>>>    int epoch = DXB.getCleanupCounter();
>>>    long start = startTime();
>>>    long timeout = calcTimeout(start)
>>>    synchronized (DXB.getCleanupMonitor()) {
>>>      while (epoch == DBX.getCleanupCounter()) {
>>>        wait(timeout);
>>>        timeout = calcTimeout(start);
>>>        if (timeout <= 0) break;
>>>      }
>>>      return epoch != DBX.getCleanupCounter();
>>>    }
>>> }
>>>
>>> cleanup function, after freeing memory
>>>
>>>    synchronized (DBX.getCleanupMonitor()) {
>>>      DBX.incCleanupCounter();
>>>      DBX.getCleanupMonitor().notify_all();
>>>    }
>>>
>>> Actually, epoch should probably have been obtained *before* the failed
>>> allocation attempt, and should be an argument to waitForCleanups.
>>>
>>> That's all quite sketchy, but I need to do other things today.
>>>
>>> Peter, care to try filling this in?
>>>
>>>
>> There's no need to maintain a special cleanup counter as java.nio.Bits already maintains the amount of currently allocated direct memory (in bytes). What your suggestion leads to is similar to one of previous versions of java.nio.Bits which waited for some 'timeout' time after invoking System.gc() and then re-tried reservation, failing if it didn't succeed. The problem with such "asynchronous" approach is that there's no right value of 'timeout' for all situations. If you wait for to short time, you might get OOME although there are plenty unreachable but still uncleaned direct buffers. If you wait for to long, your throughput will suffer. There has to be some "feedback" from reference processing to know when there's still beneficial to wait and when there's no point in waiting any more.
>>
>> Regards, Peter
> I don't think there's any throughput penalty for a long timeout.  The
> proper response to waitForCleanups returning false (assuming the epoch
> was obtained early and passed as an argument) is OOME.  I really doubt
> the latency for reporting OOME is of critical importance.
>
> That is, the caller looks something like (not even pretending to write
> Java)
>
>    alloc = tryAllocatation(allocSize)
>    if alloc != NULL
>      return alloc
>    endif
>    // Maybe add a retry+wait with a short timeout here,
>    // to allow existing cleanups to run before requesting
>    // another gc.  Not clear that's really worthwhile, as
>    // it only comes up when we get here just after a gc
>    // and the resulting cleanups are not yet all processed.
>    System.gc()
>    while true
>      epoch = getEpoch()
>      alloc = tryAllocation(allocSize)
>      if alloc != NULL
>        return alloc
>      elif !waitForCleanup(epoch)
>        throw OOME  // No cleanup progress for a while
>      endif
>    end
>

Right, this is easier to understand. I already figured out what you 
wanted to say the 1st time. I'll try to prepare a prototype along this 
idea tomorrow.

Regards, Peter