RFR (S) CR 6857566: (bf) DirectByteBuffer garbage creation can outpace reclamation

Sun Oct 6 22:56:13 UTC 2013

Hi Again,

The result of my experimentation is as follows:

Letting ReferenceHandler thread alone to en-queue References and execute 
Cleaners is not enough to prevent OOMEs when allocation is performed in 
large number of threads even if I let Cleaners do only synchronous 
announcing of what will be freed (very fast), delegate the actual 
de-allocation to a background thread and base reservation waiting on 
announced free space (still wait that space is deallocated and 
unreserved before satisfying reservation request, but wait as long as it 
takes if the announced free space is enough for reservation request).

ReferenceHandler thread, when it finds that it has no more pending 
References, parks and waits for notification from VM. The VM promptly 
process references (hooks them on the pending list), but with saturated 
CPUs, waking-up the ReferenceHandler thread and re-gaining the lock 
takes too much time. During that time allocating threads can reserve the 
whole permitted space and OOME must be thrown. So I'm back to strategy 
#1 - helping ReferenceHandler thread.

It's not so much about helping to achieve better throughput (as I noted 
deallocating can not be effectively parallelized) but to overcome the 
latency of waking-up the ReferenceHandler thread. Here's my attempt at 
doing this:

http://cr.openjdk.java.net/~plevart/jdk8-tl/DyrectBufferAlloc/webrev.01/

This is much simplified from my 1st submission of similar strategy. I 
tried to be as undisruptive to current logic of Reference processing as 
possible, but of course you decide if this is still too risky for 
inclusion into JDK8. Cleaner is unchanged - it processes it's thunk 
synchronously and ReferenceHandler thread invokes it directly. 
ReferenceHandler logic is the same - I just factored-out the content of 
the loop into a private method to be able to call it from nio Bits where 
the bulk of change lies.

The (un)reservation logic is re-implemented with atomic operations - no 
locks. When large number of threads are competing for reservation, 
locking overhead can be huge and can slow-down unreservation (which must 
use the same lock as reservation). The reservation re-try logic 1st 
tries to satisfy the reservation request while helping ReferenceHandler 
thread in en-queue-ing References and executing Cleaners until the list 
of pending references is exhausted. If this does not succeed, it 
triggers VM to process references (System.gc()) and then enters similar 
re-try loop but introducing exponentially increasing back-off delay 
every time the chain of pending references is exhausted, starting with 
1ms sleep and doubling. This gives VM time to process the references. 
Maximum number of sleeps is 9, giving max. accumulated sleep time of 0.5 
s. This means that a request that rightfully throws OOME will do so 
after 0.5 s sleep.

I did the following measurement: Using LongAdders (to avoid Heisenberg) 
I counted various exit paths from Bits.reserveMemory() during a test 
that spawned 128 allocating threads on a 4-core i7 machine, allocating 
direct buffers randomly sized between 256KB and 1MB for 60 seconds, 
using -XX:MaxDirectMemorySize=512m:

Total of 909960  allocations were performed:

- 247993 were satisfied before attempting to help ReferenceHandler thread
- 660184 were satisfied while helping ReferenceHandler thread but before 
triggering System.gc()
- 1783 were satisfied after triggering System.gc() but before doing any 
sleep
- no sleeping has been performed

The same test, just changing -XX:MaxDirectMemorySize=128m (that means 
1MB per thread each allocating direct buffers randomly sized between 
256KB and 1MB):

Total of 579943 allocations were performed:

- 131547 were satisfied before attempting to help ReferenceHandler thread
- 438345 were satisfied while helping ReferenceHandler thread but before 
triggering System.gc()
- 10016 were satisfied after triggering System.gc() but before doing any 
sleep
- 34 were satisfied after sleep(1)
- 1 was satisfied after sleep(1) followed by sleep(2)

That's it. I think this is good enough for testing on large scale. I 
have also included a modified DirectBufferAllocTest as a unit test, but 
I don't know if it's suitable since it takes 60s to run. The run time 
could be lowered with less probability to catch OOMEs.

So what do you think? Is this still too risky for JDK8?

Regards, Peter

On 10/06/2013 01:19 PM, Peter Levart wrote:
> Hi,
>
> I agree the problem with de-allocation of native memory blocks should 
> be studied deeply and this takes time.
>
> What I have observed so far on Linux platform (other platforms may 
> behave differently) is the following:
>
> Deallocation of native memory with Unsafe.freeMemory(address) can take 
> various amounts of time. It can grow to a constant amount  of several 
> milliseconds to free a 1MB block, for example, when there's already 
> lots of blocks allocated and multiple threads are constantly 
> allocating more. I'm not sure yet about the main reasons for that, but 
> it could either be a contention with allocation from multiple threads, 
> interaction with GC, or even the algorithm used in the native 
> allocator. Deallocation is also not very parallelizable. My 
> observation is that deallocating with 2 threads (on a 4 core CPU) does 
> not help much.
>
> Current scheme of deallocating in ReferenceHandler thread means that a 
> lot of "pending" Cleaner objects can accumulate and although VM has 
> promptly processed Cleaner PhantomReferences (hooked them on the 
> pending list), a lot of work is still to be done to actually free the 
> native blocks. This clogs ReferenceHandler thread and affects other 
> Reference processing. It also presents difficulties for back-off 
> strategy for allocating native memory. The strategy has no information 
> that would be needed to decide whether to wait more or to fail with OOME.
>
> I'm currently experimenting with approach where Cleaner and 
> ReferenceHandler code stays as is, but the Cheaner's thunk (the 
> Deallocator in DirectByteBuffer) is modified so that it performs some 
> actions synchronously (announcing what will be de-allocated) and 
> delegates the actual deallocation and unreservation to a background 
> thread. Reservation strategy has more information to base it's 
> back-off strategy that way. I'll let you know if I get some results 
> from that.
>
> Regards, Peter
>
> On 10/04/2013 08:39 PM, mark.reinhold at oracle.com wrote:
>> 2013/10/2 15:13 -0700,alan.bateman at oracle.com:
>>> BTW: Is this important enough to attempt to do this late in 8? I just
>>> wonder about a significant change like switching to weak references and
>>> whether it would be more sensible to hold it back to do early in 9.
>> I share your concern.  This is extraordinarily sensitive code.
>> Now is not the time to rewrite it for JDK 8.
>>
>> - Mark
>