RFR: JDK-8282405: Make thread resource areas signal safe [v5]

Sat Mar 5 07:34:59 UTC 2022

On Thu, 3 Mar 2022 07:38:48 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>> In the context of signal handlers, we may allocate RA memory. That is not ideal but may happen. One example is error reporting - even if we are careful, some code down the stack may use RA. Another example is code running in the context of AsyncGetCallTrace. I'm sure there may be more examples.
>> 
>> The problem is that the signal may (rarely) leave the current thread's RA in an inconsistent state, especially if it got interrupted in the middle of a chunk turnover. Subsequent allocations from it inside the signal handler then would malfunction.
>> 
>> A simple solution would be double buffering. Let each thread have a second resource area, to be used only in signal handling. At the entrance of the hotspot signal handler (which everyone goes through, even in chain scenarios like with AsyncGetCallTrace) we would switch over to the secondary resource area, and switch back when leaving the hotspot signal handler.
>> 
>> Note that I proposed this on hs-runtime-dev [1] but I am actually not sure if the mailing lists work, since I did not see that mail delivered to subscribers. Therefore I went ahead and implemented a simple prototype. 
>> 
>> The prototype keeps matters simple:
>> - we just use two resource areas: the normal one and an alternate one for signal handling. So we don't handle recursive calls to signal handlers, see comment in signals_posix.cpp.
>> - we preallocate both resource area at thread creation time. For the pros and cons of pre-allocating them vs creating them on demand, and possible further improvements, pls see [1].
>> 
>> Tests:
>> - SAP nightlies
>> - GHAs
>> - I tested this manually by corrupting the resource area of a thread, then faulting, and inside the signal handler, I was able to use the secondary resource area as expected.
>> - Automated tests are somewhat more difficult, akin to the existing SafeFetchInErrorHandlerTest. I am not sure if its worth the complexity.
>> 
>> [1] https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2022-February/054126.html
>
> Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision:
> 
>   ACGT: guard the correct (current) thread

Hi Kim, thanks for looking at this. Answers below.

> It seems like this perhaps narrows but does not fix the problem.
> 
> What if we're in the Chunk allocator when a signal is received, and the signal handler allocates enough that it needs another chunk? The Chunk allocator uses ThreadCritical to protect it's critical regions, which does not make it re-entrant either.
> 
> And if the Chunk allocator finds the associated ChunkPool free-list to be empty, then we need to go to malloc, which has the same problem of not being async-signal safe.
> 
> I'm also not thrilled by the idea of devoting 1K (or more, if we discover 1K isn't enough for all the allocations we might need) for each thread for this purpose. That seems like quite a bit of memory being (mostly) wasted.
> 
> So basically I'm not convinced the proposed design is right yet.
> 
> One thing I was wondering (because I don't know the answer) is whether there is any bound on the number of in-flight signals that might need to be handled at the same time by different threads.

There are some ideas I have to address your concerns:

1) Let's arenas not allocate a chunk up front, but only on first use. I never liked that upfront allocation anyway, it did not make much sense to me. We pay for what we may never need. (I always suspected that the small initial chunk size was a workaround for the unconditional allocation costs and that if we delay allocation all chunks could be uni-sized).

By delaying the first chunk allocation, we prevent memory from being wasted if the secondary RA is not being used. All we would pay is the base cost for sizeof(ResourceArea) in Thread, 40 bytes. And that could be reduced too if we were to groom Arena a bit, but that's a different story.

2) Don't use ChunkPool, for the secondary RA. Allocate the chunks directly via malloc.

That would mean we don't need to pull a central lock. We still use malloc() if someone uses RA inside signal handling - it is used on the first allocation, and if subsequent allocations expand the secondary RA.

Side note, I am curious if ChunkPool is even needed nowadays. I often find that the libc keeps returned memory close by. Re-acquisition via malloc is fast and often handled thread-local without synchronization. Especially compared to ChunkPool, where we 1) always pull a central lock and 2) free in a different thread from where the memory was malloced, which may not be ideal. So maybe removing ChunkPool would be okay in general.

3) Maybe don't release the chunks in the secondary RA on stack unwinding. Keep them locally around in the secondary RA in a freelist.

That reduces follow-up mallocs. We now only call malloc for the first dive down the stack.

4) Maybe don't allow expanding the secondary RA at all. Instead, make the first chunk - that we allocate on-demand only - large enough to handle what we think is reasonable for signal handling and during ACGT. Let's say 64K or so.

That reduces the number of dangerous malloc calls to exactly one: the initial large chunk. We keep that in the secondary RA and never release it.

The idea behind (4) is that we don't expect a ton of RA memory to be allocated in signal handling or in ACGT. We only expect small utility functions to be called, that don't need much space.

A variant of (4) could be to place the one big chunk we allow the secondary RA to have on the thread stack instead of using malloc(). Basically, we would use ResourceArea as an interface for normal stack memory allocated in the frame above us where we switched to the secondary RA. Unfortunately, this would reduce its usefulness for error reporting, which has to work in low-stack scenarios. But whoever calls ACGT at least could make sure it gets called on a large enough stack.

5) If we use C-heap for the one big chunk, and are unhappy with this one remaining malloc: Pre-allocate the initial large chunk for the secondary RA. Since that would again increase the per-thread costs, just preallocate a bunch of chunks (2-4?) at VM start - whatever parallel use we expect - and keep them around in a global freelist. That list would have to be lock-free but only needs to implement take, not return, since we don't return the chunks.

The idea behind this is that we don't need this for every thread, we don't expect a ton of threads to use RA during signal handling. We may be able to just use this for a bunch of threads (the one (?) that does ACGT sampling, and maybe the one that crashes and writes a hs-err file)

I admit (5) is not really fleshed out yet in my head. But even with (1)..(4) we could have a solution that is a lot more signal safe than what we do today.

Cheers, Thomas

-------------

PR: https://git.openjdk.java.net/jdk/pull/7624