RFR: 8301403: Eliminate memory allocations in JVMFlag::printFlags during signal handling [v3]

Thu Jul 18 05:37:37 UTC 2024

On Wed, 17 Jul 2024 19:53:56 GMT, Gerard Ziemski <gziemski at openjdk.org> wrote:

>> Allocating memory while handling an error should be avoided, however `JVMFlag::printFlags()` is used by crash log and it allocates memory to print flags sorted (using qsort).
>> 
>> We avoid memory allocation by using a simple in place algorithm that uses JVMFlag 1 bit of unused data from its private `Flags` enum data member. It is O(n^2) algorithm, compared to O(n*log(n)) for qsort, however, it's called while handling an error log, so the speed here is not paramount. Also, I measured the real impact on a simple test case and I actually observed performance improvement of about x2.5 faster (2,885,973ns in place ordered printing vs 7,389,456ns for qsort). This is because we ended up having to qsort all flags when only a fraction of them actually end up being printed.
>> 
>> Running MACH5 tests...
>
> Gerard Ziemski has updated the pull request incrementally with one additional commit since the last revision:
> 
>   remove unused header

I still dislike the misuse of the enum. Among other things, this is not threadsafe, and the flag meta information should be constant after JVM initialization. It just feels wrong.

Its also unnecessary. How about this:
- number of flags is predictable and sits at ~1500, right?
- use a stack-allocated Bitmap with 1 bit per flag. 1500 bits gives you 187 bytes. That much you can easily pay via stack allocation. If you don't have 187 bytes left on the stack, you have other problems.
- Use Kims `Bitmap` - in fact, use `BitMapView`, the variant that works with an externally supplied memory. Let that memory be a raw array allocated on the stack, so you can be sure you don't malloc
- Clearing is now just a memset of said raw array
- it is threadsafe as the bitmap sits on the thread stack
- no need to piggy-back on the enum
- it should be even a tiny bit faster since you exchange a bit operation on spread-out words (JVMFlag is 32 bytes, so infos are spread out over 750 cachelines) with one on a tight bit array that fits completely into 3 cachelines.

> I checked the performance when we print all the flags. With qsort it takes 21,882,569 ns and with my fix it takes 25,426,861 ns, which is a slow down of x1.16 times.
> Personally I would keep the code simpler and have just one code path. What do you think?

Okay. I wonder why this is. qsort should be a lot faster than O(n^2). 

> > The next spot to look at with very raised eyebrows would be the elf decoder used to print stack frames ;)
> 
> David said "The Decoder was flagged as potentially problematic due to allocations when it was added. It was considered the benefit outweighed the risk." in [JDK-8301403](https://bugs.openjdk.org/browse/JDK-8301403)
> 
> The next spot looks actually like to be NMT - see [NMT: Eliminate memory allocations during signal handling](https://bugs.openjdk.org/browse/JDK-8336399)

I don't get the reasoning. Nobody talks about removing the decoder; I meant removing malloc from the decoder.

I consider the decoder a lot more important. If malloc deadlocks in the decoder, we get no stack. If malloc deadlocks in NMT, we get no NMT trace. As important as NMT is, the stack is a *lot* more important.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20202#issuecomment-2235484794