RFR: 8324890: C2 SuperWord: refactor out VLoop, make unrolling_analysis static, remove init/reset mechanism [v4]

Sat Feb 10 14:21:11 UTC 2024

On Sat, 3 Feb 2024 20:33:53 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> @vnkozlov is there a way to measure memory fragmentation? I don't know how to answer that question.
>> And is there really a difference to how it was done before? Before we just put everything on the `comp_arena`, and never recover the memory until that arena is given up. Now we have a new `autovectorization_arena` for each AutoVectorization pass over all loops. I guess there could be multiple such passes in a single compilation. Before this patch, this means that each such AutoVectorization pass creates its SuperWord object, and allocates memory on the `comp_arena`, and the memory usage of all these passes adds up. With this patch, we at least are able to give up the memory after every pass.
>> 
>> Of course this is only helpful if the malloc/free'd chunks can properly be reused.
>> I think that chunks can be properly reused: that is what the `ChunkPool::take_from_pool/return_to_pool` are for. They are called from `ChunkPool::allocate_chunk/deallocate_chunk`, if there is a pool for the requested chunk-size/length.
>> 
>> I did some `rr` debugging. And I see that the first (`init_size`) chunk is allocated from a pool. But subsequent chunks (`grow`) have "non-standard" length, and are malloc/free'd. The reason is that `get_pool_for_size` does exact length comparison, and a random length grow will hit one of the pre-defined lengths only with a very low probability. I suppose that could eventually lead to fragmentation. We get "non-standard" lengths quickly when you for example pre-allocate a specific size, which is not a power of 2.
>> 
>> I wonder if we could not round up the chunk-size to the next bigger size for which we have a pool. Of course this would mean we have some padding in the chunks, but if they are short-lived chunks then at least the whole memory can be reclaimed. @jdksjolen you have done some work on Arenas. Do you have any wisdom to offer here?
>> 
>> An alternative: we can put the `autovectorization_arena` at `Compile`. That way, the chunks are kept until the end of compilation, and can be reused between the different AutoVectorization passes/phases.
>
>> And I see that the first (init_size) chunk is allocated from a pool. But subsequent chunks (grow) have "non-standard" length, and are malloc/free'd.
> 
> Yes, that was my concern.
> 
> There are chunks with different sizes: [arena.hpp#L66](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/memory/arena.hpp#L66). Is your allocation sizes > 32*K `Chunk::size` "Default size of an Arena chunk"? `Arena::grow()` uses `MAX2(ARENA_ALIGN(x), (size_t) Chunk::size);`.
> Which of SuperWord allocations are big? Can we split them to fit into 32K?
> 
> I think, this should not stop you from doing this refactoring. Yes, it will allow return memory back sooner and it is up to OS how it optimize it. I read your offline discussion with Johan. He has interesting suggestion fro growable arrays (use C heap).

Thanks @vnkozlov @rwestrel for the review!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17624#issuecomment-1937018349