RFR: 8324890: C2 SuperWord: refactor out VLoop, make unrolling_analysis static, remove init/reset mechanism [v4]
Emanuel Peter
epeter at openjdk.org
Sun Feb 4 15:18:06 UTC 2024
On Fri, 2 Feb 2024 10:19:36 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
>> Subtask of https://github.com/openjdk/jdk/pull/16620
>> (The basic goal is to break SuperWord into different modules. This makes the code more maintainable and extensible. And eventually this allows some modules to be reused by other/new vectorizers.)
>>
>> 1. Move out the shared code between `SuperWord::SLP_extract` (where we do vectorization) and `SuperWord::unrolling_analysis`, and move it to a new class `VLoop`. This allows us to decouple `unrolling_analysis` from the SuperWord object, and we can make it static.
>> 2. So far, SuperWord was reused for all loops in a compilation, and then "reset" (with `SuperWord::init`) for every loop. This is a bit of a nasty pattern. I now make a new `VLoop` and a new `SuperWord` object per loop.
>> 3. Since we now make more `SuperWord` objects, we allocate the internal data structures more often. Therefore, I now pre-allocate/reserve sufficient space on initialization.
>>
>> Side-note about https://github.com/openjdk/jdk/pull/17604 (integrated, no need to read any more):
>> I would like to remove the use of `SuperWord::is_marked_reduction` from `SuperWord::unrolling_analysis`. For starters: it is not clear what it was ever good for. Second: it requires us to do reduction marking/analysis before `unrolling_analysis`, and hence makes the reduction marking shared between `unrolling_analysis` and vectorization. I could move the reduction marking to `VLoop` now. But the `_loop_reducitons` set would have to be put on an arena, and I would like to avoid creating an arena for the `unrolling_analysis`. Plus, it would just be nicer code, to have reduction analysis together with body analysis, type analysis, etc. and all of them in only in `SLP_extract`.
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
>
> timing code from JDK-8325159
I think the only really big array is `_bb_idx`, which converts `node->_idx` to `bb_idx`, basically to have smaller idx so that we can then address into more compact arrays. So we need one array that can be as large as the node-count (that can be many K, hence often more than `32K` bytes), and all other arrays can be of the size of the number of nodes in the loop body.
But does using CHeap directly really improve the situation? That just means that we directly malloc, instead of through the Arena, right? This would not make a difference for memory fragmentation, or am I wrong?
But I think it might be a good idea to have at least the large `bb_idx` array be allocated in `Compile`, and we can decide if that is through the `comp_arena` or directly with `CHeap`. That would at least mean that we only have "one unit of fragmentation" per compilation, rather than per SuperWord loop or pass over all loops.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/17624#issuecomment-1925789790
More information about the hotspot-compiler-dev
mailing list