RFR: 8326962: C2 SuperWord: cache VPointer

Tue Apr 2 14:11:11 UTC 2024

On Tue, 2 Apr 2024 09:04:45 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> This is a subtask of [JDK-8315361](https://bugs.openjdk.org/browse/JDK-8315361).
> 
> Parsing `VPointer` currently happens all over SuperWord. And often in quadratic loops, where we compare all-with-all loads/stores.
> 
> I propose to cache the `VPointer`s, then we can do a constant-time cache lookup rather than parsing the pointer subgraph every time.
> 
> There are now only a few cases where we cannot use the cached `VPointer`:
> - `SuperWord::unrolling_analysis`: we have no `VLoopAnalyzer`, and so no submodules like `VLoopPointers`. We don't need to cache, since we only iterate over the loop body once, and create only a single `VPointer` per memop.
> - `SuperWord::output`: when we have a `Load`, and try to bypass `StoreVector` nodes. The `StoreVector` nodes are new, and so we have no cached `VPointer` for them. This could be fixed somehow, but I don't want to deal with it now. I intend to refactor `SuperWord::output` soon, and can look into options at that point (either I bypass before we insert the vector nodes, or I remember what scalar memop the vector was created from, and then get the cached pointer this way).
> 
> This changeset is also a preparation step for [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155). I will have a list of pointers, and sort them such that creating adjacent refs is much more efficient.
> 
> **Benchmarking SuperWord Compile Time**
> 
> I use the same benchmark from https://github.com/openjdk/jdk/pull/18532.
> 
> On master:
> 
>     C2 Compile Time:       56.816 s
>          IdealLoop:            56.604 s
>            AutoVectorize:      56.192 s
> 
> 
> With this patch:
> 
>     C2 Compile Time:       49.719 s
>          IdealLoop:            49.509 s
>            AutoVectorize:      49.106 s
> 
> 
> This saves us about `7 sec`, which is significant. I will have to see what it effect it has once we also apply https://github.com/openjdk/jdk/pull/18532, but I think the combined effect will be very significant.

That's a nice improvement and it makes sense to just compute them once and re-use them. I only have a few comments but generally looks good!

src/hotspot/share/opto/superword.hpp line 498:

> 496:   }
> 497: 
> 498:   // VLoopDependencyGraph Accessors

Suggestion:

  // VLoopDependencyGraph accessors

src/hotspot/share/opto/vectorization.cpp line 184:

> 182: }
> 183: 
> 184: void VLoopPointers::compute_and_cache() {

Could be split into something like:

allocate_pointer_memory(); 
initialize_pointers();
trace_pointers();

where allocate_pointer_memory():
number_of_pointers = compute_number_of_pointers();
uint bytes = number_of_pointers * sizeof(VPointer);
_pointers = (VPointer*)_arena->Amalloc(bytes);

src/hotspot/share/opto/vectorization.cpp line 214:

> 212:   int bb_idx = _body.bb_idx(mem);
> 213:   int pointers_idx = _bb_idx_to_pointer.at(bb_idx);
> 214:   assert(pointers_idx >= 0, "mem node must have a cached pointer");

Should we also assert here that `pointers_idx` is within the array range? You could cache the length of the `_pointers` array when you allocate/initialize it above in `compute_and_cache()`.

src/hotspot/share/opto/vectorization.cpp line 224:

> 222:   for (int i = 0; i < _body.body().length(); i++) {
> 223:     MemNode* mem = _body.body().at(i)->isa_Mem();
> 224:     if (mem != nullptr && _vloop.in_bb(mem)) {

I see that you use this pattern twice. Maybe we could provide a "for_each_mem(lambda)` in `VLoopBody`? But could also be done separately.

src/hotspot/share/opto/vectorization.hpp line 456:

> 454: // Submodule of VLoopAnalyzer.
> 455: // We compute and cache the VPointer for every load and store.
> 456: class VLoopPointers : public StackObj {

Nit: Should we call this `VLoopVPointers` to make the link to `VPointers` and not just some pointers?

src/hotspot/share/opto/vectorization.hpp line 462:

> 460:   const VLoopBody&         _body;
> 461: 
> 462:   // Array of cached pointers

Maybe make a note that we allocate/cache them lazily upon request.

-------------

PR Review: https://git.openjdk.org/jdk/pull/18577#pullrequestreview-1973894052
PR Review Comment: https://git.openjdk.org/jdk/pull/18577#discussion_r1547931473
PR Review Comment: https://git.openjdk.org/jdk/pull/18577#discussion_r1547939073
PR Review Comment: https://git.openjdk.org/jdk/pull/18577#discussion_r1547946451
PR Review Comment: https://git.openjdk.org/jdk/pull/18577#discussion_r1547952871
PR Review Comment: https://git.openjdk.org/jdk/pull/18577#discussion_r1547958314
PR Review Comment: https://git.openjdk.org/jdk/pull/18577#discussion_r1547960511